BnPC: Benchmarking Bangla Paraphrase Detection with a Gold Standard Corpus

BnPC: A Gold Standard Corpus for Paraphrase Detection in Bangla, and its Evaluation

This work is accepted at Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024. Paper is available at ACL Anthology

Abstract

In this paper, we present a benchmark dataset for paraphrase detection in Bangla. Despite being the sixth most spoken language in the world, paraphrase identification in Bangla is barely explored. Our dataset contains 8,787 human-annotated sentence pairs collected from 23 newspaper outlets’ headlines in four categories. We explored several supervised modeling approaches to benchmark the dataset, including similarity metrics, linguistic features, and fine-tuned BERT models. We also conducted a zero-shot analysis to assess the performance of pre-trained BERT models, and we carried out both zero-shot and few-shot evaluations of the publicly accessible generative language model GPT 3.5 turbo. In the benchmark evaluations, when examining GPT-3.5 using a few-shot modeling approach, it becomes evident that the model can grasp paraphrases in a manner akin to fine-tuned mBERT language models with just a handful of example data points. Within the set of benchmarking trials, the fine-tuned BanglaBERT delivered the most remarkable performance, achieving a weighted-F1 score of 87.91. Noteworthy is that GPT-3.5 excelled in both zero-shot and few-shot experiments, attaining weighted-F1 scores of 51.51 and 80.53, in that order. We also performed a cross-dataset analysis and the outcomes suggest that the model trained in our dataset resembles both diversity and generalization when tested on the other dataset. Finally, we report a human evaluation experiment to obtain a better understanding of the paraphrasing task’s limitations.

Bibtex

@inproceedings{saha-etal-2024-bnpc,
title = "{B}n{PC}: A Gold Standard Corpus for Paraphrase Detection in {B}angla, and its Evaluation",
author = "Saha, Sourav and
Nobin, Zeshan Ahmed and
Chowdhury, Mufassir Ahmad and
Mobin, Md. Shakirul Hasan Khan and
Amin, Mohammad Ruhul and
Kar, Sudipta",
editor = "Zweigenbaum, Pierre and
Rapp, Reinhard and
Sharoff, Serge",
booktitle = "Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.bucc-1.8",
pages = "69--84",
}

Related Datasets

AI Models Intelligence

@blt
Qtafsir

@kaggle
Economic Lexicon

@ecjrc
SMS Alerta Sobre Prazo Para Levantamento Do CC

@ptgov
Lookup Comparison Of 2017-13 V 2014-2020 Thematic Categorisation Codes

@esifunds
Sucursal Na Hora

@ptgov

AI Models Intelligence

Qtafsir

Economic Lexicon

SMS Alerta Sobre Prazo Para Levantamento Do CC

Lookup Comparison Of 2017-13 V 2014-2020 Thematic Categorisation Codes

Sucursal Na Hora