English-Darija Bilingual Text (Moroccan Arabic) by Kaggle | Other

About this Dataset

English-Darija Bilingual Text (Moroccan Arabic)

English-Darija Bilingual Dataset

English-Darija Bilingual Corpus for Machine Translation

By M-A-D (From Huggingface) [source]

About this dataset

The M-A-D/DarijaBridge dataset is a community-driven bilingual corpus created by the MAD-Community. Its main purpose is to facilitate machine translation tasks between Darija (Moroccan Arabic) and English, providing a valuable resource for developing and fine-tuning machine translation models. The dataset consists of sentences in Darija paired with their corresponding translations in English, allowing for the training of accurate and culturally relevant translation models. This dataset is particularly beneficial for underrepresented languages and dialects like Darija, as it aims to improve translation accuracy and promote cultural inclusivity.

The columns in the dataset include sentence which represents the original sentence in either Darija or English, translation which provides the corresponding translation of the sentence in the other language, translated which indicates whether the translation is accurate or not with a boolean value, corrected which contains any corrected versions of translations if applicable, correction indicating whether the corrected version is accurate or not with a boolean value, quality representing overall quality assessment of translations and corrections, and finally metadata providing any additional information or context related to each sentence.

By using this comprehensive dataset for training machine translation models, researchers can contribute to bridging linguistic barriers and enabling effective communication between speakers of Darija and English

How to use the dataset

Introduction:

Understanding the Dataset Structure:
The dataset consists of a CSV file called 'train.csv', which contains several columns providing essential information:

'sentence' column: This column contains original sentences in either Darija or English.

'translation' column: Here, you can find the translations of the original sentences in the other language.

'translated' column: Indicates whether the provided translation is accurate or not (True/False).

'corrected' column: If applicable, this column presents corrected versions of translations.

'correction' column: Indicates whether corrections made in the 'corrected' version are accurate (True/False).

'quality' column: Represents overall quality ratings for translations and corrections.

'metadata' column: Additional information or context related to each sentence.

Training Machine Translation Models:
Using this bilingual corpus can greatly assist in training machine translation models focused on translating between Darija and English accurately. Here's how you can utilize this dataset effectively:

a) Preprocessing Data:
Before training your model, it is crucial to preprocess both source (Darija) and target (English) sentences appropriately. Preprocessing steps may include text normalization/cleaning, tokenization, removing stop words/punctuation marks, etc.

b) Training Data Split:
For supervised machine learning approaches, consider splitting your data into three subsets - training set, validation set, and test set. The training set should be used to train your model, the validation set for tuning hyperparameters and monitoring training progress, while the test set serves as a final evaluation measure.

c) Training with Transformers:
Models based on Transformer architecture (such as BERT, GPT-2) have achieved state-of-the-art performance in a wide range of natural language processing tasks. Consider exploring these models and fine-tuning them using the M-A-D/DarijaBridge dataset.

Evaluation and Iterative Process:
After training your machine translation model, evaluate its performance extensively using appropriate metrics like BLEU score or ROUGE

Research Ideas

Building and improving machine translation models: The dataset can be used to train machine translation models that can accurately translate between Darija and English. By using the sentences and their corresponding translations, the models can learn to understand the nuances and specificities of both languages, leading to better translations.

Analyzing language patterns and cultural differences: Researchers can use this dataset to analyze language patterns, syntactical structures, and cultural differences between Darija and English. This analysis can lead to a better understanding of how languages differ from each other, facilitating cross-cultural communication.

Developing language learning resources: The dataset can also be used to create language learning resources for individuals interested in learning either Darija or English. By providing parallel sentences in both languages, learners can compare sentence structures, vocabulary usage, and idiomatic expressions, enhancing their understanding of the languages

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv

Column name	Description
sentence	The original sentence either in Darija or English. (Text)
translation	The translation of each sentence into either Darija or English, depending on whether it was originally written in Darija or English. (Text)
translated	Indicates whether or not the translation provided is accurate. (Boolean)
corrected	If a correction has been made to enhance accuracy, it will be mentioned in this column. Otherwise, it will be left blank. (Text)
quality	An assessment of overall translation quality. (Text)
metadata	Additional information or context related to each sentence. (Text)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit M-A-D (From Huggingface).

Tables

Train

@kaggle.thedevastator_english_darija_bilingual_dataset.train

22.2 MB
1235091 rows
7 columns


CREATE TABLE train (
  "sentence" VARCHAR,
  "translation" VARCHAR,
  "translated" BOOLEAN,
  "corrected" BOOLEAN,
  "correction" VARCHAR,
  "quality" BIGINT,
  "metadata" VARCHAR
);