Baselight

English-Thai Translation Quality

English-to-Thai Translation Quality

@kaggle.thedevastator_english_thai_translation_quality_dataset

Loading...
Loading...

About this Dataset

English-Thai Translation Quality


English-Thai Translation Quality

English-to-Thai Translation Quality

By generated_reviews_enth (From Huggingface) [source]


About this dataset

The English-to-Thai Translation Quality Estimation Dataset is a comprehensive collection of data designed specifically for tasks such as machine translation, sentiment analysis, and translation quality estimation. This dataset is a valuable resource for researchers and developers in the field of Natural Language Processing (NLP) who are interested in improving English-to-Thai translation models.

The dataset comprises pairs of product reviews in both English and Thai languages. Each review pair includes the original text in English along with its corresponding translation into Thai. These translations have been carefully labeled based on their fluency and adequacy to determine whether they are considered acceptable or not.

To facilitate the development and evaluation of NLP models, various columns provide essential information within the dataset. The translation column contains the translated version of the product review in Thai language, while the review_star column represents a star rating assigned to each review as an indicator of its overall sentiment or opinion.

How to use the dataset

  • Understanding the Dataset: The dataset contains product review pairs in both English and Thai, along with corresponding labels indicating whether the translations are considered acceptable or not. Familiarize yourself with the dataset's columns: translation, review_star, and correct.

  • Quality Estimation: This dataset can be used for training and evaluating translation quality estimation models. These models aim to assess the fluency and adequacy of translated texts automatically. Use the translation column as input data and train your model using any suitable algorithm.

  • Machine Translation: The dataset can also be utilized for machine translation tasks, which involve translating text from English to Thai. With access to both source (English) and target (Thai) translations, you can train machine learning models or neural networks that facilitate accurate translations.

  • Sentiment Analysis: Another application of this dataset is sentiment analysis, where you can build models that determine the sentiment expressed in product reviews written in either English or Thai languages based on their star ratings (review_star). Train your model using appropriate techniques such as Natural Language Processing (NLP) algorithms or deep learning architectures.

  • Model Training: Split the provided training data file (train.csv) into respective subsets based on your specific task requirements: training set, validation set for performance evaluation during training (validation.csv), and test set for final evaluation after model development (test.csv). Ensure that each subset represents an unbiased sample from all available classes and maintains an equal distribution of labels.

  • Model Validation & Evaluation: Use the validation set (validation.csv) during model development to tune hyperparameters, optimize performance, and ensure accuracy. Validate your model's predictions against the provided labels of this subset. Finally, evaluate your models using the test set (test.csv) to gauge their overall performance on unseen data.

  • Iterative Improvements: Based on the evaluation results, make necessary adjustments to your models or experiment with different algorithms and techniques if required. It's often beneficial to iterate through steps 4-6 multiple times until you achieve satisfactory performance.

  • Ethical Considerations: While using this dataset for various tasks, it is important to adhere to ethical guidelines and maintain respect for privacy and user rights

Research Ideas

  • Developing machine translation models: The dataset can be used to train and evaluate machine translation models for English-to-Thai translations. By using the labeled reviews as training data, the models can learn to generate accurate and fluent translations.
  • Sentiment analysis: The dataset can also be utilized for sentiment analysis tasks in English and Thai languages. By examining the translated product reviews along with their corresponding star ratings, sentiment analysis models can be trained to classify reviews based on positive or negative sentiments.
  • Translation quality estimation: The dataset can serve as a valuable resource for training translation quality estimation models. These models aim to determine the quality of translated text by assessing factors like fluency and adequacy. By using the labeled translations in this dataset, such models can better predict the accuracy of future translations from English to Thai

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name Description
translation The translated product reviews in Thai. (Text)
review_star The star ratings assigned by reviewers for the product reviews. (Numeric)
correct Labels indicating the correctness of translations, whether they are considered acceptable or not based on fluency and adequacy criteria. (Boolean)

File: train.csv

Column name Description
translation The translated product reviews in Thai. (Text)
review_star The star ratings assigned by reviewers for the product reviews. (Numeric)
correct Labels indicating the correctness of translations, whether they are considered acceptable or not based on fluency and adequacy criteria. (Boolean)

File: test.csv

Column name Description
translation The translated product reviews in Thai. (Text)
review_star The star ratings assigned by reviewers for the product reviews. (Numeric)
correct Labels indicating the correctness of translations, whether they are considered acceptable or not based on fluency and adequacy criteria. (Boolean)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit generated_reviews_enth (From Huggingface).

Tables

Test

@kaggle.thedevastator_english_thai_translation_quality_dataset.test
  • 8.69 MB
  • 17453 rows
  • 3 columns
Loading...

CREATE TABLE test (
  "translation" VARCHAR,
  "review_star" BIGINT,
  "correct" BIGINT
);

Train

@kaggle.thedevastator_english_thai_translation_quality_dataset.train
  • 70.72 MB
  • 141369 rows
  • 3 columns
Loading...

CREATE TABLE train (
  "translation" VARCHAR,
  "review_star" BIGINT,
  "correct" BIGINT
);

Validation

@kaggle.thedevastator_english_thai_translation_quality_dataset.validation
  • 7.87 MB
  • 15708 rows
  • 3 columns
Loading...

CREATE TABLE validation (
  "translation" VARCHAR,
  "review_star" BIGINT,
  "correct" BIGINT
);

Share link

Anyone who has the link will be able to view this.