High-Quality Multilingual Translation Data
13 Languages for Machine Learning
@kaggle.thedevastator_high_quality_multilingual_translation_data
13 Languages for Machine Learning
@kaggle.thedevastator_high_quality_multilingual_translation_data
By Huggingface Hub [source]
This extensive collection of multilingual translation data provides an invaluable resource for the furtherance of machine learning research. With language pairs spanning both English and non-English languages, this dataset delivers a comprehensive selection of high-quality text translations with thousands of records per language pair. Each file within the folder for a given language pair contains two distinct columns –
idandtranslation– providing identification numbers associated with each translation record as well as the corresponding translation text itself. This highly structured data set is sure to be an invaluable asset in the pursuit of advanced machine learning techniques!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use this Dataset
This multilingual translation data can be used for a variety of tasks, from training machine learning models to understanding language nuances. Below are some steps to get you started using this dataset:
- Select the language pair that you would like to work with (ex. English-Spanish). This selection can be found in the filename (eg.
en-es_train).- Download and extract the file containing your selected language pair from this Kaggle dataset. You will find two files for both training and testing within this folder -
Training_FileandTest_File.- Open your chosen file in a spreadsheet program such as Microsoft Excel or Google Sheets, so that you may explore the contents of the dataset. You will find two columns present:
id(unique identifier for each translation pair) andtranslationwhich contains information about translations from either English or a non-English language depending on which file you are accessing (training vs test).- With these files you may then generate machine learning models in order apply natural language processing techniques, or simply explore transnational correlations between languages amongst other interesting research applications!
- Developing machine translation models: This dataset can be used to train and evaluate a variety of different machine translation models. The data could be used to optimize existing algorithms as well as train entirely new models tailored specifically for multilingual applications.
- Improving natural language understanding: This corpus could be used to help build better artificial intelligence systems with an enhanced ability to process natural language inputs, thus allowing them to rapidly translate and respond accurately in multiple languages.
- Translating web content dynamically: This dataset can be leveraged by web developers who want their websites and applications to automatically detect a visitor's language and generate translations instantly in the correct language pair format. The rapid response time would eliminate the need for cumbersome inter-language switching among users
If you use this dataset in your research, please credit the original authors.
Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: ca-en_train.csv
| Column name | Description |
|---|---|
| translation | Contains both English and non-English translations side by side. (String) |
File: en-fi_train.csv
| Column name | Description |
|---|---|
| translation | Contains both English and non-English translations side by side. (String) |
File: en-es_train.csv
| Column name | Description |
|---|---|
| translation | Contains both English and non-English translations side by side. (String) |
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.
CREATE TABLE fr_ru_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE fr_sv_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE hu_it_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE hu_nl_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE hu_no_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE hu_pl_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE hu_pt_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE hu_ru_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE it_nl_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE it_pt_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE it_ru_train (
"id" BIGINT,
"translation" VARCHAR
);CREATE TABLE it_sv_train (
"id" BIGINT,
"translation" VARCHAR
);Anyone who has the link will be able to view this.