Language Identification Dataset
This data is extract from WiLi-2018 wikipedia dataset
@kaggle.zarajamshaid_language_identification_datasst
This data is extract from WiLi-2018 wikipedia dataset
@kaggle.zarajamshaid_language_identification_datasst
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.
Each language in this dataset contains 1000 rows/paragraphs.
After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages
⦁ English
⦁ Arabic
⦁ French
⦁ Hindi
⦁ Urdu
⦁ Portuguese
⦁ Persian
⦁ Pushto
⦁ Spanish
⦁ Korean
⦁ Tamil
⦁ Turkish
⦁ Estonian
⦁ Russian
⦁ Romanian
⦁ Chinese
⦁ Swedish
⦁ Latin
⦁ Indonesian
⦁ Dutch
⦁ Japanese
⦁ Thai
CREATE TABLE dataset (
"text" VARCHAR,
"language" VARCHAR
);Anyone who has the link will be able to view this.