Baselight
Sign In
kaggle

Language Identification Dataset

Kaggle

@kaggle.zarajamshaid_language_identification_datasst

Loading...
Loading...

This data is extract from WiLi-2018 wikipedia dataset

Dataset Description

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.
Each language in this dataset contains 1000 rows/paragraphs.

After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages

⦁ English
⦁ Arabic
⦁ French
⦁ Hindi
⦁ Urdu
⦁ Portuguese
⦁ Persian
⦁ Pushto
⦁ Spanish
⦁ Korean
⦁ Tamil
⦁ Turkish
⦁ Estonian
⦁ Russian
⦁ Romanian
⦁ Chinese
⦁ Swedish
⦁ Latin
⦁ Indonesian
⦁ Dutch
⦁ Japanese
⦁ Thai


Related Datasets

Share link

Anyone who has the link will be able to view this.