Language Identification Dataset
This data is extract from WiLi-2018 wikipedia dataset
@kaggle.zarajamshaid_language_identification_datasst
This data is extract from WiLi-2018 wikipedia dataset
@kaggle.zarajamshaid_language_identification_datasst
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.
Each language in this dataset contains 1000 rows/paragraphs.
After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages
⦁	English
⦁	Arabic
⦁	French
⦁	Hindi
⦁	Urdu
⦁	Portuguese
⦁	Persian
⦁	Pushto
⦁	Spanish
⦁	Korean
⦁	Tamil
⦁	Turkish
⦁	Estonian
⦁	Russian
⦁	Romanian
⦁	Chinese
⦁	Swedish
⦁	Latin
⦁	Indonesian
⦁	Dutch
⦁	Japanese
⦁	Thai
CREATE TABLE dataset (
  "text" VARCHAR,
  "language" VARCHAR
);Anyone who has the link will be able to view this.