Baselight

Language Identification Dataset

This data is extract from WiLi-2018 wikipedia dataset

@kaggle.zarajamshaid_language_identification_datasst

Loading...
Loading...

About this Dataset

Language Identification Dataset

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.
Each language in this dataset contains 1000 rows/paragraphs.

After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages

⦁ English
⦁ Arabic
⦁ French
⦁ Hindi
⦁ Urdu
⦁ Portuguese
⦁ Persian
⦁ Pushto
⦁ Spanish
⦁ Korean
⦁ Tamil
⦁ Turkish
⦁ Estonian
⦁ Russian
⦁ Romanian
⦁ Chinese
⦁ Swedish
⦁ Latin
⦁ Indonesian
⦁ Dutch
⦁ Japanese
⦁ Thai

Tables

Dataset

@kaggle.zarajamshaid_language_identification_datasst.dataset
  • 8.27 MB
  • 22000 rows
  • 2 columns
Loading...

CREATE TABLE dataset (
  "text" VARCHAR,
  "language" VARCHAR
);

Share link

Anyone who has the link will be able to view this.