LinCE (Linguistic Code-switching Evaluation) by Kaggle | Other

About this Dataset

LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

By Huggingface Hub [source]

About this dataset

Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.

Understand what is included in this dataset
This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish MultiSourceEnglish (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences

Decide What Kind Of Analysis You Want To Do
Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance

Run Appropriate Algorithms On The Data Provided In The Dataset
Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly

Research Ideas

Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.

Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.

Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: lid_msaea_test.csv

Column name	Description
words	The words in the text. (String)

File: lid_msaea_validation.csv

Column name	Description
words	The words in the text. (String)

File: pos_spaeng_train.csv

Column name	Description
words	The words in the text. (String)
pos	The part of speech tag for each word. (String)

File: lid_nepeng_validation.csv

Column name	Description
words	The words in the text. (String)

File: pos_hineng_test.csv

Column name	Description
words	The words in the text. (String)
pos	The part of speech tag for each word. (String)

File: sa_spaeng_train.csv

Column name	Description
words	The words in the text. (String)
sa	The sentiment analysis labels for each word in the text. (String)

File: lid_spaeng_test.csv

Column name	Description
words	The words in the text. (String)

File: pos_hineng_validation.csv

Column name	Description
words	The words in the text. (String)
pos	The part of speech tag for each word. (String)

File: lid_hineng_test.csv

Column name	Description
words	The words in the text. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Lid Hineng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_hineng_test

174.77 KB
1854 rows
3 columns


CREATE TABLE lid_hineng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Hineng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_hineng_train

583.31 KB
4823 rows
3 columns


CREATE TABLE lid_hineng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Hineng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_hineng_validation

97.28 KB
744 rows
3 columns


CREATE TABLE lid_hineng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Msaea Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_msaea_test

208.83 KB
1663 rows
3 columns


CREATE TABLE lid_msaea_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Msaea Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_msaea_train

1.19 MB
8464 rows
3 columns


CREATE TABLE lid_msaea_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Msaea Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_msaea_validation

167.63 KB
1116 rows
3 columns


CREATE TABLE lid_msaea_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Nepeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_nepeng_test

226.16 KB
3228 rows
3 columns


CREATE TABLE lid_nepeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Nepeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_nepeng_train

787.83 KB
8451 rows
3 columns


CREATE TABLE lid_nepeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Nepeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_nepeng_validation

130.58 KB
1332 rows
3 columns


CREATE TABLE lid_nepeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_spaeng_test

494.24 KB
8289 rows
3 columns


CREATE TABLE lid_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_spaeng_train

1.54 MB
21030 rows
3 columns


CREATE TABLE lid_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_spaeng_validation

250.78 KB
3332 rows
3 columns


CREATE TABLE lid_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Ner Hineng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_hineng_test

69.8 KB
522 rows
4 columns


CREATE TABLE ner_hineng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Hineng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_hineng_train

171.86 KB
1243 rows
4 columns


CREATE TABLE ner_hineng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Hineng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_hineng_validation

47.9 KB
314 rows
4 columns


CREATE TABLE ner_hineng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Msaea Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_msaea_test

134.93 KB
1110 rows
3 columns


CREATE TABLE ner_msaea_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "ner" VARCHAR
);

Ner Msaea Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_msaea_train

1.39 MB
10103 rows
3 columns


CREATE TABLE ner_msaea_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "ner" VARCHAR
);

Ner Msaea Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_msaea_validation

164.19 KB
1122 rows
3 columns


CREATE TABLE ner_msaea_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "ner" VARCHAR
);

Ner Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_spaeng_test

1.93 MB
23527 rows
4 columns


CREATE TABLE ner_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_spaeng_train

2.87 MB
33611 rows
4 columns


CREATE TABLE ner_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_spaeng_validation

883.35 KB
10085 rows
4 columns


CREATE TABLE ner_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Pos Hineng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_hineng_test

46.24 KB
299 rows
4 columns


CREATE TABLE pos_hineng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Hineng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_hineng_train

190.4 KB
1030 rows
4 columns


CREATE TABLE pos_hineng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Hineng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_hineng_validation

34.55 KB
160 rows
4 columns


CREATE TABLE pos_hineng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_test

394.96 KB
10720 rows
4 columns


CREATE TABLE pos_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_train

1.35 MB
27893 rows
4 columns


CREATE TABLE pos_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_validation

225.66 KB
4298 rows
4 columns


CREATE TABLE pos_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Sa Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_test

499.63 KB
4736 rows
4 columns


CREATE TABLE sa_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Sa Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_train

1.26 MB
12194 rows
4 columns


CREATE TABLE sa_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Sa Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_validation

204.42 KB
1859 rows
4 columns


CREATE TABLE sa_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);