Baselight

LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset

Loading...
Loading...

About this Dataset

LinCE (Linguistic Code-switching Evaluation)


LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

By Huggingface Hub [source]


About this dataset

Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.

Understand what is included in this dataset
This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish MultiSourceEnglish (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences

Decide What Kind Of Analysis You Want To Do
Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance

Run Appropriate Algorithms On The Data Provided In The Dataset
Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly

Research Ideas

  • Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.
  • Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.
  • Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: lid_msaea_test.csv

Column name Description
words The words in the text. (String)

File: lid_msaea_validation.csv

Column name Description
words The words in the text. (String)

File: pos_spaeng_train.csv

Column name Description
words The words in the text. (String)
pos The part of speech tag for each word. (String)

File: lid_nepeng_validation.csv

Column name Description
words The words in the text. (String)

File: pos_hineng_test.csv

Column name Description
words The words in the text. (String)
pos The part of speech tag for each word. (String)

File: sa_spaeng_train.csv

Column name Description
words The words in the text. (String)
sa The sentiment analysis labels for each word in the text. (String)

File: lid_spaeng_test.csv

Column name Description
words The words in the text. (String)

File: pos_hineng_validation.csv

Column name Description
words The words in the text. (String)
pos The part of speech tag for each word. (String)

File: lid_hineng_test.csv

Column name Description
words The words in the text. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Lid Hineng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_hineng_test
  • 174.77 KB
  • 1854 rows
  • 3 columns
Loading...

CREATE TABLE lid_hineng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Hineng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_hineng_train
  • 583.31 KB
  • 4823 rows
  • 3 columns
Loading...

CREATE TABLE lid_hineng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Hineng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_hineng_validation
  • 97.28 KB
  • 744 rows
  • 3 columns
Loading...

CREATE TABLE lid_hineng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Msaea Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_msaea_test
  • 208.83 KB
  • 1663 rows
  • 3 columns
Loading...

CREATE TABLE lid_msaea_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Msaea Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_msaea_train
  • 1.19 MB
  • 8464 rows
  • 3 columns
Loading...

CREATE TABLE lid_msaea_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Msaea Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_msaea_validation
  • 167.63 KB
  • 1116 rows
  • 3 columns
Loading...

CREATE TABLE lid_msaea_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Nepeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_nepeng_test
  • 226.16 KB
  • 3228 rows
  • 3 columns
Loading...

CREATE TABLE lid_nepeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Nepeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_nepeng_train
  • 787.83 KB
  • 8451 rows
  • 3 columns
Loading...

CREATE TABLE lid_nepeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Nepeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_nepeng_validation
  • 130.58 KB
  • 1332 rows
  • 3 columns
Loading...

CREATE TABLE lid_nepeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_spaeng_test
  • 494.24 KB
  • 8289 rows
  • 3 columns
Loading...

CREATE TABLE lid_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_spaeng_train
  • 1.54 MB
  • 21030 rows
  • 3 columns
Loading...

CREATE TABLE lid_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Lid Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.lid_spaeng_validation
  • 250.78 KB
  • 3332 rows
  • 3 columns
Loading...

CREATE TABLE lid_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR
);

Ner Hineng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_hineng_test
  • 69.8 KB
  • 522 rows
  • 4 columns
Loading...

CREATE TABLE ner_hineng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Hineng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_hineng_train
  • 171.86 KB
  • 1243 rows
  • 4 columns
Loading...

CREATE TABLE ner_hineng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Hineng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_hineng_validation
  • 47.9 KB
  • 314 rows
  • 4 columns
Loading...

CREATE TABLE ner_hineng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Msaea Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_msaea_test
  • 134.93 KB
  • 1110 rows
  • 3 columns
Loading...

CREATE TABLE ner_msaea_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "ner" VARCHAR
);

Ner Msaea Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_msaea_train
  • 1.39 MB
  • 10103 rows
  • 3 columns
Loading...

CREATE TABLE ner_msaea_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "ner" VARCHAR
);

Ner Msaea Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_msaea_validation
  • 164.19 KB
  • 1122 rows
  • 3 columns
Loading...

CREATE TABLE ner_msaea_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "ner" VARCHAR
);

Ner Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_spaeng_test
  • 1.93 MB
  • 23527 rows
  • 4 columns
Loading...

CREATE TABLE ner_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_spaeng_train
  • 2.87 MB
  • 33611 rows
  • 4 columns
Loading...

CREATE TABLE ner_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Ner Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.ner_spaeng_validation
  • 883.35 KB
  • 10085 rows
  • 4 columns
Loading...

CREATE TABLE ner_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "ner" VARCHAR
);

Pos Hineng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_hineng_test
  • 46.24 KB
  • 299 rows
  • 4 columns
Loading...

CREATE TABLE pos_hineng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Hineng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_hineng_train
  • 190.4 KB
  • 1030 rows
  • 4 columns
Loading...

CREATE TABLE pos_hineng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Hineng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_hineng_validation
  • 34.55 KB
  • 160 rows
  • 4 columns
Loading...

CREATE TABLE pos_hineng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_test
  • 394.96 KB
  • 10720 rows
  • 4 columns
Loading...

CREATE TABLE pos_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_train
  • 1.35 MB
  • 27893 rows
  • 4 columns
Loading...

CREATE TABLE pos_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_validation
  • 225.66 KB
  • 4298 rows
  • 4 columns
Loading...

CREATE TABLE pos_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Sa Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_test
  • 499.63 KB
  • 4736 rows
  • 4 columns
Loading...

CREATE TABLE sa_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Sa Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_train
  • 1.26 MB
  • 12194 rows
  • 4 columns
Loading...

CREATE TABLE sa_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Sa Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_validation
  • 204.42 KB
  • 1859 rows
  • 4 columns
Loading...

CREATE TABLE sa_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Share link

Anyone who has the link will be able to view this.