Baselight

LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset

Loading...
Loading...

About this Dataset

LinCE (Linguistic Code-switching Evaluation)


LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

By Huggingface Hub [source]


About this dataset

Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.

Understand what is included in this dataset
This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish MultiSourceEnglish (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences

Decide What Kind Of Analysis You Want To Do
Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance

Run Appropriate Algorithms On The Data Provided In The Dataset
Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly

Research Ideas

  • Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.
  • Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.
  • Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: lid_msaea_test.csv

Column name Description
words The words in the text. (String)

File: lid_msaea_validation.csv

Column name Description
words The words in the text. (String)

File: pos_spaeng_train.csv

Column name Description
words The words in the text. (String)
pos The part of speech tag for each word. (String)

File: lid_nepeng_validation.csv

Column name Description
words The words in the text. (String)

File: pos_hineng_test.csv

Column name Description
words The words in the text. (String)
pos The part of speech tag for each word. (String)

File: sa_spaeng_train.csv

Column name Description
words The words in the text. (String)
sa The sentiment analysis labels for each word in the text. (String)

File: lid_spaeng_test.csv

Column name Description
words The words in the text. (String)

File: pos_hineng_validation.csv

Column name Description
words The words in the text. (String)
pos The part of speech tag for each word. (String)

File: lid_hineng_test.csv

Column name Description
words The words in the text. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit Huggingface Hub.

Tables

Pos Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_train
  • 1.35 MB
  • 27893 rows
  • 4 columns
Loading...

CREATE TABLE pos_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Pos Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.pos_spaeng_validation
  • 225.66 KB
  • 4298 rows
  • 4 columns
Loading...

CREATE TABLE pos_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "pos" VARCHAR
);

Sa Spaeng Test

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_test
  • 499.63 KB
  • 4736 rows
  • 4 columns
Loading...

CREATE TABLE sa_spaeng_test (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Sa Spaeng Train

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_train
  • 1.26 MB
  • 12194 rows
  • 4 columns
Loading...

CREATE TABLE sa_spaeng_train (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Sa Spaeng Validation

@kaggle.thedevastator_unlock_universal_language_with_the_lince_dataset.sa_spaeng_validation
  • 204.42 KB
  • 1859 rows
  • 4 columns
Loading...

CREATE TABLE sa_spaeng_validation (
  "idx" BIGINT,
  "words" VARCHAR,
  "lid" VARCHAR,
  "sa" VARCHAR
);

Share link

Anyone who has the link will be able to view this.