Baselight

Tamil NLP

Datasets for Natural Language Processing in Tamil

@kaggle.sudalairajkumar_tamil_nlp

Loading...
Loading...

About this Dataset

Tamil NLP

Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for tamil language. Thanks for Malaikannan for the initiative and Selva for getting the data from websites. The idea is to add more datasets related to Tamil NLP at a single place.

Content

The dataset has the following files.

Tamil News Classficaition

This dataset has 14521 rows for training and 3631 rows for testing. It has 6 news categories - "tamilnadu", "india", "cinema", "sports", "politics", "world". The data is obtained from this link

  • tamil_news_train.csv - Train dataset for tamil news classification.
  • tamil_news_test.csv - Test dataset for tamil news classification

Tamil Movie Review Dataset

This dataset has 480 training samples and 121 testing samples. It has the review text in tamil and ratings between 1 to 5. The data is obtained from this link

  • tamil_movie_reviews_train.csv - Train dataset for tamil movie reviews
  • tamil_movie_reviews_test.csv - Test dataset for tamil movie reviews

Thirukkural Dataset

From Wikipedia, The Tirukkural, or shortly the Kural, is a classic Tamil text consisting of 1,330 couplets or Kurals, dealing with the everyday virtues of an individual. It is one of the two oldest works now extant in Tamil literature.

I have split the data into train and test and we can use the kural and / or the explanations to predict the three parts - aram (virtue), porul (polity) and inbam (love). The dataset is obtained from this link.

  • tamil_thirukkural_train - train dataset having 1064 rows
  • tamil_thirukkural_test - test dataset having 266 rows

Will add more datasets in the following versions.

Acknowledgements

My sincere thanks to :

  • Malaikannan for starting this initiative
  • Selvakumar for getting the data
  • Vijay Anand for the Thirukkural data

Inspiration

Some questions which can be answered are

  1. Can we do text classification for Tamil languages and get good accuracies similar to other languages?
  2. How does the Language models do for Tamil?

And lot more interesting questions to be answered.

Checkout this link to find similar and dissimilar words for Tamil.

Tables

Tamil Movie Reviews Test

@kaggle.sudalairajkumar_tamil_nlp.tamil_movie_reviews_test
  • 374.37 KB
  • 121 rows
  • 3 columns
Loading...

CREATE TABLE tamil_movie_reviews_test (
  "reviewid" BIGINT,
  "reviewintamil" VARCHAR,
  "rating" DOUBLE
);

Tamil Movie Reviews Train

@kaggle.sudalairajkumar_tamil_nlp.tamil_movie_reviews_train
  • 1.41 MB
  • 480 rows
  • 3 columns
Loading...

CREATE TABLE tamil_movie_reviews_train (
  "reviewid" BIGINT,
  "reviewintamil" VARCHAR,
  "rating" DOUBLE
);

Tamil News Test

@kaggle.sudalairajkumar_tamil_nlp.tamil_news_test
  • 399.84 KB
  • 3631 rows
  • 4 columns
Loading...

CREATE TABLE tamil_news_test (
  "newsinenglish" VARCHAR,
  "newsintamil" VARCHAR,
  "category" VARCHAR,
  "categoryintamil" VARCHAR
);

Tamil News Train

@kaggle.sudalairajkumar_tamil_nlp.tamil_news_train
  • 1.38 MB
  • 14521 rows
  • 4 columns
Loading...

CREATE TABLE tamil_news_train (
  "newsinenglish" VARCHAR,
  "newsintamil" VARCHAR,
  "category" VARCHAR,
  "categoryintamil" VARCHAR
);

Tamil Thirukkural Test

@kaggle.sudalairajkumar_tamil_nlp.tamil_thirukkural_test
  • 155.07 KB
  • 266 rows
  • 10 columns
Loading...

CREATE TABLE tamil_thirukkural_test (
  "number" BIGINT,
  "kural" VARCHAR,
  "explanation" VARCHAR,
  "adikaram_name" VARCHAR,
  "iyal_name" VARCHAR,
  "paul_name" VARCHAR,
  "paul_translation" VARCHAR,
  "mk" VARCHAR,
  "mv" VARCHAR,
  "sp" VARCHAR
);

Tamil Thirukkural Train

@kaggle.sudalairajkumar_tamil_nlp.tamil_thirukkural_train
  • 565.1 KB
  • 1064 rows
  • 10 columns
Loading...

CREATE TABLE tamil_thirukkural_train (
  "number" BIGINT,
  "kural" VARCHAR,
  "explanation" VARCHAR,
  "adikaram_name" VARCHAR,
  "iyal_name" VARCHAR,
  "paul_name" VARCHAR,
  "paul_translation" VARCHAR,
  "mk" VARCHAR,
  "mv" VARCHAR,
  "sp" VARCHAR
);

Share link

Anyone who has the link will be able to view this.