Rotten Tomatoes Movie Reviews by Kaggle | Ecommerce and Consumer Trends

About this Dataset

Rotten Tomatoes Movie Reviews

Predicting Movie Review Sentiment

Source

Huggingface Hub: link

About this dataset

The Rotten Tomatoes Movie Review Sentiment Analysis Dataset contains a set of 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. Bo Pang and Lillian Lee first used this data in their paper Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, which was published in Proceedings of the ACL in 2005. All of the data fields are identical in every single one of the splits.The text column contains the review itself, and the label column indicates whether the review is positive or negative

How to use the dataset

The Performance of Sentiment Analysis
In this post we take a look at the performance of different sentiment analysis systems on a movie review dataset from Rotten Tomatoes. This data was first used in Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales., Proceedings of the ACL, 2005. The data fields are the same among all splits

We will be using three different libraries for this post: 1) Scikit-learn, 2) NLTK, and 3) TextBlob. We will also compare the results of these systems with those from human raters. Each library takes different amounts of time and resources to run, so we will also be considering these factors in our comparisons.

NLTK

NLTK is a popular library for working with text data in Python. It includes many useful features for pre-processing text data, including tokenization, lemmatization, and part-of-speech tagging. NLTK also includes a number of helpful classes for building and evaluating predictive models (such as decision trees and maximum entropy classifiers).

TextBlob

TextBlob is a relatively new library that attempts to provide an easy-to-use interface for common text processing tasks (such as part-of-speech tagging, sentence parsing, spelling correction, etc). TextBlob is built on top of NLTK and Pattern, another Python library for web mining (see below).

Scikit-learn

Scikit-learn is a popular machine learning library for Python that provides efficient implementations of common algorithms such as support vector machines, random forests, and k-nearest neighbors classifiers. It also includes helpful utilities for pre-processing data and assessing model performance

Research Ideas

Identify positive and negative sentiment in movie reviews

Categorize movie reviews by rating

Cluster movie reviews to group together similar reviews

Acknowledgements

Huggingface Hub: link

License

> License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
> No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv

Column name	Description
text	The text of the review. (String)
label	The label of the review. (String)

File: train.csv

Column name	Description
text	The text of the review. (String)
label	The label of the review. (String)

File: test.csv

Column name	Description
text	The text of the review. (String)
label	The label of the review. (String)

Tables

Test

@kaggle.thedevastator_movie_review_data_set_from_rotten_tomatoes.test

88.3 KB
1066 rows
2 columns


CREATE TABLE test (
  "text" VARCHAR,
  "label" BIGINT
);

Train

@kaggle.thedevastator_movie_review_data_set_from_rotten_tomatoes.train

675.94 KB
8530 rows
2 columns


CREATE TABLE train (
  "text" VARCHAR,
  "label" BIGINT
);

Validation

@kaggle.thedevastator_movie_review_data_set_from_rotten_tomatoes.validation

86.61 KB
1066 rows
2 columns


CREATE TABLE validation (
  "text" VARCHAR,
  "label" BIGINT
);