Context
Large Movie Review Dataset v1.0
. 😃
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.
In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a negative
review has a score <= 4 out of 10, and a positive
review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.
Reference:
http://ai.stanford.edu/~amaas/data/sentiment/
NOTE
A starter kernel is here :
https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel
A kernel to expose Dataset collection :
Content
Now let’s understand the task in hand: given a movie review, predict whether it’s positive
or negative
.
The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library.
Each review is tagged pos or neg .
There are 50% positive reviews and 50% negative reviews both in train and test sets.
Columns:
text :
Reviews from people.
Sentiment :
Negative or Positive tag on the review/feedback (Boolean).
Acknowledgements
When using this Dataset Please Cite
this ACL paper using :
@InProceedings{
> maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
Link to ref Dataset: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html
https://www.samyzaf.com/ML/imdb/imdb.html
Inspiration
BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.