This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications.
The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping.
Key Features:
1.6 million labeled tweets
Binary sentiment classification (0 for negative, 1 for positive)
Preprocessed and tokenized text
Balanced class distribution
Suitable for various NLP tasks and model architectures
Citation
If you use this dataset in your research or project, please cite the original Sentiment140 dataset:
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.