Baselight
Sign In
kaggle

Processed Twitter Sentiment Dataset | Added Tokens

Kaggle

@kaggle.halemogpa_processed

Loading...
Loading...

Tokenized and Sentiment-Labeled Tweets for NLP and Machine Learning

Dataset Description

This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications.
The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping.
Key Features:

1.6 million labeled tweets
Binary sentiment classification (0 for negative, 1 for positive)
Preprocessed and tokenized text
Balanced class distribution
Suitable for various NLP tasks and model architectures

Citation
If you use this dataset in your research or project, please cite the original Sentiment140 dataset:
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.


Related Datasets

Share link

Anyone who has the link will be able to view this.