Tokenized and Sentiment-Labeled Tweets for NLP and Machine Learning

This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications.
The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping.
Key Features:

1.6 million labeled tweets
Binary sentiment classification (0 for negative, 1 for positive)
Preprocessed and tokenized text
Balanced class distribution
Suitable for various NLP tasks and model architectures

Citation
If you use this dataset in your research or project, please cite the original Sentiment140 dataset:
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Related Datasets

Sentiment Analysis Of Tweets

@kaggle
Eucalyptus Growth And Environmental Data

@euremarkable
Global Forest Resources Assessment

@owid
Ethnic Power Relations Dataset (ETH, 2021)

@owid
Economic Lexicon

@ecjrc
Dataset Of Thermostable In Vitro Transcription-translation Compatible With Microfluidic Droplets

@zenodo

Sentiment Analysis Of Tweets

Eucalyptus Growth And Environmental Data

Global Forest Resources Assessment

Ethnic Power Relations Dataset (ETH, 2021)

Economic Lexicon

Dataset Of Thermostable In Vitro Transcription-translation Compatible With Microfluidic Droplets