2007 TREC Data - Spam Corpus
Dataset Description
This dataset contains a total of 16,869 email messages, categorized into two classes: spam and ham (non-spam). Among these, 9,548 emails are labeled as spam and 7,321 emails are labeled as ham.
The dataset is a curated subset of the 2007 TREC Public Spam Corpus, a well-known benchmark collection widely used for research in spam detection, text classification, and natural language processing (NLP). Each entry in the dataset consists of the email text along with its corresponding label, making it suitable for building and evaluating machine learning models for binary email classification tasks.
This dataset can be used for:
- Training and testing spam detection models
- Practicing text preprocessing and feature extraction techniques
- Research in machine learning, deep learning, and NLP applications
Related Datasets
-
Dhds Dataset
@cdc
-
Fur Banning
@owid
-
Dhds Dataset
@cdc
-
JMK DHDS POC
@cdc