2007 TREC Data - Spam Corpus

This dataset contains a total of 16,869 email messages, categorized into two classes: spam and ham (non-spam). Among these, 9,548 emails are labeled as spam and 7,321 emails are labeled as ham.

The dataset is a curated subset of the 2007 TREC Public Spam Corpus, a well-known benchmark collection widely used for research in spam detection, text classification, and natural language processing (NLP). Each entry in the dataset consists of the email text along with its corresponding label, making it suitable for building and evaluating machine learning models for binary email classification tasks.

This dataset can be used for:

Training and testing spam detection models
Practicing text preprocessing and feature extraction techniques
Research in machine learning, deep learning, and NLP applications

Related Datasets

Emails For Spam Or Ham Classification (Trec 2007)

@kaggle
TGS SC2 Nasal Positivity

@cdc
Dhds Dataset

@cdc
Fur Banning

@owid
Dhds Dataset

@cdc
JMK DHDS POC

@cdc

Emails For Spam Or Ham Classification (Trec 2007)

TGS SC2 Nasal Positivity

Dhds Dataset

Fur Banning

Dhds Dataset

JMK DHDS POC