Baselight
Sign In
kaggle

Spam-Ham Text Dataset(TREC 2007)

Kaggle

@kaggle.abhaykr0111_spam_ham_text_datasettrec_2007

Loading...
Loading...

2007 TREC Data - Spam Corpus

Dataset Description

This dataset contains a total of 16,869 email messages, categorized into two classes: spam and ham (non-spam). Among these, 9,548 emails are labeled as spam and 7,321 emails are labeled as ham.

The dataset is a curated subset of the 2007 TREC Public Spam Corpus, a well-known benchmark collection widely used for research in spam detection, text classification, and natural language processing (NLP). Each entry in the dataset consists of the email text along with its corresponding label, making it suitable for building and evaluating machine learning models for binary email classification tasks.

This dataset can be used for:

  • Training and testing spam detection models
  • Practicing text preprocessing and feature extraction techniques
  • Research in machine learning, deep learning, and NLP applications

Related Datasets

Share link

Anyone who has the link will be able to view this.