PHISHING EMAIL DATASET
This dataset was compiled by researchers to study phishing email tactics. It combines emails from a variety of sources to create a comprehensive resource for analysis.
Initial Datasets:
-
Enron and Ling Datasets: These datasets focus on the core content of phishing emails, containing subject lines, email body text, and labels indicating whether the email is spam (phishing) or legitimate.
-
CEAS, Nazario, Nigerian Fraud, and SpamAssassin Datasets: These datasets provide broader context for the emails, including sender information, recipient information, date, and labels for spam/legitimate classification.
Final Dataset:
The final dataset combines the information from the initial datasets into a single resource for analysis. This dataset contains:
- Approximately 82,500 emails
- 42,891 spam emails
- 39,595 legitimate emails
This dataset allows researchers to study the content of phishing emails and the context in which they are sent to improve detection methods.
Please cite the following two articles if you are using this dataset:
- Al-Subaiey, A., Al-Thani, M., Alam, N. A., Antora, K. F., Khandakar, A., & Zaman, S. A. U. (2024, May 19). Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. ArXiv.org. https://arxiv.org/abs/2405.11619