Baselight

Email-blog

Classified emails taken from the public Enron repository.

@kaggle.mikeschmidtavemac_emailblog

About this Dataset

Email-blog

Context

Supervised classification dataset produced as part of a blog series on classifying corporate email for morale and professional alignment.
Series covers raw data extraction, analysis, unsupervised topic discovery and supervised model development.

The blog posts are available at:

Part 1. Raw email processing. https://www.avemacconsulting.com/2021/08/24/email-insights-from-data-science-techniques-part-1/
Part 2. Data analysis. https://www.avemacconsulting.com/2021/08/27/email-insights-from-data-science-part-2/
Part 3. Unsupervised topic classification (creates this dataset). https://www.avemacconsulting.com/2021/09/23/email-insights-from-data-science-part-3/
Part 4. Supervised modeling (uses this dataset). https://www.avemacconsulting.com/2021/10/12/email-insights-from-data-science-part-4/

** Note. This data is part of a blog series so is not vetted 100%. Specifically the unsupervised topic extraction step should be further tuned for accuracy.

Content

Original email content taking from the public Enron email repository located at https://www.cs.cmu.edu/~enron/.

Dataset contains email body text, various supporting features (email addresses, data/time, etc.) plus multiple classification labels.

Three (3) labels were generated for sentiment with three (3) classes (positive/negative/(neutral/unknown)).
Three (3) labels were also created for alignment(business/personal) with two (2) classes (fun/work)).

Acknowledgements

Uses sentiment lexicon from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,

Uses VADER from https://www.nltk.org/api/nltk.sentiment.html?highlight=vader#module-nltk.sentiment.vader

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.
Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Uses AFINN from http://corpustext.com/reference/sentiment_afinn.html

Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.
Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop
Proceedings 93-98. 2011 May.

Share link

Anyone who has the link will be able to view this.