Enron Emails Dataset (Subsetted)
Database of 270,087 Enron emails, taken from the May 7, 2015 version of the dataset. Data was extracted from CMU.
Columns of the dataset
There are three columns in enron.csv
:
person
: The person associated with the email.mail_num
: Identifier for each person’s email.email
: Each entry is a line in an email, including the email’s metadata (like subject, who it was sent to, etc.)
Reproducing the dataset
To reproduce the dataset enron.csv
:
- Download and unzip the file from the above CMU website.
- Ensure the resulting unzipped folder is called
maildir
. - Place the R script,
enron.R
, beside (in the same directory) asmaildir
. - Run the R script.
Dependencies: R packages tidyverse
and fs
.
Main idea of how the dataset was created
The zip file held folders corresponding to people. Each person’s folder had many other folders, but I only selected the folders named _sent_mail
, which themselves contained a number of files containing individual emails. I read in all emails in all _sent_mail
folders. I then subsetted to the first 100 emails for each person (if more than 100 existed).