July 14, 2013 at 5:39 PM
In this article, I will show you how to use RapidMiner, one of the best Open Source data mining solutions on the internet, to read email data from an IMAP or POP3 account for storage and processing. I'll cover the basic Text mining operations such as the Transform Case, Filter Stopwords, Stem, and tokenize operators.
Here is a picture of the main process.
This process allows me to data mine email messages and write the information that I'm interested with into a SQL Server database.
The operator of interest at the beginning of the process is the Process documents from mail store operator. This operator allows you to specify host details and login credentials in order to bring email messages into RapidMiner.
The process documents operator allows a sub process where you can add operators inside of the process documents operator to assist with munging your data. Simply double click on the blue square boxes on the process documents operator to enter the sub process.
Here is a screen shot of the processing operators that I use inside of the Process Documents operator.
There is a host property (webmail.yourdomain.com), username /password properties, to authenticate with the mail server, as well as a protocol drop down list. You can specify that you only want to read unread emails by ticking the checkbox that reads (only unseen) and you can mark emails as read by using the mark seen checkbox.
Here is an example of the properties for the Process Documents operator.
When you run your process.. it may take a while depending on how many emails that you plan to mine. The results that I'm writing to a database are as follows:
#1 The example set from the Select Attributes operator
I used the Select Attributes operator because for some reason their were two received columns in the result set which gave me trouble when writing to the database. So I used Select attributes to select all attributes except for the extra received attribute.
#2 The example set from the WordList to Data operator
As you can see, this is an incredible source of data. The data also offers classification modeling opportunities (I'm working on an article to detect spam using RapidMiner check back soon).
Thank you for reading.