Text mining: How to mine e-mail data from an IMAP account using RapidMiner

July 14, 2013 at 5:39 PMBuddy James

In this article, I will show you how to use RapidMiner, one of the best Open Source data mining solutions on the internet, to read email data from an IMAP or POP3 account for storage and processing.  I'll cover the basic Text mining operations such as the Transform Case, Filter Stopwords, Stem, and tokenize operators.

Here is a picture of the main process.

 

This process allows me to data mine email messages and write the information that I'm interested with into a SQL Server database.

 

The operator of interest at the beginning of the process is the Process documents from mail store operator.  This operator allows you to specify host details and login credentials in order to bring email messages into RapidMiner.  

The process documents operator allows a sub process where you can add operators inside of the process documents operator to assist with munging your data.  Simply double click on the blue square boxes on the process documents operator to enter the sub process.

Here is a screen shot of the processing operators that I use inside of the Process Documents operator.

RapidMiner process documents sub process operators

 

There is a host property (webmail.yourdomain.com), username /password properties, to authenticate with the mail server, as well as a protocol drop down list.  You can specify that you only want to read unread emails by ticking the checkbox that reads (only unseen) and you can mark emails as read by using the mark seen checkbox.

Here is an example of the properties for the Process Documents operator.

 

When you run your process.. it may take a while depending on how many emails that you plan to mine.  The results that I'm writing to a database are as follows:

#1 The example set from the Select Attributes operator

I used the Select Attributes operator because for some reason their were two received columns in the result set which gave me trouble when writing to the database.  So I used Select attributes to select all attributes except for the extra received attribute.

#2 The example set from the WordList to Data operator

 

As you can see, this is an incredible source of data.  The data also offers classification modeling opportunities (I'm working on an article to detect spam using RapidMiner  check back soon).

Thank you for reading.



Posted in: Data mining | Text mining | Analytics | RapidMiner | Tutorial | SQL Server

Tags: , , , , , ,

Comments (8) -

Please help
Im unable to read emails from my yahoo account, are there any different parameters?

Reply

Buddy James
United States Buddy James says:

Roohi,

Thanks for reading!

Here are some details from Yahoo help:
help.yahoo.com/.../index

Incoming Mail (IMAP) Server - Requires SSL

Server: imap.mail.yahoo.com
Port: 993
Requires SSL: Yes

Outgoing Mail (SMTP) Server - Requires TLS

Server: smtp.mail.yahoo.com
Port: 465 or 587
Requires SSL: Yes
Requires authentication: Yes
Login info - Requires authentication

Email address: Your full email address (name@domain.com.)
Password: Your account's password.

Reply

Rizwan Ali
Islamic Republic of Pakistan Rizwan Ali says:

hi James

thanks for this informative article i have a few questions

1  The operator of interest at the beginning of the process is the Process documents from mail store operator.

is this operator already available in the Rapid Miner? or you have written it if you have than how can i write an operator

Regards

Rizwan

Reply

Buddy James
United States Buddy James says:

Thanks for reading Rizwan.

The Process documents from mail store operator is a RapidMiner operator (I don't think it's an extension but I could be wrong).  You can install extensions from within RapidMiner from the Help menu.

I've yet to write a custom operator (or extension) but I know it can be done.  If you are having trouble finding the operator then it probably is an extension.  Also, keep in mind that I use RapidMiner version 5 because they removed the SQL operators from the new free version and frankly, I find those operators essential, so I won't upgrade RapidMiner until they've released another free version that has the Read/Write to database operators included.  Thanks again for reading!  Buddy

Reply

Rizwan Ali
Islamic Republic of Pakistan Rizwan Ali says:

Hi James

thanks for your reply on my comment i have two queries

1 is there any way to load .eml files in rapid miner if there is please do tell

2 now a days there is a trend of send .gif or peg or a image in spam mail rather than writing somthing to avoid datamining and spam filtering based on text mining is there a way around that ?

Regards

Rizwan

Reply

Thank you for this article. I have not been able to connect the email outlook web app: https://correo.xxx.cl/owa/ => "HOST": (RapidMiner) correo.xxx.cl and user xxx\lpez => "USER": (RapidMiner) lpez@xxx.cl with "PROTOCOL": (RapidMiner) imap, both for the operator "Text Processing" called "Read Documents (Mail)" and "Process Documents from Mail Store". Might you please give me some help. thank you very much

Reply

Hi thanks for this information, do you know how apply the configuration of "Process Documents from Mail Store" with Outlook Web Access? (HOST and USER)

Reply

Nikhil V Mathew
India Nikhil V Mathew says:

Hi James,

sun.security.validator exception , PKIX path building fail error is occurring during run time. How can I solve this?

Reply

Pingbacks and trackbacks (1)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading