August 19, 2013 at 11:51 PM
Email spam detection using support vector machine classification.
Hello and welcome to another tutorial on RapidMiner, one of the best open source data mining tools available. This article will discuss how to use Support Vector Machine classification in RapidMiner to determine spam in your pop3 email account. Since we will read email from a pop3 or imap account, you may want to check out my previous post Text mining: How to mine e-mail data from an IMAP account using RapidMiner.
Unlike my other tutorials, this tutorial will explain a RapidMiner solution that consists of two processes. The first process will be used to create a support vector machine classification model that is trained to identify two classes of text documents, spam and not spam. We will then store the model in the RapidMiner repository for use in the second process. The second process will will read through an email account that you specify and classify each email as spam or not spam based on the SVM model that you created in the first process. The methods used in these processes can be used to classify many different types of documents.
Process one: Train the model using text files
In the first process, we will use the "Read Documents" operator to read two directories where you will create text files that represent examples of spam and non spam emails. It's important to remember that the more examples used, the better the model will turn out when it's time to classify unlabeled emails. As you may know, the process documents operator contains a subprocess (identified by the little blue box in the bottom right corner). When you double click on the blue box, you can add operators that will process the documents that are read through the main operator. When dealing with text mining to generator word vectors, there are several operators that you will find yourself using often. In this particular example, you will use the Transform Case operator (to change all words to lowercase), Tokenize (separate words into tokens, FilterTokens (By Length) (Only process words that are at least 4 characters in length), Stem (Snowball) (This operator will change a word into it's "base word"), and finally filter stopwords which will remove common words that have no meaning. The process documents operator has a property called Text directories with a button titled "Edit List". You will click this button and create two entries. The class name represents the classes that you will use for classification. The directory will represent the directory that contains the training text files that you've created for each class. For our experiment, we will create a class called Spam and provide a directory that contains text files where you've copied and pasted the text from legitimate emails. The other class will be called not spam and the directory will contain text files full of spam email messages. Please understand that the efficiency of the model will depend on the number of examples that you provide.
Here is an image of the first process that we will use to train and store the model.
As you can see, we have a Read Documents operator, a X-Validation operator (cross validation) to calculate the performance of our model, and two Store operators that allow us to store the word vectors created from Process Documents and to store the SVM model which comes from the output of the cross validation operator. Both the Process document and X-Validation operators are both nested processes (which means they contain sub processes). I will provide screen shots so you can see how each process is setup in the sub process areas.
Run the process to save the model and to view the performance of the model.
The next process, we will read the SVM model that we stored to the repository in process 1 and use it to classify emails from the Process Documents From Email Store operator. This process is pretty straightforward. Simply add a Read operator to read the model from the repository. Next add a Read Documents From Email Store operator to read documents to be classified by the model. If you need to know how to set the properties on the Read Documents From Mail Store operator, please check out my previous article Text mining: How to mine e-mail data from an IMAP account using RapidMiner .
Here is how the 2nd process should look:
The Read operator will read the previously trained SVM model from the repository and we will use the apply model to get the classification. We will also create a process documents from mail store operator to read email messages and use them as the unlearned input to the apply model. The read documents from mail store sub process contains the same text mining operators as used in the process documents operator in process one. Next just run the process and you will see all emails and their predicted class.
Remember, the class properties of the model are arbitrary, you could just as easily set them to Positive/Negative, Good/Bad, Happy/Sad, etc. This process can be used for many different applications such as sentiment analysis.
Thanks for reading!
July 14, 2013 at 5:39 PM
In this article, I will show you how to use RapidMiner, one of the best Open Source data mining solutions on the internet, to read email data from an IMAP or POP3 account for storage and processing. I'll cover the basic Text mining operations such as the Transform Case, Filter Stopwords, Stem, and tokenize operators.
Here is a picture of the main process.
This process allows me to data mine email messages and write the information that I'm interested with into a SQL Server database.
The operator of interest at the beginning of the process is the Process documents from mail store operator. This operator allows you to specify host details and login credentials in order to bring email messages into RapidMiner.
The process documents operator allows a sub process where you can add operators inside of the process documents operator to assist with munging your data. Simply double click on the blue square boxes on the process documents operator to enter the sub process.
Here is a screen shot of the processing operators that I use inside of the Process Documents operator.
There is a host property (webmail.yourdomain.com), username /password properties, to authenticate with the mail server, as well as a protocol drop down list. You can specify that you only want to read unread emails by ticking the checkbox that reads (only unseen) and you can mark emails as read by using the mark seen checkbox.
Here is an example of the properties for the Process Documents operator.
When you run your process.. it may take a while depending on how many emails that you plan to mine. The results that I'm writing to a database are as follows:
#1 The example set from the Select Attributes operator
I used the Select Attributes operator because for some reason their were two received columns in the result set which gave me trouble when writing to the database. So I used Select attributes to select all attributes except for the extra received attribute.
#2 The example set from the WordList to Data operator
As you can see, this is an incredible source of data. The data also offers classification modeling opportunities (I'm working on an article to detect spam using RapidMiner check back soon).
Thank you for reading.