RapidMiner Tutorial: Email Spam Detection Using Support Vector Machine Classification

August 19, 2013 at 11:51 PMBuddy James

RapidMiner Email spam detection using support vector machine classification.

Hello and welcome to another tutorial on RapidMiner, one of the best open source data mining tools available.  This article will discuss how to use Support Vector Machine classification in RapidMiner to determine spam in your pop3 email account.  Since we will read email from a pop3 or imap account, you may want to check out my previous post Text mining: How to mine e-mail data from an IMAP account using RapidMiner.

Process setup

Unlike my other tutorials, this tutorial will explain a RapidMiner solution that consists of two processes.  The first process will be used to create a support vector machine classification model that is trained to identify two classes of text documents, spam and not spam.  We will then store the model in the RapidMiner repository for use in the second process.  The second process will will read through an email account that you specify and classify each email as spam or not spam based on the SVM model that you created in the first process.  The methods used in these processes can be used to classify many different types of documents.  

Process one: Train the model using text files

In the first process, we will use the "Read Documents" operator to read two directories where you will create text files that represent examples of spam and non spam emails.  It's important to remember that the more examples used, the better the model will turn out when it's time to classify unlabeled emails.  As you may know, the process documents operator contains a subprocess (identified by the little blue box in the bottom right corner).  When you double click on the blue box, you can add operators that will process the documents that are read through the main operator.  When dealing with text mining to generator word vectors, there are several operators that you will find yourself using often.  In this particular example, you will use the Transform Case operator (to change all words to lowercase), Tokenize (separate words into tokens, FilterTokens (By Length) (Only process words that are at least 4 characters in length), Stem (Snowball) (This operator will change a word into it's "base word"), and finally filter stopwords which will remove common words that have no meaning.  The process documents operator has a property called Text directories with a button titled "Edit List".  You will click this button and create two entries.  The class name represents the classes that you will use for classification.  The directory will represent the directory that contains the training text files that you've created for each class.  For our experiment, we will create a class called Spam and provide a directory that contains text files where you've copied and pasted the text from legitimate emails.  The other class will be called not spam and the directory will contain text files full of spam email messages.  Please understand that the efficiency of the model will depend on the number of examples that you provide. 

Here is an image of the first process that we will use to train and store the model.

As you can see, we have a Read Documents operator, a X-Validation operator (cross validation) to calculate the performance of our model, and two Store operators that allow us to store the word vectors created from Process Documents and to store the SVM model which comes from the output of the cross validation operator.  Both the Process document and X-Validation operators are both nested processes (which means they contain sub processes).  I will provide screen shots so you can see how each process is setup in the sub process areas.

Process documents 



Run the process to save the model and to view the performance of the model.

Process 2

The next process, we will read the SVM model that we stored to the repository in process 1 and use it to classify emails from the Process Documents From Email Store operator.  This process is pretty straightforward.  Simply add a Read operator to read the model from the repository.  Next add a Read Documents From Email Store operator to read documents to be classified by the model.  If you need to know how to set the properties on the Read Documents From Mail Store operator, please check out my previous article Text mining: How to mine e-mail data from an IMAP account using RapidMiner .

Here is how the 2nd process should look:

The Read operator will read the previously trained SVM model from the repository and we will use the apply model to get the classification.  We will also create a process documents from mail store operator to read email messages and use them as the unlearned input to the apply model.  The read documents from mail store sub process contains the same text mining operators as used in the process documents operator in process one.  Next just run the process and you will see all emails and their predicted class.

Remember, the class properties of the model are arbitrary, you could just as easily set them to Positive/Negative, Good/Bad, Happy/Sad, etc.  This process can be used for many different applications such as sentiment analysis.

Thanks for reading!

Comments (11) -

Rizwan Ali
Islamic Republic of Pakistan Rizwan Ali says:

hi james

is there any operator that verifies the links in the mail body ?

i m considering that if we can identify links in mail body we can use it as one of the parameters to identify spam

also can you share thoughts on how we can test the model like the mail that comes in my inbox is usually checked for spam n stuff

kindly please reply




Buddy James
United States Buddy James says:

Thanks for reading Rizwan!

As far as validating the email addresses, I'm not quite sure on any "out of the box" operators, however, I would probably write a RESTful web service in ASP.NET and then probably write a macro in RapidMiner that would extract the email address and call the webservice to validate the email address.  It would be interesting and a bit of a challenge but RapidMiner is a wonderful product, and I'm sure that you could find a way to make it happen.

As far as your second question, I believe you are asking how RapidMiner could check the email in your box and move mail that it considers spam out of your inbox.  If that is what you are asking, I'm not really sure how that would work.  For something like that, I'd probably suggest another web service to handle the house keeping of your mailbox and have RapidMiner call the web service and report the messages that should be marked as spam.  You can also extend RapidMiner if you know Java (RapidMiner 5 is open source).  I hope these suggestions help you on your quest to find answers.  Thanks again for reading.  Sincerely, Buddy.


You're so cool! I do not believe I've read something like this before. So wonderful to discover someone with a few genuine thoughts on this subject matter. Seriously.. thanks for starting this up. This site is something that's needed on the internet, someone with some originality!|


Hi Buddy James...may I have a copy of data set that u used in this example. I hope by using your sample I can have a better understanding how exactly email is been distinguish between spam and non-spam.

Thanks a lot.


"For our experiment, we will create a class called Spam and provide a directory that contains text files where you've copied and pasted the text from legitimate emails.  The other class will be called not spam and the directory will contain text files full of spam email messages."

i believe that is not correct why would you make a class called spam n fill it with legitmate emails?

if its right kindly explain


Hi James
I am trying to do a test on spam n non-spam messages. i realised that after adding filter stopwords and transform cases, the accuracy drops by about 6 percentage point. Any idea what could be the reasons?


Nikhil V Mathew
India Nikhil V Mathew says:

In the screenshot for process 2 , a retrieve operator is used but it can't read a model.
Is there any mistake in that screenshot for process2?
Kindly explain


Buddy James
United States Buddy James says:


Thanks for reading.. I don't see the error that you are talking about.. can you elaborate?



Nikhil V Mathew
India Nikhil V Mathew says:

Hi James,
In the explanation for process 2 you mentioned about using read document from mail operator. But in the screen shot for process 2 I can't find that operator. Is it inside the process doc from mail operator?


Garen W
United States Garen W says:

Hi, thanks for your tutorial!

Could these exact same steps be used for other models such as naive bayes?

I tried both of them (just changed SVM(linear) in validation to Naive Bayes), but the results were slightly worse.

This wasn't expected since I'm also working with a spam classification problem and I read in different places that bayesian networks should probably achieve better results than any other machine learning algorithm. Do you know why this could be the case?


Buddy James
United States Buddy James says:

Hi Garen,

Thanks for reading!  You can use many different algorithms when attempting to solve a classification problem.  You can even use multiple algorithms in the same model (an ensemble).  

The reason that you've seen different results is because you need to know how to optimize the data and choose the alogirthm(s) for the problem at hand.

Thanks again for reading!

Buddy James


Add comment

  Country flag

  • Comment
  • Preview