RapidMiner Tutorial: Email Spam Detection Using Support Vector Machine Classification

August 19, 2013 at 11:51 PMBuddy James
 Email spam detection using support vector machine classification. Hello and welcome to another tutorial on RapidMiner, one of the best open source data mining tools available.  This article will discuss how to use Support Vector Machine classification in RapidMiner to determine spam in your pop3 email account.  Since we will read email from a pop3 or imap account, you may want to check out my previous post Text mining: How to mine e-mail data from an IMAP account using RapidMiner. Process setup Unlike my other tutorials, this tutorial will explain a RapidMiner solution that consists of two processes.  The first process will be used to create a support vector machine classification model that is trained to identify two classes of text documents, spam and not spam.  We will then store the model in the RapidMiner repository for use in the second process.  The second process will will read through an email account that you specify and classify each email as spam or not spam based on the SVM model that you created in the first process.  The methods used in these processes can be used to classify many different types of documents.   Process one: Train the model using text files In the first process, we will use the "Read Documents" operator to read two directories where you will create text files that represent examples of spam and non spam emails.  It's important to remember that the more examples used, the better the model will turn out when it's time to classify unlabeled emails.  As you may know, the process documents operator contains a subprocess (identified by the little blue box in the bottom right corner).  When you double click on the blue box, you can add operators that will process the documents that are read through the main operator.  When dealing with text mining to generator word vectors, there are several operators that you will find yourself using often.  In this particular example, you will use the Transform Case operator (to change all words to lowercase), Tokenize (separate words into tokens, FilterTokens (By Length) (Only process words that are at least 4 characters in length), Stem (Snowball) (This operator will change a word into it's "base word"), and finally filter stopwords which will remove common words that have no meaning.  The process documents operator has a property called Text directories with a button titled "Edit List".  You will click this button and create two entries.  The class name represents the classes that you will use for classification.  The directory will represent the directory that contains the training text files that you've created for each class.  For our experiment, we will create a class called Spam and provide a directory that contains text files where you've copied and pasted the text from legitimate emails.  The other class will be called not spam and the directory will contain text files full of spam email messages.  Please understand that the efficiency of the model will depend on the number of examples that you provide.  Here is an image of the first process that we will use to train and store the model. As you can see, we have a Read Documents operator, a X-Validation operator (cross validation) to calculate the performance of our model, and two Store operators that allow us to store the word vectors created from Process Documents and to store the SVM model which comes from the output of the cross validation operator.  Both the Process document and X-Validation operators are both nested processes (which means they contain sub processes).  I will provide screen shots so you can see how each process is setup in the sub process areas. Process documents    X-Validation Run the process to save the model and to view the performance of the model. Process 2 The next process, we will read the SVM model that we stored to the repository in process 1 and use it to classify emails from the Process Documents From Email Store operator.  This process is pretty straightforward.  Simply add a Read operator to read the model from the repository.  Next add a Read Documents From Email Store operator to read documents to be classified by the model.  If you need to know how to set the properties on the Read Documents From Mail Store operator, please check out my previous article Text mining: How to mine e-mail data from an IMAP account using RapidMiner . Here is how the 2nd process should look: The Read operator will read the previously trained SVM model from the repository and we will use the apply model to get the classification.  We will also create a process documents from mail store operator to read email messages and use them as the unlearned input to the apply model.  The read documents from mail store sub process contains the same text mining operators as used in the process documents operator in process one.  Next just run the process and you will see all emails and their predicted class. Remember, the class properties of the model are arbitrary, you could just as easily set them to Positive/Negative, Good/Bad, Happy/Sad, etc.  This process can be used for many different applications such as sentiment analysis. Thanks for reading!


RapidMiner tips and tricks #1 How to use SQL Server named instances with RapidMiner Read/Write database operators

July 14, 2013 at 7:29 PMBuddy James
 Tips and tricks. Tip #1 How to use SQL Server named instances with RapidMiner Read/Write to database operators Hello and welcome to my first of many tips and tricks for RapidMiner.  If you are unfamiliar with RapidMiner, it's a Open Source Java based data mining solution.  You can visit the official RapidMiner website by clicking here.  My plan is to write a short article to provide solutions to problems that I encounter as I learn more about this awesome application.   RapidMiner and database connectivity There are many operators in RapidMiner that take input data sets and generate models for prediction and analysis.  Often, you will want to write the result set of the model to a database.  To do this you use the "Write Database" operator. I was using RapidMiner for web mining by way of the Crawl Web operator.  The Example set output of the Crawl Web operator was connected to the input of the Write Database operator.  At the time I was using a SQL Server database that I pay for through my web hosting account.  Just like most everything in RapidMiner, the setup was easy and worked like a charm.  My database size quota was 200MB with my current hosting plan and it became apparent to me that I would quickly run out of space.  As such, I decided to use the local SQL Express 2012 named instanced on my machine.  This is where the problem was introduced.  I couldn't figure out how to successfully setup the database connection in RapidMiner.   RapidMiner, Named Instances, and Integrated Security The issues that I encountered when trying to setup my local SQL Server 2012 named instanced were as follows: If I used the named instance for the server name(localhost\SQLExpress), I was unable to connect.  I didn't encounter this problem with my hosting server's database because it was a direct hostname (xxx.sqlserverdb.com).  There was no instance name and so the configuration was easy. I wasn't sure how to specify integrated security as this is something that you usually specify in the connection string.  I didn't encounter this problem either using my hosting database server because I was given a user name and password to connect to the server. After some research and banging my head against my laptop, I finally figured out the resolution to my problems and I'm here to save someone else the headache. For the named instance issue, there is a trick that is not readily apparent to get this to work.  You set your database server name as per usual, in my case, localhost, however, when you specify the database name, you include a semicolon (;) followed by instance=<instance name>.  So for my local server instance (localhost\sqlexpress), I set the Host value to localhost and the Database scheme value to mydatabasename;instance=sqlexpress .   As far as the integrated security requirement, all you need to do is make sure that you have the latest JTDS SQL Server driver from here.  Once you download the zip file, you'll need to extract the file jtds-1.3.0-dist.zip\x86\SSO\ntlmauth.dll and place it in your windows\system32 directory.  This will insure that you have the driver with the capabilities of using the integrated security.  Once this file is in place, you simply leave the username and password values blank. Here is a screen shot of the Manage Database Connections window in RapidMiner for your reference.   Well that about wraps it up.  Please leave a comment if you have any questions. Until next time, Buddy James


Posted in: Analytics | BI | Data mining | RapidMiner | SQL Server | Tutorial

Tags: , , , ,

Machine Learning tutorial: How to create a decision tree in RapidMiner using the Titanic passenger data set

July 14, 2013 at 7:26 PMBuddy James
    Greetings! And welcome to another wam bam, thank you ma'am, mind blowing, flex showing, machine learning tutorial here at refactorthis.net! This tutorial is based on a machine learning toolkit called RapidMiner by RapidI.  RapidMiner is a full featured Java based open source machine learning toolkit with support for all of the popular machine learning algorithms used in data analytics today.  The library supports supports the following machine learning algorithms (to name a few): k-NN Naive Bayes (kernel) Decision Tree (Weight-based, Multiway) Decision Stump Random Tree Random Forest Neural Networks Perception Linear Regression Polynomial Regression Vector Linear Regression Gaussian Process Support Vector Machine (Linear, Evolutionary, PSO) Additive Regression Relative Regression k-Means (kernel, fast) And much much more!! Excited yet?  I thought so! How to create a decision tree using RapidMiner When I first ran across screen shots of RapidMiner online, I thought to myself, "Oh boy.. I wonder how much this is going to cost...".  The UI looked so amazing.  It's like Visual Studio for Data Mining and Machine learning!  Much to my surprise, I found out that the application is open source and free! Here is a quote from the RapidMiner site: RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge. I've been trying some machine learning "challenges" recently to sharpen my skills as a data scientist, and I decided to use RapidMiner to tackle the kaggle.com machine learning challenge called "Titanic: Machine Learning from Disaster" .  The data set is a CSV file that contains information on many of the passengers of the infamous Titanic voyage.  The goal of the challenge is to take one CSV file containing training data (the training data contains all attributes as well as the label Survived) and a testing data file containing only the attributes (no Survived label) and to predict the Survived label of the testing set based on the training set. Warning: Although I'm not going to provide the complete solution to this challenge, I warn you, if you are working on this challenge, then you should probably stop reading this tutorial.  I do provide some insights into the survival data found in the training data set.  It's best to try to work the challenge out on your own.  After all, we learn by TRYING, FAILING, TRYING AGAIN, THEN SUCCEEDING.  I'd also like to say that I'm going to do my very best to go easy on the THEORY of this post..  I know that some of my readers like to get straight to the action :)  You have been warned..   Why a decision tree? A decision tree model is a great way to visualize a data set to determine which attributes of a data set influenced a particular classification (label).  A decision tree looks like a tree with branches, flipped upside down..  Perhaps a (cheesy) image will illustrate..   After you are finished laughing at my drawing, we may proceed.......  OK In my example, imagine that we have a data set that has data that is related to lifestyle and heart disease.  Each row has a person, their sex, age, Smoker (y/n), Diet (good/poor), and a label Risk (Less Risk/More Risk).  The data indicates that the biggest influence on Risk turns out to be the Smoker attribute.  Smoker becomes the first branch in our tree.  For Smokers, the next influencial attribute happens to be Age, however, for non smokers, the data indicates that their diet has a bigger influence on the risk.  The tree will branch into two different nodes until the classification os reached or the maximum "depth" that we establish is reached.  So as you can see, a decision tree can be a great way to visualize how a decision is derived based on the attributes in your data. RapidMiner and data modeling Ready to see how easy it is to create a prediction model using RapidMiner?  I thought so! Create a new process When you are working in RapidMiner, your project is known as a process.  So we will start by running RapidMiner and creating a new process.     The version of RapidMiner used in this tutorial is version 5.3.  Once the application is open, you will be presented with the following start screen.  From this screen you will click on New Process  You are presented with the main user interface for RapidMiner.  One of the most compelling aspects of Rapidminer is it's ease of use and intuitive user interface.  The basic flow of this process is as follows: Import your test and training data from CSV files into your RapidMiner repository.  This can be found in the repository menu under Import CSV file Once your data has been imported into your repository, the datasets can be dragged onto your process surface for you to apply operators You will add your training data to the process Next, you will add your testing data to the process Search the operators for Decision Tree and add the operator In order to use your training data to generate a prediction on your testing data using the Decision Tree model, we will add an "Apply Model" operator to the process.  This operator has an input that you will associate with the output model of your Decision Tree operator.  There is also an input that takes "unlearned" data from the output of your testing dataset. You will attach the outputs of Apply Model to the results connectors on the right side of the process surface. Once you have designed your model, RapidMiner will show you any problems with your process and will offer "Quick fixes" if they exists that you can double click to resolve.   Once all problems have been resolved, you can run your process and you will see the results that you wired up to the results side of the process surface. Here are screenshots of the entire process for your review  Empty Process   Add the training data from the repository by dragging and dropping the dataset that you imported from your CSV file   Repeat the process and add the testing data underneath the training data Now you can search in the operators window for Decision Tree operator.  Add it to your process. The way that you associate the inputs and outputs of operators and data sets is by clicking on the output of one item and connecting it by clicking on the input of another item.  Here we are connecting the output of the training dataset to the input of the Decision Tree operator.   Next we will add the Apply model operator Then we will create the appropriate connections for the model Observe the quick fixes in the problems window at the bottom.. you can double click the quick fixes to resolve the issues. You will be prompted to make a simple decision regarding the problem that was detected.  Once you resolve one problem, other problems may appear.  be sure to resolve all problems so that you can run your process. Here is the process after resolving all problems.   Next, I select the decision tree operator and I adjust the following parameters: Maximum Depth: change from 20 to 5. check both boxes to make sure that the tree is not "pruned". Once this has been done, you can Run your process and observe the results.  Since we connected both the model as well as the labeled result to the output connectors of the process, we are presented with a visual display of our Decision Tree (model) as well as the Test data set with the prediction applied. (Decision Tree Model)   (The example test result set with the predictions applied)   As you can see, RapidMiner makes complex data analysis and machine learning tasks extremely easy with very little effort. This concludes my tutorial on creating Decision Trees in RapidMiner. Until next time,   Buddy James  


Posted in: Analytics | BI | Data mining | machine learning | RapidMiner | Tutorial

Tags: , , , ,

RapidMiner tutorial: How to explore correlations in your data to discover the relevance of attributes

July 14, 2013 at 7:08 PMBuddy James
What is correlation? From wikipedia In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence. In laymans terms, correlation is a relationships between data attributes.  For a quick refresher, in data mining, a dataset is made up of different attributes.  We use these attributes to classify or predict a label.  Some attributes have more "meaning" or influence over the label's value.  As you can imagine, if you can determine the influence that specific attributes have over your data, you are in a better position to build a classification model because you will know which attributes you should focus on when building your model.   In this example, I will use the kaggle.com Titanic datamining challenge dataset.  This post will not uncover any information that is not readily available in the tutorial posted on kaggle.com. Here are two screenshots.  The first screenshot will show you some statistics about the dataset.  The second screenshot will show a sample of the data. Meta data view of the Titanic data mining challenge Training dataset A data view of the dataset The correlation matrix First start by importing the Titanic training dataset into RapidMiner.  You can use Read From CSV, Read From Excel, or Read from Database to achieve this step.  Next, search for the "Correlation Matrix" operator and drag it onto the process surface.  Connect the Titanic training dataset output port to the Correlation Matrix operator's input example port.  Your process should look like this.   Now run the process and observe the output. You are presented with several different result views.  The first view will be the Correlation Matrix Attribute Weights view.  The Attribute weights view displays the "weight" of each attribute.  The purpose of this tutorial is to explain a different view of the Correlation matrix.  Click on the Correlation Matrix view.  This is a matrix that shows the Correlation Coefficients which is a measure of the strength of the relationship between our attributes.  An easy way to get started with the Correlation matrix is to notice that when an attribute intersects with itself, you have a dark blue cell with the value of 1 which represents the strongest possible value.  This is because any attribute matched with itself is a perfect correlation.  A correlation coefficient value can be positive or negative.  A negative value does not necessarily mean there is less of a relationship between the values represented.  The larger the coefficient in either direction represents a strong relationship between those two attributes.  If we look at the matrix and follow along the top row (survived) we will see the attributes that have the strongest correlation with the label in which we are trying to predict. Just as the kaggle.com tutorial specifies, the attributes with the strongest correlation with the label (survived) are sex(0.295), pclass(0.115), and fare(0.66)  Remember that the value as well as the color will help you to visually identify the stronger correlation between attributes. If you are working with a classification problem, I'm sure you can see how valuable the correlation matrix can be in showing you the relationships between your label and attributes.  Such insights let can provide a great start on where to focus your attention when building your classification model. Thanks for reading and keep your eyes open for my next tutorial! 


Posted in: Analytics | BI | Data mining | machine learning | RapidMiner | Tutorial

Tags: , , ,

Text mining: How to mine e-mail data from an IMAP account using RapidMiner

July 14, 2013 at 5:39 PMBuddy James
In this article, I will show you how to use RapidMiner, one of the best Open Source data mining solutions on the internet, to read email data from an IMAP or POP3 account for storage and processing.  I'll cover the basic Text mining operations such as the Transform Case, Filter Stopwords, Stem, and tokenize operators. Here is a picture of the main process.   This process allows me to data mine email messages and write the information that I'm interested with into a SQL Server database.   The operator of interest at the beginning of the process is the Process documents from mail store operator.  This operator allows you to specify host details and login credentials in order to bring email messages into RapidMiner.   The process documents operator allows a sub process where you can add operators inside of the process documents operator to assist with munging your data.  Simply double click on the blue square boxes on the process documents operator to enter the sub process. Here is a screen shot of the processing operators that I use inside of the Process Documents operator.   There is a host property (webmail.yourdomain.com), username /password properties, to authenticate with the mail server, as well as a protocol drop down list.  You can specify that you only want to read unread emails by ticking the checkbox that reads (only unseen) and you can mark emails as read by using the mark seen checkbox. Here is an example of the properties for the Process Documents operator.   When you run your process.. it may take a while depending on how many emails that you plan to mine.  The results that I'm writing to a database are as follows: #1 The example set from the Select Attributes operator I used the Select Attributes operator because for some reason their were two received columns in the result set which gave me trouble when writing to the database.  So I used Select attributes to select all attributes except for the extra received attribute. #2 The example set from the WordList to Data operator   As you can see, this is an incredible source of data.  The data also offers classification modeling opportunities (I'm working on an article to detect spam using RapidMiner  check back soon). Thank you for reading.


Posted in: Data mining | Text mining | Analytics | RapidMiner | Tutorial | SQL Server

Tags: , , , , , ,