RapidMiner 6 has been released

November 24, 2013 at 10:02 PMBuddy James
  RapiderMiner 6  The highly anticipated RapidMiner 6 named RapidMiner studio has been released.  As I'm sure you can tell by my blog, I'm a huge fan of RapidMiner.  It make data mining tasks extremely easy for anyone to learn.  This is made possible by the awesome user interface design which is based on a WYSIWYG editor where you drag and drop process operators in the design view and connect connect the operators through in and out ports that provide a simply and effective way to model the process flow of data.  While these are the primary reasons that I love RapidMiner, they are nothing new.  So lets take a look at the new features added to this awesome application. New logo for a new brand The first thing that I noticed was the changes made on the RapidMiner website.  The website has been completely changed.  The changes include a new logo as well as a new presentation of the application which looks more like a product advertisement than an open source application.  This is because there are now several different versions of the application.  The open source, free community edition,   Here is a list of the different versions from the RapidMiner website.   STARTER Free 1 GB CSV and Excel None Community support Unlimited Download   PERSONAL $999/ yr. 4 GB Common types Open source databases Community support 1 year Buy   FEATURED PROFESSIONAL $2999/ yr. 8 GB Common types All database systems Community support 1 year Buy   ENTERPRISE Ask Unlimited main memory Common types + SPSS, SAS, HDFS All database systems Enterprise support 1 year Contact Us   New features Process templates The new version of RapidMiner Studio contains project templates that provide processes geared toward specific data mining problems. The templates and new tutorials look to be great additions to this already stellar package.  I plan to download a trial of the professional version and write a full review.  Stay tuned!


Posted in: Analytics | BI | Data mining | RapidMiner | Review

Tags: , , , ,

RapidMiner tips and tricks #1 How to use SQL Server named instances with RapidMiner Read/Write database operators

July 14, 2013 at 7:29 PMBuddy James
 Tips and tricks. Tip #1 How to use SQL Server named instances with RapidMiner Read/Write to database operators Hello and welcome to my first of many tips and tricks for RapidMiner.  If you are unfamiliar with RapidMiner, it's a Open Source Java based data mining solution.  You can visit the official RapidMiner website by clicking here.  My plan is to write a short article to provide solutions to problems that I encounter as I learn more about this awesome application.   RapidMiner and database connectivity There are many operators in RapidMiner that take input data sets and generate models for prediction and analysis.  Often, you will want to write the result set of the model to a database.  To do this you use the "Write Database" operator. I was using RapidMiner for web mining by way of the Crawl Web operator.  The Example set output of the Crawl Web operator was connected to the input of the Write Database operator.  At the time I was using a SQL Server database that I pay for through my web hosting account.  Just like most everything in RapidMiner, the setup was easy and worked like a charm.  My database size quota was 200MB with my current hosting plan and it became apparent to me that I would quickly run out of space.  As such, I decided to use the local SQL Express 2012 named instanced on my machine.  This is where the problem was introduced.  I couldn't figure out how to successfully setup the database connection in RapidMiner.   RapidMiner, Named Instances, and Integrated Security The issues that I encountered when trying to setup my local SQL Server 2012 named instanced were as follows: If I used the named instance for the server name(localhost\SQLExpress), I was unable to connect.  I didn't encounter this problem with my hosting server's database because it was a direct hostname (xxx.sqlserverdb.com).  There was no instance name and so the configuration was easy. I wasn't sure how to specify integrated security as this is something that you usually specify in the connection string.  I didn't encounter this problem either using my hosting database server because I was given a user name and password to connect to the server. After some research and banging my head against my laptop, I finally figured out the resolution to my problems and I'm here to save someone else the headache. For the named instance issue, there is a trick that is not readily apparent to get this to work.  You set your database server name as per usual, in my case, localhost, however, when you specify the database name, you include a semicolon (;) followed by instance=<instance name>.  So for my local server instance (localhost\sqlexpress), I set the Host value to localhost and the Database scheme value to mydatabasename;instance=sqlexpress .   As far as the integrated security requirement, all you need to do is make sure that you have the latest JTDS SQL Server driver from here.  Once you download the zip file, you'll need to extract the file jtds-1.3.0-dist.zip\x86\SSO\ntlmauth.dll and place it in your windows\system32 directory.  This will insure that you have the driver with the capabilities of using the integrated security.  Once this file is in place, you simply leave the username and password values blank. Here is a screen shot of the Manage Database Connections window in RapidMiner for your reference.   Well that about wraps it up.  Please leave a comment if you have any questions. Until next time, Buddy James


Posted in: Analytics | BI | Data mining | RapidMiner | SQL Server | Tutorial

Tags: , , , ,

RapidMiner tutorial: How to explore correlations in your data to discover the relevance of attributes

July 14, 2013 at 7:08 PMBuddy James
What is correlation? From wikipedia In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence. In laymans terms, correlation is a relationships between data attributes.  For a quick refresher, in data mining, a dataset is made up of different attributes.  We use these attributes to classify or predict a label.  Some attributes have more "meaning" or influence over the label's value.  As you can imagine, if you can determine the influence that specific attributes have over your data, you are in a better position to build a classification model because you will know which attributes you should focus on when building your model.   In this example, I will use the kaggle.com Titanic datamining challenge dataset.  This post will not uncover any information that is not readily available in the tutorial posted on kaggle.com. Here are two screenshots.  The first screenshot will show you some statistics about the dataset.  The second screenshot will show a sample of the data. Meta data view of the Titanic data mining challenge Training dataset A data view of the dataset The correlation matrix First start by importing the Titanic training dataset into RapidMiner.  You can use Read From CSV, Read From Excel, or Read from Database to achieve this step.  Next, search for the "Correlation Matrix" operator and drag it onto the process surface.  Connect the Titanic training dataset output port to the Correlation Matrix operator's input example port.  Your process should look like this.   Now run the process and observe the output. You are presented with several different result views.  The first view will be the Correlation Matrix Attribute Weights view.  The Attribute weights view displays the "weight" of each attribute.  The purpose of this tutorial is to explain a different view of the Correlation matrix.  Click on the Correlation Matrix view.  This is a matrix that shows the Correlation Coefficients which is a measure of the strength of the relationship between our attributes.  An easy way to get started with the Correlation matrix is to notice that when an attribute intersects with itself, you have a dark blue cell with the value of 1 which represents the strongest possible value.  This is because any attribute matched with itself is a perfect correlation.  A correlation coefficient value can be positive or negative.  A negative value does not necessarily mean there is less of a relationship between the values represented.  The larger the coefficient in either direction represents a strong relationship between those two attributes.  If we look at the matrix and follow along the top row (survived) we will see the attributes that have the strongest correlation with the label in which we are trying to predict. Just as the kaggle.com tutorial specifies, the attributes with the strongest correlation with the label (survived) are sex(0.295), pclass(0.115), and fare(0.66)  Remember that the value as well as the color will help you to visually identify the stronger correlation between attributes. If you are working with a classification problem, I'm sure you can see how valuable the correlation matrix can be in showing you the relationships between your label and attributes.  Such insights let can provide a great start on where to focus your attention when building your classification model. Thanks for reading and keep your eyes open for my next tutorial! 


Posted in: Analytics | BI | Data mining | machine learning | RapidMiner | Tutorial

Tags: , , ,

Text mining: How to mine e-mail data from an IMAP account using RapidMiner

July 14, 2013 at 5:39 PMBuddy James
In this article, I will show you how to use RapidMiner, one of the best Open Source data mining solutions on the internet, to read email data from an IMAP or POP3 account for storage and processing.  I'll cover the basic Text mining operations such as the Transform Case, Filter Stopwords, Stem, and tokenize operators. Here is a picture of the main process.   This process allows me to data mine email messages and write the information that I'm interested with into a SQL Server database.   The operator of interest at the beginning of the process is the Process documents from mail store operator.  This operator allows you to specify host details and login credentials in order to bring email messages into RapidMiner.   The process documents operator allows a sub process where you can add operators inside of the process documents operator to assist with munging your data.  Simply double click on the blue square boxes on the process documents operator to enter the sub process. Here is a screen shot of the processing operators that I use inside of the Process Documents operator.   There is a host property (webmail.yourdomain.com), username /password properties, to authenticate with the mail server, as well as a protocol drop down list.  You can specify that you only want to read unread emails by ticking the checkbox that reads (only unseen) and you can mark emails as read by using the mark seen checkbox. Here is an example of the properties for the Process Documents operator.   When you run your process.. it may take a while depending on how many emails that you plan to mine.  The results that I'm writing to a database are as follows: #1 The example set from the Select Attributes operator I used the Select Attributes operator because for some reason their were two received columns in the result set which gave me trouble when writing to the database.  So I used Select attributes to select all attributes except for the extra received attribute. #2 The example set from the WordList to Data operator   As you can see, this is an incredible source of data.  The data also offers classification modeling opportunities (I'm working on an article to detect spam using RapidMiner  check back soon). Thank you for reading.


Posted in: Data mining | Text mining | Analytics | RapidMiner | Tutorial | SQL Server

Tags: , , , , , ,