July 14, 2013 at 7:29 PM
Tips and tricks. Tip #1 How to use SQL Server named instances with RapidMiner Read/Write to database operators
Hello and welcome to my first of many tips and tricks for RapidMiner. If you are unfamiliar with RapidMiner, it's a Open Source Java based data mining solution. You can visit the official RapidMiner website by clicking here. My plan is to write a short article to provide solutions to problems that I encounter as I learn more about this awesome application.
RapidMiner and database connectivity
There are many operators in RapidMiner that take input data sets and generate models for prediction and analysis. Often, you will want to write the result set of the model to a database. To do this you use the "Write Database" operator.
I was using RapidMiner for web mining by way of the Crawl Web operator. The Example set output of the Crawl Web operator was connected to the input of the Write Database operator. At the time I was using a SQL Server database that I pay for through my web hosting account. Just like most everything in RapidMiner, the setup was easy and worked like a charm. My database size quota was 200MB with my current hosting plan and it became apparent to me that I would quickly run out of space. As such, I decided to use the local SQL Express 2012 named instanced on my machine. This is where the problem was introduced. I couldn't figure out how to successfully setup the database connection in RapidMiner.
RapidMiner, Named Instances, and Integrated Security
The issues that I encountered when trying to setup my local SQL Server 2012 named instanced were as follows:
If I used the named instance for the server name(localhost\SQLExpress), I was unable to connect. I didn't encounter this problem with my hosting server's database because it was a direct hostname (xxx.sqlserverdb.com). There was no instance name and so the configuration was easy.
I wasn't sure how to specify integrated security as this is something that you usually specify in the connection string. I didn't encounter this problem either using my hosting database server because I was given a user name and password to connect to the server.
After some research and banging my head against my laptop, I finally figured out the resolution to my problems and I'm here to save someone else the headache.
For the named instance issue, there is a trick that is not readily apparent to get this to work. You set your database server name as per usual, in my case, localhost, however, when you specify the database name, you include a semicolon (;) followed by instance=<instance name>. So for my local server instance (localhost\sqlexpress), I set the Host value to localhost and the Database scheme value to mydatabasename;instance=sqlexpress .
As far as the integrated security requirement, all you need to do is make sure that you have the latest JTDS SQL Server driver from here. Once you download the zip file, you'll need to extract the file jtds-1.3.0-dist.zip\x86\SSO\ntlmauth.dll and place it in your windows\system32 directory. This will insure that you have the driver with the capabilities of using the integrated security. Once this file is in place, you simply leave the username and password values blank. Here is a screen shot of the Manage Database Connections window in RapidMiner for your reference.
Well that about wraps it up. Please leave a comment if you have any questions.
Until next time,
July 14, 2013 at 5:39 PM
In this article, I will show you how to use RapidMiner, one of the best Open Source data mining solutions on the internet, to read email data from an IMAP or POP3 account for storage and processing. I'll cover the basic Text mining operations such as the Transform Case, Filter Stopwords, Stem, and tokenize operators.
Here is a picture of the main process.
This process allows me to data mine email messages and write the information that I'm interested with into a SQL Server database.
The operator of interest at the beginning of the process is the Process documents from mail store operator. This operator allows you to specify host details and login credentials in order to bring email messages into RapidMiner.
The process documents operator allows a sub process where you can add operators inside of the process documents operator to assist with munging your data. Simply double click on the blue square boxes on the process documents operator to enter the sub process.
Here is a screen shot of the processing operators that I use inside of the Process Documents operator.
There is a host property (webmail.yourdomain.com), username /password properties, to authenticate with the mail server, as well as a protocol drop down list. You can specify that you only want to read unread emails by ticking the checkbox that reads (only unseen) and you can mark emails as read by using the mark seen checkbox.
Here is an example of the properties for the Process Documents operator.
When you run your process.. it may take a while depending on how many emails that you plan to mine. The results that I'm writing to a database are as follows:
#1 The example set from the Select Attributes operator
I used the Select Attributes operator because for some reason their were two received columns in the result set which gave me trouble when writing to the database. So I used Select attributes to select all attributes except for the extra received attribute.
#2 The example set from the WordList to Data operator
As you can see, this is an incredible source of data. The data also offers classification modeling opportunities (I'm working on an article to detect spam using RapidMiner check back soon).
Thank you for reading.