Twitter Sentiment Analysis - Part 2. Extracting and Mining Twitter Data Using Zapier, RapidMiner and Google/Microsoft Tools

Twitter Sentiment Analysis – Part 2. Extracting and Mining Twitter Data Using Zapier, RapidMiner and Google/Microsoft Tools

In this post I would like to build on what was developed in first iteration to this two part series describing Twitter data extraction and sentiment analysis. In part one, I explained how anyone can extract Twitter data into Google Docs spread sheet and then transfer is into a local environment using two different methods. In this series-final post, I would like to show you how this data can be analyzed for sentiment i.e. whether a specific Twitter feed can be considered as negative or positive. To do this, I will employ free software called RapidMiner which can be downloaded from HERE as well as two separate data sets of already pre-classified tweets for model learning and Microsoft SQL Server for some data scrubbing and storage engine. Most of the files I am using in this project can be downloaded from HERE (including all the SQL code, YouTube video, sample data files etc.).

To get started, I created two tables which will house the data, one for reference data which already contains sentiment information and one for the data extracted from Twitter which hasn’t been analysed yet using the following code:

CREATE TABLE [dbo].[Twitter_Training_Data]
([ID] [int] IDENTITY(1,1) NOT NULL,
 [Feed] [varchar](max) NULL,
 [Sentiment] [varchar](50) NULL) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Twitter_Test_Data]
([ID] [int] IDENTITY(1,1) NOT NULL,
 [Feed] [varchar](max) NULL) ON [PRIMARY]
GO

Next, I downloaded two files containing already sentiment-tagged Twitter feeds. These are quite large in size and probably overkill for this project but, in theory, the more data we provide for model to learn to distinguish between NEGATIVE and POSITIVE categories, the better it should perform. I imported and percentage sampled (for smaller dataset) the data from the two pre-processed files using a small SSIS package into my local MS SQL Server database together with the file containing Twitter feeds created using Zapier (see previous post) which hasn’t been analysed yet. The records from the reference data went into ‘Twitter_Training_Data’ table whereas the feeds we will mine were inserted into ‘Twitter_Test_Data’ table. The whole SSIS solution is as per below (Control and Data Flow)

To be more flexible, I have also created two variables inside the package which hold the table names the package populates.

These variables get passed into the stored procedure which allows it to be applied to both tables. To pass the variables to the data cleansing stored procedure I assigned the variable name in ‘Parameters Mapping’ property of each of the two Execute SQL Tasks with its corresponding data type to a parameter as per below.

The reason for storing this information in a database rather than a file is simply because once in a table, the data can undergo further cleaning and formatting using the logic implemented in the SQL code. As I found out the hard way, a lot of the strings people put into their tweets contain tons of useless ACSII characters and shortened URL references which only pollute the test data and confuse the model. Also, reading data from a database can incur substantial performance increase. To further clean the data, I created two functions, one to weed out non-alphanumeric characters and one to remove references to URLs. As the classification mining model relies on words which are greater than 2 characters in length and web application generated URLs are mostly comprised of artificially created strings, these characters don not contribute to the overall outcome and in many cases can in fact skew the output. These two functions get used by the stored procedure which executes as the final step of the ETL package. The SQL code for those is as per below.

--CREATE STRING SPLITTING FUNCTION
CREATE FUNCTION [dbo].[Split] (@sep VARCHAR(32), @s VARCHAR(MAX))
RETURNS @t TABLE
 (val VARCHAR(MAX))
AS
   BEGIN
     DECLARE @xml XML
     SET @XML = N'' + REPLACE(@s, @sep, '') + ''
     INSERT INTO @t(val)
     SELECT r.value('.','VARCHAR(1000)') as Item
     FROM @xml.nodes('//root/r') AS RECORDS(r)
     RETURN
  END

--CREATE FUNCTION TO REMOVE UNWANTED CHARACTERS
CREATE FUNCTION [dbo].[RemoveSpecialChars]
(@Input VARCHAR(MAX))
RETURNS VARCHAR(MAX)
BEGIN
   DECLARE @Output VARCHAR(MAX)
   IF (ISNULL(@Input,'')='')
      SET @Output = @Input
   ELSE
   BEGIN
    DECLARE @Len INT
    DECLARE @Counter INT
    DECLARE @CharCode INT
    SET @Output = ''
    SET @Len = LEN(@Input)
   SET @Counter = 1
       WHILE @Counter <= @Len
       BEGIN
          SET @CharCode = ASCII(SUBSTRING(@Input, @Counter, 1))
          IF @CharCode=32 OR @CharCode BETWEEN 48 and 57 OR @CharCode BETWEEN 65 AND 90
		  OR @CharCode BETWEEN 97 AND 122
             SET @Output = @Output + CHAR(@CharCode)
          SET @Counter = @Counter + 1
       END
   END
   RETURN @Output
END

--CREATE STRING CLEANING STORED PROCEDURE
CREATE PROCEDURE [dbo].[usp_Massage_Twitter_Data] 
(@table_name NVARCHAR(100))
AS
BEGIN
	BEGIN TRY
		SET NOCOUNT ON

		--CREATE CLUSTERED INDEX SQL
		DECLARE @SQL_Drop_Index NVARCHAR(1000)
		SET @SQL_Drop_Index = 
				'IF EXISTS (SELECT name FROM sys.indexes
				WHERE name = N''IX_Feed_ID'')
				DROP INDEX IX_Feed_ID ON ' + @table_name + '
				CREATE CLUSTERED INDEX IX_Feed_ID 
				ON  ' + @Table_Name + '  (ID)';

		--REMOVE UNWANTED CHARACTERS SQL
		DECLARE @SQL_Remove_Chars NVARCHAR(2000)
		SET @SQL_Remove_Chars = 
				'UPDATE ' + @Table_Name + '
				SET feed = (select dbo.RemoveSpecialChars(feed))
	   
				DECLARE @z int
				DECLARE db_cursor CURSOR
				FOR
					SELECT id
					FROM ' + @Table_Name + '
					OPEN db_cursor
					FETCH NEXT
					FROM db_cursor INTO @z
					WHILE @@FETCH_STATUS = 0
						BEGIN
						DECLARE @Combined_String VARCHAR(max);
						WITH cte(id, val)  AS (
							SELECT a.id, fs.val
							FROM ' + @Table_Name + '  a
							CROSS APPLY  dbo.split('' '', feed)  AS fs
							WHERE fs.val NOT LIKE  ''%http%'' and a.id = @z)
						SELECT @Combined_String = 
						COALESCE(@Combined_String + '' '', '''')  + val
						FROM cte

						UPDATE ' + @Table_Name + '
						SET feed  = ltrim(rtrim(@Combined_String))
						WHERE ' + @Table_Name + '.id = @z

						SELECT @Combined_String = ''''
						FETCH NEXT
						FROM db_cursor INTO @z
						END
					CLOSE db_cursor
					DEALLOCATE db_cursor'

		--RESEED IDENTITY COLUMN & DELETE EMPTY RECORDS	SQL
		DECLARE @SQL_Reseed_Delete NVARCHAR(1000)

		SET @SQL_Reseed_Delete = 
				'IF EXISTS	(SELECT   c.is_identity
				FROM   sys.tables t
				JOIN sys.schemas s
				ON t.schema_id = s.schema_id
				JOIN sys.Columns c
				ON c.object_id = t.object_id
				JOIN sys.Types ty
				ON ty.system_type_id = c.system_type_id
				WHERE  t.name = ''+ @Table_Name +''
				AND s.Name = ''dbo''
				AND c.is_identity=1)
				SET IDENTITY_INSERT ' + @Table_Name + ' ON

				DELETE FROM ' + @Table_Name + '
				WHERE Feed IS NULL or Feed =''''

				DBCC CHECKIDENT (' + @Table_Name + ', reseed, 1)
				SET IDENTITY_INSERT ' + @Table_Name + ' OFF

				ALTER INDEX IX_Feed_ID ON ' + @Table_Name + '
				REORGANIZE'

		EXECUTE sp_executesql @SQL_Drop_Index
		EXECUTE sp_executesql @SQL_Remove_Chars
		EXECUTE sp_executesql @SQL_Reseed_Delete
	END TRY
	BEGIN CATCH
		IF @@TRANCOUNT > 0
		BEGIN
			ROLLBACK TRANSACTION
		END

		DECLARE @ErrorMessage NVARCHAR(4000);
		DECLARE @ErrorSeverity INT;
		DECLARE @ErrorState INT;
		SELECT 	@ErrorMessage = ERROR_MESSAGE(),
				@ErrorSeverity = ERROR_SEVERITY(),
				@ErrorState = ERROR_STATE();
		RAISERROR (
				@ErrorMessage
				,-- Message text.
				@ErrorSeverity
				,-- Severity.
				@ErrorState -- State.
				);
	END CATCH
		IF @@TRANCOUNT > 0
			BEGIN
				COMMIT TRANSACTION
	END
END

The package extracts Twitter feeds from each of the three CSV files (I downloaded them as XLSX and converted into CSV simply by opening them up in Excel and saving again as CSV as SSIS can be a bit temperamental using Excel as a data source), storing them in assigned tables and performing some data cleansing using the above SQL code.

After both training data and new observations have been loaded into the database it is time to fire up RapidMiner. If you have any experience with ETL applications relying on drag-and-drop transformations functionally you should feel at home using RapidMiner. Using it is similar to building an SSIS routine on a development pane and linking individual transformations to create a logical data flow. To start, we need to create a database connection to read our tweets from SQL Server. There are numerous tutorials on the Internet and a few particularly useful ones on YouTube on how to install JTDE driver on the Windows box to read data from SQL Server instance so I won’t go into details on how to accomplish that. Below is a screen shot of the connection I set up in RapidMiner on my machine.

Finally, it is time to develop the model and the rest of the solution for Twitter data classification. As explaining it step by step with all supporting screen shots would require at least another post, I decided to provide this overview as a short footage hosted on YouTube – see below.

You can also download the original footage with all the complementary files e.g. SQL code, source data files etc. from HERE.

If you enjoyed this post please also check out my earlier version of Twitter sentiment analysis using pure SQL and SSIS which can be viewed from HERE. Finally, if you’re after a more adhoc/playful sentiment analysis tool for Twitter data and (optionally) have some basic knowledge of Python programming language, check out my post on using etcML web based tool under THIS link.

Submit Article :- BlinkList + Blogmarks + Digg + Del.icio.us + Ekstreme Socializer + Feedmarker + Furl + Google Bookmarks + ma.gnolia + Netvouz + RawSugar + Reddit + Scuttle + Shadows + Simpy + Spurl + Technorati + Unalog + Wink

http://scuttle.org/bookmarks.php/pass?action=add

Posted in: SQL, SSIS

This entry was posted on Friday, March 29th, 2013 at 4:54 pm and is filed under SQL, SSIS. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

21 Responses to “Twitter Sentiment Analysis – Part 2. Extracting and Mining Twitter Data Using Zapier, RapidMiner and Google/Microsoft Tools”

Kenjd September 16th, 2013 at 6:00 pm

Hi Marcin,

Thanks for your wonderful posts. And supportive Youtube videos.

As someone not as technical as you, I hope you don’t mind a couple of questions. And forgive me if you find them basic.

1. I know you said that for performance reasons, you converted the Excel data into CSV files, then imported those into SQL. I’m not able to follow you technically here, but I understand that I could just use the CSV files in Rapidminer. But I’m confused about your 3 files – a large Twitter stream of “Bigdata” tweets, and a smaller file with positive feeds and one with negative feeds. I’m sorry, but I don’t get where these came from? Wasn’t this what RapidMiner was going to tell us? How did you determine that beforehand?

2. I know you are not trying to teach Rapidminer in your text and video tutorials. I completely understand that. But I’m new to the tool and am finding it difficult to follow and find exactly what you are doing to get your results. Do you have any other posts, for sentiment data, where you take newbies through what you obviously so deeply already know?

Again, thanks for these. Despite these two questions I have been able to glean a lot from your posts.

admin September 16th, 2013 at 10:46 pm

Hi there. Thanks for reading my blog and your comments and questions. I tried to address your queries as per below. Please feel free to contact me if you require further clarification.
Question 1 – Yes, that’s correct. You could just use RepidMiner and import the files straight from a CSV file. The problem I encountered was that given the number of records in the sample files provided as well as my machine hardware limitations e.g. memory, CPU (I am only guessing here) I found the import process very painful and time consuming. RapidMiner would crash, become unresponsive or throw an error so given the fact I already had SQL Server database installed on my machine, it made sense to import the data into the database first. This decreased data retrieval time considerably even though it involves going through a few additional steps. In the end it was worthwhile as, pending your technical skills, you can always process your data even further e.g. cleanse it, remove certain strings or characters etc.
As far as the three files you mentioned, one contains feeds which have not been analysed – no sentiment attached, whereas the other two contain already pre-classified feeds. The simplest terms, in order to determine if something is ‘negative’ or ‘positive’ is to compare it to a similar entity which has already been tagged as such. For that purpose we are required to have two sample datasets with already pre-labelled feeds, otherwise RapidMiner and the algorithm applied would not be able to train/learn itself to distinguish between whether the sample we wish to analyse has positive connotations or negative ones. In my case, the dataset containing feeds which have already been tagged comes from the Internet, I downloaded it from a university’s website (cannot remember exactly which one). The dataset which contains feeds which have not been tagged comes from Twitter, I downloaded those feeds using the methods and services provided in the first post e.g. Zapier, SQL Server SSIS etc. Please also note that my sample feeds contained a lot of strings in foreign languages. If you were to do this analysis more thoroughly, I would recommend sticking to one language only – it would be much more accurate to have all three datasets with feeds in English only.
Question 2 – I am not an expert myself, actually far from it. RapidMiner was my second attempt in exploring data mining territory so knowledge wise we are probably on par if you you’re also just starting out. I wrote another post about mining Twitter data using pure SQL and a little bit of SSIS and .Net a while ago. If you’re interested it shouldn’t be hard to find on my blog.
Hope that helps………, Marcin!

Wilian Stancke October 20th, 2013 at 5:17 pm

Your tutorial is excellent!!

But I’m doing something wrong and I’m not getting the same results.

You have the RapidMiner project files ?

Thanks.

PS: Sorry my english, I’m from Brazil.

admin October 20th, 2013 at 10:08 pm

Hi there. There are no project files in Rapid Miner but if download the zipped up package provided in one of the links, you should be able to find an XML file which is the actual solution. Hope that helps…….Marcin.

Shaan January 20th, 2014 at 9:43 am

Marcin, my friend, you’ve given me a peerless tutorial on sentiment analysis and I’m very grateful! I’ve been arduously following every step of this tutorial, but I got lost in the second part of the tutorial wherein the flat files are imported into a database in the MS SQL server before undergoing ETL. I admit I’m a total noob when it comes to establishing databases in SQL servers. I’ve done my best in realizing the steps of the tutorial using MS SQL server 2008 and the corresponding MS Visual studio 2008, but I got stuck somehow. My gratitude would go beyond comprehension if you can give me a detailed step-by-step tutorial of the process of creating the databases in MS SQL server; Everything between the part 1 of the tutorial and the Rapidminer part. Screenshots if included will be much appreciated. :’D Dude, I’m helpless here and I’m necessitated to fully realize your tutorial before my project submission deadline ends. I’ll return the favor somehow, in any possible way. Thanks a ton!

admin January 21st, 2014 at 12:13 pm

Hi Shaan. Creating a database in MS SQL Server is as easy as it gets. You simply right-click on the database node in the Project Explorer (providing you wish to use Management Studio and do it via GUI) and selecting ‘New Database…’ from the options provided. Please provide me with your email address and I may be able to work something out e.g. send you a few screen shots etc.

Cheers, Marcin

Shaan January 23rd, 2014 at 9:43 am

Marcin, I’ve now a decent understanding of the utmost basics of sql server management and BIDS. 😀 But I encounter problems configuring the tasks and transformations in the data flow and control flow. The derived column transformation is confusing and I’m unable to configure it for appending the tags to the flat files. Also, about the functions and stored procedures, where do I enter all those? In the ‘Execute SQL task’ task in the control flow for both the positive and negative feeds? I’ll mail you the screenshots shortly so that you can have a clear picture of where exactly I went wrong. Thanks in advance brother! May God bless you and your family! ^_^

Shaan January 23rd, 2014 at 9:52 am

In retrospect, I never realized that your email id wasn’t mentioned in this blog site. Here’s my email id bro. 🙂 calvinhendrick@ymail.com

Shaan January 23rd, 2014 at 9:59 am

Also, I’m unable to understand how to configure the ‘Truncate tables’ SQL task in the beginning of the SSIS package. A step by step tutorial of the SSIS configurations involved would be a prayer answered bro. Pardon me if I’m being an imposition, but I’m quite new to all these. I’ll earnestly recommend your blog to my cousins working in the IT and to my classmates too. 😀

Thandar January 27th, 2015 at 7:42 am

Hi, First I want to say thank you. Your posts is great. It supported me a lot.
But I have an error and I also follow your instruction with my dataset. Here is my error. “The operator SVM (Linear) does not have sufficient capabilities for the given data set: polynominal attributes not supported”. I hope you will reply my comment. 🙂 Thank you

admin January 27th, 2015 at 9:25 am

Hi there. Sorry to hear you’re having issues with this process. Without looking at your data sets it’s too difficult for me to pinpoint the exact issue. Please have a look at the data sets used for this exercise (you can download them from my OneDrive folder with link in the post). Maybe something obvious will jump out at you once you compare the two together. Chances are you are missing an attribute or the data set used for the model training has more then two distinct values in the ‘sentiment’ column (should only be positive or negative)…….hope that helps………..cheers, Martin

Ryan March 9th, 2015 at 4:20 pm

Great tutorial.

I have replicated the system and all its steps on my machine. However I keep receiving the following error:

“The input ExampleSet does no match the training ExampleSet. Missing Attribute:’aaa'”

I am using the same data file provided though OneDrive. Also, I imported the xml file provided which had produced the same error.

The only difference I am noticing in RapidMiner is the ‘Compatibility level’ which = 6.3.000 on my machine.

admin March 10th, 2015 at 6:06 am

Hi there

Sorry to hear you’re having issues. I haven’t looked at that solution for a while now so I probably won’t be of too much assistance to you now. All I can advise you to do is to try to replicate everything step by step rather than importing the solution as there may be some version incompatibility though judging from your error that is probably not the case. Maybe the error is related to the two transformations I have on the development pane in the video where I initially attempted to read from Excel files and then decided to disabled them due to performance reasons. I am not quite sure if they got persisted in the XML files I provided (by mistake) but they are obsolete in the context of this solution. I presume you’re getting errors when following on with the second part of this series (2 posts in total). Would that be the case?

Again, try to place and connect all the transformations as per video on your own rather than importing the XML file. Let me know how it goes and good luck!

Cheers,

Martin

weewey May 23rd, 2015 at 8:27 am

Hi Martin! Thanks for your tutorial! It helped my project greatly!!

But I’m currently having a problem similar to ryan in the earlier posts.

I followed the video to a tee but still encountering the problem
“The input ExampleSet does no match the training ExampleSet. Missing Attribute:’aaa’”
in rapidminer. The problem exists in the Apply model(2).

Would really appreciate your help!

Thanks!

admin May 23rd, 2015 at 10:42 am

Hi there. Please see my previous response. It’s been a while since I used RapidMiner and cannot really pinpoint which step/variable could potentially cause you grief off the top of my head. What I would suggest is adding/removing transformations one by one until the solution throws an error to narrow down your search and save you time. I had no issues with this file when running it as per the video and a quick word-search in the XML file provided did not reveal anything resembling ‘aaa’ attribute. Sorry I can’t help you further at the moment.

An Le June 28th, 2015 at 4:30 am

Hi Martin! Thanks for your tutorial! It helped my project greatly!
But I have a question as follows:
Data training you make or that you get from any source? Can you tell me where you get training data?
Thanks!

admin June 28th, 2015 at 4:41 am

Hi there. Glad to hear that this post helped you in your project. Please have a look at my OneDrive folder (link provided in the post)…..it includes both training and testing datasets. Thanks, Martin

Ard July 8th, 2015 at 11:31 am

Hi thanks for your tutorial!

I am trying to do the same thing (Sentiment analysis) however my data source is an excel file and I do not have access to SQL. The data set is an excel file with one column as comments (the text) and other columns as attributes. But I’m having two problems:

1- I can not get the word vector using the excel data set, it seems like the process document operator does not recognize the text column (I also tried select attribute) so the example port passes on the table but the word port does nothing.

2- I exactly followed your steps (except the source) but I get the SVM not capable for polynominal attribute error.

Can you help me?

Thank you so much!

admin July 9th, 2015 at 5:30 am

Hi there.
Hard to tell why the issues you’re experiencing occur in your case as I can’t really see the solution you’re working on. Regarding your first problem, I also had issues with Excel spreadsheets albeit for the different reasons i.e. in my case it was the execution speed. Maybe you should give CSV format a go instead and see if that works better. Alternatively, go with something like MySQL or PostgreSQL which are free, you won’t regret it. Regarding your second issue make sure you have your testing data already labelled with appropriate sentiment i.e. positive or negative. It may be that SVM complains because you have more then 2 values/options available for it i.e. this variable is polynomial not binomial. Hope that helps……….cheers, Martin

Ruowei February 10th, 2017 at 9:01 pm

Hi there. Thanks a lot for your tutorial!

I have to say this blog helps me a lot for my project. But now I am really getting stuck for the SSIS package to load the data. The problem is that the twitter flat file (positve feeds and negative feeds) can’t show proppriately when I creat the flat file connection manager (which is “postive twitter feeds connection” and “negative twitter feeds connection” in our case).

For example, some “text” field data in CSV file “negative feeds” have already contain commas in the sentences, and the flat file connection manager mis-recognize it and set a new comlumn for that.

I am thinking about this problem for days and still can’t get the answer. Could you tell me how did you solve this problem?

Thank you so much!

admin February 11th, 2017 at 8:59 am

Hi mate. If you downloaded the files from my OneDrive folder you should be able to see SSIS solution files used in this demo, therefore, you should also be able to work out ‘how it all hangs together’. Use those to load data into the database (if you’re using MSFT database and ETL tools) or alternatively replace commas with pipes or other delimiter to differentiate between semantic commas in strings and those used as delimiters. Hope that helps….cheers, Martin

What you are looking at...

My name is Martin and this site is a random collection of recipes and reflections about various topics covering information management, data engineering, machine learning, business intelligence and visualisation plus everything else that I fancy to categorise under the 'analytics' umbrella. I'm a native of Poland but since my university days I have lived in Melbourne, Australia and worked as a DBA, developer, data architect, technical lead and team manager. My main interests lie in both, helping clients in technical aspects of information management e.g. data modelling, systems architecture, cloud deployments as well as business-oriented strategies e.g. enterprise data solutions project management, data governance and stewardship, data security and privacy or data monetisation. On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analysed and visualised, you've got my attention!

Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book, the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!

Subscribe via RSS | Comments (RSS)