{"id":1564,"date":"2013-03-29T16:54:29","date_gmt":"2013-03-29T16:54:29","guid":{"rendered":"http:\/\/bicortex.com\/?p=1564"},"modified":"2014-01-07T11:55:31","modified_gmt":"2014-01-07T11:55:31","slug":"twitter-sentiment-analysis-part-2-extracting-and-mining-twitter-data-using-zapier-rapidminer-and-googlemicrosoft-tools","status":"publish","type":"post","link":"http:\/\/bicortex.com\/bicortex\/twitter-sentiment-analysis-part-2-extracting-and-mining-twitter-data-using-zapier-rapidminer-and-googlemicrosoft-tools\/","title":{"rendered":"Twitter Sentiment Analysis  &#8211; Part 2. Extracting and Mining Twitter Data Using Zapier, RapidMiner and Google\/Microsoft Tools"},"content":{"rendered":"<p style=\"text-align: justify;\">In this post I would like to build on what was developed in <a href=\"http:\/\/bicortex.com\/twitter-sentiment-analysis-mining-twitter-data-using-rapidminer-part-1\/\" target=\"_blank\">first iteration<\/a> to this two part series describing Twitter data extraction and sentiment analysis. In <a href=\"http:\/\/bicortex.com\/twitter-sentiment-analysis-mining-twitter-data-using-rapidminer-part-1\/\" target=\"_blank\">part one<\/a>, I explained how anyone can extract Twitter data into Google Docs spread sheet and then transfer is into a local environment using two different methods. In this series-final post, I would like to show you how this data can be analyzed for sentiment i.e. whether a specific Twitter feed can be considered as negative or positive. To do this, I will employ free software called RapidMiner which can be downloaded from <a href=\"http:\/\/rapid-i.com\/\" target=\"_blank\">HERE<\/a> as well as two separate data sets of already pre-classified tweets for model learning and Microsoft SQL Server for some data scrubbing and storage engine. Most of the files I am using in this project can be downloaded from <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!44704&amp;authkey=!AFBClNgpLPaMU2s\" target=\"_blank\">HERE<\/a> (including all the SQL code, YouTube video, sample data files etc.).<\/p>\n<p style=\"text-align: justify;\">To get started, I created two tables which will house the data, one for reference data which already contains sentiment information and one for the data extracted from Twitter which hasn\u2019t been analysed yet using the following code:<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nCREATE TABLE &#x5B;dbo].&#x5B;Twitter_Training_Data]\r\n(&#x5B;ID] &#x5B;int] IDENTITY(1,1) NOT NULL,\r\n &#x5B;Feed] &#x5B;varchar](max) NULL,\r\n &#x5B;Sentiment] &#x5B;varchar](50) NULL) ON &#x5B;PRIMARY]\r\nGO\r\n\r\nCREATE TABLE &#x5B;dbo].&#x5B;Twitter_Test_Data]\r\n(&#x5B;ID] &#x5B;int] IDENTITY(1,1) NOT NULL,\r\n &#x5B;Feed] &#x5B;varchar](max) NULL) ON &#x5B;PRIMARY]\r\nGO\r\n<\/pre>\n<p style=\"text-align: justify;\">Next, I downloaded two files containing already sentiment-tagged Twitter feeds. These are quite large in size and probably overkill for this project but, in theory, the more data we provide for model to learn to distinguish between NEGATIVE and POSITIVE categories, the better it should perform. I imported and percentage sampled (for smaller dataset) the data from the two pre-processed files using a small SSIS package into my local MS SQL Server database together with the file containing Twitter feeds created using Zapier (see <a href=\"http:\/\/bicortex.com\/twitter-sentiment-analysis-mining-twitter-data-using-rapidminer-part-1\/\" target=\"_blank\">previous post<\/a>) which hasn\u2019t been analysed yet. The records from the reference data went into \u2018Twitter_Training_Data\u2019 table whereas the feeds we will mine were inserted into \u2018Twitter_Test_Data\u2019 table. The whole SSIS solution is as per below (Control and Data Flow)<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1593\" alt=\"Twitter_Sentiment_Analysis_2_SSIS_1\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_1.png\" width=\"580\" height=\"240\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_1.png 580w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_1-300x124.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1594\" alt=\"Twitter_Sentiment_Analysis_2_SSIS_2\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_2.png\" width=\"580\" height=\"350\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">To be more flexible, I have also created two variables inside the package which hold the table names the package populates.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_Variables1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1595\" alt=\"Twitter_Sentiment_Analysis_2_SSIS_Variables1\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_Variables1.png\" width=\"580\" height=\"126\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_Variables1.png 580w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_Variables1-300x65.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">These variables get passed into the stored procedure which allows it to be applied to both tables. To pass the variables to the data cleansing stored procedure I assigned the variable name in \u2018Parameters Mapping\u2019 property of each of the two Execute SQL Tasks with its corresponding data type to a parameter as per below.<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1596\" alt=\"Twitter_Sentiment_Analysis_2_SSIS_3\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_3.png\" width=\"580\" height=\"235\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_3.png 580w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2013\/03\/Twitter_Sentiment_Analysis_2_SSIS_3-300x121.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">The reason for storing this information in a database rather than a file is simply because once in a table, the data can undergo further cleaning and formatting using the logic implemented in the SQL code. As I found out the hard way, a lot of the strings people put into their tweets contain tons of useless ACSII characters and shortened URL references which only pollute the test data and confuse the model. Also, reading data from a database can incur substantial performance increase. To further clean the data, I created two functions, one to weed out non-alphanumeric characters and one to remove references to URLs. As the classification mining model relies on words which are greater than 2 characters in length and web application generated URLs are mostly comprised of artificially created strings, these characters don not contribute to the overall outcome and in many cases can in fact skew the output. These two functions get used by the stored procedure which executes as the final step of the ETL package. The SQL code for those is as per below.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n--CREATE STRING SPLITTING FUNCTION\r\nCREATE FUNCTION &#x5B;dbo].&#x5B;Split] (@sep VARCHAR(32), @s VARCHAR(MAX))\r\nRETURNS @t TABLE\r\n (val VARCHAR(MAX))\r\nAS\r\n   BEGIN\r\n     DECLARE @xml XML\r\n     SET @XML = N'' + REPLACE(@s, @sep, '') + ''\r\n     INSERT INTO @t(val)\r\n     SELECT r.value('.','VARCHAR(1000)') as Item\r\n     FROM @xml.nodes('\/\/root\/r') AS RECORDS(r)\r\n     RETURN\r\n  END\r\n<\/pre>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n--CREATE FUNCTION TO REMOVE UNWANTED CHARACTERS\r\nCREATE FUNCTION &#x5B;dbo].&#x5B;RemoveSpecialChars]\r\n(@Input VARCHAR(MAX))\r\nRETURNS VARCHAR(MAX)\r\nBEGIN\r\n   DECLARE @Output VARCHAR(MAX)\r\n   IF (ISNULL(@Input,'')='')\r\n      SET @Output = @Input\r\n   ELSE\r\n   BEGIN\r\n    DECLARE @Len INT\r\n    DECLARE @Counter INT\r\n    DECLARE @CharCode INT\r\n    SET @Output = ''\r\n    SET @Len = LEN(@Input)\r\n   SET @Counter = 1\r\n       WHILE @Counter &lt;= @Len\r\n       BEGIN\r\n          SET @CharCode = ASCII(SUBSTRING(@Input, @Counter, 1))\r\n          IF @CharCode=32 OR @CharCode BETWEEN 48 and 57 OR @CharCode BETWEEN 65 AND 90\r\n\t\t  OR @CharCode BETWEEN 97 AND 122\r\n             SET @Output = @Output + CHAR(@CharCode)\r\n          SET @Counter = @Counter + 1\r\n       END\r\n   END\r\n   RETURN @Output\r\nEND\r\n<\/pre>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n--CREATE STRING CLEANING STORED PROCEDURE\r\nCREATE PROCEDURE &#x5B;dbo].&#x5B;usp_Massage_Twitter_Data] \r\n(@table_name NVARCHAR(100))\r\nAS\r\nBEGIN\r\n\tBEGIN TRY\r\n\t\tSET NOCOUNT ON\r\n\r\n\t\t--CREATE CLUSTERED INDEX SQL\r\n\t\tDECLARE @SQL_Drop_Index NVARCHAR(1000)\r\n\t\tSET @SQL_Drop_Index = \r\n\t\t\t\t'IF EXISTS (SELECT name FROM sys.indexes\r\n\t\t\t\tWHERE name = N''IX_Feed_ID'')\r\n\t\t\t\tDROP INDEX IX_Feed_ID ON ' + @table_name + '\r\n\t\t\t\tCREATE CLUSTERED INDEX IX_Feed_ID \r\n\t\t\t\tON  ' + @Table_Name + '  (ID)';\r\n\r\n\t\t--REMOVE UNWANTED CHARACTERS SQL\r\n\t\tDECLARE @SQL_Remove_Chars NVARCHAR(2000)\r\n\t\tSET @SQL_Remove_Chars = \r\n\t\t\t\t'UPDATE ' + @Table_Name + '\r\n\t\t\t\tSET feed = (select dbo.RemoveSpecialChars(feed))\r\n\t   \r\n\t\t\t\tDECLARE @z int\r\n\t\t\t\tDECLARE db_cursor CURSOR\r\n\t\t\t\tFOR\r\n\t\t\t\t\tSELECT id\r\n\t\t\t\t\tFROM ' + @Table_Name + '\r\n\t\t\t\t\tOPEN db_cursor\r\n\t\t\t\t\tFETCH NEXT\r\n\t\t\t\t\tFROM db_cursor INTO @z\r\n\t\t\t\t\tWHILE @@FETCH_STATUS = 0\r\n\t\t\t\t\t\tBEGIN\r\n\t\t\t\t\t\tDECLARE @Combined_String VARCHAR(max);\r\n\t\t\t\t\t\tWITH cte(id, val)  AS (\r\n\t\t\t\t\t\t\tSELECT a.id, fs.val\r\n\t\t\t\t\t\t\tFROM ' + @Table_Name + '  a\r\n\t\t\t\t\t\t\tCROSS APPLY  dbo.split('' '', feed)  AS fs\r\n\t\t\t\t\t\t\tWHERE fs.val NOT LIKE  ''%http%'' and a.id = @z)\r\n\t\t\t\t\t\tSELECT @Combined_String = \r\n\t\t\t\t\t\tCOALESCE(@Combined_String + '' '', '''')  + val\r\n\t\t\t\t\t\tFROM cte\r\n\r\n\t\t\t\t\t\tUPDATE ' + @Table_Name + '\r\n\t\t\t\t\t\tSET feed  = ltrim(rtrim(@Combined_String))\r\n\t\t\t\t\t\tWHERE ' + @Table_Name + '.id = @z\r\n\r\n\t\t\t\t\t\tSELECT @Combined_String = ''''\r\n\t\t\t\t\t\tFETCH NEXT\r\n\t\t\t\t\t\tFROM db_cursor INTO @z\r\n\t\t\t\t\t\tEND\r\n\t\t\t\t\tCLOSE db_cursor\r\n\t\t\t\t\tDEALLOCATE db_cursor'\r\n\r\n\t\t--RESEED IDENTITY COLUMN &amp; DELETE EMPTY RECORDS\tSQL\r\n\t\tDECLARE @SQL_Reseed_Delete NVARCHAR(1000)\r\n\r\n\t\tSET @SQL_Reseed_Delete = \r\n\t\t\t\t'IF EXISTS\t(SELECT   c.is_identity\r\n\t\t\t\tFROM   sys.tables t\r\n\t\t\t\tJOIN sys.schemas s\r\n\t\t\t\tON t.schema_id = s.schema_id\r\n\t\t\t\tJOIN sys.Columns c\r\n\t\t\t\tON c.object_id = t.object_id\r\n\t\t\t\tJOIN sys.Types ty\r\n\t\t\t\tON ty.system_type_id = c.system_type_id\r\n\t\t\t\tWHERE  t.name = ''+ @Table_Name +''\r\n\t\t\t\tAND s.Name = ''dbo''\r\n\t\t\t\tAND c.is_identity=1)\r\n\t\t\t\tSET IDENTITY_INSERT ' + @Table_Name + ' ON\r\n\r\n\t\t\t\tDELETE FROM ' + @Table_Name + '\r\n\t\t\t\tWHERE Feed IS NULL or Feed =''''\r\n\r\n\t\t\t\tDBCC CHECKIDENT (' + @Table_Name + ', reseed, 1)\r\n\t\t\t\tSET IDENTITY_INSERT ' + @Table_Name + ' OFF\r\n\r\n\t\t\t\tALTER INDEX IX_Feed_ID ON ' + @Table_Name + '\r\n\t\t\t\tREORGANIZE'\r\n\r\n\t\tEXECUTE sp_executesql @SQL_Drop_Index\r\n\t\tEXECUTE sp_executesql @SQL_Remove_Chars\r\n\t\tEXECUTE sp_executesql @SQL_Reseed_Delete\r\n\tEND TRY\r\n\tBEGIN CATCH\r\n\t\tIF @@TRANCOUNT &gt; 0\r\n\t\tBEGIN\r\n\t\t\tROLLBACK TRANSACTION\r\n\t\tEND\r\n\r\n\t\tDECLARE @ErrorMessage NVARCHAR(4000);\r\n\t\tDECLARE @ErrorSeverity INT;\r\n\t\tDECLARE @ErrorState INT;\r\n\t\tSELECT \t@ErrorMessage = ERROR_MESSAGE(),\r\n\t\t\t\t@ErrorSeverity = ERROR_SEVERITY(),\r\n\t\t\t\t@ErrorState = ERROR_STATE();\r\n\t\tRAISERROR (\r\n\t\t\t\t@ErrorMessage\r\n\t\t\t\t,-- Message text.\r\n\t\t\t\t@ErrorSeverity\r\n\t\t\t\t,-- Severity.\r\n\t\t\t\t@ErrorState -- State.\r\n\t\t\t\t);\r\n\tEND CATCH\r\n\t\tIF @@TRANCOUNT &gt; 0\r\n\t\t\tBEGIN\r\n\t\t\t\tCOMMIT TRANSACTION\r\n\tEND\r\nEND\r\n<\/pre>\n<p style=\"text-align: justify;\">The package extracts Twitter feeds from each of the three CSV files (I downloaded them as XLSX and converted into CSV simply by opening them up in Excel and saving again as CSV as SSIS can be a bit temperamental using Excel as a data source), storing them in assigned tables and performing some data cleansing using the above SQL code.<\/p>\n<p style=\"text-align: justify;\">After both training data and new observations have been loaded into the database it is time to fire up RapidMiner. If you have any experience with ETL applications relying on drag-and-drop transformations functionally you should feel at home using RapidMiner. Using it is similar to building an SSIS routine on a development pane and linking individual transformations to create a logical data flow. To start, we need to create a database connection to read our tweets from SQL Server. There are numerous tutorials on the Internet and a few particularly useful ones on YouTube on how to install JTDE driver on the Windows box to read data from SQL Server instance so I won\u2019t go into details on how to accomplish that. Below is a screen shot of the connection I set up in RapidMiner on my machine.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_DB_Conn_RapidMiner.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter  wp-image-1597\" alt=\"Twitter_Sentiment_Analysis_2_DB_Conn_RapidMiner\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2013\/03\/Twitter_Sentiment_Analysis_2_DB_Conn_RapidMiner.png\" width=\"580\" height=\"420\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Finally, it is time to develop the model and the rest of the solution for Twitter data classification. As explaining it step by step with all supporting screen shots would require at least another post, I decided to provide this overview as a short footage hosted on YouTube \u2013 see below.<\/p>\n<p><iframe loading=\"lazy\" src=\"http:\/\/www.youtube.com\/embed\/4EWXf3_0ta4?feature=player_embedded\" height=\"340\" width=\"580\" frameborder=\"0\"><\/iframe><\/p>\n<p style=\"text-align: justify;\">You can also download the original footage with all the complementary files e.g. SQL code, source data files etc. from <a href=\"https:\/\/skydrive.live.com\/redir?resid=715AEF07A82832E1!44704&amp;authkey=!AFBClNgpLPaMU2s\" target=\"_blank\">HERE<\/a>.<\/p>\n<p style=\"text-align: justify;\">If you enjoyed this post please also check out my earlier version of Twitter sentiment analysis using pure SQL and SSIS which can be viewed from <a href=\"http:\/\/bicortex.com\/twitter-data-analysis-using-microsoft-sql-server-c-ssis-ssas-and-excel\/\" target=\"_blank\">HERE<\/a>. Finally, if you&#8217;re after a more adhoc\/playful sentiment analysis tool for Twitter data and (optionally) have some basic knowledge of Python programming language, check out my post on using etcML web based tool under <a href=\"http:\/\/bicortex.com\/twitter-data-sentiment-analysis-using-etcml-and-python\/\" target=\"_blank\">THIS<\/a> link.<\/p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post I would like to build on what was developed in first iteration to this two part series describing Twitter data extraction and sentiment analysis. In part one, I explained how anyone can extract Twitter data into Google Docs spread sheet and then transfer is into a local environment using two different methods. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,50],"tags":[12,57,37,35,49,19,13],"class_list":["post-1564","post","type-post","status-publish","format-standard","hentry","category-sql","category-ssis","tag-code","tag-data-mining","tag-rapidminer","tag-social-networks","tag-sql","tag-sql-server","tag-ssis"],"aioseo_notices":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1564","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/comments?post=1564"}],"version-history":[{"count":54,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1564\/revisions"}],"predecessor-version":[{"id":2050,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/1564\/revisions\/2050"}],"wp:attachment":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/media?parent=1564"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/categories?post=1564"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/tags?post=1564"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}