In this post I would like to build on what was developed in first iteration to this two part series describing Twitter data extraction and sentiment analysis. In part one, I explained how anyone can extract Twitter data into Google Docs spread sheet and then transfer is into a local environment using two different methods. In this series-final post, I would like to show you how this data can be analyzed for sentiment i.e. whether a specific Twitter feed can be considered as negative or positive. To do this, I will employ free software called RapidMiner which can be downloaded from HERE as well as two separate data sets of already pre-classified tweets for model learning and Microsoft SQL Server for some data scrubbing and storage engine. Most of the files I am using in this project can be downloaded from HERE (including all the SQL code, YouTube video, sample data files etc.).
To get started, I created two tables which will house the data, one for reference data which already contains sentiment information and one for the data extracted from Twitter which hasn’t been analysed yet using the following code:
CREATE TABLE [dbo].[Twitter_Training_Data]
([ID] [int] IDENTITY(1,1) NOT NULL,
[Feed] [varchar](max) NULL,
[Sentiment] [varchar](50) NULL) ON [PRIMARY]
CREATE TABLE [dbo].[Twitter_Test_Data]
([ID] [int] IDENTITY(1,1) NOT NULL,
[Feed] [varchar](max) NULL) ON [PRIMARY]
Next, I downloaded two files containing already sentiment-tagged Twitter feeds. These are quite large in size and probably overkill for this project but, in theory, the more data we provide for model to learn to distinguish between NEGATIVE and POSITIVE categories, the better it should perform. I imported and percentage sampled (for smaller dataset) the data from the two pre-processed files using a small SSIS package into my local MS SQL Server database together with the file containing Twitter feeds created using Zapier (see previous post) which hasn’t been analysed yet. The records from the reference data went into ‘Twitter_Training_Data’ table whereas the feeds we will mine were inserted into ‘Twitter_Test_Data’ table. The whole SSIS solution is as per below (Control and Data Flow)
To be more flexible, I have also created two variables inside the package which hold the table names the package populates.
These variables get passed into the stored procedure which allows it to be applied to both tables. To pass the variables to the data cleansing stored procedure I assigned the variable name in ‘Parameters Mapping’ property of each of the two Execute SQL Tasks with its corresponding data type to a parameter as per below.
The reason for storing this information in a database rather than a file is simply because once in a table, the data can undergo further cleaning and formatting using the logic implemented in the SQL code. As I found out the hard way, a lot of the strings people put into their tweets contain tons of useless ACSII characters and shortened URL references which only pollute the test data and confuse the model. Also, reading data from a database can incur substantial performance increase. To further clean the data, I created two functions, one to weed out non-alphanumeric characters and one to remove references to URLs. As the classification mining model relies on words which are greater than 2 characters in length and web application generated URLs are mostly comprised of artificially created strings, these characters don not contribute to the overall outcome and in many cases can in fact skew the output. These two functions get used by the stored procedure which executes as the final step of the ETL package. The SQL code for those is as per below.
--CREATE STRING SPLITTING FUNCTION
CREATE FUNCTION [dbo].[Split] (@sep VARCHAR(32), @s VARCHAR(MAX))
RETURNS @t TABLE
DECLARE @xml XML
SET @XML = N'' + REPLACE(@s, @sep, '') + ''
INSERT INTO @t(val)
SELECT r.value('.','VARCHAR(1000)') as Item
FROM @xml.nodes('//root/r') AS RECORDS(r)
--CREATE FUNCTION TO REMOVE UNWANTED CHARACTERS
CREATE FUNCTION [dbo].[RemoveSpecialChars]
DECLARE @Output VARCHAR(MAX)
SET @Output = @Input
DECLARE @Len INT
DECLARE @Counter INT
DECLARE @CharCode INT
SET @Output = ''
SET @Len = LEN(@Input)
SET @Counter = 1
WHILE @Counter <= @Len
SET @CharCode = ASCII(SUBSTRING(@Input, @Counter, 1))
IF @CharCode=32 OR @CharCode BETWEEN 48 and 57 OR @CharCode BETWEEN 65 AND 90
OR @CharCode BETWEEN 97 AND 122
SET @Output = @Output + CHAR(@CharCode)
SET @Counter = @Counter + 1
--CREATE STRING CLEANING STORED PROCEDURE
CREATE PROCEDURE [dbo].[usp_Massage_Twitter_Data]
SET NOCOUNT ON
--CREATE CLUSTERED INDEX SQL
DECLARE @SQL_Drop_Index NVARCHAR(1000)
SET @SQL_Drop_Index =
'IF EXISTS (SELECT name FROM sys.indexes
WHERE name = N''IX_Feed_ID'')
DROP INDEX IX_Feed_ID ON ' + @table_name + '
CREATE CLUSTERED INDEX IX_Feed_ID
ON ' + @Table_Name + ' (ID)';
--REMOVE UNWANTED CHARACTERS SQL
DECLARE @SQL_Remove_Chars NVARCHAR(2000)
SET @SQL_Remove_Chars =
'UPDATE ' + @Table_Name + '
SET feed = (select dbo.RemoveSpecialChars(feed))
DECLARE @z int
DECLARE db_cursor CURSOR
FROM ' + @Table_Name + '
FROM db_cursor INTO @z
WHILE @@FETCH_STATUS = 0
DECLARE @Combined_String VARCHAR(max);
WITH cte(id, val) AS (
SELECT a.id, fs.val
FROM ' + @Table_Name + ' a
CROSS APPLY dbo.split('' '', feed) AS fs
WHERE fs.val NOT LIKE ''%http%'' and a.id = @z)
SELECT @Combined_String =
COALESCE(@Combined_String + '' '', '''') + val
UPDATE ' + @Table_Name + '
SET feed = ltrim(rtrim(@Combined_String))
WHERE ' + @Table_Name + '.id = @z
SELECT @Combined_String = ''''
FROM db_cursor INTO @z
--RESEED IDENTITY COLUMN & DELETE EMPTY RECORDS SQL
DECLARE @SQL_Reseed_Delete NVARCHAR(1000)
SET @SQL_Reseed_Delete =
'IF EXISTS (SELECT c.is_identity
FROM sys.tables t
JOIN sys.schemas s
ON t.schema_id = s.schema_id
JOIN sys.Columns c
ON c.object_id = t.object_id
JOIN sys.Types ty
ON ty.system_type_id = c.system_type_id
WHERE t.name = ''+ @Table_Name +''
AND s.Name = ''dbo''
SET IDENTITY_INSERT ' + @Table_Name + ' ON
DELETE FROM ' + @Table_Name + '
WHERE Feed IS NULL or Feed =''''
DBCC CHECKIDENT (' + @Table_Name + ', reseed, 1)
SET IDENTITY_INSERT ' + @Table_Name + ' OFF
ALTER INDEX IX_Feed_ID ON ' + @Table_Name + '
EXECUTE sp_executesql @SQL_Drop_Index
EXECUTE sp_executesql @SQL_Remove_Chars
EXECUTE sp_executesql @SQL_Reseed_Delete
IF @@TRANCOUNT > 0
DECLARE @ErrorMessage NVARCHAR(4000);
DECLARE @ErrorSeverity INT;
DECLARE @ErrorState INT;
SELECT @ErrorMessage = ERROR_MESSAGE(),
@ErrorSeverity = ERROR_SEVERITY(),
@ErrorState = ERROR_STATE();
,-- Message text.
@ErrorState -- State.
IF @@TRANCOUNT > 0
The package extracts Twitter feeds from each of the three CSV files (I downloaded them as XLSX and converted into CSV simply by opening them up in Excel and saving again as CSV as SSIS can be a bit temperamental using Excel as a data source), storing them in assigned tables and performing some data cleansing using the above SQL code.
After both training data and new observations have been loaded into the database it is time to fire up RapidMiner. If you have any experience with ETL applications relying on drag-and-drop transformations functionally you should feel at home using RapidMiner. Using it is similar to building an SSIS routine on a development pane and linking individual transformations to create a logical data flow. To start, we need to create a database connection to read our tweets from SQL Server. There are numerous tutorials on the Internet and a few particularly useful ones on YouTube on how to install JTDE driver on the Windows box to read data from SQL Server instance so I won’t go into details on how to accomplish that. Below is a screen shot of the connection I set up in RapidMiner on my machine.
Finally, it is time to develop the model and the rest of the solution for Twitter data classification. As explaining it step by step with all supporting screen shots would require at least another post, I decided to provide this overview as a short footage hosted on YouTube – see below.
You can also download the original footage with all the complementary files e.g. SQL code, source data files etc. from HERE.
If you enjoyed this post please also check out my earlier version of Twitter sentiment analysis using pure SQL and SSIS which can be viewed from HERE.
This time I am a little bit late off the mark but thought that the news is still timely and topical. The 2013 Magic Quadrant for BI and Analytics platform got released by Gartner and it seems that the dominant theme for 2012 was data discovery and predictive analytics. In my opinion, this is likely to continue if not intensify in 2013 as self-service BI, coupled with predictive modelling, simulation an data mining will slowly start making inroads into what is considered as standard BI capabilities. Anyhow, a copy of the quadrant is as per the image below and full report with detailed description of each vendor can be accessed form HERE.
In this short series (two parts – second part can be found HERE) I want to expand on the subject of sentiment analysis of Twitter data through data mining techniques. In the previous post I showed how to extract Twitter data using an SSIS package, load it into a relational database, and create a small cube to show the harvested tweets in a pivot table. For me, the best part of working on that solution was creating a stored procedure which determines Twitter feeds sentiment based on a dictionary of most commonly used words. It was a pretty rudimentary approach; nevertheless, it worked well and turned out to be an effective way of analysing social media data using bare SQL. In the next few posts I want to elaborate on the sentiment analysis part and rather than using SQL, I will show how to conduct such analysis using more sophisticated tool, a free data mining software called RapidMiner.
This series will be split into two separate posts – first one about the methods of tweets extraction and collection and second one on the actual sentiment analysis. In this first post I want to focus on how to amass a decent pool of tweets in two different ways using a service called Zapier, Google Docs and a little handy tool called GDocBackUpCMD as well as SSIS and a little bit of C#. Each way has its pros and cons but they are more robust then using RSS/ATOM feed as shown in my previous post.
Firstly, let’s look at Zapier. Zapier is a great service which allows web applications to ‘talk’ to each other by providing integration platform for a number of popular web services e.g. YouTube, Gmail, Evernote, Paypal etc. It is even possible to integrate the data into MS SQL Server; however this is classed as a premium service with a higher cost involved. We will use Zapier to extract Twitter feeds into a Google Docs spread sheet and then copy the data across to our local environment to mine it for sentiment trends. To get started make sure you have a Twitter account, Google account and have set up Zapier account as well. Next, in Google Docs, create an empty spreads sheet with three columns and their headings – I have labeled mine as ‘UserName’ for the feed author, ‘Date’ for when the tweet was posted and ‘Feed’ representing the content of the tweet. Also, make sure that you enter a dummy record in one of the cells – I have added ‘test row’ as per screenshot below. For some reason Zapier requires that to validate the object it will be writing to.
Once all your accounts are active and you have a Google Docs spread sheet skeleton ready, building a zap is really easy. Just select the services you wish to integrate by dragging and dropping them onto the top process pane, configure the parameters and enable the integration. Here, I have placed Twitter as a source and Google Docs as a target.
Next, enable Zapier access to those services and configure the parameters. Below, on the left hand side, I have ‘Search Mentions’ option selected and a word ‘Bigdata’ to account for all tweets which contain this string. On the right hand side I have connected to my previously created ‘Twitter_Data’ spread sheet and populated each spread sheet column with a Twitter variable by dragging and dropping (left to right) individual tweet attributes e.g. Twitter attribute called ‘Text’, which is the actual tweet content, will go into the spread sheet column named ‘Feed’. When this task is completed you also have a choice to set up filters and pump some sample data into the document to ensure everything has been configured correctly.
Finally, enable the zap and bingo, every 15 minutes (using fee account) you should see your Twitter data populating the spread sheet. The complete zap should resemble the image below.
Providing your zap has run, we are ready to copy the Google Docs data into our local environment. As mentioned before, in this post I will explain how to achieve this using two different methods – free tool called GDocBackUpCMD with a little batch file and SSIS for database integration as well as C# hacking with SQL Server Integration Services.
Using Excel, SSIS and GDocBackUp Utility
Let’s start with GDocBackUp utility. Download the CMD version from HERE (GUI version is also available but will not work in this context) and unpack it into your C:\ drive into ‘GDocBackUpCMD’ directory. Also in your C:\ drive create an empty directory called ‘Tweets’ – this will be the place where our local spread sheet version will be stored. Next, open up Notepad or other text editor and create a new file. Type in the following content into you file, making sure you populate ‘username’ and ‘password’ variables with your credentials.
Save the file on the C:\drive as ‘Export_Tweets.bat’. The two directories and the batch file should look as per image below.
Now we can execute the file to see if the process works as expected. Double-clicking on the batch file should execute it in the console view and provided all the parameters were entered correctly run the process. After completion we should also have a copy of our Google Doc spread sheet in our local folder ‘Tweets’.
Finally, all there is to do it to create a database table where all twitter data will be exported into and a small SSIS package which will help us populate it based on the local spread sheet version as per below.
Naturally, if you wish to make it a continuous process, a more elaborate package would be required to handle truncation (either Google spread sheet or the database table), data comparison for already stored tweets and some sort of looping.
This is just one of the ways to export Google Docs spread sheet into our local environment. Another method involves a little bit of C# but in turn does not rely on other utilities and tools; it can all be done in SSIS.
Using C#, SSIS and Google API SDK
First, we need to download Google data API SDK for .NET platform which can be obtained from HERE. The purpose of API SDK is to create an interface between Google online services and the client which will be taken advantage of those, in our case the SSIS package.
By default, Google API setup places the DLLs in C:\Program Files\Google\Google Data API SDK\Redist. When you open up this location you should see a number of DLL files, the ones we are interested in are Google.GData.Client.dll, Google.GData.Extensions.dll and Google.GData.Spreadsheets.dll. Next, we need to register those in GAC, otherwise known as Global Assembly Cache. For this purpose you can use the gacutil.exe executable being called from a batch file or command line, GacView tool if you are afraid of the console or simply by copying those three DLL files into C:\Windows\assembly folder. If for some reason you get an error, you should change the settings for UAC (User Access Control) by just typing ‘UAC’ in the Windows search pane, adjusting the settings and rebooting the computer. Once the DLLs have been registered we can create out database table which will store our Twitter feeds. As my spread sheet has three columns, we will create a table with three attributes, executing the following code:
Next, let’s create an Integration Services package. To do this, in a blank project create two variable of string data type which will represent your Gmail User ID and Gmail password and populate them with your credentials as per image below.
Now, drop a Data Flow Task from the tools menu onto Control Flow tab pane and within it (in Data Flow tab) place a Script Component and an OLE DB Destination component, linking the two together as per below.
In the Script Component properties pane enter the two variables we defined earlier as ‘ReadOnlyVariables’.
Next, in the Inputs and Outputs property create three outputs columns correlating to the names of the database table we created earlier, ensuring that the data types and lengths are the same as your database table.
Just before we are ready to write the code, we should reference/add the DLLs form Google API we downloaded previously. Go ahead and click on Edit Script which should invoke Visual Studio code environment and under Solution Explorer on the right hand side, right-click on References and select ‘Add Reference…’. From here, navigate to the directory where DLLs are stored (on my machine they were saved under C:\Program Files (x86)\Google\Google Data API SDK\Redist and select Google.GData.Client.dll, Google.GData.Extensions.dll and Google.GData.Spreadsheets.dll files. When all three files were added, click OK.
Finally, we are ready to write some code. Enter the following C# script (ensuring that appropriate adjustments are made to reflect your spread sheet/file name, sheet/tab name and column headings) and when finished, save the code and run the package.
public class ScriptMain : UserComponent
public override void PreExecute()
public override void PostExecute()
public override void CreateNewOutputRows()
GoogleExcelService = new SpreadsheetsService("Spreadsheet");
SpreadsheetQuery query = new SpreadsheetQuery();
SpreadsheetFeed myFeed = GoogleExcelService.Query(query);
foreach (SpreadsheetEntry mySpread in myFeed.Entries)
if (mySpread.Title.Text == "Twitter_Data")
WorksheetFeed wfeed = mySpread.Worksheets;
foreach (WorksheetEntry wsheet in wfeed.Entries)
if (wsheet.Title.Text == "Sheet1")
AtomLink atm = wsheet.Links.FindService(GDataSpreadsheetsNameTable.ListRel, null);
ListQuery Lquery = new ListQuery(atm.HRef.ToString());
ListFeed LFeed = GoogleExcelService.Query(Lquery);
foreach (ListEntry LmySpread in LFeed.Entries)
Output0Buffer.UserName = LmySpread.Elements.Value;
Output0Buffer.Date = LmySpread.Elements.Value;
Output0Buffer.Feed = LmySpread.Elements.Value;
Hopefully, everything executed as expected and when you query the table you should see the tweets extracted from the Google spread sheet content.
In the second part to this two-part series I want to explore sentiment analysis using freely available RapidMiner tool based on the data we have collected and create a mining model which can be used for Twitter feeds classification.
My name is Marcin and this site is a random collection of (after) thoughts, recipes, reflections and desultory posts about the world of BI, data analytics and everything that I fancy and categorize under the BI umbrella. I'm a native of Poland but since my university days I have lived in Australia, Melbourne and worked as a DBA, BI Developer, Business Analyst and BI Consultant. My main interests lie in both, technical aspects of Business Intelligence (primarily Microsoft BI stack i.e. MS SQL Server, SSIS, SSRS, SSAS, SharePoint, PowerPivot), data modeling and systems architecture as well as business applications of BI solutions (project management, corporate data management strategies, enterprise BI solutions implementation). On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analyzed and visualized, you've got my attention!
Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book (these days odds are against the book), the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!