Twitter Sentiment Analysis – Part 1. Extracting and Mining Twitter Data Using Zapier, RapidMiner and Google/Microsoft Tools

February 28th, 2013 / 22 Comments » / by admin

In this short series (two parts – second part can be found HERE) I want to expand on the subject of sentiment analysis of Twitter data through data mining techniques. In the previous post I showed how to extract Twitter data using an SSIS package, load it into a relational database, and create a small cube to show the harvested tweets in a pivot table. For me, the best part of working on that solution was creating a stored procedure which determines Twitter feeds sentiment based on a dictionary of most commonly used words. It was a pretty rudimentary approach; nevertheless, it worked well and turned out to be an effective way of analysing social media data using bare SQL. In the next few posts I want to elaborate on the sentiment analysis part and rather than using SQL, I will show how to conduct such analysis using more sophisticated tool, a free data mining software called RapidMiner.

This series will be split into two separate posts – first one about the methods of tweets extraction and collection and second one on the actual sentiment analysis.  In this first post I want to focus on how to amass a decent pool of tweets in two different ways using a service called Zapier, Google Docs and a little handy tool called GDocBackUpCMD as well as SSIS and a little bit of C#. Each way has its pros and cons but they are more robust then using RSS/ATOM feed as shown in my previous post.

Firstly, let’s look at Zapier. Zapier is a great service which allows web applications to ‘talk’ to each other by providing integration platform for a number of popular web services e.g. YouTube, Gmail, Evernote, Paypal etc. It is even possible to integrate the data into MS SQL Server; however this is classed as a premium service with a higher cost involved. We will use Zapier to extract Twitter feeds into a Google Docs spread sheet and then copy the data across to our local environment to mine it for sentiment trends. To get started make sure you have a Twitter account, Google account and have set up Zapier account as well. Next, in Google Docs, create an empty spreads sheet with three columns and their headings – I have labeled mine as ‘UserName’ for the feed author, ‘Date’ for when the tweet was posted and ‘Feed’ representing the content of the tweet. Also, make sure that you enter a dummy record in one of the cells – I have added ‘test row’ as per screenshot below. For some reason Zapier requires that to validate the object it will be writing to.

Google_Spreadsheet_Twitter_Data

Once all your accounts are active and you have a Google Docs spread sheet skeleton ready, building a zap is really easy. Just select the services you wish to integrate by dragging and dropping them onto the top process pane, configure the parameters and enable the integration. Here, I have placed Twitter as a source and Google Docs as a target.

Zapier_for_Twitter_1Zapier_for_Twitter_2Next, enable Zapier access to those services and configure the parameters. Below, on the left hand side, I have ‘Search Mentions’ option selected and a word ‘Bigdata’ to account for all tweets which contain this string. On the right hand side I have connected to my previously created ‘Twitter_Data’ spread sheet and populated each spread sheet column with a Twitter variable by dragging and dropping (left to right) individual tweet attributes e.g. Twitter attribute called ‘Text’, which is the actual tweet content, will go into the spread sheet column named ‘Feed’. When this task is completed you also have a choice to set up filters and pump some sample data into the document to ensure everything has been configured correctly.

Zapier_for_Twitter_3

Finally, enable the zap and bingo, every 15 minutes (using fee account) you should see your Twitter data populating the spread sheet. The complete zap should resemble the image below.

Zapier_for_Twitter_4

Providing your zap has run, we are ready to copy the Google Docs data into our local environment. As mentioned before, in this post I will explain how to achieve this using two different methods – free tool called GDocBackUpCMD with a little batch file and SSIS for database integration as well as C# hacking with SQL Server Integration Services.

Using Excel, SSIS and GDocBackUp Utility

Let’s start with GDocBackUp utility. Download the CMD version from HERE (GUI version is also available but will not work in this context) and unpack it into your C:\ drive into ‘GDocBackUpCMD’ directory. Also in your C:\ drive create an empty directory called ‘Tweets’ – this will be the place where our local spread sheet version will be stored. Next, open up Notepad or other text editor and create a new file. Type in the following content into you file, making sure you populate ‘username’ and ‘password’ variables with your credentials.

@ech off
echo
c:\GDocBackupCMD\GDocBackupCMD.exe -mode=backup -username=YourUserName -password=YourPassword -destDir=c:\Tweets -docF=xls -sprsF=xls -presF=xls -drawF=xls

Save the file on the C:\drive as ‘Export_Tweets.bat’. The two directories and the batch file should look as per image below.

Twitter_Export_Local_Dir_View

Now we can execute the file to see if the process works as expected. Double-clicking on the batch file should execute it in the console view and provided all the parameters were entered correctly run the process. After completion we should also have a copy of our Google Doc spread sheet in our local folder ‘Tweets’.

GDocBackUp_Console_View

Finally, all there is to do it to create a database table where all twitter data will be exported into and a small SSIS package which will help us populate it based on the local spread sheet version as per below.

Twitter_Data_Export_SSIS_View_2

Naturally, if you wish to make it a continuous process, a more elaborate package would be required to handle truncation (either Google spread sheet or the database table), data comparison for already stored tweets and some sort of looping.

This is just one of the ways to export Google Docs spread sheet into our local environment. Another method involves a little bit of C# but in turn does not rely on other utilities and tools; it can all be done in SSIS.

Using C#, SSIS and Google API SDK

First, we need to download Google data API SDK for .NET platform which can be obtained from HERE. The purpose of  API SDK is to create an interface between Google online services and the client which will be taken advantage of those, in our case the SSIS package.

Google_Data_API_WebView

By default, Google API setup places the DLLs in C:\Program Files\Google\Google Data API SDK\Redist. When you open up this location you should see a number of DLL files, the ones we are interested in are Google.GData.Client.dll, Google.GData.Extensions.dll and Google.GData.Spreadsheets.dll. Next, we need to register those in GAC, otherwise known as Global Assembly Cache.  For this purpose you can use the gacutil.exe executable being called from a batch file or command line, GacView tool if you are afraid of the console or simply by copying those three DLL files into C:\Windows\assembly folder. If for some reason you get an error, you should change the settings for UAC (User Access Control) by just typing ‘UAC’ in the Windows search pane, adjusting the settings and rebooting the computer. Once the DLLs have been registered we can create out database table which will store our Twitter feeds. As my spread sheet has three columns, we will create a table with three attributes, executing the following code:

CREATE TABLE [dbo].[Twitter_Feed](
[Feed]     [varchar] (300) NULL,
[Date]     [varchar] (100) NULL,
[UserName] [varchar] (100) NULL)

Next, let’s create an Integration Services package. To do this, in a blank project create two variable of string data type which will represent your Gmail User ID and Gmail password and populate them with your credentials as per image below.

Twitter_Data_Export_Job_Variables

Now, drop a Data Flow Task from the tools menu onto Control Flow tab pane and within it (in Data Flow tab) place a Script Component and an OLE DB Destination component, linking the two together as per below.

Twitter_Data_Export_SSIS_View_1

In the Script Component properties pane enter the two variables we defined earlier as ‘ReadOnlyVariables’.

Twitter_Data_Export_Variables_Use

Next, in the Inputs and Outputs property create three outputs columns correlating to the names of the database table we created earlier, ensuring that the data types and lengths are the same as your database table.

Inputs_Outputs

Just before we are ready to write the code, we should reference/add the DLLs form Google API we downloaded previously. Go ahead and click on Edit Script which should invoke Visual Studio code environment and under Solution Explorer on the right hand side, right-click on References and select ‘Add Reference…’. From here, navigate to the directory where DLLs are stored (on my machine they were saved under C:\Program Files (x86)\Google\Google Data API SDK\Redist and select Google.GData.Client.dll, Google.GData.Extensions.dll and Google.GData.Spreadsheets.dll files. When all three files were added, click OK.

Google_Data_DLL_Refs

Finally, we are ready to write some code. Enter the following C# script (ensuring that appropriate adjustments are made to reflect your spread sheet/file name, sheet/tab name and column headings) and when finished, save the code and run the package.

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Google.GData;
using Google.GData.Client;
using Google.GData.Extensions;
using Google.GData.Spreadsheets;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    public override void PreExecute()
    {
        base.PreExecute();
    }
    public override void PostExecute()
    {
        base.PostExecute();
    }
    public override void CreateNewOutputRows()
    {
        SpreadsheetsService GoogleExcelService;
        GoogleExcelService = new SpreadsheetsService("Spreadsheet");
        GoogleExcelService.setUserCredentials(Variables.GmailUserID, Variables.GmailPassword);
        SpreadsheetQuery query = new SpreadsheetQuery();
        SpreadsheetFeed myFeed = GoogleExcelService.Query(query);

        foreach (SpreadsheetEntry mySpread in myFeed.Entries)
        {
            if (mySpread.Title.Text == "Twitter_Data")
            {
                WorksheetFeed wfeed = mySpread.Worksheets;
                foreach (WorksheetEntry wsheet in wfeed.Entries)
                {
                    if (wsheet.Title.Text == "Sheet1")
                    {
                        AtomLink atm = wsheet.Links.FindService(GDataSpreadsheetsNameTable.ListRel, null);
                        ListQuery Lquery = new ListQuery(atm.HRef.ToString());
                        ListFeed LFeed = GoogleExcelService.Query(Lquery);
                        foreach (ListEntry LmySpread in LFeed.Entries)
                        {
                            Output0Buffer.AddRow();
                            Output0Buffer.UserName = LmySpread.Elements[0].Value;
                            Output0Buffer.Date = LmySpread.Elements[1].Value;
                            Output0Buffer.Feed = LmySpread.Elements[2].Value;
                        }
                    }
                }
            }
        }
    }
}

Hopefully, everything executed as expected and when you query the table you should see the tweets extracted from the Google spread sheet content.

Twitter_Data_Export_Table_Target_Tbl_View

In the second part to this two-part series I want to explore sentiment analysis using freely available RapidMiner tool based on the data we have collected and create a mining model which can be used for Twitter feeds classification. Finally, if you’re after a more adhoc/playful sentiment analysis tool for Twitter data and (optionally) have some basic knowledge of Python programming language, check out my post on using etcML web based tool under THIS link.

Tags: , , , , , , , ,

Twitter Data Analysis using Microsoft SQL Server, C#, SSIS, SSAS and Excel

January 26th, 2013 / 11 Comments » / by admin

With the advent of social media which inadvertently seems to be penetrating more and more aspects of our lives, BI is also starting to look at the values it can derive from it. There are many providers offering the insight to the data generated by the social sites for a small fee and with billions of people using Twitter, Facebook, Pinterest etc. you can see why this type of data has become a precious commodity. Microsoft’s BI tools do not offer default integration with the social media applications however, with third-party plug-ins e.g. NodeXL for Excel or a little hacking anyone can create a basic solution that will provide just that. This post explains how to put together an SSIS package and build a small SSAS cube with a report on top of it to look at Twitter feeds. The premise is quite simple: create a ETL routine which will harvest Tweeter data in certain intervals, store them on SQL Server database in a fact table, try to determine the messages sentiment (whether it is positive, negative or neutral), build an SSAS cube based on the data collected and display the tweets on an Excel based report.

I will not go through the entire solution step-by-step; however, I think it is worthwhile to go through some key components which constitute this project to clarify the reasons for the approach taken. Please also bear in mind that this is not a production ready product but rather a ‘quick and dirty’ way to build a basic Twitter data analysis solution in a short space of time. If you are more serious about Twitter data, connecting to their API programmatically would be the preferred option. Also, if you are keen to replicate this method or create something similar all the files used (SSIS, SSAS, SQL, TXT etc.) are located HERE for download.

No let’s take a look at how we can take advantage of RSS/ATOM Feed functionality and connect to Twitter and harvest the feeds using Microsoft’s SQL Server Integration Services. For that purpose I have created a simple FOR LOOP Container (point 1 on the first image) which houses the rest of the transformations.

Twitter_Solution_SSIS_Control_Flow_Main_View

The purpose of this is to be able to periodically connect to Twitter RSS via a series of HTTP connections, load the feeds into a database and process the cube without having to create an SQL Server Agent job which would kick off at set intervals (nice and simple). This loop has the following expression in EvalExpression property ‘TRUE ==TRUE’. As this expression always evaluates to TRUE, the loop never stops so we can leave the package running (in development environment only) until we are satisfied we have enough data to analyze. Next, we have a simple EXECUTE SQL TASK (point 2 on the first image) as a package starting point with a short SQL statement ‘WAITFOR DELAY ‘000:00:05”. This enables to establish time intervals between the subsequent re-connections. You can change the value to whatever you consider appropriate but since I didn’t want to lose any feeds (by default, with this method you can only get the top 15 feeds for the given hashtag) I have it executing quite frequently. Next, we have two PRECEDENCE CONSTRAINTS (points 3 and 4 on the first image) which point to two different directions – on the left hand side is the Twitter feeds harvesting logic, on the right hand side is the fact table population and SSAS cube processing logic. The logic in those PRECEDENCE CONSTRAINTS evaluates which ‘way to go’ based on an expression:

Twitter_Solution_SSIS_Precedence_Constraint_Logic

This allows the left hand side to run every time except when the minute part of the server time is either 15, 30, 45 or 00. These are the times when the right hand side executes to process fact table and the cube. This allows the cube to be fairly up-to-date and does not permit the staging table where the feeds are inserted to initially to grow too much (it gets truncated every time the fact table is updated as per first image, point 9).   The left hand side of the LOOP CONTAINER is where most of the heavy lifting is done. DATA FLOW TASK called ‘Extract From All Feeds’ (point 5 on the first image) is where the RRS Twitter connections are made using C# script component and data routed to a staging table. The actual C# code can easily be found searching on the Internet. This is what it looks like for #MICROSOFT feeds data extraction (all other feeds use the same code, the only difference is the HTTP connection assigned):

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Xml;
using System.ServiceModel.Syndication;
using System.ServiceModel.Web;
using System.Xml;
using System.Text;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private SyndicationFeed myfeed = null;
    private XmlReader myxmlreader = null;

    public override void PreExecute()
    {
        base.PreExecute();
        myxmlreader = XmlReader.Create(Connections.Connection.ConnectionString);
        myfeed = SyndicationFeed.Load(myxmlreader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
    }
    public override void CreateNewOutputRows()
    {
        if (myfeed != null)
        {
            foreach (var item in myfeed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.TwitterID = item.Id.ToString();
                Output0Buffer.TwitterFeed = item.Title.Text;
                Output0Buffer.TwitterDate = item.PublishDate.DateTime;
                Output0Buffer.AuthorDetails = item.Authors[0].Email;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

When opened up you can see there are five HTTP connections, each dedicated to harvest feeds relating only to specific realm e.g. Microsoft, MS Office, SharePoint etc.

Twitter_Solution_SSIS_Connection_Manager

The connection string is formatted to take only feeds containing specific hashtags into considerations e.g. feeds containing hashtag #BigData will be extracted using the following string:

Twitter_Solution_SSIS_HTTP_Connection_String

This allows us to target feeds pertaining only to specific content. The rest of the DATA FLOW CONTAINER simply appends ID values (#MICROSOFT =1, #BIGDATA = 2 etc.), unions the feeds together and does a comparison between the newly harvested feeds and what is already in the staging table using MERGE COMPONENT to extract only those feeds which are new.

Twitter_Solution_SSIS_Data_Flow_Main_View

Coming back to CONTROL FLOW, connected to ‘Extract From All Feeds’ DATA FLOW TASK we have an EXECUTE SQL TASK called ‘Append Feed Sentiment’ (point 6 on the first image) which executes a stored procedure responsible for determining whether the feed content is positive or negative (if neither, a default ‘Neutral’ tag is assigned). To understand the process, you can have a look at the actual code (the whole solution, including SQL files can be downloaded from HERE) but in a nutshell, it breaks up the Twitter feed string into individual words using an SQL function, compares them to a dictionary stored in a control table (via a lookup), and doing the ranking, it assigns the sentiment value based on that lookup. The actual dictionary is stored in the same database as a control table which is populated via a separate SSIS solution using two text files I found on the Internet (also downloadable as part of the whole solution). One file contains words which are typically viewed as negative; the other one stores words with positive connotations. The solution for sentiment control table population is quite simple as per image below.

Twitter_Solution_SSIS_Data_Flow_Term_Extraction

The stored procedure also extracts e-mail addresses and author names from the feed content as well as deletes duplicate feeds. This is what the SQL code for string splitting function and sentiment analysis stored procedure looks like.

--Create string splitting function
CREATE FUNCTION [dbo].[Split]
    (
      @sep VARCHAR(32) ,
      @s VARCHAR(MAX)
    )
RETURNS @t TABLE ( val VARCHAR(MAX) )
AS
    BEGIN
        DECLARE @xml XML
        SET @XML = N'' + REPLACE(@s, @sep, '') + ''

        INSERT  INTO @t
                ( val
                )
                SELECT  r.value('.', 'VARCHAR(50)') AS Item
                FROM    @xml.nodes('//root/r') AS RECORDS ( r )
        RETURN
    END
--Create stored procedure for sentiment analysis and other clean up tasks
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_Update_Twitter_Feeds_Staging_Data]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_Update_Twitter_Feeds_Staging_Data]
GO
CREATE PROCEDURE usp_Update_Twitter_Feeds_Staging_Data AS

BEGIN
BEGIN TRY

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[#Email]') AND type in (N'U'))
DROP TABLE [dbo].[#Email];

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[#Author]') AND type in (N'U'))
DROP TABLE [dbo].[#Author];

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[#Sentiment]') AND type in (N'U'))
DROP TABLE [dbo].[#Sentiment];


--Extract E-mail addresses
SELECT  Twitter_ID, CASE WHEN AtIndex=0 THEN '' --no email found 
           ELSE RIGHT(head, PATINDEX('% %', REVERSE(head) + ' ') - 1) 
        + LEFT(tail + ' ', PATINDEX('% %', tail + ' ')) 
        END Author_Email
INTO #Email
FROM (SELECT Twitter_ID,RIGHT(EmbeddedEmail, [len] - AtIndex) AS tail,
             LEFT(EmbeddedEmail, AtIndex) AS head, AtIndex
			 FROM (SELECT Twitter_ID,PATINDEX('%[A-Z0-9]@[A-Z0-9]%', EmbeddedEmail+' ') AS AtIndex,
					LEN(EmbeddedEmail+'|')-1 AS [len],
					EmbeddedEmail
					FROM   (SELECT Twitter_ID, Author_Details,Author_Email from [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data])
					AS a (Twitter_ID,EmbeddedEmail,Author_Email) WHERE a.[Author_Email] IS NULL
           )a
     )b


--Extract Author Names
SELECT Twitter_ID, CASE WHEN CHARINDEX ('(', Author_Details)>1 THEN
REPLACE(SUBSTRING (Author_Details, CHARINDEX ('(', Author_Details,0) +1, LEN(Author_Details)),')','') END AS Author
INTO #Author
FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data]
WHERE [Author] IS NULL



--Extract Sentiment
DECLARE @0 nvarchar (1) SET @0 = '' 
DECLARE @1 nvarchar (1) SET @1 = '>' 
DECLARE @2 nvarchar (1) SET @2 = '<' 
DECLARE @3 nvarchar (1) SET @3 = '('
DECLARE @4 nvarchar (1) SET @4 = ')'
DECLARE @5 nvarchar (1) SET @5 = '!'
DECLARE @6 nvarchar (1) SET @6 = '?'
DECLARE @7 nvarchar (1) SET @7 = ','
DECLARE @8 nvarchar (1) SET @8 = '@'
DECLARE @9 nvarchar (1) SET @9 = '#'
DECLARE @10 nvarchar (1) SET @10 = '$'
DECLARE @11 nvarchar (1) SET @11 = '%'
DECLARE @12 nvarchar (1) SET @12 = '&';      

WITH temp_results as 
	(
	SELECT  tfsd.Twitter_ID, 
	upper(ltrim(rtrim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace
	(fs.val,@1,@0),@2,@0),@3,@0),@4,@0),@5,@0),@6,@0),@7,@0),@8,@0),@9,@0),@10,@0),@11,@0),@12,@0)))) as val, se.Term_Category
	FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] tfsd
	CROSS APPLY dbo.Split(' ',tfsd.Twitter_Feed) as fs  
	LEFT JOIN Ctl_Sentiment se on 
	upper(ltrim(rtrim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace
	(fs.val,@1,@0),@2,@0),@3,@0),@4,@0),@5,@0),@6,@0),@7,@0),@8,@0),@9,@0),@10,@0),@11,@0),@12,@0))))
	= se.term
	WHERE tfsd.Sentiment IS NULL
	) 
SELECT Twitter_ID, Term_Category as Sentiment 
INTO #Sentiment
FROM
	(SELECT  Twitter_ID, Term_Category,rnk FROM 
		(SELECT Counts, Twitter_ID, Term_Category, RANK() OVER (PARTITION BY Twitter_ID ORDER BY Counts DESC) AS rnk FROM	
			(SELECT COUNT(Term_Category) Counts, Twitter_ID, Term_Category FROM
				temp_results 
				GROUP BY Twitter_ID, Term_Category
			) a
		)b
	where b.rnk = 1) c			

--Update Twitter_Feeds_Staging_Data
UPDATE [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data]
SET 
Author_Email	=	CASE WHEN b.Author_Email = '' THEN 'Unknown' ELSE b.Author_Email END,
Author			=	ISNULL(c.Author,'Unknown'),
Sentiment		=	ISNULL(d.Sentiment,'Neutral')
FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] a
LEFT JOIN #Email b		ON a.Twitter_ID = b.Twitter_ID
LEFT JOIN #Author c		ON a.Twitter_ID = c.Twitter_ID 
LEFT JOIN #Sentiment d	ON a.Twitter_ID = d.Twitter_ID
WHERE a.Author_Email IS NULL OR a.Author IS NULL OR a.Sentiment IS NULL
OR a.Author_Email = '' OR a.Author = '' OR a.Sentiment = ''



--Delete Duplicate Twitter Feeds
DECLARE @ID varchar (200)
DECLARE @COUNT int
DECLARE CUR_DELETE CURSOR FOR
	SELECT [Twitter_ID],COUNT([Twitter_ID]) FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] 
	GROUP BY [Twitter_ID] HAVING COUNT([Twitter_ID]) > 1

	OPEN CUR_DELETE

		FETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT
		WHILE @@FETCH_STATUS = 0
		BEGIN
		DELETE TOP(@COUNT -1) FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] WHERE [Twitter_ID] = @ID
		FETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT
	END
CLOSE CUR_DELETE
DEALLOCATE CUR_DELETE


END TRY

BEGIN CATCH
		IF @@TRANCOUNT > 0
		BEGIN
			ROLLBACK TRANSACTION
		END	

		DECLARE @ErrorMessage NVARCHAR(4000);
		DECLARE @ErrorSeverity INT;
		DECLARE @ErrorState INT;

		SELECT 
			@ErrorMessage = ERROR_MESSAGE(),
			@ErrorSeverity = ERROR_SEVERITY(),
			@ErrorState = ERROR_STATE();
			
		RAISERROR (@ErrorMessage, -- Message text.
				   @ErrorSeverity, -- Severity.
				   @ErrorState -- State.
				   );
END CATCH

	IF @@TRANCOUNT > 0
	BEGIN
		COMMIT TRANSACTION
	END
END

It is not an ideal solution as I experienced. To mine text data in a proper manner it takes a lot more than a list of words and SQL code to perform a comparison. For example, as ‘cloud computing’ term has become a bit of a buzz word and ‘cloud’ is a negatively trending word on a list of words used to for my sentiment intelligence. Therefore, by default, any Twitter feeds containing the word ‘cloud’ are tagged as negative which is not necessarily correct. However, most of the time it seemed pretty reliable and good enough for this simple exercise. On the right hand side of FOR LOOP CONTAINER, executing in roughly 15 minutes intervals are the two EXECUTE SQL TASKS (point 7 and 9 one the first image) as well as one ANALYSIS SERVICES PROCESSING TASK (point 8 on the first image). First container is responsible for updating the fact table with fresh tweets using a stored procedure. Next up, going down the PRECEDENCE CONSTRAINTS lines we have a standard Analysis Services cube processing container pointing to the SSAS cube and finally another SQL task which simply truncates staging table once all the tweets have been incorporated into the fact table. That is it. As mentioned before, all this can be downloaded from HERE and minus the actual server connections which rely on your specific environment you shouldn’t have any problems replicating this setup if you wish to.

Based on the tweets data I managed to harvest through this simple SSIS package and SSAS cube I created as a separate solution I was able to generate basic Excel report which provides some rudimentary analytics on the volume of feeds and their sentiment per hashtag and date/time they were generated. Here is the SSAS solution view.

Twitter_Solution_SSAS_Cube_ViewAnd here is a simple pivot table with some of the cube’s dimensions and measures displayed.

Twitter_Solution_Excel_Report

So there you go! A really elementary solution to extract Twitter feeds using SSIS. Below is a sample footage which shows the package in execution mode. Pay attention to how the left hand side is executed multiple times first, searching for new tweets until server time changes to 10.00 P.M. which triggers the right hand side of the FOR LOOP CONTAINER to process the fact table and SSAS cube.

Tags: , , , , , , ,