Twitter Data Analysis using Microsoft SQL Server, C#, SSIS, SSAS and Excel

January 26th, 2013 / 11 Comments » / by admin

With the advent of social media which inadvertently seems to be penetrating more and more aspects of our lives, BI is also starting to look at the values it can derive from it. There are many providers offering the insight to the data generated by the social sites for a small fee and with billions of people using Twitter, Facebook, Pinterest etc. you can see why this type of data has become a precious commodity. Microsoft’s BI tools do not offer default integration with the social media applications however, with third-party plug-ins e.g. NodeXL for Excel or a little hacking anyone can create a basic solution that will provide just that. This post explains how to put together an SSIS package and build a small SSAS cube with a report on top of it to look at Twitter feeds. The premise is quite simple: create a ETL routine which will harvest Tweeter data in certain intervals, store them on SQL Server database in a fact table, try to determine the messages sentiment (whether it is positive, negative or neutral), build an SSAS cube based on the data collected and display the tweets on an Excel based report.

I will not go through the entire solution step-by-step; however, I think it is worthwhile to go through some key components which constitute this project to clarify the reasons for the approach taken. Please also bear in mind that this is not a production ready product but rather a ‘quick and dirty’ way to build a basic Twitter data analysis solution in a short space of time. If you are more serious about Twitter data, connecting to their API programmatically would be the preferred option. Also, if you are keen to replicate this method or create something similar all the files used (SSIS, SSAS, SQL, TXT etc.) are located HERE for download.

No let’s take a look at how we can take advantage of RSS/ATOM Feed functionality and connect to Twitter and harvest the feeds using Microsoft’s SQL Server Integration Services. For that purpose I have created a simple FOR LOOP Container (point 1 on the first image) which houses the rest of the transformations.

Twitter_Solution_SSIS_Control_Flow_Main_View

The purpose of this is to be able to periodically connect to Twitter RSS via a series of HTTP connections, load the feeds into a database and process the cube without having to create an SQL Server Agent job which would kick off at set intervals (nice and simple). This loop has the following expression in EvalExpression property ‘TRUE ==TRUE’. As this expression always evaluates to TRUE, the loop never stops so we can leave the package running (in development environment only) until we are satisfied we have enough data to analyze. Next, we have a simple EXECUTE SQL TASK (point 2 on the first image) as a package starting point with a short SQL statement ‘WAITFOR DELAY ‘000:00:05”. This enables to establish time intervals between the subsequent re-connections. You can change the value to whatever you consider appropriate but since I didn’t want to lose any feeds (by default, with this method you can only get the top 15 feeds for the given hashtag) I have it executing quite frequently. Next, we have two PRECEDENCE CONSTRAINTS (points 3 and 4 on the first image) which point to two different directions – on the left hand side is the Twitter feeds harvesting logic, on the right hand side is the fact table population and SSAS cube processing logic. The logic in those PRECEDENCE CONSTRAINTS evaluates which ‘way to go’ based on an expression:

Twitter_Solution_SSIS_Precedence_Constraint_Logic

This allows the left hand side to run every time except when the minute part of the server time is either 15, 30, 45 or 00. These are the times when the right hand side executes to process fact table and the cube. This allows the cube to be fairly up-to-date and does not permit the staging table where the feeds are inserted to initially to grow too much (it gets truncated every time the fact table is updated as per first image, point 9).   The left hand side of the LOOP CONTAINER is where most of the heavy lifting is done. DATA FLOW TASK called ‘Extract From All Feeds’ (point 5 on the first image) is where the RRS Twitter connections are made using C# script component and data routed to a staging table. The actual C# code can easily be found searching on the Internet. This is what it looks like for #MICROSOFT feeds data extraction (all other feeds use the same code, the only difference is the HTTP connection assigned):

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Xml;
using System.ServiceModel.Syndication;
using System.ServiceModel.Web;
using System.Xml;
using System.Text;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private SyndicationFeed myfeed = null;
    private XmlReader myxmlreader = null;

    public override void PreExecute()
    {
        base.PreExecute();
        myxmlreader = XmlReader.Create(Connections.Connection.ConnectionString);
        myfeed = SyndicationFeed.Load(myxmlreader);
    }
    public override void PostExecute()
    {
        base.PostExecute();
    }
    public override void CreateNewOutputRows()
    {
        if (myfeed != null)
        {
            foreach (var item in myfeed.Items)
            {
                Output0Buffer.AddRow();
                Output0Buffer.TwitterID = item.Id.ToString();
                Output0Buffer.TwitterFeed = item.Title.Text;
                Output0Buffer.TwitterDate = item.PublishDate.DateTime;
                Output0Buffer.AuthorDetails = item.Authors[0].Email;
            }
            Output0Buffer.SetEndOfRowset();
        }
    }
}

When opened up you can see there are five HTTP connections, each dedicated to harvest feeds relating only to specific realm e.g. Microsoft, MS Office, SharePoint etc.

Twitter_Solution_SSIS_Connection_Manager

The connection string is formatted to take only feeds containing specific hashtags into considerations e.g. feeds containing hashtag #BigData will be extracted using the following string:

Twitter_Solution_SSIS_HTTP_Connection_String

This allows us to target feeds pertaining only to specific content. The rest of the DATA FLOW CONTAINER simply appends ID values (#MICROSOFT =1, #BIGDATA = 2 etc.), unions the feeds together and does a comparison between the newly harvested feeds and what is already in the staging table using MERGE COMPONENT to extract only those feeds which are new.

Twitter_Solution_SSIS_Data_Flow_Main_View

Coming back to CONTROL FLOW, connected to ‘Extract From All Feeds’ DATA FLOW TASK we have an EXECUTE SQL TASK called ‘Append Feed Sentiment’ (point 6 on the first image) which executes a stored procedure responsible for determining whether the feed content is positive or negative (if neither, a default ‘Neutral’ tag is assigned). To understand the process, you can have a look at the actual code (the whole solution, including SQL files can be downloaded from HERE) but in a nutshell, it breaks up the Twitter feed string into individual words using an SQL function, compares them to a dictionary stored in a control table (via a lookup), and doing the ranking, it assigns the sentiment value based on that lookup. The actual dictionary is stored in the same database as a control table which is populated via a separate SSIS solution using two text files I found on the Internet (also downloadable as part of the whole solution). One file contains words which are typically viewed as negative; the other one stores words with positive connotations. The solution for sentiment control table population is quite simple as per image below.

Twitter_Solution_SSIS_Data_Flow_Term_Extraction

The stored procedure also extracts e-mail addresses and author names from the feed content as well as deletes duplicate feeds. This is what the SQL code for string splitting function and sentiment analysis stored procedure looks like.

--Create string splitting function
CREATE FUNCTION [dbo].[Split]
    (
      @sep VARCHAR(32) ,
      @s VARCHAR(MAX)
    )
RETURNS @t TABLE ( val VARCHAR(MAX) )
AS
    BEGIN
        DECLARE @xml XML
        SET @XML = N'' + REPLACE(@s, @sep, '') + ''

        INSERT  INTO @t
                ( val
                )
                SELECT  r.value('.', 'VARCHAR(50)') AS Item
                FROM    @xml.nodes('//root/r') AS RECORDS ( r )
        RETURN
    END
--Create stored procedure for sentiment analysis and other clean up tasks
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_Update_Twitter_Feeds_Staging_Data]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_Update_Twitter_Feeds_Staging_Data]
GO
CREATE PROCEDURE usp_Update_Twitter_Feeds_Staging_Data AS

BEGIN
BEGIN TRY

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[#Email]') AND type in (N'U'))
DROP TABLE [dbo].[#Email];

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[#Author]') AND type in (N'U'))
DROP TABLE [dbo].[#Author];

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[#Sentiment]') AND type in (N'U'))
DROP TABLE [dbo].[#Sentiment];


--Extract E-mail addresses
SELECT  Twitter_ID, CASE WHEN AtIndex=0 THEN '' --no email found 
           ELSE RIGHT(head, PATINDEX('% %', REVERSE(head) + ' ') - 1) 
        + LEFT(tail + ' ', PATINDEX('% %', tail + ' ')) 
        END Author_Email
INTO #Email
FROM (SELECT Twitter_ID,RIGHT(EmbeddedEmail, [len] - AtIndex) AS tail,
             LEFT(EmbeddedEmail, AtIndex) AS head, AtIndex
			 FROM (SELECT Twitter_ID,PATINDEX('%[A-Z0-9]@[A-Z0-9]%', EmbeddedEmail+' ') AS AtIndex,
					LEN(EmbeddedEmail+'|')-1 AS [len],
					EmbeddedEmail
					FROM   (SELECT Twitter_ID, Author_Details,Author_Email from [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data])
					AS a (Twitter_ID,EmbeddedEmail,Author_Email) WHERE a.[Author_Email] IS NULL
           )a
     )b


--Extract Author Names
SELECT Twitter_ID, CASE WHEN CHARINDEX ('(', Author_Details)>1 THEN
REPLACE(SUBSTRING (Author_Details, CHARINDEX ('(', Author_Details,0) +1, LEN(Author_Details)),')','') END AS Author
INTO #Author
FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data]
WHERE [Author] IS NULL



--Extract Sentiment
DECLARE @0 nvarchar (1) SET @0 = '' 
DECLARE @1 nvarchar (1) SET @1 = '>' 
DECLARE @2 nvarchar (1) SET @2 = '<' 
DECLARE @3 nvarchar (1) SET @3 = '('
DECLARE @4 nvarchar (1) SET @4 = ')'
DECLARE @5 nvarchar (1) SET @5 = '!'
DECLARE @6 nvarchar (1) SET @6 = '?'
DECLARE @7 nvarchar (1) SET @7 = ','
DECLARE @8 nvarchar (1) SET @8 = '@'
DECLARE @9 nvarchar (1) SET @9 = '#'
DECLARE @10 nvarchar (1) SET @10 = '$'
DECLARE @11 nvarchar (1) SET @11 = '%'
DECLARE @12 nvarchar (1) SET @12 = '&';      

WITH temp_results as 
	(
	SELECT  tfsd.Twitter_ID, 
	upper(ltrim(rtrim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace
	(fs.val,@1,@0),@2,@0),@3,@0),@4,@0),@5,@0),@6,@0),@7,@0),@8,@0),@9,@0),@10,@0),@11,@0),@12,@0)))) as val, se.Term_Category
	FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] tfsd
	CROSS APPLY dbo.Split(' ',tfsd.Twitter_Feed) as fs  
	LEFT JOIN Ctl_Sentiment se on 
	upper(ltrim(rtrim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace
	(fs.val,@1,@0),@2,@0),@3,@0),@4,@0),@5,@0),@6,@0),@7,@0),@8,@0),@9,@0),@10,@0),@11,@0),@12,@0))))
	= se.term
	WHERE tfsd.Sentiment IS NULL
	) 
SELECT Twitter_ID, Term_Category as Sentiment 
INTO #Sentiment
FROM
	(SELECT  Twitter_ID, Term_Category,rnk FROM 
		(SELECT Counts, Twitter_ID, Term_Category, RANK() OVER (PARTITION BY Twitter_ID ORDER BY Counts DESC) AS rnk FROM	
			(SELECT COUNT(Term_Category) Counts, Twitter_ID, Term_Category FROM
				temp_results 
				GROUP BY Twitter_ID, Term_Category
			) a
		)b
	where b.rnk = 1) c			

--Update Twitter_Feeds_Staging_Data
UPDATE [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data]
SET 
Author_Email	=	CASE WHEN b.Author_Email = '' THEN 'Unknown' ELSE b.Author_Email END,
Author			=	ISNULL(c.Author,'Unknown'),
Sentiment		=	ISNULL(d.Sentiment,'Neutral')
FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] a
LEFT JOIN #Email b		ON a.Twitter_ID = b.Twitter_ID
LEFT JOIN #Author c		ON a.Twitter_ID = c.Twitter_ID 
LEFT JOIN #Sentiment d	ON a.Twitter_ID = d.Twitter_ID
WHERE a.Author_Email IS NULL OR a.Author IS NULL OR a.Sentiment IS NULL
OR a.Author_Email = '' OR a.Author = '' OR a.Sentiment = ''



--Delete Duplicate Twitter Feeds
DECLARE @ID varchar (200)
DECLARE @COUNT int
DECLARE CUR_DELETE CURSOR FOR
	SELECT [Twitter_ID],COUNT([Twitter_ID]) FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] 
	GROUP BY [Twitter_ID] HAVING COUNT([Twitter_ID]) > 1

	OPEN CUR_DELETE

		FETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT
		WHILE @@FETCH_STATUS = 0
		BEGIN
		DELETE TOP(@COUNT -1) FROM [Twitter_DB].[dbo].[Twitter_Feeds_Staging_Data] WHERE [Twitter_ID] = @ID
		FETCH NEXT FROM CUR_DELETE INTO @ID, @COUNT
	END
CLOSE CUR_DELETE
DEALLOCATE CUR_DELETE


END TRY

BEGIN CATCH
		IF @@TRANCOUNT > 0
		BEGIN
			ROLLBACK TRANSACTION
		END	

		DECLARE @ErrorMessage NVARCHAR(4000);
		DECLARE @ErrorSeverity INT;
		DECLARE @ErrorState INT;

		SELECT 
			@ErrorMessage = ERROR_MESSAGE(),
			@ErrorSeverity = ERROR_SEVERITY(),
			@ErrorState = ERROR_STATE();
			
		RAISERROR (@ErrorMessage, -- Message text.
				   @ErrorSeverity, -- Severity.
				   @ErrorState -- State.
				   );
END CATCH

	IF @@TRANCOUNT > 0
	BEGIN
		COMMIT TRANSACTION
	END
END

It is not an ideal solution as I experienced. To mine text data in a proper manner it takes a lot more than a list of words and SQL code to perform a comparison. For example, as ‘cloud computing’ term has become a bit of a buzz word and ‘cloud’ is a negatively trending word on a list of words used to for my sentiment intelligence. Therefore, by default, any Twitter feeds containing the word ‘cloud’ are tagged as negative which is not necessarily correct. However, most of the time it seemed pretty reliable and good enough for this simple exercise. On the right hand side of FOR LOOP CONTAINER, executing in roughly 15 minutes intervals are the two EXECUTE SQL TASKS (point 7 and 9 one the first image) as well as one ANALYSIS SERVICES PROCESSING TASK (point 8 on the first image). First container is responsible for updating the fact table with fresh tweets using a stored procedure. Next up, going down the PRECEDENCE CONSTRAINTS lines we have a standard Analysis Services cube processing container pointing to the SSAS cube and finally another SQL task which simply truncates staging table once all the tweets have been incorporated into the fact table. That is it. As mentioned before, all this can be downloaded from HERE and minus the actual server connections which rely on your specific environment you shouldn’t have any problems replicating this setup if you wish to.

Based on the tweets data I managed to harvest through this simple SSIS package and SSAS cube I created as a separate solution I was able to generate basic Excel report which provides some rudimentary analytics on the volume of feeds and their sentiment per hashtag and date/time they were generated. Here is the SSAS solution view.

Twitter_Solution_SSAS_Cube_ViewAnd here is a simple pivot table with some of the cube’s dimensions and measures displayed.

Twitter_Solution_Excel_Report

So there you go! A really elementary solution to extract Twitter feeds using SSIS. Below is a sample footage which shows the package in execution mode. Pay attention to how the left hand side is executed multiple times first, searching for new tweets until server time changes to 10.00 P.M. which triggers the right hand side of the FOR LOOP CONTAINER to process the fact table and SSAS cube.

Tags: , , , , , , ,

How to implement Slowly Changing Dimensions – Part 4. Using MERGE T-SQL to load dimension data

August 26th, 2012 / 1 Comment » / by admin

In the first post I briefly outlined how to set up Slowly Changing Dimensions process using default ETL functionality (Slowly Changing Dimensions Component in SSIS, SQL Server 2012). This was followed by another 2 posts to this series outlining the usage of Checksum SSIS transformation as well as Dimension Merge SCD component which also take advantage of Microsoft’s Integration Services environment. As much as I appreciate wizard driven process setup and a quick and painless deployment cycle, on the downside, some of those components have minimal customization options and as it is in case of default SCD component – take a big performance hit when dealing with large volumes of data. Sometimes, custom coding seems to work best and given the fact that MERGE function has been present in SQL code repertoire since version 2008, it is a great alternative for using out-of-the-box, standard SSIS component.

In this example I will be using mostly the same objects and dummy dataset as I used for SSIS SCD in Part 1, Part 2 and Part 3, so if you’re keen to get your hands dirty just execute the SQL below to have something to work with.

--CREATE RELATIONAL TABLES AND CONSTRAINTS. POPULATE TABLES WITH SOME DUMMY DATA
CREATE TABLE Customer(
CustomerID int IDENTITY (1,1),
First_Name varchar (50) NULL,
Middle_Name varchar (50) NULL,
Last_Name varchar (50) NULL,
Email_Address varchar (100) NULL,
Phone_Number varchar (50) NULL,
Created_Date datetime NULL,
Modified_Date datetime NULL,
CONSTRAINT pk_CustID PRIMARY KEY (CustomerID))

CREATE TABLE Address(
AddressID int IDENTITY (1,1) NOT NULL,
Address1 varchar (50) NULL,
Address2 varchar (50) NULL,
City varchar (50) NULL,
Post_Code char (4) NULL,
Is_Current_Flag char(1) NULL,
CONSTRAINT pk_AddressID PRIMARY KEY (AddressID))

CREATE TABLE Customer_Address_Bridge(
CustomerID int NOT NULL,
AddressID int NOT NULL)

GO

INSERT INTO Customer
(First_Name, Middle_Name, Last_Name,Email_Address,Phone_Number,Created_Date,Modified_Date)
SELECT 'Mary', 'Joeanne', 'Black', 'mary0120@yahoo.com.au', '03-8573-9455', '2012-01-01', NULL UNION ALL
SELECT 'John', 'Lorance', 'Moore', 'johnnym@awol.com', '03-3423-1155', '2012-01-01', '2012-05-30' UNION ALL
SELECT 'Martin', NULL, 'Laser', 'mlaser91@aol.com', '03-2355-1109', '2012-01-01', '2012-05-12' UNION ALL
SELECT 'Spencer', 'Chris', 'McEvans', 'spencerdude@hotmail.com', '03-1122-0007', '2012-01-01', '2012-05-30' UNION ALL
SELECT 'Mark', NULL, 'King', 'mk038722@gmail.com', '03-3423-1155', '2012-01-01', '2012-05-30' UNION ALL
SELECT 'Mary', 'Susan', 'Grey', 'mmgrey@gmail.com', '03-1299-3859', '2012-01-01', NULL UNION ALL
SELECT 'Luis', 'Blake', 'Shimaro', 'shimarolou@yahoo.com.au', '03-0385-3999', '2012-01-01', NULL UNION ALL
SELECT 'Natalie', 'G', 'Chin', 'nataliechin@mediasmarts.com.au', '03-3759-1001', '2012-01-01', NULL UNION ALL
SELECT 'Marian', NULL, 'McErin', 'marianmcerin@gmail.com', '03-3400-3331', '2012-01-01', '2012-05-01' UNION ALL
SELECT 'Rick', 'Tony', 'Webster', 'rikky69@gmail.com', '03-9459-1112', '2012-01-01', NULL

INSERT INTO Address
(Address1, Address2, City, Post_Code, Is_Current_Flag)
SELECT '6 Agave Street', 'Apartment 4A', 'Launceston', '7250', 'Y' UNION ALL
SELECT '88 Dodge Street', NULL, 'Sunshine', '3020', 'Y' UNION ALL
SELECT '3 McKenzie Court', 'Level 9', 'Perthville', '2795', 'Y' UNION ALL
SELECT '5 Spencer Drive', 'Unit 9D', 'Melbourne', '3002', 'Y' UNION ALL
SELECT '8 Sunny Avenue', NULL, 'Sydney', '2000', 'Y' UNION ALL
SELECT '83 Mara Drive', NULL, 'Echuca', '3563', 'Y' UNION ALL
SELECT '1038 Mustang Street', NULL, 'Brisbane', '4000', 'Y' UNION ALL
SELECT '1 Bradman Street', NULL, 'Bendigo', '3550', 'Y' UNION ALL
SELECT '12 Cruger Drive', 'Block C', 'Cairns', '4870', 'Y' UNION ALL
SELECT '124 Lasting Court', NULL, 'Adelaide', '5000', 'Y'

INSERT INTO Customer_Address_Bridge
(CustomerID, AddressID)
SELECT 1, 1 UNION ALL
SELECT 2,2 UNION ALL
SELECT 3,3 UNION ALL
SELECT 4,4 UNION ALL
SELECT 5,5 UNION ALL
SELECT 6,6 UNION ALL
SELECT 7,7 UNION ALL
SELECT 8,8 UNION ALL
SELECT 9,9 UNION ALL
SELECT 10, 10

--CREATE CONSTRAINTS
ALTER TABLE Customer_Address_Bridge
ADD FOREIGN KEY (AddressID) REFERENCES Address(AddressID)
ALTER TABLE Customer_Address_Bridge
ADD FOREIGN KEY (CustomerID) REFERENCES Customer(CustomerID)

GO

--CREATE DIMENSION 'DimCustomer' TABLE
CREATE TABLE DimCustomer(
CustomerSK int IDENTITY (1,1) NOT NULL,
CustomerID int NULL,
First_Name varchar (50) NULL,
Last_Name varchar (50) NULL,
Email_Address varchar (100) NULL,
City varchar (50) NULL,
DT_Valid_From datetime NULL,
DT_Valid_To datetime NULL,
Is_Current bit NULL,
CONSTRAINT PK_DimCustomer PRIMARY KEY CLUSTERED (CustomerSK asc))

GO

--POPULATE DimCustomer TABLE WITH DUMMY DATA
INSERT INTO DimCustomer
(CustomerID, First_Name, Last_Name, City, Email_Address, DT_Valid_From, DT_Valid_To, Is_Current)
SELECT cu.CustomerID, cu.First_Name, cu.Last_Name, ad.City, cu.Email_Address, GetDate(), '9999-12-31', 1
FROM dbo.Address AS ad INNER JOIN
dbo.Customer_Address_Bridge AS br ON ad.AddressID = br.AddressID INNER JOIN
dbo.Customer AS cu ON br.CustomerID = cu.CustomerID

When executed, this code should give us few tables populated with some dummy data. These, combined together, create a base for the last insert statement which is responsible for the initial ‘DimCustomer’ table population. Provided you executed the code, the ‘DimCustomer’ table should have the content as per the image below.

As you can see, all the records have the status of  ‘Is_Current’ and a very distant expiry date (‘DT_Valid_To’) which is a default for a currently valid record. To see how MERGE SQL works on updating dimension tables, first we need to make some changes to our source data. Let’s assume that ‘Last_Name’, ‘First_Name’ and ‘City’ have been defined as Type 2 attributes whereas ‘Email_Address’ will be treated as Type 1 column. Let’s execute the following UPDATE statements to alter our source data to see how MERGE T-SQL works in practice.

--UPDATE SOURCE TABLES
UPDATE Customer
SET First_Name = 'zzzzzzz' WHERE customerid = 4

GO

UPDATE Customer
SET Email_Address = 'zzzzzzz@yyy.com' WHERE CustomerID = 8

Now that we have our source  table updated, first, let’s look at how to handle SCD Type 2 using MERGE function. As you recall, Type 2 SCD allows for tracking history by creating  new record in a dimension table (as opposed to Type 0 which ignores updates completely or Type 1 which only keeps the latest record). Type 2 also handles records  termination by expiring the rows and inserting identifiers which are usually expressed by either populating dates column indication insertion and expiry or setting the flag attribute to an appropriate boolean value e.g. ‘Y’/’N’ or number 0/1 or both. In order to process Type 2 SCD, the code needs to do the following:

  • Insert brand new customer rows with appropriate effective and end dates
  • Expire the old rows for those rows which have a type 2 attribute change by setting the appriopriate end date (DT_Valid_To field in our code) and Boolean flag (Is_Current in our code) to 0
  • Insert the changed Type 2 rows with appropriate effective and end dates and Boolean flag (Is_Current in our code) to 1

The T-SQL MERGE statement can only update a single row per incoming row, but there is a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement. We will therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row. This may sound a bit convoluted but in reality is rather simple. We have already established that the following columns to be treated as Type 2 SCD: ‘First_Name’, ‘Last_Name’, and ‘City’. This means that for those 3 columns we will want an additional record created for each changed value detected in the source table. We will omit ‘Email_Address’ attribute from this statement for the moment as this will be addressed when dealing with Type 1 SCD. The code starts with the outer INSERT and SELECT clause to handle the changed row inserts at the end of the MERGE statement. This has to come first because the MERGE is nested inside the INSERT. The code includes several references to ‘GETDATE()’ as it presumes the change was effective as of today. Finally, following the code, there are comments that refer to the line numbers.

--UPDATE CUSTOMER TABLE - TYPE 2 SLOWLY CHANGING DIMENSION
INSERT INTO DimCustomer
(
CustomerID, First_Name, Last_Name, Email_Address, City, Is_Current, DT_Valid_From, DT_Valid_To
)
SELECT
CustomerID, First_Name, Last_Name, Email_Address, City, 1, GetDate(), '9999-12-31'
FROM
	(
	MERGE DimCustomer as Target
		USING
			(SELECT cu.CustomerID,
			cu.First_Name,
			cu.Last_Name,
			ad.City,
			cu.Email_Address
			FROM dbo.Address AS ad INNER JOIN
			dbo.Customer_Address_Bridge AS br ON ad.AddressID = br.AddressID INNER JOIN
			dbo.Customer AS cu ON br.CustomerID = cu.CustomerID)
			as [Source]
	                ON Target.CustomerID = Source.CustomerID AND Target.Is_Current = 1
	                WHEN MATCHED AND
			(
			Target.First_Name <> [Source].First_Name OR
			Target.Last_Name <> [Source].Last_Name OR
			Target.City <> [Source].City
			)
				THEN UPDATE SET
				Is_Current = 0,
				DT_Valid_To = GETDATE()
	WHEN NOT MATCHED BY TARGET
			THEN INSERT
			(
                        CustomerID,
                        First_Name,
                        Last_Name,
                        Email_Address,
                        City,
                        DT_Valid_From,
                        DT_Valid_To,
                        Is_Current
                        )
			VALUES
			(
                        [Source].CustomerID,
                        [Source].First_Name,
                        [Source].Last_Name,
                        [Source].Email_Address,
                        [Source].City,
                        GetDate(),
                        '9999-12-31',
                        1
                        )
	WHEN NOT MATCHED BY SOURCE AND Target.Is_Current = 1
			THEN UPDATE
			SET IS_Current = 0,
			DT_Valid_From = GetDate(),
			DT_Valid_To = GetDate()
	OUTPUT $action as Action,[Source].*) as MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
AND CustomerID IS NOT NULL;

Now, let’s go through the code in details and shed some light on what it is doing.

  • First 8 lines are just a standard INSERT from the subsequent MERGE statement
  • Line 10 qualifies ‘DimCustomer’ table as our destination table and aliases it as ‘Target’. This is where the actual MERGE function starts
  • Lines 11 – 20 creates our dimensional data query which will be used to populate ‘DimCustomer’ and alases it as ‘Source’
  • Line 21 specifies the link between the ‘Target’ and the ‘Source’
  • Lines 22 – 30 specifies what needs to be done when selected Type 2 attributes i.e. ‘First_Name’, ‘Last_Name’ and ‘City’ from ‘Source’ and ‘Target’ match. This is based based on the link set up in line 21. If the match is found, ‘Is_Current’ flag is set to 0 and DT_Valid_To is set to the current date and time.
  • Lines 31 – 53 specify what the query is required to do when the match between ‘Source’ and ‘Target’ is not found. In such instance the code inserts new records to the ‘Target’ dimension table.
  • Lines 54 – 58 specify that all rows of ‘Target’ table that do not match the rows returned by ‘Source’ and that satisfy any additional search condition are updated.
  • Line 59 returns a row for every row in ‘Target’ table that is updated, inserted or deleted, in no particular order. This line also includes $Action column of type NVARCHAR (10) which stores values for each row depending on what action was executed against it i.e. INSERT, UPDATE or DELETE. In our case we return everything from the ‘Source’ and qualify/alias this dataset as ‘MergeOutput’. This dataset will be further narrowed down by WHERE clause in the next 2 lines.
  • Finally, lines 60 – 61 provide the condition for the INSERT statement (see lines 1-8) which specifies that we only require records with the ‘Action’ attribute of ‘UPDATE’ and where ‘CustomerID’ is present.

That is Type 2 SCD handled. Now, let’s look at how we handle Type 1 SCD. SCD Type 2 attributes excluded Email_Address fields which, as mentioned before, was qualified to be overwritten by the new/updated version. For Type 1 we also want to know when the record was updated. This block of code updates the Type 1 attributes (in this case, Email_Address). Line 8 (DimCustomer. Is_Current) check is optional depending on whether you only want to update current or all records.

--UPDATE CUSTOMER TABLE - TYPE 1 SLOWLY CHANGING DIMENSION
UPDATE DimCustomer
SET Email_Address = Customer.Email_Address,
DT_Valid_From = GetDate()
FROM DimCustomer INNER JOIN Customer
ON DimCustomer.CustomerID = Customer.CustomerID AND
DimCustomer.Email_Address <> Customer.Email_Address AND
DimCustomer.Is_Current = 1 -------> Optional

When both of those SQL statements are executed, the output in our dimension table should be as follows (click on the image to enlarge).

 

What we can observe here is that one row was added  (bottom red rectangle) to ‘DimCustomer’ table which precisely reflects our previous UPDATE statement to the source data. As you can recall we updated ‘CustomerID’ number 4 with a new first name changing the value from ‘Spencer’ to ‘zzzzzzz’. As ‘First_Name’ is a Type 2 attribute, a new record was added and the previous one (top red rectangle) expired. Also, as per our second update statement, we updated email address for ‘CustomerID’ number 8 from ‘nataliechin@mediasmarts.com’ to ‘zzzzzzz@yyy.com’. At this column was qualified as Type 1 Slowly Changing Dimension, no extra row was created and the existing record updated with a new ‘DT_Vlaid_From’ date (blue rectangle).

This is the last post to the series describing Slowly Changing Dimensions update using a variety of methods. If using MERGE T-SQL does not rock your boat you can always try implementing other alternatives to SCD handling. Please see my other posts about managing SCD using checksum transformation, SCD SSIS default component or SCD Dimension Merge SSIS transformation.

Tags: , ,