How to implement Slowly Changing Dimensions – Part 4. Using MERGE T-SQL to load dimension data
August 26th, 2012 / 1 Comment » / by admin
In the first post I briefly outlined how to set up Slowly Changing Dimensions process using default ETL functionality (Slowly Changing Dimensions Component in SSIS, SQL Server 2012). This was followed by another 2 posts to this series outlining the usage of Checksum SSIS transformation as well as Dimension Merge SCD component which also take advantage of Microsoft’s Integration Services environment. As much as I appreciate wizard driven process setup and a quick and painless deployment cycle, on the downside, some of those components have minimal customization options and as it is in case of default SCD component – take a big performance hit when dealing with large volumes of data. Sometimes, custom coding seems to work best and given the fact that MERGE function has been present in SQL code repertoire since version 2008, it is a great alternative for using out-of-the-box, standard SSIS component.
In this example I will be using mostly the same objects and dummy dataset as I used for SSIS SCD in Part 1, Part 2 and Part 3, so if you’re keen to get your hands dirty just execute the SQL below to have something to work with.
--CREATE RELATIONAL TABLES AND CONSTRAINTS. POPULATE TABLES WITH SOME DUMMY DATA CREATE TABLE Customer( CustomerID int IDENTITY (1,1), First_Name varchar (50) NULL, Middle_Name varchar (50) NULL, Last_Name varchar (50) NULL, Email_Address varchar (100) NULL, Phone_Number varchar (50) NULL, Created_Date datetime NULL, Modified_Date datetime NULL, CONSTRAINT pk_CustID PRIMARY KEY (CustomerID)) CREATE TABLE Address( AddressID int IDENTITY (1,1) NOT NULL, Address1 varchar (50) NULL, Address2 varchar (50) NULL, City varchar (50) NULL, Post_Code char (4) NULL, Is_Current_Flag char(1) NULL, CONSTRAINT pk_AddressID PRIMARY KEY (AddressID)) CREATE TABLE Customer_Address_Bridge( CustomerID int NOT NULL, AddressID int NOT NULL) GO INSERT INTO Customer (First_Name, Middle_Name, Last_Name,Email_Address,Phone_Number,Created_Date,Modified_Date) SELECT 'Mary', 'Joeanne', 'Black', 'mary0120@yahoo.com.au', '03-8573-9455', '2012-01-01', NULL UNION ALL SELECT 'John', 'Lorance', 'Moore', 'johnnym@awol.com', '03-3423-1155', '2012-01-01', '2012-05-30' UNION ALL SELECT 'Martin', NULL, 'Laser', 'mlaser91@aol.com', '03-2355-1109', '2012-01-01', '2012-05-12' UNION ALL SELECT 'Spencer', 'Chris', 'McEvans', 'spencerdude@hotmail.com', '03-1122-0007', '2012-01-01', '2012-05-30' UNION ALL SELECT 'Mark', NULL, 'King', 'mk038722@gmail.com', '03-3423-1155', '2012-01-01', '2012-05-30' UNION ALL SELECT 'Mary', 'Susan', 'Grey', 'mmgrey@gmail.com', '03-1299-3859', '2012-01-01', NULL UNION ALL SELECT 'Luis', 'Blake', 'Shimaro', 'shimarolou@yahoo.com.au', '03-0385-3999', '2012-01-01', NULL UNION ALL SELECT 'Natalie', 'G', 'Chin', 'nataliechin@mediasmarts.com.au', '03-3759-1001', '2012-01-01', NULL UNION ALL SELECT 'Marian', NULL, 'McErin', 'marianmcerin@gmail.com', '03-3400-3331', '2012-01-01', '2012-05-01' UNION ALL SELECT 'Rick', 'Tony', 'Webster', 'rikky69@gmail.com', '03-9459-1112', '2012-01-01', NULL INSERT INTO Address (Address1, Address2, City, Post_Code, Is_Current_Flag) SELECT '6 Agave Street', 'Apartment 4A', 'Launceston', '7250', 'Y' UNION ALL SELECT '88 Dodge Street', NULL, 'Sunshine', '3020', 'Y' UNION ALL SELECT '3 McKenzie Court', 'Level 9', 'Perthville', '2795', 'Y' UNION ALL SELECT '5 Spencer Drive', 'Unit 9D', 'Melbourne', '3002', 'Y' UNION ALL SELECT '8 Sunny Avenue', NULL, 'Sydney', '2000', 'Y' UNION ALL SELECT '83 Mara Drive', NULL, 'Echuca', '3563', 'Y' UNION ALL SELECT '1038 Mustang Street', NULL, 'Brisbane', '4000', 'Y' UNION ALL SELECT '1 Bradman Street', NULL, 'Bendigo', '3550', 'Y' UNION ALL SELECT '12 Cruger Drive', 'Block C', 'Cairns', '4870', 'Y' UNION ALL SELECT '124 Lasting Court', NULL, 'Adelaide', '5000', 'Y' INSERT INTO Customer_Address_Bridge (CustomerID, AddressID) SELECT 1, 1 UNION ALL SELECT 2,2 UNION ALL SELECT 3,3 UNION ALL SELECT 4,4 UNION ALL SELECT 5,5 UNION ALL SELECT 6,6 UNION ALL SELECT 7,7 UNION ALL SELECT 8,8 UNION ALL SELECT 9,9 UNION ALL SELECT 10, 10 --CREATE CONSTRAINTS ALTER TABLE Customer_Address_Bridge ADD FOREIGN KEY (AddressID) REFERENCES Address(AddressID) ALTER TABLE Customer_Address_Bridge ADD FOREIGN KEY (CustomerID) REFERENCES Customer(CustomerID) GO --CREATE DIMENSION 'DimCustomer' TABLE CREATE TABLE DimCustomer( CustomerSK int IDENTITY (1,1) NOT NULL, CustomerID int NULL, First_Name varchar (50) NULL, Last_Name varchar (50) NULL, Email_Address varchar (100) NULL, City varchar (50) NULL, DT_Valid_From datetime NULL, DT_Valid_To datetime NULL, Is_Current bit NULL, CONSTRAINT PK_DimCustomer PRIMARY KEY CLUSTERED (CustomerSK asc)) GO --POPULATE DimCustomer TABLE WITH DUMMY DATA INSERT INTO DimCustomer (CustomerID, First_Name, Last_Name, City, Email_Address, DT_Valid_From, DT_Valid_To, Is_Current) SELECT cu.CustomerID, cu.First_Name, cu.Last_Name, ad.City, cu.Email_Address, GetDate(), '9999-12-31', 1 FROM dbo.Address AS ad INNER JOIN dbo.Customer_Address_Bridge AS br ON ad.AddressID = br.AddressID INNER JOIN dbo.Customer AS cu ON br.CustomerID = cu.CustomerID
When executed, this code should give us few tables populated with some dummy data. These, combined together, create a base for the last insert statement which is responsible for the initial ‘DimCustomer’ table population. Provided you executed the code, the ‘DimCustomer’ table should have the content as per the image below.
As you can see, all the records have the status of ‘Is_Current’ and a very distant expiry date (‘DT_Valid_To’) which is a default for a currently valid record. To see how MERGE SQL works on updating dimension tables, first we need to make some changes to our source data. Let’s assume that ‘Last_Name’, ‘First_Name’ and ‘City’ have been defined as Type 2 attributes whereas ‘Email_Address’ will be treated as Type 1 column. Let’s execute the following UPDATE statements to alter our source data to see how MERGE T-SQL works in practice.
--UPDATE SOURCE TABLES UPDATE Customer SET First_Name = 'zzzzzzz' WHERE customerid = 4 GO UPDATE Customer SET Email_Address = 'zzzzzzz@yyy.com' WHERE CustomerID = 8
Now that we have our source table updated, first, let’s look at how to handle SCD Type 2 using MERGE function. As you recall, Type 2 SCD allows for tracking history by creating new record in a dimension table (as opposed to Type 0 which ignores updates completely or Type 1 which only keeps the latest record). Type 2 also handles records termination by expiring the rows and inserting identifiers which are usually expressed by either populating dates column indication insertion and expiry or setting the flag attribute to an appropriate boolean value e.g. ‘Y’/’N’ or number 0/1 or both. In order to process Type 2 SCD, the code needs to do the following:
- Insert brand new customer rows with appropriate effective and end dates
- Expire the old rows for those rows which have a type 2 attribute change by setting the appriopriate end date (DT_Valid_To field in our code) and Boolean flag (Is_Current in our code) to 0
- Insert the changed Type 2 rows with appropriate effective and end dates and Boolean flag (Is_Current in our code) to 1
The T-SQL MERGE statement can only update a single row per incoming row, but there is a trick that we can take advantage of by making use of the OUTPUT clause. Merge can output the results of what it has done, which in turn can be consumed by a separate INSERT statement. We will therefore use the MERGE statement to update the existing record, terminating it, and then pass the relevant source rows out to the INSERT statement to create the new row. This may sound a bit convoluted but in reality is rather simple. We have already established that the following columns to be treated as Type 2 SCD: ‘First_Name’, ‘Last_Name’, and ‘City’. This means that for those 3 columns we will want an additional record created for each changed value detected in the source table. We will omit ‘Email_Address’ attribute from this statement for the moment as this will be addressed when dealing with Type 1 SCD. The code starts with the outer INSERT and SELECT clause to handle the changed row inserts at the end of the MERGE statement. This has to come first because the MERGE is nested inside the INSERT. The code includes several references to ‘GETDATE()’ as it presumes the change was effective as of today. Finally, following the code, there are comments that refer to the line numbers.
--UPDATE CUSTOMER TABLE - TYPE 2 SLOWLY CHANGING DIMENSION INSERT INTO DimCustomer ( CustomerID, First_Name, Last_Name, Email_Address, City, Is_Current, DT_Valid_From, DT_Valid_To ) SELECT CustomerID, First_Name, Last_Name, Email_Address, City, 1, GetDate(), '9999-12-31' FROM ( MERGE DimCustomer as Target USING (SELECT cu.CustomerID, cu.First_Name, cu.Last_Name, ad.City, cu.Email_Address FROM dbo.Address AS ad INNER JOIN dbo.Customer_Address_Bridge AS br ON ad.AddressID = br.AddressID INNER JOIN dbo.Customer AS cu ON br.CustomerID = cu.CustomerID) as [Source] ON Target.CustomerID = Source.CustomerID AND Target.Is_Current = 1 WHEN MATCHED AND ( Target.First_Name <> [Source].First_Name OR Target.Last_Name <> [Source].Last_Name OR Target.City <> [Source].City ) THEN UPDATE SET Is_Current = 0, DT_Valid_To = GETDATE() WHEN NOT MATCHED BY TARGET THEN INSERT ( CustomerID, First_Name, Last_Name, Email_Address, City, DT_Valid_From, DT_Valid_To, Is_Current ) VALUES ( [Source].CustomerID, [Source].First_Name, [Source].Last_Name, [Source].Email_Address, [Source].City, GetDate(), '9999-12-31', 1 ) WHEN NOT MATCHED BY SOURCE AND Target.Is_Current = 1 THEN UPDATE SET IS_Current = 0, DT_Valid_From = GetDate(), DT_Valid_To = GetDate() OUTPUT $action as Action,[Source].*) as MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND CustomerID IS NOT NULL;
Now, let’s go through the code in details and shed some light on what it is doing.
- First 8 lines are just a standard INSERT from the subsequent MERGE statement
- Line 10 qualifies ‘DimCustomer’ table as our destination table and aliases it as ‘Target’. This is where the actual MERGE function starts
- Lines 11 – 20 creates our dimensional data query which will be used to populate ‘DimCustomer’ and alases it as ‘Source’
- Line 21 specifies the link between the ‘Target’ and the ‘Source’
- Lines 22 – 30 specifies what needs to be done when selected Type 2 attributes i.e. ‘First_Name’, ‘Last_Name’ and ‘City’ from ‘Source’ and ‘Target’ match. This is based based on the link set up in line 21. If the match is found, ‘Is_Current’ flag is set to 0 and DT_Valid_To is set to the current date and time.
- Lines 31 – 53 specify what the query is required to do when the match between ‘Source’ and ‘Target’ is not found. In such instance the code inserts new records to the ‘Target’ dimension table.
- Lines 54 – 58 specify that all rows of ‘Target’ table that do not match the rows returned by ‘Source’ and that satisfy any additional search condition are updated.
- Line 59 returns a row for every row in ‘Target’ table that is updated, inserted or deleted, in no particular order. This line also includes $Action column of type NVARCHAR (10) which stores values for each row depending on what action was executed against it i.e. INSERT, UPDATE or DELETE. In our case we return everything from the ‘Source’ and qualify/alias this dataset as ‘MergeOutput’. This dataset will be further narrowed down by WHERE clause in the next 2 lines.
- Finally, lines 60 – 61 provide the condition for the INSERT statement (see lines 1-8) which specifies that we only require records with the ‘Action’ attribute of ‘UPDATE’ and where ‘CustomerID’ is present.
That is Type 2 SCD handled. Now, let’s look at how we handle Type 1 SCD. SCD Type 2 attributes excluded Email_Address fields which, as mentioned before, was qualified to be overwritten by the new/updated version. For Type 1 we also want to know when the record was updated. This block of code updates the Type 1 attributes (in this case, Email_Address). Line 8 (DimCustomer. Is_Current) check is optional depending on whether you only want to update current or all records.
--UPDATE CUSTOMER TABLE - TYPE 1 SLOWLY CHANGING DIMENSION UPDATE DimCustomer SET Email_Address = Customer.Email_Address, DT_Valid_From = GetDate() FROM DimCustomer INNER JOIN Customer ON DimCustomer.CustomerID = Customer.CustomerID AND DimCustomer.Email_Address <> Customer.Email_Address AND DimCustomer.Is_Current = 1 -------> Optional
When both of those SQL statements are executed, the output in our dimension table should be as follows (click on the image to enlarge).
What we can observe here is that one row was added (bottom red rectangle) to ‘DimCustomer’ table which precisely reflects our previous UPDATE statement to the source data. As you can recall we updated ‘CustomerID’ number 4 with a new first name changing the value from ‘Spencer’ to ‘zzzzzzz’. As ‘First_Name’ is a Type 2 attribute, a new record was added and the previous one (top red rectangle) expired. Also, as per our second update statement, we updated email address for ‘CustomerID’ number 8 from ‘nataliechin@mediasmarts.com’ to ‘zzzzzzz@yyy.com’. At this column was qualified as Type 1 Slowly Changing Dimension, no extra row was created and the existing record updated with a new ‘DT_Vlaid_From’ date (blue rectangle).
This is the last post to the series describing Slowly Changing Dimensions update using a variety of methods. If using MERGE T-SQL does not rock your boat you can always try implementing other alternatives to SCD handling. Please see my other posts about managing SCD using checksum transformation, SCD SSIS default component or SCD Dimension Merge SSIS transformation.