How to implement Slowly Changing Dimensions – Part 3. Using SSIS Dimension Merge SCD Component to load dimension data
In the first post to the series I explained how SSIS default component for handling Slowly Changing Dimensions can be used when incorporated into a package. This method was followed by a second post depicting managing SCD via ‘Checksum Transformation’ third party add-in into Integration Services. In this post to the series I will not be leaving SSIS environment just yet as I still have not exhausted the options for additional transformations which can potentially be used for SCD processing. Before I explain how Merge T-SQL function can be utilised outside SSIS (see subsequent post), I would like to explore one more add-in and outline SSIS Dimension Merge SCD Component (from here on called DMSC). The component can be downloaded from HERE. Unfortunately, at the time of writing this post there is no version released for SQL Server 2012 hence all screen captures are based on BIDS environment and not SQL Server Data Tools.
Again, if you want to get your hands dirty, after component installation, execute THIS SQL script to create all database objects I will be using throughout this outline and create a new SSIS project in BIDS.
--CREATE RELATIONAL TABLES AND CONSTRAINTS. POPULATE TABLES WITH SOME DUMMY DATA CREATE TABLE Customer( CustomerID int IDENTITY (1,1), First_Name varchar (50) NULL, Middle_Name varchar (50), Last_Name varchar (50) NULL, Email_Address varchar (100) NULL, Phone_Number varchar (50) NULL, Created_Date datetime NULL, Modified_Date datetime NULL, CONSTRAINT pk_CustID PRIMARY KEY (CustomerID)) CREATE TABLE Address( AddressID int IDENTITY (1,1) NOT NULL, Address1 varchar (50) NULL, Address2 varchar (50) NULL, City varchar (50) NULL, Post_Code char (4) NULL, Is_Current_Flag char(1) NULL, CONSTRAINT pk_AddressID PRIMARY KEY (AddressID)) CREATE TABLE Customer_Address_Bridge( CustomerID int NOT NULL, AddressID int NOT NULL) GO INSERT INTO Customer (First_Name, Middle_Name, Last_Name,Email_Address,Phone_Number,Created_Date,Modified_Date) SELECT 'Mary', 'Joeanne', 'Black', 'mary0120@yahoo.com.au', '03-8573-9455', '2012-01-01', NULL UNION ALL SELECT 'John', 'Lorance', 'Moore', 'johnnym@awol.com', '03-3423-1155', '2012-01-01', '2012-05-30' UNION ALL SELECT 'Martin', NULL, 'Laser', 'mlaser91@aol.com', '03-2355-1109', '2012-01-01', '2012-05-12' UNION ALL SELECT 'Spencer', 'Chris', 'McEvans', 'spencerdude@hotmail.com', '03-1122-0007', '2012-01-01', '2012-05-30' UNION ALL SELECT 'Mark', NULL, 'King', 'mk038722@gmail.com', '03-3423-1155', '2012-01-01', '2012-05-30' UNION ALL SELECT 'Mary', 'Susan', 'Grey', 'mmgrey@gmail.com', '03-1299-3859', '2012-01-01', NULL UNION ALL SELECT 'Luis', 'Blake', 'Shimaro', 'shimarolou@yahoo.com.au', '03-0385-3999', '2012-01-01', NULL UNION ALL SELECT 'Natalie', 'G', 'Chin', 'nataliechin@mediasmarts.com.au', '03-3759-1001', '2012-01-01', NULL UNION ALL SELECT 'Marian', NULL, 'McErin', 'marianmcerin@gmail.com', '03-3400-3331', '2012-01-01', '2012-05-01' UNION ALL SELECT 'Rick', 'Tony', 'Webster', 'rikky69@gmail.com', '03-9459-1112', '2012-01-01', NULL INSERT INTO Address (Address1, Address2, City, Post_Code, Is_Current_Flag) SELECT '6 Agave Street', 'Apartment 4A', 'Launceston', '7250', 'Y' UNION ALL SELECT '88 Dodge Street', NULL, 'Sunshine', '3020', 'Y' UNION ALL SELECT '3 McKenzie Court', 'Level 9', 'Perthville', '2795', 'Y' UNION ALL SELECT '5 Spencer Drive', 'Unit 9D', 'Melbourne', '3002', 'Y' UNION ALL SELECT '8 Sunny Avenue', NULL, 'Sydney', '2000', 'Y' UNION ALL SELECT '83 Mara Drive', NULL, 'Echuca', '3563', 'Y' UNION ALL SELECT '1038 Mustang Street', NULL, 'Brisbane', '4000', 'Y' UNION ALL SELECT '1 Bradman Street', NULL, 'Bendigo', '3550', 'Y' UNION ALL SELECT '12 Cruger Drive', 'Block C', 'Cairns', '4870', 'Y' UNION ALL SELECT '124 Lasting Court', NULL, 'Adelaide', '5000', 'Y' INSERT INTO Customer_Address_Bridge (CustomerID, AddressID) SELECT 1, 1 UNION ALL SELECT 2,2 UNION ALL SELECT 3,3 UNION ALL SELECT 4,4 UNION ALL SELECT 5,5 UNION ALL SELECT 6,6 UNION ALL SELECT 7,7 UNION ALL SELECT 8,8 UNION ALL SELECT 9,9 UNION ALL SELECT 10, 10 --CREATE CONSTRAINTS ALTER TABLE Customer_Address_Bridge ADD FOREIGN KEY (AddressID) REFERENCES Address(AddressID) ALTER TABLE Customer_Address_Bridge ADD FOREIGN KEY (CustomerID) REFERENCES Customer(CustomerID) GO --CREATE DIMENSION 'DimCustomer' TABLE CREATE TABLE DimCustomer( CustomerSK int IDENTITY (1,1) NOT NULL, CustomerID int NULL, First_Name varchar (50) NULL, Last_Name varchar (50) NULL, Email_Address varchar (100) NULL, City varchar (50) NULL, DT_Valid_From datetime NULL, DT_Valid_To datetime NULL, Is_Current bit NULL, CONSTRAINT PK_DimCustomer PRIMARY KEY CLUSTERED (CustomerSK asc))
Now that we have the groundwork laid out let’s see what the query and the query output for constructing our dimension table looks like – we will only be selecting a limited amount of columns needed to populate our dimension. Executing the following SQL code will give you an idea of what attributes we will be using to populate out ‘DimCustomer’ table.
SELECT cu.CustomerID, cu.First_Name, cu.Last_Name, ad.City, cu.Email_Address FROM dbo.Address AS ad INNER JOIN dbo.Customer_Address_Bridge AS br ON ad.AddressID = br.AddressID INNER JOIN dbo.Customer AS cu ON br.CustomerID = cu.CustomerID
Now the fun part. Let’s create a simple SSIS solution in BIDS. Drop Data Flow Task onto the Control Flow surface and double-click it to enter the Data Flow development pane. Here, drop 2 OLE DB Source containers (here called ‘Merge Input Source’ and ‘Merge Input Target’), 1 Dimension Merge Slowly Changing Dimension transformation, 2 OLE DB Command transformations (here called ‘Update SCD Type 1 Records’ and ‘Expire SCD Type 2 Records‘) and 2 OLE DB Destination containers (here called ‘Insert New Records’ and ‘Insert SCD Type 2 Records’) onto the surface as per below.
Let’s go to each individual ‘source’ components to adjust their properties. First, in ‘Merge Input Source’ OLE DB Source create a database connection, change the ‘Data Access Mode’ to ‘SQL Command’ and input the same query as we used before to construct dimensional data from our source tables.
In the second OLD DB Source which we called ‘Merge Input Target’, choose ‘Table or View’ data access mode and select ‘DimCustomer’ table as our target object. Finally, link the two OLE DB Sources to DMSC transform ensuring that the one with the query is linked as ‘Source System’ and the one with the dimension destination table is marked as ‘Existing Dimension’. Don’t connect the remaining components just yet.
These are our source tables/query configured. Now, let’s tweak the core transformation which will do all the heavy-duty lifting for us in this package – our DMSC transformation. Follow the screenshots below to adjust individual tab properties leaving Row Change Detection, Surrogate Key Handling, Inferred Member Behaviour, Auditing, Logging and Performance tabs on default settings. The remaining tab properties adjustments are depicted below with the key sections to pay attention to highlighted in red.
If you look just at the screenshot above, you will notice that in the ‘Output Column Selection’ we have removed ‘Deleted’ output from the options provided. This is because we will not be deleting any records out of the existing dimension but without excluding it from this section, the package would not allow us to go any further, asking for explicitly specifying the records for deletion. Don’t worry about the error message when you click ‘Finish’, this should disappear when we configure the outputs for this DMSC. To do this, first, link DMSC transformation to ‘Insert New Records’ OLE DB Source selecting ‘New’ from the ‘Input Output Selection’. Go into the newly linked OLE DB Source, select our dimension table and create the mapping ensuring that surrogate keys are omitted. We don’t want those populated by the package as ‘CustomerSK’ is an identity column and should increment automatically. For our second OLE DB Source, link DMSC transformation selecting ‘New SCD 2’ option from the ‘Input Output Selection’. As per previous OLE DB Source configuration, create columns mappings, leaving out surrogate keys. All we should have left unlinked in our Data Flow pane are the two OLE DB Commands. Link DMSC to the first one – ‘Update SCD Type 1 Records’ OLE DB Command, select ‘Updated SCD1’ from the option dropdown and proceed to adjusting its properties. Select our previously created database connection from the ‘Connection Managers’ tab, go to ‘Component Properties’ tab and in its ‘SQLCommand’ property input the following query.
UPDATE DimCustomer SET Email_Address = ?, DT_Valid_From = ? WHERE CustomerSK = ? AND Is_Current = 1
Go to the next tab and map the parameters and columns accordingly as per image below.
Lastly, configure the ‘Expire SCD Type 2 Records’ OLE DB Command. Link it to DMSC selecting ‘Expired SCD 2 with SCD 1 Changes’ from the selection available, input the database connection and in ‘Component Properties’ input the following string value.
UPDATE DimCustomer SET First_Name = ?, Last_Name = ?, City = ?, DT_Valid_From = ?, DT_Valid_To = ?, Is_Current = ? WHERE CustomerSK = ? AND Is_Current = 1
Complete component adjustments by mapping columns to their corresponding parameters as per below.
That is our SSIS package finished. Go ahead and run it. All it should do at this stage is simply populate the ‘DimCustomer’ table with new values, inserting exactly 10 rows as per below.
Now let’s try and see how the package handles new, SCD Type 1 and SCD Type 2 records when we commit an update to our source tables by executing the following SQL script.
--INSERT NEW VALUES INTO CUSTOMER INSERT INTO Customer (First_Name, Middle_Name, Last_Name, Email_Address, Phone_Number, Created_Date, Modified_Date) SELECT '11_xxx', NULL, '11_yyy', '11_some_email_address', NULL, GetDate(), NULL UNION ALL SELECT '12_xxx', NULL, '12_yyy', '12_some_email_address', NULL, GetDate(), NULL UNION ALL SELECT '13_xxx', NULL, '13_yyy', '13_some_email_address', NULL, GetDate(), NULL GO --INSERT NEW VALUES INTO ADDRESS INSERT INTO Address (City, Is_Current_Flag) SELECT '11_some_city', 'Y' UNION ALL SELECT '12_some_city', 'Y' UNION ALL SELECT '13_some_city', 'Y' GO --UPDATE TYPE 1 SCD RECORDS. DON'T WORRY ABOUT UPDATING 'MODIFIED_DATE' COLUMN UPDATE Customer SET Email_Address = '1_some_email_address' WHERE CustomerID = 1 UPDATE Customer SET Email_Address = '2_some_email_address' WHERE CustomerID = 2 UPDATE Customer SET Email_Address = '3_some_email_address' WHERE CustomerID = 3 GO --UPDATE TYPE 2 SCD RECORDS. DON'T WORRY ABOUT UPDATING 'MODIFIED_DATE' COLUMN UPDATE Customer SET First_Name = '4_some_first_Name', Last_Name = '4_some_last_Name' WHERE CustomerID = 4 UPDATE Customer SET First_Name = '5_some_first_Name', Last_Name = '5_some_last_Name' WHERE CustomerID = 5 UPDATE Customer SET First_Name = '6_some_first_Name', Last_Name = '6_some_last_Name' WHERE CustomerID = 6 --UPDATE BRIDGING TABLE INSERT INTO Customer_Address_Bridge (CustomerID, AddressID) VALUES (11,11) INSERT INTO Customer_Address_Bridge (CustomerID, AddressID) VALUES (12,12) INSERT INTO Customer_Address_Bridge (CustomerID, AddressID) VALUES (13,13)
Try running the package again and your output should be identical to the one below.
What we have done here is made the DMSC to recognize that 3 records needed to be inserted (new data), 3 records required a Type 1 update and finally 3 records had to be expired due to Type 2 SCD insert of 3 more rows. This means that a total of 6 records needed to be inserted into our dimension table – 3 new ones and 3 changed ones (Type 2), adjusting the total row number to 16. This should be synonymous with what I have done in Part 1, 2 and 4, the only exception using Dimension Merge Slowly Changing Dimension transformation rather than T-SQL Merge function, checksum transformation or a default SSIS component. The output to this second execution is as per image below (click on it to enlarge).
What you can observe here is that records 1, 2 and 3 (blue rectangle) were updated according to SCD Type 1 i.e. no new rows were created, records 4, 5 and 6 were added as new rows which previously did not exist in this table and finally rows 7, 8 and 9 were created as SCD 2 update to existing records i.e. new entries were created for rows which already exist in the table but were expired due to changes to Type 2 attributes i.e. ‘First_Name’ and ‘Last_Name’.
Next up, the final part to this series, using T-SQL Merge function to update dimensional data with Type 1 and Type 2 SCD.
http://scuttle.org/bookmarks.php/pass?action=addThis entry was posted on Thursday, August 23rd, 2012 at 1:19 pm and is filed under How To's, SQL, SSIS. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.