How To Build A Data Mart Using Microsoft BI Stack Part 3 – Data Mart Load Approach and Coding
September 9th, 2013 / No Comments » / by admin
In the FIRST POST to this series I analysed the schema, objects and the data they hold for our mini data mart. This was followed by my SECOND POST outlining some key development concepts and preliminary code base for data mart objects schema. In this post I would like to dig dipper into technical aspects of a data mart development and touch on SQL which is used to populate the data mart tables which will become a prelude to SSIS package structure and design in the subsequent post.
For the sake of simplicity (with the exception of one dimension and one fact table which I will describe in more details later) we will assume that every time data is loaded into the data mart, we will perform a full load i.e. data will not be loaded incrementally and all the objects will be emptied and repopulated from scratch. This is not a standard approach, particularly when large volumes of data are at stake and it somewhat violates the basic principle of creating a data mart i.e. the purpose is to keep all historical information, with all the changes included. In this way, a track record of all source data is kept and any changes in history can be easily accounted for. Therefore, not trying to make things excessively complex but also reflecting some more advanced concepts such as Slowly Changing Dimensions, I will try to depict the design in such way that majority of the objects will be completely repopulated (easy approach), leaving the core fact table (FactSales) as well as one dimension table (DimStores) to conform to typical development standards and perform an incremental load (harder approach).
The typical steps involved in data warehouse table loading include:
- Deciding on full or delta load (already established as per above)
- Isolating the data to be extracted
- Developing the transformation logic
- Loading the data into the table
- Repeating until all tables have been populated
As we have already established which objects will participate in incremental load versus full load we can proceed to SQL code development and empty the tables first. To empty tables out of their data we may use the DELETE SQL statement or alternatively use TRUNCATE SQL command (this however will require dropping foreign key relationships first). There are number of reasons why we would opt for TRUNCATE versus DELETE. DELETE, for example, can prove to be quite slow as it removes table records one by one while logging all operations. TRUNCATE command, on the other hand, de-allocates data pages that internally store the data in SQL Server with no logging, therefore is much quicker. An additional benefit of truncation over deletion is that if you have a table using the identity option to create integer key values, truncation will automatically reset the numbering scheme to its original value. Deletion, on the other hand, will not reset the number; therefore, when you insert a new row, the new integer value will continue from where the previous insertions left off before deletion. Normally this is not what you want, because the numbering will no longer start from 1, which may be confusing. Let’s TRUNCATE our database objects and remove any foreign key constraints.
/* STEP 1 Drop foreign key constraints */ USE DW_Sample --DROP FOREIGN KEYS ALTER TABLE [dbo].[DimTitles] DROP CONSTRAINT [FK_DimTitles_DimPublishers] ALTER TABLE [dbo].[FactTitlesAuthors] DROP CONSTRAINT [FK_FactTitlesAuthors_DimAuthors] ALTER TABLE [dbo].[FactTitlesAuthors] DROP CONSTRAINT [FK_FactTitlesAuthors_DimTitles] ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimStores] ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimTitles] ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimDates] ALTER TABLE [dbo].[DimTitles] DROP CONSTRAINT [FK_DimTitles_DimDates] GO /* STEP 2 Truncate all data mart tables */ --USE DW_Sample --DELETE FROM ALL TABLES WITHOUT RESETTING THEIR IDENTITY AUTO NUMBER --DELETE FROM [dbo].[FactSales] --DELETE FROM [dbo].[FactTitlesAuthors] --DELETE FROM [dbo].[DimTitles] --DELETE FROM [dbo].[DimPublishers] --DELETE FROM [dbo].[DimStores] --DELETE FROM [dbo].[DimAuthors] --DELETE FROM [dbo].[DimDates] --TRUNCATE ALL TABLES AND RESET THEIR IDENTITY AUTO NUMBER USE DW_Sample TRUNCATE TABLE [dbo].[FactSales] TRUNCATE TABLE [dbo].[FactTitlesAuthors] TRUNCATE TABLE [dbo].[DimTitles] TRUNCATE TABLE [dbo].[DimPublishers] TRUNCATE TABLE [dbo].[DimStores] TRUNCATE TABLE [dbo].[DimAuthors] TRUNCATE TABLE [dbo].[DimDates] GO
Next, we will start our data mart code development by inserting data into DimDates dimension. This will be the only dimension table which is populated by generating ‘synthetic’ data, rather than retrieving it from the source. The SQL code below is a simple way of populating date dimension which contains a few attributes derived from a calendar date – just enough for this project. For more comprehensive code please refer to THIS POST.
/* STEP 3 Populate DimDates table with date and its derrived values data */ USE DW_Sample --DECLARE DATE VARIABLES FOR DATE PERIOD DECLARE @StartDate datetime = '01/01/1990' DECLARE @EndDate datetime = '01/01/1995' DECLARE @DateInProcess datetime SET @DateInProcess = @StartDate WHILE @DateInProcess < = @EndDate BEGIN SET NOCOUNT ON --LOOP THROUGH INDIVIDUAL DATES DEFINED BY TIME PERIOD INSERT INTO DimDates ( [Date], [DATENAME], [Month], [MonthName], [QUARTER], [QUARTERName], [YEAR], [YEARName]) VALUES ( @DateInProcess, CONVERT(varchar(50), @DateInProcess, 110) + ', ' + DATENAME(WEEKDAY, @DateInProcess ), MONTH( @DateInProcess), CAST(YEAR(@DateInProcess) as nvarchar(4)) + ' - ' + DATENAME(MONTH, @DateInProcess ), DATENAME( QUARTER, @DateInProcess ), Cast(YEAR(@DateInProcess) as nvarchar(4)) + ' - ' + 'Q' + DATENAME(QUARTER, @DateInProcess ), YEAR(@DateInProcess), Cast(YEAR(@DateInProcess) as nvarchar(4))) SET @DateInProcess = DATEADD(DAY, 1, @DateInProcess) END
This is probably a good time to mention the issue of NULL values and they could be handled in data warehouse scenarios. Nulls are most often considered to be an unknown value, but what does the term ‘unknown’ really mean? Does it mean that it is unknowable and could never be known? Does it mean that is unknown at the moment but will soon be available? Does it mean that a value just does not apply in this particular instance? Any of these can be true: the value may not be known, may be missing, or may not be applicable. Because of the ambiguity of the term NULL, a decision has to be made as to which interpretation is most accurate, or you may face endless arguments over the validity of your reports. There are multiple ways of dealing with NULL values. The most common one (in case of dimensions) is to exclude those from the data and load only that data which has a valid values against it. This, however, jeopardises the project as in some cases records which have NULL against them may want to be accounted for. More common way is to translate NULLS into more descriptive values by replacing NULL with -1, Unknown, Corrupt, Not Applicable, Unavailable etc. values (depending on circumstances and data types). This will allow tables to be populated with NON-NULL values thus providing accurate results when aggregating e.g. average will bring back different results across the records when NULL values are encountered vs. when NULLS are replaced with more descriptive substitutes. There are other options which can be utilised to account for NULLS, however, given the adherence to simplicity across this series, we will be using the most standard approach – creating a dummy record in a dimension table to reference NULL values with more suitable substitutes.
If you have a closer look at the DimDates table population SQL script you will notice that we created the DimDates table with an IDENTITY option on the DateKey column. This option automatically inserts numeric values every time a new row is added. By default the IDENTITY option prevents values from being inserted into the column manually. However, we can manually insert a value if you enable SQL’s IDENTITY_INSERT option, by using the SET IDENTITY_INSERT < table name > ON SQL command. Once that is done, additional lookup values can be added to the table. Let’s update DimDates table with a record replacing NULL values if they happen to occur as per the code below.
/* STEP 4 Append Unknown values to DimDates table */ USE DW_Sample SET IDENTITY_INSERT [DW_Sample].[dbo].[DimDates] ON GO Insert Into [DW_Sample].[dbo].[DimDates] ([DateKey], [Date], [DateName], [Month], [MonthName], [Quarter], [QuarterName], [Year], [YearName]) SELECT [DateKey] = -1, [Date] = CAST( '01/01/1900' as nvarchar(50)), [DateName] = CAST('Unknown_Day' as nvarchar(50)), [Month] = -1, [MonthName] = CAST('Unknown_Month' as nvarchar(50)), [Quarter] = -1, [QuarterName] = CAST('Unknown_Quarter' as nvarchar(50)), [Year] = -1, [YearName] = CAST('Unknown_Year' as nvarchar(50)) GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimDates] OFF GO
Thanks to this small addition, whenever lookup against DimDates table is made and null values is found, we can easily replace it with a meaningful entry indication to the end user that the date name cannot be confirmed at the time the query is executed (replaced with -1 value) as per SQL below.
/* Show how to reference/look up dimension value to account for NULLs */ USE DW_Sample SELECT [Pubs].[dbo].[Sales].qty, [Order_Date_Key] = ISNULL([DW_Sample].[dbo].[DimDates].[DateKey], -1) FROM [Pubs].[dbo].[Sales] LEFT JOIN [DW_Sample].[dbo].[DimDates] ON [Pubs].[dbo].[Sales].[ord_date] = [DW_Sample].[dbo].[DimDates].[Date]
Now, let’s populate the next three data mart dimensions based on the data from pubs source database.
/* STEP 5 Populate DimAuthors dimensions with source data */ USE DW_Sample INSERT INTO [DW_Sample].[dbo].[DimAuthors] (AuthorId, AuthorName, AuthorState) SELECT [AuthorId] = CAST(au_id as nchar(12)), [AuthorName] = CAST((au_fname + ' ' + au_lname) as nvarchar(100)), [AuthorState] = CAST(state as nchar(12)) FROM Pubs.dbo.Authors GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimAuthors] ON GO INSERT INTO [DW_Sample].[dbo].[DimAuthors] (AuthorKey, AuthorID, AuthorName, AuthorState) SELECT -1, 'Unknown', 'Unknown', 'Unknown' GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimAuthors] OFF GO
/* STEP 6 Populate DimPublishers dimension with source data */ USE DW_Sample INSERT INTO [DW_Sample].[dbo].[DimPublishers] (PublisherId, PublisherName) SELECT [PublisherId] = CAST(pub_id as nchar(12)), [PublisherName] = CAST(pub_name as nvarchar(50)) FROM pubs.dbo.publishers GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimPublishers] ON GO INSERT INTO [DW_Sample].[dbo].[DimPublishers] (PublisherKey, PublisherId, PublisherName) SELECT -1, 'Unknown', 'Unknown' GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimPublishers] OFF GO
/* STEP 7 Populate DimTitles dimension with source data */ USE DW_Sample INSERT INTO [DW_Sample].[dbo].[DimTitles] (TitleId, TitleName, TitleType, PublisherKey, TitlePrice, PublishedDateKey) SELECT [TitleId] = CAST(ISNULL([title_id], -1) AS nvarchar(12)), [TitleName] = CAST(ISNULL([title], 'Unknown') AS nvarchar(100)), [TitleType] = CASE CAST(ISNULL([type], 'Unknown') AS nvarchar(50)) WHEN 'business' THEN N'Business' WHEN 'mod_cook' THEN N'Modern Cooking' WHEN 'popular_comp' THEN N'Popular Computing' WHEN 'psychology' THEN N'Psychology' WHEN 'trad_cook' THEN N'Traditional Cooking' WHEN 'UNDECIDED' THEN N'Undecided' END, [PublisherKey] = ISNULL([DW_Sample].[dbo].[DimPublishers].[PublisherKey], -1), [TitlePrice] = CAST(ISNULL([price], -1) AS DECIMAL(18, 4)), [PublishedDateKey] = ISNULL([DW_Sample].[dbo].[DimDates].[DateKey], -1) FROM [Pubs].[dbo].[Titles] Join [DW_Sample].[dbo].[DimPublishers] ON [Pubs].[dbo].[Titles].[pub_id] = [DW_Sample].[dbo].[DimPublishers].[PublisherId] Left Join [DW_Sample].[dbo].[DimDates] ON [Pubs].[dbo].[Titles].[pubdate] = [DW_Sample].[dbo].[DimDates].[Date] GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimTitles] ON GO INSERT INTO [DW_Sample].[dbo].[DimTitles] (TitleKey, TitleID, TitleName, TitleType, PublisherKey, TitlePrice, PublishedDateKey) SELECT -1, 'Unknown', 'Unknown', 'Unknown', -1, -1, -1 GO SET IDENTITY_INSERT [DW_Sample].[dbo].[DimTitles] OFF GO
At this stage the only dimension which has not been loaded is DimStore. This is because it is the only dimension in our data model which has been identified as requiring Slowly Changing Dimension approach. Slowly Changing Dimensions (from this point forward identified as SCD) is a crucial concept of any data mart. At this time, Wikipedia identifies six types of SCDs but in this project we will only use two types – Type 1 where old data is simply overwritten with new data and history is not kept and Type 2 where multiple records are created for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In Type 2 SCD unlimited history is preserved for each insert. More on individual methods of SCDs handling can be viewed in my other posts, mainly HERE, HERE, HERE and HERE. For this exercise we will use a simple MERGE SQL statement, rather than resorting to third party components or SSIS native tool. As previously mentioned, DimStores table will be a mix of Type 1 and Type 2 SCD. Let’s assume that StoreAddress, StoreCity, StoreState and StoreZip attributes in DimStore table are defined as Type 1 SCD and every time source data is changed, old data is replaced with new one. Let’s also assume that StoreName attribute is defined as Type 2 SCD – every time new StoreName is assigned to the specific StoreID, we want to maintain the history and preserve old StoreName, making sure we can distinguish between retired and current StoreName. Both of those requirements can be handled by means of using MERGE SQL statement which takes care of both INSERT and UPDATE for Type 2 SCD operations depending on how the source data has been changed and which attribute it relates to. Below is the SQL for our DimStores table population.
/* STEP 8 Populate DimStores dimension table with source data assuming that some attributes have been defined as either Type 1 or Type 2 Slowly Changing Dimensions */ USE DW_Sample INSERT INTO [DW_Sample].[dbo].[DimStores] (StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent) SELECT StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, 1 FROM (MERGE DimStores AS Target USING (SELECT [StoreId] = CAST(stor_id as nchar(12)), [StoreName] = CAST(stor_name as nvarchar(50)), [StoreAddress] = CAST(stor_address as nvarchar (80)), [StoreCity] = CAST(city as nvarchar (40)), [StoreState] = CAST(state as nvarchar (12)), [StoreZip] = CAST(zip as nvarchar (12)) FROM Pubs.dbo.Stores) AS [Source] ON Target.StoreID = Source.StoreID AND Target.IsCurrent = 1 WHEN MATCHED AND (Target.StoreName <> [Source].StoreName) THEN UPDATE SET IsCurrent = 0 WHEN NOT MATCHED BY TARGET THEN INSERT (StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent) VALUES ([Source].StoreID, [Source].StoreName, [Source].StoreAddress, [Source].StoreCity, [Source].StoreState, [Source].StoreZip, 1) WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1 THEN UPDATE SET IsCurrent = 0 OUTPUT $action as Action,[Source].*) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND StoreName IS NOT NULL GO UPDATE [DW_Sample].[dbo].[DimStores] SET StoreAddress = source.stor_address FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.storeaddress <> source.stor_address AND target.IsCurrent = 1 UPDATE [DW_Sample].[dbo].[DimStores] SET StoreCity = source.city FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.storecity <> source.city AND target.IsCurrent = 1 UPDATE [DW_Sample].[dbo].[DimStores] SET StoreState = source.State FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.storeState <> source.state AND target.IsCurrent = 1 UPDATE [DW_Sample].[dbo].[DimStores] SET StoreZip = source.Zip FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.StoreZip <> source.Zip AND target.IsCurrent = 1 BEGIN IF EXISTS (SELECT 1 FROM [DW_Sample].[dbo].[DimStores] WHERE StoreKey = -1) BEGIN RETURN END ELSE BEGIN SET IDENTITY_INSERT [DW_Sample].[dbo].[DimStores] ON INSERT INTO [DW_Sample].[dbo].[DimStores] (StoreKey, StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent) SELECT -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',-1 SET IDENTITY_INSERT [DW_Sample].[dbo].[DimStores] OFF END END
To better understand how SCD and the above code logic work lets temporarily alter source data and run MERGE SQL for DimStores dimension again. To do this let’s run the following SQL and check how the records were affected.
/* Show how a combination of MERGE and UPDATE will work with SCD 1 and SCD 2 dimensions */ USE DW_Sample BEGIN BEGIN TRANSACTION USE Pubs UPDATE [pubs].[dbo].[stores] SET Stor_Name = 'New Store Name' WHERE stor_id = 7067 UPDATE [pubs].[dbo].[stores] SET stor_address = 'New Store Address' WHERE stor_id = 6380 UPDATE [pubs].[dbo].[stores] SET zip = '00000' WHERE stor_id = 7131 USE DW_Sample INSERT INTO [DW_Sample].[dbo].[DimStores] (StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent) SELECT StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, 1 FROM (MERGE DimStores AS Target USING (SELECT [StoreId] = CAST(stor_id as nchar(12)), [StoreName] = CAST(stor_name as nvarchar(50)), [StoreAddress] = CAST(stor_address as nvarchar (80)), [StoreCity] = CAST(city as nvarchar (40)), [StoreState] = CAST(state as nvarchar (12)), [StoreZip] = CAST(zip as nvarchar (12)) FROM Pubs.dbo.Stores) AS [Source] ON Target.StoreID = Source.StoreID AND Target.IsCurrent = 1 WHEN MATCHED AND (Target.StoreName <> [Source].StoreName) THEN UPDATE SET IsCurrent = 0 WHEN NOT MATCHED BY TARGET THEN INSERT (StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent) VALUES ([Source].StoreID, [Source].StoreName, [Source].StoreAddress, [Source].StoreCity, [Source].StoreState, [Source].StoreZip, 1) WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1 THEN UPDATE SET IsCurrent = 0 OUTPUT $action as Action,[Source].*) AS MergeOutput WHERE MergeOutput.Action = 'UPDATE' AND StoreName IS NOT NULL UPDATE [DW_Sample].[dbo].[DimStores] SET StoreAddress = source.stor_address FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.storeaddress <> source.stor_address AND target.IsCurrent = 1 UPDATE [DW_Sample].[dbo].[DimStores] SET StoreCity = source.city FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.storecity <> source.city AND target.IsCurrent = 1 UPDATE [DW_Sample].[dbo].[DimStores] SET StoreState = source.State FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.storeState <> source.state AND target.IsCurrent = 1 UPDATE [DW_Sample].[dbo].[DimStores] SET StoreZip = source.Zip FROM dimstores target INNER JOIN Pubs.dbo.Stores source ON target.StoreId = source.stor_id AND target.StoreZip <> source.Zip AND target.IsCurrent = 1 SELECT * FROM [DW_Sample].[dbo].[DimStores] ROLLBACK TRANSACTION END
After executing, what we can notice an additional record added to this small dimension in accordance with the business requirements i.e. every time the source data is changed for stor_name attribute in Stores table in pubs database, an old version will be retired and a new one added to the dimension in the data mart. This is signified by means of using a control column IsCurrent which changes its value from 1 to 0 every time an old value is replaced with an updated one. In this way any alterations to the source data are preserved in history. Therefore, the name for StoreID 7067 was changed from ‘News & Brews’ to ‘New Store Name’ and IsCurrent flag updated to the new value. Also, notice that StoreAddress and StoreZip, Type 1 SCD, changed (for StoreID 6380 and 7131 correspondingly) and these columns were overwritten with new values as per SCD Type 1 definition.
In the above example I used a control attribute called IsCurrent, but using expiry dates to signify whether a given record is current or not is also common. Again, for detailed description of how MERGE behaves in SCD deployment please have a look at one of my previous posts HERE. Alternatively, if you’re not comfortable with syntax intricacies of MERGE SQL command you can try using alternative approaches e.g. Microsoft’s own Slowly Changing Dimension component as per image below. It is a good way to start if you’re just trying to get your head around the concept of SCD types as it is wizard-driven and easy to configure.
Finally, all there is left to do is to populate fact tables and add foreign keys as per the SQL below.
/* STEP 9 Populate FactTitlesAuthors fact table with source data assuming that if historical load has already been run, only new values should be inserted. */ USE DW_Sample INSERT INTO [DW_Sample].[dbo].[FactTitlesAuthors] (TitleKey, AuthorKey, AuthorOrder) SELECT [TitleKey] = ISNULL(titles.TitleKey, -1), [AuthorKey] = ISNULL(authors.AuthorKey, -1), [AuthorOrder] = au_ord FROM pubs.dbo.titleauthor AS titleauthor JOIN DW_Sample.dbo.DimTitles AS titles ON titleauthor.Title_id = titles.TitleId JOIN DW_Sample.dbo.DimAuthors AS authors ON titleauthor.Au_id = authors.AuthorId WHERE NOT EXISTS (SELECT TOP 1 'I already exist' FROM [DW_Sample].[dbo].[FactTitlesAuthors] fact WHERE fact.AuthorKey = authors.AuthorKey AND fact.TitleKey = titles.TitleKey)
/* STEP 10 Populate FactSales fact table with source data assuming that if historical load has already been run, only new values should be inserted. */ USE DW_Sample INSERT INTO [DW_Sample].[dbo].[FactSales] (OrderNumber, OrderDateKey, TitleKey, StoreKey, SalesQuantity) SELECT [OrderNumber] = CAST(sales.ord_num AS nvarchar(50)), [OrderDateKey] = ISNULL(dates.DateKey, -1), [TitleKey] = ISNULL(titles.TitleKey, -1), [StoreKey] = ISNULL(stores.StoreKey, -1), [SalesQuantity] = sales.Qty FROM pubs.dbo.sales AS sales JOIN DW_Sample.dbo.DimDates AS dates ON sales.ord_date = dates.[date] JOIN DW_Sample.dbo.DimTitles AS titles ON sales.Title_id = titles.TitleId JOIN DW_Sample.dbo.DimStores AS stores ON sales.Stor_id = stores.StoreId WHERE NOT EXISTS (SELECT TOP 1 'I already exist' FROM [DW_Sample].[dbo].[FactSales] as fact WHERE sales.ord_num = fact.OrderNumber AND dates.DateKey = fact.OrderDateKey AND titles.TitleKey = fact.TitleKey AND stores.StoreKey = fact.StoreKey)
/* STEP 11 Recreate all foreign key constraints */ USE DW_Sample ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimPublishers] FOREIGN KEY ([PublisherKey]) REFERENCES [dbo].[DimPublishers] ([PublisherKey]) ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT [FK_FactTitlesAuthors_DimAuthors] FOREIGN KEY ([AuthorKey]) REFERENCES [dbo].[DimAuthors] ([AuthorKey]) ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT [FK_FactTitlesAuthors_DimTitles] FOREIGN KEY ([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey]) ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimStores] FOREIGN KEY ([StoreKey]) REFERENCES [dbo].[DimStores] ([Storekey]) ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimTitles] FOREIGN KEY ([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey]) ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimDates] FOREIGN KEY([OrderDateKey]) REFERENCES [dbo].[DimDates] ([DateKey]) ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimDates] FOREIGN KEY([PublishedDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
Please note that both queries populating fact tables have NOT EXISTS statement as part of the WHARE clause. We talked about inserting new values into dimensions e.g. SCD Type 2, this time it’s executing incremental INSERT as part of fact table population. As we only wish to commit INSERT when new fact data has been found, NOT EXISTS and everything else that follows provides this functionality and excludes records which can already be found in a fact table. There are different types of fact tables and some of those e.g. cumulative snapshot facts allow for the same data to be inserted multiple times over a period of time. In our case, however, we are only interested in new records thus exclude all transactions which have already been accounted for. We can test this out by executing the following SQL where a new record is temporarily added to Sales source table in Pubs database and the code tries to INSERT the same record twice, with only one successful INSERT.
/* Show how NOT EXISTS will work to prevent duplicate data being inserted in fact table */ USE DW_Sample BEGIN BEGIN TRANSACTION USE pubs INSERT INTO sales (stor_id, ord_num, ord_date, qty, payterms, title_id) SELECT '7066', 'ZZZ999', '1994-09-14', 1, 'Net 30', 'PS2091' SELECT * FROM sales WHERE ord_num = 'ZZZ999' DECLARE @CountOfInserts int = 0 DECLARE @RowsAffected int = 0 WHILE @CountOfInserts < 2 BEGIN INSERT INTO [DW_Sample].[dbo].[FactSales] (OrderNumber, OrderDateKey, TitleKey, StoreKey, SalesQuantity) SELECT [OrderNumber] = CAST(sales.ord_num AS nvarchar(50)), [OrderDateKey] = ISNULL(dates.DateKey, -1), [TitleKey] = ISNULL(titles.TitleKey, -1), [StoreKey] = ISNULL(stores.StoreKey, -1), [SalesQuantity] = sales.Qty FROM pubs.dbo.sales AS sales JOIN DW_Sample.dbo.DimDates AS dates ON sales.ord_date = dates.[date] JOIN DW_Sample.dbo.DimTitles AS titles ON sales.Title_id = titles.TitleId JOIN DW_Sample.dbo.DimStores AS stores ON sales.Stor_id = stores.StoreId WHERE NOT EXISTS (SELECT TOP 1 'I already exist' FROM [DW_Sample].[dbo].[FactSales] as fact WHERE sales.ord_num = fact.OrderNumber AND dates.DateKey = fact.OrderDateKey AND titles.TitleKey = fact.TitleKey AND stores.StoreKey = fact.StoreKey) SET @RowsAffected = @@ROWCOUNT SELECT 'Pass number' + ' ' + CAST(@CountOfInserts + 1 as varchar (10)) + '. ' + 'Number of records affected = ' + CAST(@RowsAffected as varchar (10)) FROM [DW_Sample].[dbo].[FactSales] WHERE OrderNumber = 'ZZZ999' SET @CountOfInserts = @CountOfInserts + 1 END ROLLBACK TRANSACTION END
This last bit of code concludes this post and, as usual, all the code samples can be found and downloaded from HERE. Also, if you’re keen to follow along, please take note of the sequence number assigned to most of the above SQL snippets as these will be a required reference for ETL development in the NEXT POST. In the next iteration to this series I will explore SQL Server Integration Services functionality and how the above code samples will be integrated into developing an ETL routine to automate data mart loading.
Please also check other posts from this series:
- How To Build A Data Mart Using Microsoft BI Stack Part 1 – Introduction And OLTP Database Analysis
- How To Build A Data Mart Using Microsoft BI Stack Part 2 – OLAP Database Objects Modelling
- How To Build A Data Mart Using Microsoft BI Stack Part 4 – Data Mart Load Using SSIS
- How To Build A Data Mart Using Microsoft BI Stack Part 5 – Defining SSAS Project And Its Dimensions
- How To Build A Data Mart Using Microsoft BI Stack Part 6 – Refining SSAS Data Warehouse Dimensions
- How To Build A Data Mart Using Microsoft BI Stack Part 7 – Cube Creation, Cube Deployment And Data Validation
- How To Build A Data Mart Using Microsoft BI Stack Part 8 – Creating Sample SSRS Report
All SQL code and solution files can be found and downloaded from HERE.