How To Build A Data Mart Using Microsoft BI Stack Part 3 – Data Mart Load Approach and Coding

September 9th, 2013 / No Comments » / by admin

In the FIRST POST to this series I analysed the schema, objects and the data they hold for our mini data mart. This was followed by my SECOND POST outlining some key development concepts and preliminary code base for data mart objects schema. In this post I would like to dig dipper into technical aspects of a data mart development and touch on SQL which is used to populate the data mart tables which will become a prelude to SSIS package structure and design in the subsequent post.

For the sake of simplicity (with the exception of one dimension and one fact table which I will describe in more details later) we will assume that every time data is loaded into the data mart, we will perform a full load i.e. data will not be loaded incrementally and all the objects will be emptied and repopulated from scratch. This is not a standard approach, particularly when large volumes of data are at stake and it somewhat violates the basic principle of creating a data mart i.e. the purpose is to keep all historical information, with all the changes included. In this way, a track record of all source data is kept and any changes in history can be easily accounted for. Therefore, not trying to make things excessively complex but also reflecting some more advanced concepts such as Slowly Changing Dimensions, I will try to depict the design in such way that majority of the objects will be completely repopulated (easy approach), leaving the core fact table (FactSales) as well as one dimension table (DimStores) to conform to typical development standards and perform an incremental load (harder approach).

The typical steps involved in data warehouse table loading include:

  • Deciding on full or delta load (already established as per above)
  • Isolating the data to be extracted
  • Developing the transformation logic
  • Loading the data into the table
  • Repeating until all tables have been populated

As we have already established which objects will participate in incremental load versus full load we can proceed to SQL code development and empty the tables first. To empty tables out of their data we may use the DELETE SQL statement or alternatively use TRUNCATE SQL command (this however will require dropping foreign key relationships first). There are number of reasons why we would opt for TRUNCATE versus DELETE. DELETE, for example, can prove to be quite slow as it removes table records one by one while logging all operations. TRUNCATE command, on the other hand, de-allocates data pages that internally store the data in SQL Server with no logging, therefore is much quicker. An additional benefit of truncation over deletion is that if you have a table using the identity option to create integer key values, truncation will automatically reset the numbering scheme to its original value. Deletion, on the other hand, will not reset the number; therefore, when you insert a new row, the new integer value will continue from where the previous insertions left off before deletion. Normally this is not what you want, because the numbering will no longer start from 1, which may be confusing. Let’s TRUNCATE our database objects and remove any foreign key constraints.

/*
STEP 1
Drop foreign key constraints
*/
USE DW_Sample
--DROP FOREIGN KEYS
ALTER TABLE [dbo].[DimTitles] DROP CONSTRAINT [FK_DimTitles_DimPublishers]
ALTER TABLE [dbo].[FactTitlesAuthors] DROP CONSTRAINT [FK_FactTitlesAuthors_DimAuthors]
ALTER TABLE [dbo].[FactTitlesAuthors] DROP CONSTRAINT [FK_FactTitlesAuthors_DimTitles]
ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimStores]
ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimTitles]
ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimDates]
ALTER TABLE [dbo].[DimTitles] DROP CONSTRAINT [FK_DimTitles_DimDates]
GO

/*
STEP 2
Truncate all data mart tables
*/
--USE DW_Sample
--DELETE FROM ALL TABLES WITHOUT RESETTING THEIR IDENTITY AUTO NUMBER
--DELETE FROM [dbo].[FactSales]
--DELETE FROM [dbo].[FactTitlesAuthors]
--DELETE FROM [dbo].[DimTitles]
--DELETE FROM [dbo].[DimPublishers]
--DELETE FROM [dbo].[DimStores]
--DELETE FROM [dbo].[DimAuthors]
--DELETE FROM [dbo].[DimDates]
--TRUNCATE ALL TABLES AND RESET THEIR IDENTITY AUTO NUMBER

USE DW_Sample
TRUNCATE TABLE [dbo].[FactSales]
TRUNCATE TABLE [dbo].[FactTitlesAuthors]
TRUNCATE TABLE [dbo].[DimTitles]
TRUNCATE TABLE [dbo].[DimPublishers]
TRUNCATE TABLE [dbo].[DimStores]
TRUNCATE TABLE [dbo].[DimAuthors]
TRUNCATE TABLE [dbo].[DimDates]
GO

Next, we will start our data mart code development by inserting data into DimDates dimension. This will be the only dimension table which is populated by generating ‘synthetic’ data, rather than retrieving it from the source. The SQL code below is a simple way of populating date dimension which contains a few attributes derived from a calendar date – just enough for this project. For more comprehensive code please refer to THIS POST.

/*
STEP 3
Populate DimDates table with date and its derrived values data
*/
USE DW_Sample
--DECLARE DATE VARIABLES FOR DATE PERIOD
DECLARE @StartDate datetime	= '01/01/1990'
DECLARE @EndDate datetime	= '01/01/1995'
DECLARE @DateInProcess datetime
SET @DateInProcess		= @StartDate
WHILE @DateInProcess	< = @EndDate
BEGIN
	SET NOCOUNT ON
--LOOP THROUGH INDIVIDUAL DATES DEFINED BY TIME PERIOD
		INSERT INTO DimDates (
		[Date],
		[DATENAME],
		[Month],
		[MonthName],
		[QUARTER],
		[QUARTERName],
		[YEAR],
		[YEARName])
		VALUES (
		@DateInProcess,
		CONVERT(varchar(50), @DateInProcess, 110) + ', ' + DATENAME(WEEKDAY, @DateInProcess ),
		MONTH( @DateInProcess),
		CAST(YEAR(@DateInProcess) as nvarchar(4)) + ' - ' + DATENAME(MONTH, @DateInProcess ),
		DATENAME( QUARTER, @DateInProcess ),
		Cast(YEAR(@DateInProcess) as nvarchar(4)) + ' - ' + 'Q' + DATENAME(QUARTER, @DateInProcess ),
		YEAR(@DateInProcess),
		Cast(YEAR(@DateInProcess) as nvarchar(4)))
	SET @DateInProcess = DATEADD(DAY, 1, @DateInProcess)
END

This is probably a good time to mention the issue of NULL values and they could be handled in data warehouse scenarios. Nulls are most often considered to be an unknown value, but what does the term ‘unknown’ really mean? Does it mean that it is unknowable and could never be known? Does it mean that is unknown at the moment but will soon be available? Does it mean that a value just does not apply in this particular instance? Any of these can be true: the value may not be known, may be missing, or may not be applicable. Because of the ambiguity of the term NULL, a decision has to be made as to which interpretation is most accurate, or you may face endless arguments over the validity of your reports. There are multiple ways of dealing with NULL values. The most common one (in case of dimensions) is to exclude those from the data and load only that data which has a valid values against it. This, however, jeopardises the project as in some cases records which have NULL against them may want to be accounted for. More common way is to translate NULLS into more descriptive values by replacing NULL with -1, Unknown, Corrupt, Not Applicable, Unavailable etc. values (depending on circumstances and data types). This will allow tables to be populated with NON-NULL values thus providing accurate results when aggregating e.g. average will bring back different results across the records when NULL values are encountered vs. when NULLS are replaced with more descriptive substitutes. There are other options which can be utilised to account for NULLS, however, given the adherence to simplicity across this series, we will be using the most standard approach – creating a dummy record in a dimension table to reference NULL values with more suitable substitutes.

If you have a closer look at the DimDates table population SQL script you will notice that we created the DimDates table with an IDENTITY option on the DateKey column. This option automatically inserts numeric values every time a new row is added. By default the IDENTITY option prevents values from being inserted into the column manually. However, we can manually insert a value if you enable SQL’s IDENTITY_INSERT option, by using the SET IDENTITY_INSERT < table name > ON SQL command. Once that is done, additional lookup values can be added to the table. Let’s update DimDates table with a record replacing NULL values if they happen to occur as per the code below.

/*
STEP 4
Append Unknown values to DimDates table
*/
USE DW_Sample
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimDates] ON
GO
Insert Into [DW_Sample].[dbo].[DimDates]
([DateKey],
[Date],
[DateName],
[Month],
[MonthName],
[Quarter],
[QuarterName],
[Year], [YearName])
SELECT
[DateKey]		= -1,
[Date]			= CAST( '01/01/1900' as nvarchar(50)),
[DateName]		= CAST('Unknown_Day' as nvarchar(50)),
[Month]			= -1,
[MonthName]		= CAST('Unknown_Month' as nvarchar(50)),
[Quarter]		= -1,
[QuarterName]	= CAST('Unknown_Quarter' as nvarchar(50)),
[Year]			= -1,
[YearName]		= CAST('Unknown_Year' as nvarchar(50))
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimDates] OFF
GO

Thanks to this small addition, whenever lookup against DimDates table is made and null values is found, we can easily replace it with a meaningful entry indication to the end user that the date name cannot be confirmed at the time the query is executed (replaced with -1 value) as per SQL below.

/*
Show how to reference/look up dimension value to account for NULLs
*/
USE DW_Sample
SELECT
[Pubs].[dbo].[Sales].qty,
[Order_Date_Key] = ISNULL([DW_Sample].[dbo].[DimDates].[DateKey], -1)
FROM [Pubs].[dbo].[Sales]
LEFT JOIN [DW_Sample].[dbo].[DimDates]
ON [Pubs].[dbo].[Sales].[ord_date] = [DW_Sample].[dbo].[DimDates].[Date]

Now, let’s populate the next three data mart dimensions based on the data from pubs source database.

/*
STEP 5
Populate DimAuthors dimensions with source data
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimAuthors]
(AuthorId, AuthorName, AuthorState)
SELECT
[AuthorId]		= CAST(au_id as nchar(12)),
[AuthorName]	= CAST((au_fname + ' ' + au_lname) as nvarchar(100)),
[AuthorState]	= CAST(state as nchar(12))
FROM Pubs.dbo.Authors
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimAuthors] ON
GO
INSERT INTO [DW_Sample].[dbo].[DimAuthors]
(AuthorKey, AuthorID, AuthorName, AuthorState)
SELECT -1, 'Unknown', 'Unknown', 'Unknown'
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimAuthors] OFF
GO
/*
STEP 6
Populate DimPublishers dimension with source data
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimPublishers]
(PublisherId, PublisherName)
SELECT
[PublisherId]	= CAST(pub_id as nchar(12)),
[PublisherName] = CAST(pub_name as nvarchar(50))
FROM pubs.dbo.publishers
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimPublishers] ON
GO
INSERT INTO [DW_Sample].[dbo].[DimPublishers]
(PublisherKey, PublisherId, PublisherName)
SELECT -1, 'Unknown', 'Unknown'
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimPublishers] OFF
GO
/*
STEP 7
Populate DimTitles dimension with source data
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimTitles]
(TitleId, TitleName, TitleType, PublisherKey, TitlePrice, PublishedDateKey)
SELECT
[TitleId]			= CAST(ISNULL([title_id], -1) AS nvarchar(12)),
[TitleName]			= CAST(ISNULL([title], 'Unknown') AS nvarchar(100)),
[TitleType]			= CASE CAST(ISNULL([type], 'Unknown') AS nvarchar(50))
						WHEN 'business'		THEN N'Business'
						WHEN 'mod_cook'		THEN N'Modern Cooking'
						WHEN 'popular_comp' THEN N'Popular Computing'
						WHEN 'psychology'	THEN N'Psychology'
						WHEN 'trad_cook'	THEN N'Traditional Cooking'
						WHEN 'UNDECIDED'	THEN N'Undecided'
						END,
[PublisherKey]		= ISNULL([DW_Sample].[dbo].[DimPublishers].[PublisherKey], -1),
[TitlePrice]		= CAST(ISNULL([price], -1) AS DECIMAL(18, 4)),
[PublishedDateKey]	= ISNULL([DW_Sample].[dbo].[DimDates].[DateKey], -1)
FROM [Pubs].[dbo].[Titles] Join [DW_Sample].[dbo].[DimPublishers]
ON [Pubs].[dbo].[Titles].[pub_id] = [DW_Sample].[dbo].[DimPublishers].[PublisherId]
Left Join [DW_Sample].[dbo].[DimDates]
ON [Pubs].[dbo].[Titles].[pubdate] = [DW_Sample].[dbo].[DimDates].[Date]
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimTitles] ON
GO
INSERT INTO [DW_Sample].[dbo].[DimTitles]
(TitleKey, TitleID, TitleName, TitleType, PublisherKey, TitlePrice, PublishedDateKey)
SELECT -1, 'Unknown', 'Unknown', 'Unknown', -1, -1, -1
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimTitles] OFF
GO

At this stage the only dimension which has not been loaded is DimStore. This is because it is the only dimension in our data model which has been identified as requiring Slowly Changing Dimension approach. Slowly Changing Dimensions (from this point forward identified as SCD) is a crucial concept of any data mart. At this time, Wikipedia identifies six types of SCDs but in this project we will only use two types – Type 1 where old data is simply overwritten with new data and history is not kept and Type 2 where multiple records are created for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In Type 2 SCD unlimited history is preserved for each insert. More on individual methods of SCDs handling can be viewed in my other posts, mainly HERE, HERE, HERE and HERE. For this exercise we will use a simple MERGE SQL statement, rather than resorting to third party components or SSIS native tool. As previously mentioned, DimStores table will be a mix of Type 1 and Type 2 SCD. Let’s assume that StoreAddress, StoreCity, StoreState and StoreZip attributes in DimStore table are defined as Type 1 SCD and every time source data is changed, old data is replaced with new one. Let’s also assume that StoreName attribute is defined as Type 2 SCD – every time new StoreName is assigned to the specific StoreID, we want to maintain the history and preserve old StoreName, making sure we can distinguish between retired and current StoreName. Both of those requirements can be handled by means of using MERGE SQL statement which takes care of both INSERT and UPDATE for Type 2 SCD operations depending on how the source data has been changed and which attribute it relates to. Below is the SQL for our DimStores table population.

/*
STEP 8
Populate DimStores dimension table with source data assuming that some attributes
have been defined as either Type 1 or Type 2 Slowly Changing Dimensions
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimStores]
(StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
SELECT
StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, 1
FROM
    (MERGE DimStores AS Target
        USING
            (SELECT
			[StoreId]		= CAST(stor_id as nchar(12)),
			[StoreName]		= CAST(stor_name as nvarchar(50)),
			[StoreAddress]	= CAST(stor_address as nvarchar (80)),
			[StoreCity]		= CAST(city as nvarchar (40)),
			[StoreState]	= CAST(state as nvarchar (12)),
			[StoreZip]		= CAST(zip as nvarchar (12))

			FROM Pubs.dbo.Stores)
            AS [Source]
                    ON Target.StoreID = Source.StoreID AND Target.IsCurrent = 1
                    WHEN MATCHED AND
            (Target.StoreName <> [Source].StoreName)
                THEN UPDATE SET
                IsCurrent = 0
    WHEN NOT MATCHED BY TARGET
            THEN INSERT
            (StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
            VALUES
            ([Source].StoreID,
			[Source].StoreName,
			[Source].StoreAddress,
			[Source].StoreCity,
			[Source].StoreState,
			[Source].StoreZip,
			1)
    WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1
            THEN UPDATE
            SET IsCurrent = 0
    OUTPUT $action as Action,[Source].*) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
AND StoreName IS NOT NULL
GO

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreAddress = source.stor_address
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.storeaddress <> source.stor_address
AND target.IsCurrent = 1

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreCity = source.city
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.storecity <> source.city
AND target.IsCurrent = 1

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreState = source.State
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.storeState <> source.state
AND target.IsCurrent = 1

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreZip = source.Zip
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.StoreZip <> source.Zip
AND target.IsCurrent = 1

BEGIN
	IF EXISTS (SELECT 1 FROM [DW_Sample].[dbo].[DimStores] WHERE StoreKey = -1)
		BEGIN
			RETURN
		END
	ELSE
		BEGIN
		SET IDENTITY_INSERT [DW_Sample].[dbo].[DimStores] ON
		INSERT INTO [DW_Sample].[dbo].[DimStores]
		(StoreKey, StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
		SELECT -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',-1
		SET IDENTITY_INSERT [DW_Sample].[dbo].[DimStores] OFF
		END
END

To better understand how SCD and the above code logic work lets temporarily alter source data and run MERGE SQL for DimStores dimension again. To do this let’s run the following SQL and check how the records were affected.

/*
Show how a combination of MERGE and UPDATE will work with SCD 1 and SCD 2 dimensions
*/
USE DW_Sample
BEGIN
BEGIN TRANSACTION
	USE Pubs
	UPDATE [pubs].[dbo].[stores]
	SET Stor_Name = 'New Store Name'
	WHERE stor_id = 7067
	UPDATE [pubs].[dbo].[stores]
	SET stor_address = 'New Store Address'
	WHERE stor_id = 6380
	UPDATE [pubs].[dbo].[stores]
	SET zip = '00000'
	WHERE stor_id = 7131

	USE DW_Sample
	INSERT INTO [DW_Sample].[dbo].[DimStores]
	(StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
	SELECT
	StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, 1
	FROM
		(MERGE DimStores AS Target
			USING
				(SELECT
				[StoreId]		= CAST(stor_id as nchar(12)),
				[StoreName]		= CAST(stor_name as nvarchar(50)),
				[StoreAddress]	= CAST(stor_address as nvarchar (80)),
				[StoreCity]		= CAST(city as nvarchar (40)),
				[StoreState]	= CAST(state as nvarchar (12)),
				[StoreZip]		= CAST(zip as nvarchar (12))

				FROM Pubs.dbo.Stores)
				AS [Source]
						ON Target.StoreID = Source.StoreID AND Target.IsCurrent = 1
						WHEN MATCHED AND
				(Target.StoreName <> [Source].StoreName)
					THEN UPDATE SET
					IsCurrent = 0
		WHEN NOT MATCHED BY TARGET
				THEN INSERT
				(StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
				VALUES
				([Source].StoreID,
				[Source].StoreName,
				[Source].StoreAddress,
				[Source].StoreCity,
				[Source].StoreState,
				[Source].StoreZip,
				1)
		WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1
				THEN UPDATE
				SET IsCurrent = 0
		OUTPUT $action as Action,[Source].*) AS MergeOutput
	WHERE MergeOutput.Action = 'UPDATE'
	AND StoreName IS NOT NULL

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreAddress = source.stor_address
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.storeaddress <> source.stor_address
	AND target.IsCurrent = 1

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreCity = source.city
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.storecity <> source.city
	AND target.IsCurrent = 1

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreState = source.State
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.storeState <> source.state
	AND target.IsCurrent = 1

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreZip = source.Zip
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.StoreZip <> source.Zip
	AND target.IsCurrent = 1

	SELECT * FROM [DW_Sample].[dbo].[DimStores]
ROLLBACK TRANSACTION
END

After executing, what we can notice an additional record added to this small dimension in accordance with the business requirements i.e. every time the source data is changed for stor_name attribute in Stores table in pubs database, an old version will be retired and a new one added to the dimension in the data mart. This is signified by means of using a control column IsCurrent which changes its value from 1 to 0 every time an old value is replaced with an updated one. In this way any alterations to the source data are preserved in history. Therefore, the name for StoreID 7067 was changed from ‘News & Brews’ to ‘New Store Name’ and IsCurrent flag updated to the new value. Also, notice that StoreAddress and StoreZip, Type 1 SCD, changed (for StoreID 6380 and 7131 correspondingly) and these columns were overwritten with new values as per SCD Type 1 definition.

MERGE_sql_output_HTDM

In the above example I used a control attribute called IsCurrent, but using expiry dates to signify whether a given record is current or not is also common. Again, for detailed description of how MERGE behaves in SCD deployment please have a look at one of my previous posts HERE. Alternatively, if you’re not comfortable with syntax intricacies of MERGE SQL command you can try using alternative approaches e.g. Microsoft’s own Slowly Changing Dimension component as per image below. It is a good way to start if you’re just trying to get your head around the concept of SCD types as it is wizard-driven and easy to configure.

SSIS_default_SCD_component_HTDM

Finally, all there is left to do is to populate fact tables and add foreign keys as per the SQL below.

/*
STEP 9
Populate FactTitlesAuthors fact table with source data assuming that
if historical load has already been run, only new values should be inserted.
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[FactTitlesAuthors]
(TitleKey, AuthorKey, AuthorOrder)
SELECT
[TitleKey]		= ISNULL(titles.TitleKey, -1),
[AuthorKey]		= ISNULL(authors.AuthorKey, -1),
[AuthorOrder]	= au_ord
FROM pubs.dbo.titleauthor AS titleauthor
JOIN DW_Sample.dbo.DimTitles AS titles
ON titleauthor.Title_id = titles.TitleId
JOIN DW_Sample.dbo.DimAuthors AS authors
ON titleauthor.Au_id = authors.AuthorId
WHERE NOT EXISTS	(SELECT TOP 1 'I already exist' FROM
					[DW_Sample].[dbo].[FactTitlesAuthors] fact
					WHERE fact.AuthorKey	= authors.AuthorKey
					AND fact.TitleKey		= titles.TitleKey)
/*
STEP 10
Populate FactSales fact table with source data assuming that
if historical load has already been run, only new values should be inserted.
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[FactSales]
(OrderNumber, OrderDateKey, TitleKey, StoreKey, SalesQuantity)
SELECT
[OrderNumber]	= CAST(sales.ord_num AS nvarchar(50)),
[OrderDateKey]	= ISNULL(dates.DateKey, -1),
[TitleKey]		= ISNULL(titles.TitleKey, -1),
[StoreKey]		= ISNULL(stores.StoreKey, -1),
[SalesQuantity] = sales.Qty
FROM pubs.dbo.sales AS sales
JOIN DW_Sample.dbo.DimDates AS dates
ON sales.ord_date = dates.[date]
JOIN DW_Sample.dbo.DimTitles AS titles
ON sales.Title_id = titles.TitleId
JOIN DW_Sample.dbo.DimStores AS stores
ON sales.Stor_id = stores.StoreId
WHERE NOT EXISTS	(SELECT TOP 1 'I already exist' FROM
					[DW_Sample].[dbo].[FactSales] as fact
					WHERE sales.ord_num	= fact.OrderNumber
					AND dates.DateKey	= fact.OrderDateKey
					AND titles.TitleKey = fact.TitleKey
					AND stores.StoreKey = fact.StoreKey)
/*
STEP 11
Recreate all foreign key constraints
*/
USE DW_Sample
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimPublishers]
FOREIGN KEY ([PublisherKey]) REFERENCES [dbo].[DimPublishers] ([PublisherKey])
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT [FK_FactTitlesAuthors_DimAuthors]
FOREIGN KEY ([AuthorKey]) REFERENCES [dbo].[DimAuthors] ([AuthorKey])
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT [FK_FactTitlesAuthors_DimTitles]
FOREIGN KEY ([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimStores]
FOREIGN KEY ([StoreKey]) REFERENCES [dbo].[DimStores] ([Storekey])
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimTitles]
FOREIGN KEY ([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimDates]
FOREIGN KEY([OrderDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimDates]
FOREIGN KEY([PublishedDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])

Please note that both queries populating fact tables have NOT EXISTS statement as part of the WHARE clause. We talked about inserting new values into dimensions e.g. SCD Type 2, this time it’s executing incremental INSERT as part of fact table population. As we only wish to commit INSERT when new fact data has been found, NOT EXISTS and everything else that follows provides this functionality and excludes records which can already be found in a fact table. There are different types of fact tables and some of those e.g. cumulative snapshot facts allow for the same data to be inserted multiple times over a period of time. In our case, however, we are only interested in new records thus exclude all transactions which have already been accounted for. We can test this out by executing the following SQL where a new record is temporarily added to Sales source table in Pubs database and the code tries to INSERT the same record twice, with only one successful INSERT.

/*
Show how NOT EXISTS will work to prevent duplicate data being inserted in fact table
*/
USE DW_Sample
BEGIN
BEGIN TRANSACTION
USE pubs
	INSERT INTO sales
	(stor_id, ord_num, ord_date, qty, payterms, title_id)
	SELECT
	'7066', 'ZZZ999', '1994-09-14', 1, 'Net 30', 'PS2091'
	SELECT * FROM sales WHERE ord_num = 'ZZZ999'

	DECLARE @CountOfInserts int = 0
	DECLARE @RowsAffected int = 0
	WHILE @CountOfInserts < 2
		BEGIN
			INSERT INTO [DW_Sample].[dbo].[FactSales]
			(OrderNumber, OrderDateKey, TitleKey, StoreKey, SalesQuantity)
			SELECT
			[OrderNumber]	= CAST(sales.ord_num AS nvarchar(50)),
			[OrderDateKey]	= ISNULL(dates.DateKey, -1),
			[TitleKey]		= ISNULL(titles.TitleKey, -1),
			[StoreKey]		= ISNULL(stores.StoreKey, -1),
			[SalesQuantity] = sales.Qty
			FROM pubs.dbo.sales AS sales
			JOIN DW_Sample.dbo.DimDates AS dates
			ON sales.ord_date = dates.[date]
			JOIN DW_Sample.dbo.DimTitles AS titles
			ON sales.Title_id = titles.TitleId
			JOIN DW_Sample.dbo.DimStores AS stores
			ON sales.Stor_id = stores.StoreId
			WHERE NOT EXISTS	(SELECT TOP 1 'I already exist' FROM
								[DW_Sample].[dbo].[FactSales] as fact
								WHERE sales.ord_num	= fact.OrderNumber
								AND dates.DateKey	= fact.OrderDateKey
								AND titles.TitleKey = fact.TitleKey
								AND stores.StoreKey = fact.StoreKey)
			SET @RowsAffected = @@ROWCOUNT

			SELECT 'Pass number' + ' ' + CAST(@CountOfInserts + 1 as varchar (10)) + '. ' +
			'Number of records affected = ' + CAST(@RowsAffected as varchar (10))
			FROM [DW_Sample].[dbo].[FactSales] WHERE OrderNumber = 'ZZZ999'
			SET @CountOfInserts = @CountOfInserts + 1
	END
ROLLBACK TRANSACTION
END

NOT_EXISTS_sql_output_HTDM

This last bit of code concludes this post and, as usual, all the code samples can be found and downloaded from HERE. Also, if you’re keen to follow along, please take note of the sequence number assigned to most of the above SQL snippets as these will be a required reference for ETL development in the NEXT POST. In the next iteration to this series I will explore SQL Server Integration Services functionality and how the above code samples will be integrated into developing an ETL routine to automate data mart loading.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

Tags: , ,

How To Build A Data Mart Using Microsoft BI Stack Part 2 – OLAP Database Objects Modelling

September 9th, 2013 / No Comments » / by admin

In the FIRST POST to this series I analysed the schema, objects and the data they hold for our mini data mart project. In this post I would like to start digging into more nuts-and-bolts data warehouse development with some key concepts emphasised and preliminary code base for data mart objects developed.

Key Concepts

I was going to refrain from discussing the intricacies and principles of data mart design in more details, however, there are three concepts which I’d like to mention as they will form a linchpin of any further modelling or development activities we will proceed with. Number one is the concept of de-normalization. As opposed to OLTP data bases which typically focus on insert and update performance whilst minimising reads through dividing large tables into smaller and less redundant objects and defining relationships between them, OLAP design leans towards the process of attempting to optimize the read performance by adding redundant data or by grouping data, often referred to as de-normalizing. The focus here is on report performance and readability rather than lack of redundancy. A normalized design will often store different but related pieces of information in separate logical tables (called relations). If these relations are stored physically as separate disk files, completing a database query that draws information from several relations (a join operation) can be slow. If many relations are joined, it may be prohibitively slow. De-normalizing facilitates retrieval speed making query performance much more efficient. Following this logic, if the operational database stores locations in three separate tables to provide hierarchical view e.g. dbo.country, dbo.state and dbo.city, for the purpose of dimensional modeling we can roll this data up into one table called dbo.location. In that way, in order to retrieve sales data, for example, for any given post code stored in dbo.city table we will not need to traverse two other tables thus implicating unnecessary joins. Providing that sales data comes from fact dbo.sales table which is directly connected to dbo.location dimension, only one join will be involved rather than having to navigate through additional two joins. As our OLTP schema in pubs database is mostly designed in such way that collapsing multiple tables into one dimension or fact is hardly necessary, this becomes a straightforward exercise. For more detailed overview of the methodology involved in dimensional modeling you may want to read THIS DOCUMENT. Alternatively, there is copious amount of great publications available through the Internet.

Another important fact around dimensional modeling is the concept of surrogate keys. The idea is that instead of heaving values that naturally occur in OLPT environment, we have synthetic integer value that acts as dimension key. This surrogate key then becomes a foreign key in the fact table creating a relationship between the two. It is considered best practice to implement on all dimensions, possibly with the exception of date dimension which, depending on the school of thought you subscribe to, recommends storing converted date in a format DDMMYYYY as a table key, inserting an integer surrogate key or doing nothing and storing date in a pure date format (either way the key should be made up of unique values only). Surrogate keys assist when merging data from many dispersed OLTP systems, improve query performance and allow for changes tracking e.g. Slowly Changing Dimensions concept (more on SCD concept in subsequent posts to this series).

Finally, the concept of bridge tables also called associative entities. A bridge table’s purpose is to provide a connection between many-to-many dimension tables, and it is often referred to as a factless fact table.  If we look at the below image with the objects highlighted and imagine that dbo.sales will become primary fact table, whereas dbo.authors and dbo.titles will be converted into dimension tables, we will notice that in order to process authors’ information we will need to go thorough titles’ data and its associated dbo.titleauthor table. Therefore, dbo.titleauthor table is a good candidate for a factless fact table – a fact table that serves as a link between two dimensions.

Pubs_OLTP_schema_bridge_tbl_HTDM

Based on the above information, and the schema/data analysis from POST 1 we can conclude that sales table in pubs database will become our fact table; publishers, authors, titles and stores will become dimension tables whereas dbo.titleauthor bridge table will become a factless fact table. Therefore, once all objects have been created (see the SQL code further on in this post), the data mart schema should resemble the diagram as per image below.

Data_mart_ERD_HTDM

There is a multitude of other concepts that are very relevant when it comes to data mark design and modelling. Taking dimensions modelling only, notions such as conformed dimensions, Slowly Changing Dimensions (I outlined SCD approach HERE, HERE, HERE and HERE), junk dimensions, parent-child dimensions, role-playing dimensions, degenerate dimensions etc. come to my mind; however, given the conciseness of this post and the intended purpose – technical details over paradigms and methodologies – I will need to refer to those and other concepts as we go through the remainder of this series.

SQL Code

Let’s dive into less conceptual aspects of data mart design – scripting out our data mart objects based on the information from Post 1 and the concepts covered at the start of this post. First, I created two directories on my C:\ drive which I will use for storing the SQL Server data, log and back up files – DW_Sample_Files on the C:\ drive and a second DW_Sample_BackUp directory within it. Next, I will create a new DW_Sample database (this will be our relational warehouse database), all its objects as well as perform a back up to the nominated directory. All this is done via T-SQL rather than using SQL Server Management Studio graphical interface which should reduce the number of screen shots and steps involved. I have also included detailed description of individual sections of the script for further clarification and analysis.

/*
STEP 1
Check if the database we wish to create is already in place and if so
drop it and recreate it with files (log and data) located as per
description above. Next, assign login SA (administrator) as the database
owner and set the recovery model to BULKED_LOGGED.
*/
USE [master]
GO
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'DW_Sample')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [DW_Sample] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [DW_Sample]
END
GO
CREATE DATABASE [DW_Sample] ON PRIMARY
( NAME = N'DW_Sample'
, FILENAME = N'C:\DW_Sample_Files\DW_Sample.mdf'
, SIZE = 10MB
, MAXSIZE = 1GB
, FILEGROWTH = 10MB )
LOG ON
( NAME = N'DW_Sample_log'
, FILENAME = N'C:\DW_Sample_Files\DW_Sample_log.LDF'
, SIZE = 1MB
, MAXSIZE = 1GB
, FILEGROWTH = 10MB)
GO
--Assign database ownership to login SA
EXEC [DW_Sample].dbo.sp_changedbowner @loginame = N'SA', @map = false
GO
--Change the recovery model to BULK_LOGGED
ALTER DATABASE [DW_Sample] SET RECOVERY BULK_LOGGED
GO

/*
STEP 2
Create individual objects i.e. fact and dimension tables and create
foreign keys to establish referential integrity between individual tables
*/
USE [DW_Sample]
GO
--Create the Dimension Tables
CREATE TABLE [dbo].[DimStores](
[StoreKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[StoreId] [nchar](12) NOT NULL,
[StoreName] [nvarchar](50) NOT NULL,
[StoreAddress] [nvarchar] (80) NOT NULL,
[StoreCity] [nvarchar] (40) NOT NULL,
[StoreState] [nvarchar] (12) NOT NULL,
[StoreZip] [nvarchar] (12) NOT NULL,
[IsCurrent] [int] NOT NULL
)
GO
CREATE TABLE [dbo].[DimPublishers](
[PublisherKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[PublisherId] [nchar](12) NOT NULL,
[PublisherName] [nvarchar](50) NOT NULL
)
GO
CREATE TABLE [dbo].[DimDates](
[DateKey] int NOT NULL PRIMARY KEY IDENTITY (1, 1),
[Date] datetime NOT NULL,
[DateName] nVarchar(50),
[Month] int NOT NULL,
[MonthName] nVarchar(50) NOT NULL,
[Quarter] int NOT NULL,
[QuarterName] nVarchar(50) NOT NULL,
[Year] int NOT NULL,
[YearName] nVarchar(50) NOT NULL
)
GO
CREATE TABLE [dbo].[DimAuthors](
[AuthorKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[AuthorId] [nchar](12) NOT NULL,
[AuthorName] [nvarchar](100) NOT NULL,
[AuthorState] [nchar](12) NOT NULL
)
GO
CREATE TABLE [dbo].[DimTitles](
[TitleKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[TitleId] [nvarchar](12) NOT NULL,
[TitleName] [nvarchar](100) NOT NULL,
[TitleType] [nvarchar](50) NOT NULL,
[PublisherKey] [int] NOT NULL,
[TitlePrice] [decimal](18, 4) NOT NULL,
[PublishedDateKey] [int] NOT NULL
)
GO
--Create the Fact Tables
CREATE TABLE [dbo].[FactTitlesAuthors](
[TitleKey] [int] NOT NULL,
[AuthorKey] [int] NOT NULL,
[AuthorOrder] [int] NOT NULL,
CONSTRAINT [PK_FactTitlesAuthors] PRIMARY KEY CLUSTERED
( [TitleKey] ASC, [AuthorKey] ASC )
)
GO
CREATE TABLE [dbo].[FactSales](
[OrderNumber] [nvarchar](50) NOT NULL,
[OrderDateKey] [int] NOT NULL,
[TitleKey] [int] NOT NULL,
[StoreKey] [int] NOT NULL,
[SalesQuantity] [int] NOT NULL,
CONSTRAINT [PK_FactSales] PRIMARY KEY CLUSTERED
( [OrderNumber] ASC,[OrderDateKey] ASC, [TitleKey] ASC, [StoreKey] ASC )
)
GO
--Add Foreign Keys
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimPublishers]
FOREIGN KEY([PublisherKey]) REFERENCES [dbo].[DimPublishers] ([PublisherKey])
GO
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT
[FK_FactTitlesAuthors_DimAuthors]
FOREIGN KEY([AuthorKey]) REFERENCES [dbo].[DimAuthors] ([AuthorKey])
GO
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT
[FK_FactTitlesAuthors_DimTitles]
FOREIGN KEY([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
GO
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimStores]
FOREIGN KEY([StoreKey]) REFERENCES [dbo].[DimStores] ([Storekey])
GO
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimTitles]
FOREIGN KEY([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
GO
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimDates]
FOREIGN KEY([OrderDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
GO
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimDates]
FOREIGN KEY([PublishedDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
GO

/*
STEP 3
Back up new database to disk placing the file in the directory nominated
and restore database using the back up file replacing the one already created
with the backed up one
*/
BACKUP DATABASE [DW_Sample]
TO DISK =
N'C:\DW_Sample_Files\DW_Sample_BackUp\DW_Sample.bak'
GO
USE [Master]
RESTORE DATABASE [DW_Sample]
FROM DISK =
N'C:\DW_Sample_Files\DW_Sample_BackUp\DW_Sample.bak'
WITH REPLACE
GO
ALTER DATABASE [DW_Sample]
SET MULTI_USER

After successful code execution you should be left roughly the same message as per image below.

DW_Sample_restore_info_HTDM

This concludes this post and, as usual, all code samples and solution files can be found and downloaded from HERE. In the NEXT POST I would like to dig dipper into technical aspects of a data mart development and touch on SQL which is used to populate the data mart tables which will later become a prelude to SSIS package structure and design in subsequent post.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

Tags: ,