How To Build A Data Mart Using Microsoft BI Stack Part 3 – Data Mart Load Approach and Coding

In the FIRST POST to this series I analysed the schema, objects and the data they hold for our mini data mart. This was followed by my SECOND POST outlining some key development concepts and preliminary code base for data mart objects schema. In this post I would like to dig dipper into technical aspects of a data mart development and touch on SQL which is used to populate the data mart tables which will become a prelude to SSIS package structure and design in the subsequent post.

For the sake of simplicity (with the exception of one dimension and one fact table which I will describe in more details later) we will assume that every time data is loaded into the data mart, we will perform a full load i.e. data will not be loaded incrementally and all the objects will be emptied and repopulated from scratch. This is not a standard approach, particularly when large volumes of data are at stake and it somewhat violates the basic principle of creating a data mart i.e. the purpose is to keep all historical information, with all the changes included. In this way, a track record of all source data is kept and any changes in history can be easily accounted for. Therefore, not trying to make things excessively complex but also reflecting some more advanced concepts such as Slowly Changing Dimensions, I will try to depict the design in such way that majority of the objects will be completely repopulated (easy approach), leaving the core fact table (FactSales) as well as one dimension table (DimStores) to conform to typical development standards and perform an incremental load (harder approach).

The typical steps involved in data warehouse table loading include:

  • Deciding on full or delta load (already established as per above)
  • Isolating the data to be extracted
  • Developing the transformation logic
  • Loading the data into the table
  • Repeating until all tables have been populated

As we have already established which objects will participate in incremental load versus full load we can proceed to SQL code development and empty the tables first. To empty tables out of their data we may use the DELETE SQL statement or alternatively use TRUNCATE SQL command (this however will require dropping foreign key relationships first). There are number of reasons why we would opt for TRUNCATE versus DELETE. DELETE, for example, can prove to be quite slow as it removes table records one by one while logging all operations. TRUNCATE command, on the other hand, de-allocates data pages that internally store the data in SQL Server with no logging, therefore is much quicker. An additional benefit of truncation over deletion is that if you have a table using the identity option to create integer key values, truncation will automatically reset the numbering scheme to its original value. Deletion, on the other hand, will not reset the number; therefore, when you insert a new row, the new integer value will continue from where the previous insertions left off before deletion. Normally this is not what you want, because the numbering will no longer start from 1, which may be confusing. Let’s TRUNCATE our database objects and remove any foreign key constraints.

/*
STEP 1
Drop foreign key constraints
*/
USE DW_Sample
--DROP FOREIGN KEYS
ALTER TABLE [dbo].[DimTitles] DROP CONSTRAINT [FK_DimTitles_DimPublishers]
ALTER TABLE [dbo].[FactTitlesAuthors] DROP CONSTRAINT [FK_FactTitlesAuthors_DimAuthors]
ALTER TABLE [dbo].[FactTitlesAuthors] DROP CONSTRAINT [FK_FactTitlesAuthors_DimTitles]
ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimStores]
ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimTitles]
ALTER TABLE [dbo].[FactSales] DROP CONSTRAINT [FK_FactSales_DimDates]
ALTER TABLE [dbo].[DimTitles] DROP CONSTRAINT [FK_DimTitles_DimDates]
GO

/*
STEP 2
Truncate all data mart tables
*/
--USE DW_Sample
--DELETE FROM ALL TABLES WITHOUT RESETTING THEIR IDENTITY AUTO NUMBER
--DELETE FROM [dbo].[FactSales]
--DELETE FROM [dbo].[FactTitlesAuthors]
--DELETE FROM [dbo].[DimTitles]
--DELETE FROM [dbo].[DimPublishers]
--DELETE FROM [dbo].[DimStores]
--DELETE FROM [dbo].[DimAuthors]
--DELETE FROM [dbo].[DimDates]
--TRUNCATE ALL TABLES AND RESET THEIR IDENTITY AUTO NUMBER

USE DW_Sample
TRUNCATE TABLE [dbo].[FactSales]
TRUNCATE TABLE [dbo].[FactTitlesAuthors]
TRUNCATE TABLE [dbo].[DimTitles]
TRUNCATE TABLE [dbo].[DimPublishers]
TRUNCATE TABLE [dbo].[DimStores]
TRUNCATE TABLE [dbo].[DimAuthors]
TRUNCATE TABLE [dbo].[DimDates]
GO

Next, we will start our data mart code development by inserting data into DimDates dimension. This will be the only dimension table which is populated by generating ‘synthetic’ data, rather than retrieving it from the source. The SQL code below is a simple way of populating date dimension which contains a few attributes derived from a calendar date – just enough for this project. For more comprehensive code please refer to THIS POST.

/*
STEP 3
Populate DimDates table with date and its derrived values data
*/
USE DW_Sample
--DECLARE DATE VARIABLES FOR DATE PERIOD
DECLARE @StartDate datetime	= '01/01/1990'
DECLARE @EndDate datetime	= '01/01/1995'
DECLARE @DateInProcess datetime
SET @DateInProcess		= @StartDate
WHILE @DateInProcess	< = @EndDate
BEGIN
	SET NOCOUNT ON
--LOOP THROUGH INDIVIDUAL DATES DEFINED BY TIME PERIOD
		INSERT INTO DimDates (
		[Date],
		[DATENAME],
		[Month],
		[MonthName],
		[QUARTER],
		[QUARTERName],
		[YEAR],
		[YEARName])
		VALUES (
		@DateInProcess,
		CONVERT(varchar(50), @DateInProcess, 110) + ', ' + DATENAME(WEEKDAY, @DateInProcess ),
		MONTH( @DateInProcess),
		CAST(YEAR(@DateInProcess) as nvarchar(4)) + ' - ' + DATENAME(MONTH, @DateInProcess ),
		DATENAME( QUARTER, @DateInProcess ),
		Cast(YEAR(@DateInProcess) as nvarchar(4)) + ' - ' + 'Q' + DATENAME(QUARTER, @DateInProcess ),
		YEAR(@DateInProcess),
		Cast(YEAR(@DateInProcess) as nvarchar(4)))
	SET @DateInProcess = DATEADD(DAY, 1, @DateInProcess)
END

This is probably a good time to mention the issue of NULL values and they could be handled in data warehouse scenarios. Nulls are most often considered to be an unknown value, but what does the term ‘unknown’ really mean? Does it mean that it is unknowable and could never be known? Does it mean that is unknown at the moment but will soon be available? Does it mean that a value just does not apply in this particular instance? Any of these can be true: the value may not be known, may be missing, or may not be applicable. Because of the ambiguity of the term NULL, a decision has to be made as to which interpretation is most accurate, or you may face endless arguments over the validity of your reports. There are multiple ways of dealing with NULL values. The most common one (in case of dimensions) is to exclude those from the data and load only that data which has a valid values against it. This, however, jeopardises the project as in some cases records which have NULL against them may want to be accounted for. More common way is to translate NULLS into more descriptive values by replacing NULL with -1, Unknown, Corrupt, Not Applicable, Unavailable etc. values (depending on circumstances and data types). This will allow tables to be populated with NON-NULL values thus providing accurate results when aggregating e.g. average will bring back different results across the records when NULL values are encountered vs. when NULLS are replaced with more descriptive substitutes. There are other options which can be utilised to account for NULLS, however, given the adherence to simplicity across this series, we will be using the most standard approach – creating a dummy record in a dimension table to reference NULL values with more suitable substitutes.

If you have a closer look at the DimDates table population SQL script you will notice that we created the DimDates table with an IDENTITY option on the DateKey column. This option automatically inserts numeric values every time a new row is added. By default the IDENTITY option prevents values from being inserted into the column manually. However, we can manually insert a value if you enable SQL’s IDENTITY_INSERT option, by using the SET IDENTITY_INSERT < table name > ON SQL command. Once that is done, additional lookup values can be added to the table. Let’s update DimDates table with a record replacing NULL values if they happen to occur as per the code below.

/*
STEP 4
Append Unknown values to DimDates table
*/
USE DW_Sample
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimDates] ON
GO
Insert Into [DW_Sample].[dbo].[DimDates]
([DateKey],
[Date],
[DateName],
[Month],
[MonthName],
[Quarter],
[QuarterName],
[Year], [YearName])
SELECT
[DateKey]		= -1,
[Date]			= CAST( '01/01/1900' as nvarchar(50)),
[DateName]		= CAST('Unknown_Day' as nvarchar(50)),
[Month]			= -1,
[MonthName]		= CAST('Unknown_Month' as nvarchar(50)),
[Quarter]		= -1,
[QuarterName]	= CAST('Unknown_Quarter' as nvarchar(50)),
[Year]			= -1,
[YearName]		= CAST('Unknown_Year' as nvarchar(50))
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimDates] OFF
GO

Thanks to this small addition, whenever lookup against DimDates table is made and null values is found, we can easily replace it with a meaningful entry indication to the end user that the date name cannot be confirmed at the time the query is executed (replaced with -1 value) as per SQL below.

/*
Show how to reference/look up dimension value to account for NULLs
*/
USE DW_Sample
SELECT
[Pubs].[dbo].[Sales].qty,
[Order_Date_Key] = ISNULL([DW_Sample].[dbo].[DimDates].[DateKey], -1)
FROM [Pubs].[dbo].[Sales]
LEFT JOIN [DW_Sample].[dbo].[DimDates]
ON [Pubs].[dbo].[Sales].[ord_date] = [DW_Sample].[dbo].[DimDates].[Date]

Now, let’s populate the next three data mart dimensions based on the data from pubs source database.

/*
STEP 5
Populate DimAuthors dimensions with source data
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimAuthors]
(AuthorId, AuthorName, AuthorState)
SELECT
[AuthorId]		= CAST(au_id as nchar(12)),
[AuthorName]	= CAST((au_fname + ' ' + au_lname) as nvarchar(100)),
[AuthorState]	= CAST(state as nchar(12))
FROM Pubs.dbo.Authors
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimAuthors] ON
GO
INSERT INTO [DW_Sample].[dbo].[DimAuthors]
(AuthorKey, AuthorID, AuthorName, AuthorState)
SELECT -1, 'Unknown', 'Unknown', 'Unknown'
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimAuthors] OFF
GO
/*
STEP 6
Populate DimPublishers dimension with source data
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimPublishers]
(PublisherId, PublisherName)
SELECT
[PublisherId]	= CAST(pub_id as nchar(12)),
[PublisherName] = CAST(pub_name as nvarchar(50))
FROM pubs.dbo.publishers
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimPublishers] ON
GO
INSERT INTO [DW_Sample].[dbo].[DimPublishers]
(PublisherKey, PublisherId, PublisherName)
SELECT -1, 'Unknown', 'Unknown'
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimPublishers] OFF
GO
/*
STEP 7
Populate DimTitles dimension with source data
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimTitles]
(TitleId, TitleName, TitleType, PublisherKey, TitlePrice, PublishedDateKey)
SELECT
[TitleId]			= CAST(ISNULL([title_id], -1) AS nvarchar(12)),
[TitleName]			= CAST(ISNULL([title], 'Unknown') AS nvarchar(100)),
[TitleType]			= CASE CAST(ISNULL([type], 'Unknown') AS nvarchar(50))
						WHEN 'business'		THEN N'Business'
						WHEN 'mod_cook'		THEN N'Modern Cooking'
						WHEN 'popular_comp' THEN N'Popular Computing'
						WHEN 'psychology'	THEN N'Psychology'
						WHEN 'trad_cook'	THEN N'Traditional Cooking'
						WHEN 'UNDECIDED'	THEN N'Undecided'
						END,
[PublisherKey]		= ISNULL([DW_Sample].[dbo].[DimPublishers].[PublisherKey], -1),
[TitlePrice]		= CAST(ISNULL([price], -1) AS DECIMAL(18, 4)),
[PublishedDateKey]	= ISNULL([DW_Sample].[dbo].[DimDates].[DateKey], -1)
FROM [Pubs].[dbo].[Titles] Join [DW_Sample].[dbo].[DimPublishers]
ON [Pubs].[dbo].[Titles].[pub_id] = [DW_Sample].[dbo].[DimPublishers].[PublisherId]
Left Join [DW_Sample].[dbo].[DimDates]
ON [Pubs].[dbo].[Titles].[pubdate] = [DW_Sample].[dbo].[DimDates].[Date]
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimTitles] ON
GO
INSERT INTO [DW_Sample].[dbo].[DimTitles]
(TitleKey, TitleID, TitleName, TitleType, PublisherKey, TitlePrice, PublishedDateKey)
SELECT -1, 'Unknown', 'Unknown', 'Unknown', -1, -1, -1
GO
SET IDENTITY_INSERT [DW_Sample].[dbo].[DimTitles] OFF
GO

At this stage the only dimension which has not been loaded is DimStore. This is because it is the only dimension in our data model which has been identified as requiring Slowly Changing Dimension approach. Slowly Changing Dimensions (from this point forward identified as SCD) is a crucial concept of any data mart. At this time, Wikipedia identifies six types of SCDs but in this project we will only use two types – Type 1 where old data is simply overwritten with new data and history is not kept and Type 2 where multiple records are created for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In Type 2 SCD unlimited history is preserved for each insert. More on individual methods of SCDs handling can be viewed in my other posts, mainly HERE, HERE, HERE and HERE. For this exercise we will use a simple MERGE SQL statement, rather than resorting to third party components or SSIS native tool. As previously mentioned, DimStores table will be a mix of Type 1 and Type 2 SCD. Let’s assume that StoreAddress, StoreCity, StoreState and StoreZip attributes in DimStore table are defined as Type 1 SCD and every time source data is changed, old data is replaced with new one. Let’s also assume that StoreName attribute is defined as Type 2 SCD – every time new StoreName is assigned to the specific StoreID, we want to maintain the history and preserve old StoreName, making sure we can distinguish between retired and current StoreName. Both of those requirements can be handled by means of using MERGE SQL statement which takes care of both INSERT and UPDATE for Type 2 SCD operations depending on how the source data has been changed and which attribute it relates to. Below is the SQL for our DimStores table population.

/*
STEP 8
Populate DimStores dimension table with source data assuming that some attributes
have been defined as either Type 1 or Type 2 Slowly Changing Dimensions
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[DimStores]
(StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
SELECT
StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, 1
FROM
    (MERGE DimStores AS Target
        USING
            (SELECT
			[StoreId]		= CAST(stor_id as nchar(12)),
			[StoreName]		= CAST(stor_name as nvarchar(50)),
			[StoreAddress]	= CAST(stor_address as nvarchar (80)),
			[StoreCity]		= CAST(city as nvarchar (40)),
			[StoreState]	= CAST(state as nvarchar (12)),
			[StoreZip]		= CAST(zip as nvarchar (12))

			FROM Pubs.dbo.Stores)
            AS [Source]
                    ON Target.StoreID = Source.StoreID AND Target.IsCurrent = 1
                    WHEN MATCHED AND
            (Target.StoreName <> [Source].StoreName)
                THEN UPDATE SET
                IsCurrent = 0
    WHEN NOT MATCHED BY TARGET
            THEN INSERT
            (StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
            VALUES
            ([Source].StoreID,
			[Source].StoreName,
			[Source].StoreAddress,
			[Source].StoreCity,
			[Source].StoreState,
			[Source].StoreZip,
			1)
    WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1
            THEN UPDATE
            SET IsCurrent = 0
    OUTPUT $action as Action,[Source].*) AS MergeOutput
WHERE MergeOutput.Action = 'UPDATE'
AND StoreName IS NOT NULL
GO

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreAddress = source.stor_address
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.storeaddress <> source.stor_address
AND target.IsCurrent = 1

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreCity = source.city
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.storecity <> source.city
AND target.IsCurrent = 1

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreState = source.State
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.storeState <> source.state
AND target.IsCurrent = 1

UPDATE [DW_Sample].[dbo].[DimStores]
SET StoreZip = source.Zip
FROM dimstores target INNER JOIN Pubs.dbo.Stores source
ON target.StoreId = source.stor_id
AND target.StoreZip <> source.Zip
AND target.IsCurrent = 1

BEGIN
	IF EXISTS (SELECT 1 FROM [DW_Sample].[dbo].[DimStores] WHERE StoreKey = -1)
		BEGIN
			RETURN
		END
	ELSE
		BEGIN
		SET IDENTITY_INSERT [DW_Sample].[dbo].[DimStores] ON
		INSERT INTO [DW_Sample].[dbo].[DimStores]
		(StoreKey, StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
		SELECT -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',-1
		SET IDENTITY_INSERT [DW_Sample].[dbo].[DimStores] OFF
		END
END

To better understand how SCD and the above code logic work lets temporarily alter source data and run MERGE SQL for DimStores dimension again. To do this let’s run the following SQL and check how the records were affected.

/*
Show how a combination of MERGE and UPDATE will work with SCD 1 and SCD 2 dimensions
*/
USE DW_Sample
BEGIN
BEGIN TRANSACTION
	USE Pubs
	UPDATE [pubs].[dbo].[stores]
	SET Stor_Name = 'New Store Name'
	WHERE stor_id = 7067
	UPDATE [pubs].[dbo].[stores]
	SET stor_address = 'New Store Address'
	WHERE stor_id = 6380
	UPDATE [pubs].[dbo].[stores]
	SET zip = '00000'
	WHERE stor_id = 7131

	USE DW_Sample
	INSERT INTO [DW_Sample].[dbo].[DimStores]
	(StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
	SELECT
	StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, 1
	FROM
		(MERGE DimStores AS Target
			USING
				(SELECT
				[StoreId]		= CAST(stor_id as nchar(12)),
				[StoreName]		= CAST(stor_name as nvarchar(50)),
				[StoreAddress]	= CAST(stor_address as nvarchar (80)),
				[StoreCity]		= CAST(city as nvarchar (40)),
				[StoreState]	= CAST(state as nvarchar (12)),
				[StoreZip]		= CAST(zip as nvarchar (12))

				FROM Pubs.dbo.Stores)
				AS [Source]
						ON Target.StoreID = Source.StoreID AND Target.IsCurrent = 1
						WHEN MATCHED AND
				(Target.StoreName <> [Source].StoreName)
					THEN UPDATE SET
					IsCurrent = 0
		WHEN NOT MATCHED BY TARGET
				THEN INSERT
				(StoreID, StoreName, StoreAddress, StoreCity, StoreState, StoreZip, IsCurrent)
				VALUES
				([Source].StoreID,
				[Source].StoreName,
				[Source].StoreAddress,
				[Source].StoreCity,
				[Source].StoreState,
				[Source].StoreZip,
				1)
		WHEN NOT MATCHED BY SOURCE AND Target.IsCurrent = 1
				THEN UPDATE
				SET IsCurrent = 0
		OUTPUT $action as Action,[Source].*) AS MergeOutput
	WHERE MergeOutput.Action = 'UPDATE'
	AND StoreName IS NOT NULL

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreAddress = source.stor_address
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.storeaddress <> source.stor_address
	AND target.IsCurrent = 1

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreCity = source.city
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.storecity <> source.city
	AND target.IsCurrent = 1

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreState = source.State
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.storeState <> source.state
	AND target.IsCurrent = 1

	UPDATE [DW_Sample].[dbo].[DimStores]
	SET StoreZip = source.Zip
	FROM dimstores target INNER JOIN Pubs.dbo.Stores source
	ON target.StoreId = source.stor_id
	AND target.StoreZip <> source.Zip
	AND target.IsCurrent = 1

	SELECT * FROM [DW_Sample].[dbo].[DimStores]
ROLLBACK TRANSACTION
END

After executing, what we can notice an additional record added to this small dimension in accordance with the business requirements i.e. every time the source data is changed for stor_name attribute in Stores table in pubs database, an old version will be retired and a new one added to the dimension in the data mart. This is signified by means of using a control column IsCurrent which changes its value from 1 to 0 every time an old value is replaced with an updated one. In this way any alterations to the source data are preserved in history. Therefore, the name for StoreID 7067 was changed from ‘News & Brews’ to ‘New Store Name’ and IsCurrent flag updated to the new value. Also, notice that StoreAddress and StoreZip, Type 1 SCD, changed (for StoreID 6380 and 7131 correspondingly) and these columns were overwritten with new values as per SCD Type 1 definition.

MERGE_sql_output_HTDM

In the above example I used a control attribute called IsCurrent, but using expiry dates to signify whether a given record is current or not is also common. Again, for detailed description of how MERGE behaves in SCD deployment please have a look at one of my previous posts HERE. Alternatively, if you’re not comfortable with syntax intricacies of MERGE SQL command you can try using alternative approaches e.g. Microsoft’s own Slowly Changing Dimension component as per image below. It is a good way to start if you’re just trying to get your head around the concept of SCD types as it is wizard-driven and easy to configure.

SSIS_default_SCD_component_HTDM

Finally, all there is left to do is to populate fact tables and add foreign keys as per the SQL below.

/*
STEP 9
Populate FactTitlesAuthors fact table with source data assuming that
if historical load has already been run, only new values should be inserted.
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[FactTitlesAuthors]
(TitleKey, AuthorKey, AuthorOrder)
SELECT
[TitleKey]		= ISNULL(titles.TitleKey, -1),
[AuthorKey]		= ISNULL(authors.AuthorKey, -1),
[AuthorOrder]	= au_ord
FROM pubs.dbo.titleauthor AS titleauthor
JOIN DW_Sample.dbo.DimTitles AS titles
ON titleauthor.Title_id = titles.TitleId
JOIN DW_Sample.dbo.DimAuthors AS authors
ON titleauthor.Au_id = authors.AuthorId
WHERE NOT EXISTS	(SELECT TOP 1 'I already exist' FROM
					[DW_Sample].[dbo].[FactTitlesAuthors] fact
					WHERE fact.AuthorKey	= authors.AuthorKey
					AND fact.TitleKey		= titles.TitleKey)
/*
STEP 10
Populate FactSales fact table with source data assuming that
if historical load has already been run, only new values should be inserted.
*/
USE DW_Sample
INSERT INTO [DW_Sample].[dbo].[FactSales]
(OrderNumber, OrderDateKey, TitleKey, StoreKey, SalesQuantity)
SELECT
[OrderNumber]	= CAST(sales.ord_num AS nvarchar(50)),
[OrderDateKey]	= ISNULL(dates.DateKey, -1),
[TitleKey]		= ISNULL(titles.TitleKey, -1),
[StoreKey]		= ISNULL(stores.StoreKey, -1),
[SalesQuantity] = sales.Qty
FROM pubs.dbo.sales AS sales
JOIN DW_Sample.dbo.DimDates AS dates
ON sales.ord_date = dates.[date]
JOIN DW_Sample.dbo.DimTitles AS titles
ON sales.Title_id = titles.TitleId
JOIN DW_Sample.dbo.DimStores AS stores
ON sales.Stor_id = stores.StoreId
WHERE NOT EXISTS	(SELECT TOP 1 'I already exist' FROM
					[DW_Sample].[dbo].[FactSales] as fact
					WHERE sales.ord_num	= fact.OrderNumber
					AND dates.DateKey	= fact.OrderDateKey
					AND titles.TitleKey = fact.TitleKey
					AND stores.StoreKey = fact.StoreKey)
/*
STEP 11
Recreate all foreign key constraints
*/
USE DW_Sample
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimPublishers]
FOREIGN KEY ([PublisherKey]) REFERENCES [dbo].[DimPublishers] ([PublisherKey])
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT [FK_FactTitlesAuthors_DimAuthors]
FOREIGN KEY ([AuthorKey]) REFERENCES [dbo].[DimAuthors] ([AuthorKey])
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT [FK_FactTitlesAuthors_DimTitles]
FOREIGN KEY ([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimStores]
FOREIGN KEY ([StoreKey]) REFERENCES [dbo].[DimStores] ([Storekey])
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimTitles]
FOREIGN KEY ([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimDates]
FOREIGN KEY([OrderDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimDates]
FOREIGN KEY([PublishedDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])

Please note that both queries populating fact tables have NOT EXISTS statement as part of the WHARE clause. We talked about inserting new values into dimensions e.g. SCD Type 2, this time it’s executing incremental INSERT as part of fact table population. As we only wish to commit INSERT when new fact data has been found, NOT EXISTS and everything else that follows provides this functionality and excludes records which can already be found in a fact table. There are different types of fact tables and some of those e.g. cumulative snapshot facts allow for the same data to be inserted multiple times over a period of time. In our case, however, we are only interested in new records thus exclude all transactions which have already been accounted for. We can test this out by executing the following SQL where a new record is temporarily added to Sales source table in Pubs database and the code tries to INSERT the same record twice, with only one successful INSERT.

/*
Show how NOT EXISTS will work to prevent duplicate data being inserted in fact table
*/
USE DW_Sample
BEGIN
BEGIN TRANSACTION
USE pubs
	INSERT INTO sales
	(stor_id, ord_num, ord_date, qty, payterms, title_id)
	SELECT
	'7066', 'ZZZ999', '1994-09-14', 1, 'Net 30', 'PS2091'
	SELECT * FROM sales WHERE ord_num = 'ZZZ999'

	DECLARE @CountOfInserts int = 0
	DECLARE @RowsAffected int = 0
	WHILE @CountOfInserts < 2
		BEGIN
			INSERT INTO [DW_Sample].[dbo].[FactSales]
			(OrderNumber, OrderDateKey, TitleKey, StoreKey, SalesQuantity)
			SELECT
			[OrderNumber]	= CAST(sales.ord_num AS nvarchar(50)),
			[OrderDateKey]	= ISNULL(dates.DateKey, -1),
			[TitleKey]		= ISNULL(titles.TitleKey, -1),
			[StoreKey]		= ISNULL(stores.StoreKey, -1),
			[SalesQuantity] = sales.Qty
			FROM pubs.dbo.sales AS sales
			JOIN DW_Sample.dbo.DimDates AS dates
			ON sales.ord_date = dates.[date]
			JOIN DW_Sample.dbo.DimTitles AS titles
			ON sales.Title_id = titles.TitleId
			JOIN DW_Sample.dbo.DimStores AS stores
			ON sales.Stor_id = stores.StoreId
			WHERE NOT EXISTS	(SELECT TOP 1 'I already exist' FROM
								[DW_Sample].[dbo].[FactSales] as fact
								WHERE sales.ord_num	= fact.OrderNumber
								AND dates.DateKey	= fact.OrderDateKey
								AND titles.TitleKey = fact.TitleKey
								AND stores.StoreKey = fact.StoreKey)
			SET @RowsAffected = @@ROWCOUNT

			SELECT 'Pass number' + ' ' + CAST(@CountOfInserts + 1 as varchar (10)) + '. ' +
			'Number of records affected = ' + CAST(@RowsAffected as varchar (10))
			FROM [DW_Sample].[dbo].[FactSales] WHERE OrderNumber = 'ZZZ999'
			SET @CountOfInserts = @CountOfInserts + 1
	END
ROLLBACK TRANSACTION
END

NOT_EXISTS_sql_output_HTDM

This last bit of code concludes this post and, as usual, all the code samples can be found and downloaded from HERE. Also, if you’re keen to follow along, please take note of the sequence number assigned to most of the above SQL snippets as these will be a required reference for ETL development in the NEXT POST. In the next iteration to this series I will explore SQL Server Integration Services functionality and how the above code samples will be integrated into developing an ETL routine to automate data mart loading.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

http://scuttle.org/bookmarks.php/pass?action=add

Tags: , ,

This entry was posted on Monday, September 9th, 2013 at 6:53 am and is filed under Data Modelling, How To's, SQL, SSIS. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply