How To Populate Date Dimension Table

May 20th, 2012 / 2 Comments » / by admin

I am confident that there is already a number (probably tens if not hundreds) of Internet posts which deal with the process of ‘DATE’ dimension table population and I thought long and hard whether I should spend my time to reiterate this information; however, how we customize this routine is very individual and there is no ‘one size fits all’ doctrine on how to structure the ETL or query to get the best results. Not trying to echo what has already been said, I thought it would be worthwhile to provide a quick rundown on what I have grown accustomed to when it comes to creating or updating the ‘DATE’ dimension, the linchpin of any dimensional data mart model. This process can be further customized to accommodate any changes or additions based on the business needs.

Nearly any dimensional table can be created using an SSIS package but in this case DTS job just adds to maintenance overhead and complexity of the whole solution, especially, that once populated, Dim_Date does not require further updates or ongoing maintenance. Therefore, this post demonstrates only how to do it via SQL. Also, many people use Excel spreadsheet (a good sample can be downloaded from the Kimbal Group’s website) and simply import this into a table via ‘Import Data’ database task or a job/package. It is a great solution; however, in spite of Excel powerful formulae pool, more complex date derivatives cannot be easily generated. Besides, once most of the Dim_Date table structure is set in concrete, it is easy, quick and powerful method which can be reused (partially or fully) in other projects.

Firstly, we need to create an empty table (Dim_Date) to populate with ‘datetime’ values and their derivatives. It will also be assigned a primary key (Date_Key) as a surrogate key which will become a foreign key in a fact table. We also create a unique index on a Date_Time column to enable faster lookups. The code for all these functions is as follows:

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Dim_Date]') AND type in (N'U'))
DROP TABLE [dbo].[Dim_Date]
GO
CREATE TABLE [dbo].[Dim_Date](
[Date_Key]						[int]			NOT NULL,
[Date_Time]						[datetime]		NOT NULL,
[Calendar_Date]					[date]			NOT NULL,
[Calendar_Year]					[int]			NOT NULL,
[Calendar_Quarter]				[char](8)		NOT NULL,
[Is_Leap_Year]					[int]			NOT NULL,
[Calendar_Quarter_Number]		[int]			NOT NULL,
[Calendar_Year_Quarter_Name]	[varchar](12)	NOT NULL,
[Calendar_Month]				[int]			NOT NULL,
[Calendar_Month_Number]			[int]			NOT NULL,
[Calendar_Month_Name]			[varchar](10)	NOT NULL,
[Calendar_Year_Month_Name]		[varchar](15)	NOT NULL,
[Calendar_Month_Start]			[int]			NOT NULL,
[Calendar_Month_End]			[int]			NOT NULL,
[Calendar_Week]					[int]			NOT NULL,
[Week_Day]						[int]			NOT NULL,
[Week_Day_Name]					[varchar](10)	NOT NULL,
[Week_Day_Type]					[varchar](8)	NOT NULL,
[Days_In_Calendar_Year]			[smallint]		NOT NULL,
[Days_In_Calendar_Month]		[smallint]		NOT NULL,
[Days_In_Calendar_Week]			[smallint]		NOT NULL,
[Weeks_In_Calendar_Month]		[smallint]		NOT NULL,
[Weeks_In_Calendar_Year]		[smallint]		NOT NULL,
[Public_Holiday_Flag]			[varchar] (8)	NULL,       	--Varies depending on the location e.g. country, region
[Financial_Year]				[int]			NOT NULL,		--Based on Australian standard financial year start and end dates
[Financial_Quarter]				[smallint]		NOT NULL    	--Based on Australian standard financial year start and end dates
 
CONSTRAINT [PK_DimDate] PRIMARY KEY CLUSTERED
(
[Date_Key] ASC
)
) ON [PRIMARY]
GO
CREATE UNIQUE INDEX Idx_DimDate ON Dim_Date(Date_Key)
GO
SET DATEFIRST 1    --If first day of week is a Monday. If first day of week = Saturday then set to 6
GO

Next, once the table has been created, we can populate it with the data which will be defined using a number of functions and formulae encapsulated in a simple stored procedure based around two variables determining the start and end dates for the table content i.e. how far back in history or how far into the future in time you would like to go. The below code creates stored procedure which we will execute to populate our Dim_Date table.

CREATE PROCEDURE dbo.usp_Populate_Date_Dimension
(
@Start_Day		date,
@End_Day		date,
@Final_Message  VARCHAR (150) OUTPUT
)
AS 

BEGIN
BEGIN TRY

--Create variable to handle record count for verification purposes
DECLARE  @Record_Count	int;

--Loop through all dates in the specified date range and insert dates and their derivatives into 'Dim_Date' dimension table
WITH DateCTE AS
(
  SELECT CAST(@Start_Day as datetime) AS Date_Value
  UNION ALL
  SELECT DATEADD(dd, 1, Date_Value)
  FROM DateCTE
  WHERE DATEADD(dd, 1, Date_Value) <= @End_Day
)
 
INSERT INTO dbo.Dim_Date
	SELECT 
	[Date_Key]						=	CAST(CONVERT(CHAR(8),CAST(Date_Value as DATETIME),112) as INT),
	[Date_Time]						=	Date_Value,
	[Calendar_Date]					=	CAST(Date_Value as DATE),
	[Calendar_Year]					=	YEAR(Date_Value),
	[Calendar_Quarter]				=	CAST(YEAR(Date_Value) AS CHAR(4)) + CAST(DATEPART(Quarter ,Date_Value) AS CHAR(1)),
	[Is_Leap_Year]					=	CASE WHEN DATEPART(day,DATEADD(day,-1,CAST(DATENAME(year,Date_Value) + '-03-01' AS date)))=29 THEN 1 ELSE 0 END,
	[Calendar_Quarter_Number]		=	DATEPART(Quarter ,Date_Value),
	[Calendar_Year_Quarter_Name]	=	DATENAME(YEAR,Date_Value) + ' Qtr' + DATENAME(QUARTER,Date_Value),
	[Calendar_Month]				=	CAST(CONVERT(CHAR(6),CAST(Date_Value AS DATETIME),112) AS INT),
	[Calendar_Month_Number]			=	DATEPART(m,Date_Value),
	[Calendar_Month_Name]			=	DATENAME(MONTH,Date_Value),
	[Calendar_Year_Month_Name]		=	CAST(YEAR(Date_Value) AS CHAR(4)) + ' ' + DATENAME(MONTH,Date_Value),
	[Calendar_Month_Start]			=	CONVERT(CHAR(6),CAST(Date_Value AS DATETIME),112) + '01',
	[Calendar_Month_End]			=	CAST(CONVERT(CHAR(8),CAST(CONVERT(VARCHAR(12), dateadd(day,-1 * DAY(DATEADD(MONTH,1,Date_Value)),
										DATEADD(month,1,Date_Value)),113) AS DATETIME),112) AS INT),
	[Calendar_Week]					=	CAST(YEAR(Date_Value) AS CHAR(4)) + CAST(DATEPART(wk ,Date_Value) AS VARCHAR(2)),
	[Week_Day]						=	DATEPART(dw, Date_Value),
	[Week_Day_Name]					=	DATENAME(dw, Date_Value),
	[Week_Day_Type]					=	CASE WHEN DATENAME(dw, Date_Value) IN ('Saturday','Sunday') 
										THEN 'Weekend' ELSE 'Weekday' END,  --Day_Type_Code function from previous post may also be used here
	[Days_In_Calendar_Year]			=	COUNT(DATEPART(wk ,Date_Value)) OVER (PARTITION BY YEAR(Date_Value)),
	[Days_In_Calendar_Month]		=	COUNT(*) OVER (PARTITION BY CAST(CONVERT(CHAR(6),CAST(Date_Value AS DATETIME),112) AS INT)),
	[Days_In_Calendar_Week]			=	COUNT(*) OVER (PARTITION BY CAST(YEAR(Date_Value) AS CHAR(4)) + CAST(DATEPART(wk ,Date_Value) AS VARCHAR(2)) ),
	[Weeks_In_Calendar_Month]		=	(SELECT DISTINCT COUNT(DATEPART(wk,b.Date_Value)) OVER (PARTITION BY DATEPART(m,b.Date_Value))
										FROM DateCTE b  
										WHERE
										DATEPART(m,a.Date_Value) = DATEPART(m,b.Date_Value)
										AND YEAR(a.Date_Value) = YEAR(b.Date_Value)
										AND DATEPART(dw, Date_Value) = (1)),									
	[Weeks_In_Calendar_Year]		=	(SELECT DISTINCT COUNT(DATEPART(wk ,Date_Value)) OVER (PARTITION BY YEAR(Date_Value))
										FROM DateCTE b
										WHERE YEAR(a.Date_Value) = YEAR(b.Date_Value)
										AND DATEPART(dw, Date_Value) = (1)),
	[Public_Holiday_Flag]			=	NULL,
	[Financial_Year]				=	CASE WHEN datepart(mm, (select dateadd(d,datediff(d,0,dateadd(s,-1,dateadd(m,datediff(m,0,Date_Value)+1,0))),0))) > 6
										THEN datepart(yy, (select dateadd(d,datediff(d,0,dateadd(s,-1,dateadd(m,datediff(m,0,Date_Value)+1,0))),0))) + 1
										ELSE datepart(yy,(select dateadd(d,datediff(d,0,dateadd(s,-1,dateadd(m,datediff(m,0,Date_Value)+1,0))),0))) END,
	[Finacial_Quarter]				=	CASE WHEN datepart(mm, Date_Value) BETWEEN 7 AND 9 THEN 1
										WHEN datepart(mm, Date_Value) BETWEEN 10 AND 12 THEN 2
										WHEN datepart(mm, Date_Value) BETWEEN 1 AND 3 THEN 3
										WHEN datepart(mm, Date_Value) BETWEEN 4 AND 6 THEN 4 END
	FROM DateCTE a
	ORDER BY Date_Key
OPTION (MAXRECURSION 0)

SET  @Record_Count = @@ROWCOUNT;

--Disable identity column to insert explicit values into Dim_Date table for handling unknown members
IF EXISTS	(SELECT   c.is_identity
			FROM   sys.tables t
            JOIN sys.schemas s
            ON t.schema_id = s.schema_id
            JOIN sys.Columns c
            ON c.object_id = t.object_id
            JOIN sys.Types ty
            ON ty.system_type_id = c.system_type_id
			WHERE  t.name = 'Dim_date'
            AND s.Name = 'dbo'
            AND c.is_identity=1)
SET IDENTITY_INSERT dbo.dim_date ON

DECLARE @Day date
SET		@Day='9999-12-31'

--Insert one records for handling unknown members during date lookups		  
INSERT INTO dbo.Dim_Date
(
    [Date_Key],
	[Date_Time],
	[Calendar_Date],
	[Calendar_Year],
	[Calendar_Quarter],
	[Is_Leap_Year],
	[Calendar_Quarter_Number],
	[Calendar_Year_Quarter_Name],
	[Calendar_Month],
	[Calendar_Month_Number],
	[Calendar_Month_Name],
	[Calendar_Year_Month_Name],
	[Calendar_Month_Start],
	[Calendar_Month_End],
	[Calendar_Week],
	[Week_Day],
	[Week_Day_Name],
	[Week_Day_Type],
	[Days_In_Calendar_Year],
	[Days_In_Calendar_Month],
	[Days_In_Calendar_Week],
	[Weeks_In_Calendar_Month],
	[Weeks_In_Calendar_Year],
	[Public_Holiday_Flag],			
	[Financial_Year],				
	[Financial_Quarter] 
)
SELECT 
	[Date_Key]						=	CAST(CONVERT(CHAR(8),CAST(@Day as DATETIME),112) as INT),
	[Date_Time]						=	@Day,
	[Calendar_Date]					=	CAST(@day as date),
	[Calendar_Year]					=	-1,
	[Calendar_Quarter]				=	'UNKNOWN',
	[Is_Leap_Year]					=	-1,
	[Calendar_Quarter_Number]		=	-1,
	[Calendar_Year_Quarter_Name]	=	'UNKNOWN',
	[Calendar_Month]				=	-1,
	[Calendar_Month_Number]			=	-1,
	[Calendar_Month_Name]			=	'UNKNOWN',
	[Calendar_Year_Month_Name]		=	'UNKNOWN',
	[Calendar_Month_Start]			=	-1,
	[Calendar_Month_End]			=	-1,
	[Calendar_Week]					=	-1,
	[Week_Day]						=	-1,
	[Week_Day_Name]					=	'UNKNOWN',
	[Week_Day_Type]					=	'UNKNOWN',
	[Days_In_Calendar_Year]			=	-1,
	[Days_In_Calendar_Month]		=	-1,
	[Days_In_Calendar_Week]			=	-1,
	[Weeks_In_Calendar_Month]		=	-1,
	[Weeks_In_Calendar_Year]		=	-1,
	[Public_Holiday_Flag]			=	-1,
	[Financial_Year]				=	-1,
	[Financial_Quarter]				=	-1
	
SET  @Record_Count = @Record_Count + @@ROWCOUNT	
	
--Enable identity column on the Dim_Date table
IF EXISTS	(SELECT   c.is_identity
			FROM   sys.tables t
            JOIN sys.schemas s
            ON t.schema_id = s.schema_id
            JOIN sys.Columns c
            ON c.object_id = t.object_id
            JOIN sys.Types ty
            ON ty.system_type_id = c.system_type_id
			WHERE  t.name = 'Dim_date'
            AND s.Name = 'dbo'
            AND c.is_identity=1)
SET IDENTITY_INSERT dbo.dim_date OFF	

IF @record_Count <> (SELECT CAST (DATEDIFF(DAY, @Start_Day, @End_Day) + 2 as INT))
	SET @Final_Message = 'Something went wrong! Your table has not been populated.'
	ELSE 
	SET @Final_Message = 'Everything went well! The data should now be inserted into your table.'
END TRY

BEGIN CATCH
		IF @@TRANCOUNT > 0
		BEGIN
			ROLLBACK TRANSACTION
		END	
		DECLARE @ErrorMessage NVARCHAR(4000);
		DECLARE @ErrorSeverity INT;
		DECLARE @ErrorState INT;
		SELECT 
			@ErrorMessage = ERROR_MESSAGE(),
			@ErrorSeverity = ERROR_SEVERITY(),
			@ErrorState = ERROR_STATE();
		RAISERROR (@ErrorMessage, -- Message text.
				   @ErrorSeverity, -- Severity.
				   @ErrorState -- State.
				   );
END CATCH
	IF @@TRANCOUNT > 0
	BEGIN
		COMMIT TRANSACTION
	END
END

The above query is pretty straightforward – create CTE (Common Table Expression), populate it with a range of dates also using query hint (MAXRECURSION) to provide looping and finally create all date derivatives i.e. attributes of the Dim_Date dimension that the business can potentially look for/need as far as reporting and analytics functions. It is essential to ensure that the most common or requested date derivatives are calculated during this process rather than during report query building or cube/fact table population as this will enable the creation of a central repository for date data, also ensuring that unnecessary server resources are not wasted during further ETL processes, cube population, query execution etc. In most cases this can be established early on i.e. during the business requirements gathering phase or looking at sample legacy reports. Also, as this information is fairly rigid and not susceptible to ongoing changes (unlike other dimensions requiring Slowly Changing Dimensions functionality to cater for future alterations), future updates and maintenance should be fairly easy.

OK, now that we’ve got the stored procedure compiled, let’s execute it. You can do that from SQL Server Management Studio or simply by running the following code (note the start and end dates as the table boundaries):

DECLARE	@return_value int,
		@Final_Message varchar(150)
EXEC	@return_value = [dbo].[usp_Populate_Date_Dimension]
		@Start_Day = 2012/01/01,
		@End_Day = 2013/12/31,
		@Final_Message = @Final_Message OUTPUT
SELECT	@Final_Message as N'@Final_Message'
SELECT	'Return Value' = @return_value
GO

You should now be able query the table content and see all the dates and related attributes. Notice that Public_Holiday_Flag attribute in the Dim_Date table can accommodate NULL values and that during the population process we insert NULLs into it. This is because this column will be populated separately due to its nature i.e. these values cannot be computed or derived and need to be inserted manually. The existence of this field is not applicable to all environments but anywhere where KPIs are calculated based on a number of days from or to an event and the exclusion of public holidays is taken into account, this flag is necessary. Unfortunately, this process requires manual updates to a separate lookup table where public holidays are stored although I have seen automated attempts to pull this information from an online feed instead. As much as I am an “automation junkie”, I tend to do it manually as it only takes 2 minutes to gather the date values for the whole year. To populate public holidays flag field I’ve created a control table which I update once a year based on the yearly public holidays calendar issued by the state government. As in Australia not all public holidays are celebrated in all states equally, this may require some modifications based on your business requirements e.g. you may have a number of dates pointing to public holidays for each state/province.

CREATE TABLE [dbo].[Public_Holidays](
[Public_Holiday_Date] [date] NOT NULL
) ON [PRIMARY]
GO

All there is left to finish the Dim_Date table population is to issue an UPDATE statement to populate the column according to the Public_Holidays dates as per below:

UPDATE Dim_Date
SET Public_Holiday_Flag =  'Y'
FROM Dim_Date a INNER JOIN Public_Holidays b
ON a.Calendar_Date = b.Public_Holiday_Date
GO
UPDATE Dim_Date
SET Public_Holiday_Flag =  'N'
WHERE Public_Holiday_Flag IS NULL
GO

This is just one of the ways to create a ‘DATE’ dimension table used for a data mart dimensional model. Some of those attributes we calculated and pushed into the table may be obsolete for your project and some may not be there at all. This is why ‘DATE’ dimension table, although common to all data marts, is an enterprise-specific object and which attributes you use to populate it are driven by your business requirements and needs. If a different data derivative or date format is required, there is no reason why it should not be implemented during this process as apposed to during query structuring for a given output (this really is a separate topic for a separate post). Remember, dimension tables are supposed to be wide, as opposed to fact tables, which are typically long. Having a few extra columns will not hurt a great deal and the CPU cycles we save by ensuring all required attributes are already pre-computed can go a long way during an intensive ETL or report execution process.

Tags: ,

2012 Information Management and BI trends report by InformationWeek

May 15th, 2012 / No Comments » / by admin

InformationWeek (www.informationweek.com) has not long ago released the 2012 Information Management and BI trends report. The purpose of the report was to examine adoption trends and strategies around business intelligence, analytics and information management by examining 542 respondents across North America. A copy of the report can be found HERE.

Tags: ,