How To Populate Date Dimension Table
May 20th, 2012 / 2 Comments » / by admin
I am confident that there is already a number (probably tens if not hundreds) of Internet posts which deal with the process of ‘DATE’ dimension table population and I thought long and hard whether I should spend my time to reiterate this information; however, how we customize this routine is very individual and there is no ‘one size fits all’ doctrine on how to structure the ETL or query to get the best results. Not trying to echo what has already been said, I thought it would be worthwhile to provide a quick rundown on what I have grown accustomed to when it comes to creating or updating the ‘DATE’ dimension, the linchpin of any dimensional data mart model. This process can be further customized to accommodate any changes or additions based on the business needs.
Nearly any dimensional table can be created using an SSIS package but in this case DTS job just adds to maintenance overhead and complexity of the whole solution, especially, that once populated, Dim_Date does not require further updates or ongoing maintenance. Therefore, this post demonstrates only how to do it via SQL. Also, many people use Excel spreadsheet (a good sample can be downloaded from the Kimbal Group’s website) and simply import this into a table via ‘Import Data’ database task or a job/package. It is a great solution; however, in spite of Excel powerful formulae pool, more complex date derivatives cannot be easily generated. Besides, once most of the Dim_Date table structure is set in concrete, it is easy, quick and powerful method which can be reused (partially or fully) in other projects.
Firstly, we need to create an empty table (Dim_Date) to populate with ‘datetime’ values and their derivatives. It will also be assigned a primary key (Date_Key) as a surrogate key which will become a foreign key in a fact table. We also create a unique index on a Date_Time column to enable faster lookups. The code for all these functions is as follows:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Dim_Date]') AND type in (N'U')) DROP TABLE [dbo].[Dim_Date] GO CREATE TABLE [dbo].[Dim_Date]( [Date_Key] [int] NOT NULL, [Date_Time] [datetime] NOT NULL, [Calendar_Date] [date] NOT NULL, [Calendar_Year] [int] NOT NULL, [Calendar_Quarter] [char](8) NOT NULL, [Is_Leap_Year] [int] NOT NULL, [Calendar_Quarter_Number] [int] NOT NULL, [Calendar_Year_Quarter_Name] [varchar](12) NOT NULL, [Calendar_Month] [int] NOT NULL, [Calendar_Month_Number] [int] NOT NULL, [Calendar_Month_Name] [varchar](10) NOT NULL, [Calendar_Year_Month_Name] [varchar](15) NOT NULL, [Calendar_Month_Start] [int] NOT NULL, [Calendar_Month_End] [int] NOT NULL, [Calendar_Week] [int] NOT NULL, [Week_Day] [int] NOT NULL, [Week_Day_Name] [varchar](10) NOT NULL, [Week_Day_Type] [varchar](8) NOT NULL, [Days_In_Calendar_Year] [smallint] NOT NULL, [Days_In_Calendar_Month] [smallint] NOT NULL, [Days_In_Calendar_Week] [smallint] NOT NULL, [Weeks_In_Calendar_Month] [smallint] NOT NULL, [Weeks_In_Calendar_Year] [smallint] NOT NULL, [Public_Holiday_Flag] [varchar] (8) NULL, --Varies depending on the location e.g. country, region [Financial_Year] [int] NOT NULL, --Based on Australian standard financial year start and end dates [Financial_Quarter] [smallint] NOT NULL --Based on Australian standard financial year start and end dates CONSTRAINT [PK_DimDate] PRIMARY KEY CLUSTERED ( [Date_Key] ASC ) ) ON [PRIMARY] GO CREATE UNIQUE INDEX Idx_DimDate ON Dim_Date(Date_Key) GO SET DATEFIRST 1 --If first day of week is a Monday. If first day of week = Saturday then set to 6 GO
Next, once the table has been created, we can populate it with the data which will be defined using a number of functions and formulae encapsulated in a simple stored procedure based around two variables determining the start and end dates for the table content i.e. how far back in history or how far into the future in time you would like to go. The below code creates stored procedure which we will execute to populate our Dim_Date table.
CREATE PROCEDURE dbo.usp_Populate_Date_Dimension ( @Start_Day date, @End_Day date, @Final_Message VARCHAR (150) OUTPUT ) AS BEGIN BEGIN TRY --Create variable to handle record count for verification purposes DECLARE @Record_Count int; --Loop through all dates in the specified date range and insert dates and their derivatives into 'Dim_Date' dimension table WITH DateCTE AS ( SELECT CAST(@Start_Day as datetime) AS Date_Value UNION ALL SELECT DATEADD(dd, 1, Date_Value) FROM DateCTE WHERE DATEADD(dd, 1, Date_Value) <= @End_Day ) INSERT INTO dbo.Dim_Date SELECT [Date_Key] = CAST(CONVERT(CHAR(8),CAST(Date_Value as DATETIME),112) as INT), [Date_Time] = Date_Value, [Calendar_Date] = CAST(Date_Value as DATE), [Calendar_Year] = YEAR(Date_Value), [Calendar_Quarter] = CAST(YEAR(Date_Value) AS CHAR(4)) + CAST(DATEPART(Quarter ,Date_Value) AS CHAR(1)), [Is_Leap_Year] = CASE WHEN DATEPART(day,DATEADD(day,-1,CAST(DATENAME(year,Date_Value) + '-03-01' AS date)))=29 THEN 1 ELSE 0 END, [Calendar_Quarter_Number] = DATEPART(Quarter ,Date_Value), [Calendar_Year_Quarter_Name] = DATENAME(YEAR,Date_Value) + ' Qtr' + DATENAME(QUARTER,Date_Value), [Calendar_Month] = CAST(CONVERT(CHAR(6),CAST(Date_Value AS DATETIME),112) AS INT), [Calendar_Month_Number] = DATEPART(m,Date_Value), [Calendar_Month_Name] = DATENAME(MONTH,Date_Value), [Calendar_Year_Month_Name] = CAST(YEAR(Date_Value) AS CHAR(4)) + ' ' + DATENAME(MONTH,Date_Value), [Calendar_Month_Start] = CONVERT(CHAR(6),CAST(Date_Value AS DATETIME),112) + '01', [Calendar_Month_End] = CAST(CONVERT(CHAR(8),CAST(CONVERT(VARCHAR(12), dateadd(day,-1 * DAY(DATEADD(MONTH,1,Date_Value)), DATEADD(month,1,Date_Value)),113) AS DATETIME),112) AS INT), [Calendar_Week] = CAST(YEAR(Date_Value) AS CHAR(4)) + CAST(DATEPART(wk ,Date_Value) AS VARCHAR(2)), [Week_Day] = DATEPART(dw, Date_Value), [Week_Day_Name] = DATENAME(dw, Date_Value), [Week_Day_Type] = CASE WHEN DATENAME(dw, Date_Value) IN ('Saturday','Sunday') THEN 'Weekend' ELSE 'Weekday' END, --Day_Type_Code function from previous post may also be used here [Days_In_Calendar_Year] = COUNT(DATEPART(wk ,Date_Value)) OVER (PARTITION BY YEAR(Date_Value)), [Days_In_Calendar_Month] = COUNT(*) OVER (PARTITION BY CAST(CONVERT(CHAR(6),CAST(Date_Value AS DATETIME),112) AS INT)), [Days_In_Calendar_Week] = COUNT(*) OVER (PARTITION BY CAST(YEAR(Date_Value) AS CHAR(4)) + CAST(DATEPART(wk ,Date_Value) AS VARCHAR(2)) ), [Weeks_In_Calendar_Month] = (SELECT DISTINCT COUNT(DATEPART(wk,b.Date_Value)) OVER (PARTITION BY DATEPART(m,b.Date_Value)) FROM DateCTE b WHERE DATEPART(m,a.Date_Value) = DATEPART(m,b.Date_Value) AND YEAR(a.Date_Value) = YEAR(b.Date_Value) AND DATEPART(dw, Date_Value) = (1)), [Weeks_In_Calendar_Year] = (SELECT DISTINCT COUNT(DATEPART(wk ,Date_Value)) OVER (PARTITION BY YEAR(Date_Value)) FROM DateCTE b WHERE YEAR(a.Date_Value) = YEAR(b.Date_Value) AND DATEPART(dw, Date_Value) = (1)), [Public_Holiday_Flag] = NULL, [Financial_Year] = CASE WHEN datepart(mm, (select dateadd(d,datediff(d,0,dateadd(s,-1,dateadd(m,datediff(m,0,Date_Value)+1,0))),0))) > 6 THEN datepart(yy, (select dateadd(d,datediff(d,0,dateadd(s,-1,dateadd(m,datediff(m,0,Date_Value)+1,0))),0))) + 1 ELSE datepart(yy,(select dateadd(d,datediff(d,0,dateadd(s,-1,dateadd(m,datediff(m,0,Date_Value)+1,0))),0))) END, [Finacial_Quarter] = CASE WHEN datepart(mm, Date_Value) BETWEEN 7 AND 9 THEN 1 WHEN datepart(mm, Date_Value) BETWEEN 10 AND 12 THEN 2 WHEN datepart(mm, Date_Value) BETWEEN 1 AND 3 THEN 3 WHEN datepart(mm, Date_Value) BETWEEN 4 AND 6 THEN 4 END FROM DateCTE a ORDER BY Date_Key OPTION (MAXRECURSION 0) SET @Record_Count = @@ROWCOUNT; --Disable identity column to insert explicit values into Dim_Date table for handling unknown members IF EXISTS (SELECT c.is_identity FROM sys.tables t JOIN sys.schemas s ON t.schema_id = s.schema_id JOIN sys.Columns c ON c.object_id = t.object_id JOIN sys.Types ty ON ty.system_type_id = c.system_type_id WHERE t.name = 'Dim_date' AND s.Name = 'dbo' AND c.is_identity=1) SET IDENTITY_INSERT dbo.dim_date ON DECLARE @Day date SET @Day='9999-12-31' --Insert one records for handling unknown members during date lookups INSERT INTO dbo.Dim_Date ( [Date_Key], [Date_Time], [Calendar_Date], [Calendar_Year], [Calendar_Quarter], [Is_Leap_Year], [Calendar_Quarter_Number], [Calendar_Year_Quarter_Name], [Calendar_Month], [Calendar_Month_Number], [Calendar_Month_Name], [Calendar_Year_Month_Name], [Calendar_Month_Start], [Calendar_Month_End], [Calendar_Week], [Week_Day], [Week_Day_Name], [Week_Day_Type], [Days_In_Calendar_Year], [Days_In_Calendar_Month], [Days_In_Calendar_Week], [Weeks_In_Calendar_Month], [Weeks_In_Calendar_Year], [Public_Holiday_Flag], [Financial_Year], [Financial_Quarter] ) SELECT [Date_Key] = CAST(CONVERT(CHAR(8),CAST(@Day as DATETIME),112) as INT), [Date_Time] = @Day, [Calendar_Date] = CAST(@day as date), [Calendar_Year] = -1, [Calendar_Quarter] = 'UNKNOWN', [Is_Leap_Year] = -1, [Calendar_Quarter_Number] = -1, [Calendar_Year_Quarter_Name] = 'UNKNOWN', [Calendar_Month] = -1, [Calendar_Month_Number] = -1, [Calendar_Month_Name] = 'UNKNOWN', [Calendar_Year_Month_Name] = 'UNKNOWN', [Calendar_Month_Start] = -1, [Calendar_Month_End] = -1, [Calendar_Week] = -1, [Week_Day] = -1, [Week_Day_Name] = 'UNKNOWN', [Week_Day_Type] = 'UNKNOWN', [Days_In_Calendar_Year] = -1, [Days_In_Calendar_Month] = -1, [Days_In_Calendar_Week] = -1, [Weeks_In_Calendar_Month] = -1, [Weeks_In_Calendar_Year] = -1, [Public_Holiday_Flag] = -1, [Financial_Year] = -1, [Financial_Quarter] = -1 SET @Record_Count = @Record_Count + @@ROWCOUNT --Enable identity column on the Dim_Date table IF EXISTS (SELECT c.is_identity FROM sys.tables t JOIN sys.schemas s ON t.schema_id = s.schema_id JOIN sys.Columns c ON c.object_id = t.object_id JOIN sys.Types ty ON ty.system_type_id = c.system_type_id WHERE t.name = 'Dim_date' AND s.Name = 'dbo' AND c.is_identity=1) SET IDENTITY_INSERT dbo.dim_date OFF IF @record_Count <> (SELECT CAST (DATEDIFF(DAY, @Start_Day, @End_Day) + 2 as INT)) SET @Final_Message = 'Something went wrong! Your table has not been populated.' ELSE SET @Final_Message = 'Everything went well! The data should now be inserted into your table.' END TRY BEGIN CATCH IF @@TRANCOUNT > 0 BEGIN ROLLBACK TRANSACTION END DECLARE @ErrorMessage NVARCHAR(4000); DECLARE @ErrorSeverity INT; DECLARE @ErrorState INT; SELECT @ErrorMessage = ERROR_MESSAGE(), @ErrorSeverity = ERROR_SEVERITY(), @ErrorState = ERROR_STATE(); RAISERROR (@ErrorMessage, -- Message text. @ErrorSeverity, -- Severity. @ErrorState -- State. ); END CATCH IF @@TRANCOUNT > 0 BEGIN COMMIT TRANSACTION END END
The above query is pretty straightforward – create CTE (Common Table Expression), populate it with a range of dates also using query hint (MAXRECURSION) to provide looping and finally create all date derivatives i.e. attributes of the Dim_Date dimension that the business can potentially look for/need as far as reporting and analytics functions. It is essential to ensure that the most common or requested date derivatives are calculated during this process rather than during report query building or cube/fact table population as this will enable the creation of a central repository for date data, also ensuring that unnecessary server resources are not wasted during further ETL processes, cube population, query execution etc. In most cases this can be established early on i.e. during the business requirements gathering phase or looking at sample legacy reports. Also, as this information is fairly rigid and not susceptible to ongoing changes (unlike other dimensions requiring Slowly Changing Dimensions functionality to cater for future alterations), future updates and maintenance should be fairly easy.
OK, now that we’ve got the stored procedure compiled, let’s execute it. You can do that from SQL Server Management Studio or simply by running the following code (note the start and end dates as the table boundaries):
DECLARE @return_value int, @Final_Message varchar(150) EXEC @return_value = [dbo].[usp_Populate_Date_Dimension] @Start_Day = 2012/01/01, @End_Day = 2013/12/31, @Final_Message = @Final_Message OUTPUT SELECT @Final_Message as N'@Final_Message' SELECT 'Return Value' = @return_value GO
You should now be able query the table content and see all the dates and related attributes. Notice that Public_Holiday_Flag attribute in the Dim_Date table can accommodate NULL values and that during the population process we insert NULLs into it. This is because this column will be populated separately due to its nature i.e. these values cannot be computed or derived and need to be inserted manually. The existence of this field is not applicable to all environments but anywhere where KPIs are calculated based on a number of days from or to an event and the exclusion of public holidays is taken into account, this flag is necessary. Unfortunately, this process requires manual updates to a separate lookup table where public holidays are stored although I have seen automated attempts to pull this information from an online feed instead. As much as I am an “automation junkie”, I tend to do it manually as it only takes 2 minutes to gather the date values for the whole year. To populate public holidays flag field I’ve created a control table which I update once a year based on the yearly public holidays calendar issued by the state government. As in Australia not all public holidays are celebrated in all states equally, this may require some modifications based on your business requirements e.g. you may have a number of dates pointing to public holidays for each state/province.
CREATE TABLE [dbo].[Public_Holidays]( [Public_Holiday_Date] [date] NOT NULL ) ON [PRIMARY] GO
All there is left to finish the Dim_Date table population is to issue an UPDATE statement to populate the column according to the Public_Holidays dates as per below:
UPDATE Dim_Date SET Public_Holiday_Flag = 'Y' FROM Dim_Date a INNER JOIN Public_Holidays b ON a.Calendar_Date = b.Public_Holiday_Date GO UPDATE Dim_Date SET Public_Holiday_Flag = 'N' WHERE Public_Holiday_Flag IS NULL GO
This is just one of the ways to create a ‘DATE’ dimension table used for a data mart dimensional model. Some of those attributes we calculated and pushed into the table may be obsolete for your project and some may not be there at all. This is why ‘DATE’ dimension table, although common to all data marts, is an enterprise-specific object and which attributes you use to populate it are driven by your business requirements and needs. If a different data derivative or date format is required, there is no reason why it should not be implemented during this process as apposed to during query structuring for a given output (this really is a separate topic for a separate post). Remember, dimension tables are supposed to be wide, as opposed to fact tables, which are typically long. Having a few extra columns will not hurt a great deal and the CPU cycles we save by ensuring all required attributes are already pre-computed can go a long way during an intensive ETL or report execution process.