How To Build A Data Mart Using Microsoft BI Stack Part 2 – OLAP Database Objects Modelling

September 9th, 2013 / No Comments » / by admin

In the FIRST POST to this series I analysed the schema, objects and the data they hold for our mini data mart project. In this post I would like to start digging into more nuts-and-bolts data warehouse development with some key concepts emphasised and preliminary code base for data mart objects developed.

Key Concepts

I was going to refrain from discussing the intricacies and principles of data mart design in more details, however, there are three concepts which I’d like to mention as they will form a linchpin of any further modelling or development activities we will proceed with. Number one is the concept of de-normalization. As opposed to OLTP data bases which typically focus on insert and update performance whilst minimising reads through dividing large tables into smaller and less redundant objects and defining relationships between them, OLAP design leans towards the process of attempting to optimize the read performance by adding redundant data or by grouping data, often referred to as de-normalizing. The focus here is on report performance and readability rather than lack of redundancy. A normalized design will often store different but related pieces of information in separate logical tables (called relations). If these relations are stored physically as separate disk files, completing a database query that draws information from several relations (a join operation) can be slow. If many relations are joined, it may be prohibitively slow. De-normalizing facilitates retrieval speed making query performance much more efficient. Following this logic, if the operational database stores locations in three separate tables to provide hierarchical view e.g. dbo.country, dbo.state and dbo.city, for the purpose of dimensional modeling we can roll this data up into one table called dbo.location. In that way, in order to retrieve sales data, for example, for any given post code stored in dbo.city table we will not need to traverse two other tables thus implicating unnecessary joins. Providing that sales data comes from fact dbo.sales table which is directly connected to dbo.location dimension, only one join will be involved rather than having to navigate through additional two joins. As our OLTP schema in pubs database is mostly designed in such way that collapsing multiple tables into one dimension or fact is hardly necessary, this becomes a straightforward exercise. For more detailed overview of the methodology involved in dimensional modeling you may want to read THIS DOCUMENT. Alternatively, there is copious amount of great publications available through the Internet.

Another important fact around dimensional modeling is the concept of surrogate keys. The idea is that instead of heaving values that naturally occur in OLPT environment, we have synthetic integer value that acts as dimension key. This surrogate key then becomes a foreign key in the fact table creating a relationship between the two. It is considered best practice to implement on all dimensions, possibly with the exception of date dimension which, depending on the school of thought you subscribe to, recommends storing converted date in a format DDMMYYYY as a table key, inserting an integer surrogate key or doing nothing and storing date in a pure date format (either way the key should be made up of unique values only). Surrogate keys assist when merging data from many dispersed OLTP systems, improve query performance and allow for changes tracking e.g. Slowly Changing Dimensions concept (more on SCD concept in subsequent posts to this series).

Finally, the concept of bridge tables also called associative entities. A bridge table’s purpose is to provide a connection between many-to-many dimension tables, and it is often referred to as a factless fact table.  If we look at the below image with the objects highlighted and imagine that dbo.sales will become primary fact table, whereas dbo.authors and dbo.titles will be converted into dimension tables, we will notice that in order to process authors’ information we will need to go thorough titles’ data and its associated dbo.titleauthor table. Therefore, dbo.titleauthor table is a good candidate for a factless fact table – a fact table that serves as a link between two dimensions.

Pubs_OLTP_schema_bridge_tbl_HTDM

Based on the above information, and the schema/data analysis from POST 1 we can conclude that sales table in pubs database will become our fact table; publishers, authors, titles and stores will become dimension tables whereas dbo.titleauthor bridge table will become a factless fact table. Therefore, once all objects have been created (see the SQL code further on in this post), the data mart schema should resemble the diagram as per image below.

Data_mart_ERD_HTDM

There is a multitude of other concepts that are very relevant when it comes to data mark design and modelling. Taking dimensions modelling only, notions such as conformed dimensions, Slowly Changing Dimensions (I outlined SCD approach HERE, HERE, HERE and HERE), junk dimensions, parent-child dimensions, role-playing dimensions, degenerate dimensions etc. come to my mind; however, given the conciseness of this post and the intended purpose – technical details over paradigms and methodologies – I will need to refer to those and other concepts as we go through the remainder of this series.

SQL Code

Let’s dive into less conceptual aspects of data mart design – scripting out our data mart objects based on the information from Post 1 and the concepts covered at the start of this post. First, I created two directories on my C:\ drive which I will use for storing the SQL Server data, log and back up files – DW_Sample_Files on the C:\ drive and a second DW_Sample_BackUp directory within it. Next, I will create a new DW_Sample database (this will be our relational warehouse database), all its objects as well as perform a back up to the nominated directory. All this is done via T-SQL rather than using SQL Server Management Studio graphical interface which should reduce the number of screen shots and steps involved. I have also included detailed description of individual sections of the script for further clarification and analysis.

/*
STEP 1
Check if the database we wish to create is already in place and if so
drop it and recreate it with files (log and data) located as per
description above. Next, assign login SA (administrator) as the database
owner and set the recovery model to BULKED_LOGGED.
*/
USE [master]
GO
IF EXISTS (SELECT name FROM sys.databases WHERE name = N'DW_Sample')
BEGIN
-- Close connections to the DW_Sample database
ALTER DATABASE [DW_Sample] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE [DW_Sample]
END
GO
CREATE DATABASE [DW_Sample] ON PRIMARY
( NAME = N'DW_Sample'
, FILENAME = N'C:\DW_Sample_Files\DW_Sample.mdf'
, SIZE = 10MB
, MAXSIZE = 1GB
, FILEGROWTH = 10MB )
LOG ON
( NAME = N'DW_Sample_log'
, FILENAME = N'C:\DW_Sample_Files\DW_Sample_log.LDF'
, SIZE = 1MB
, MAXSIZE = 1GB
, FILEGROWTH = 10MB)
GO
--Assign database ownership to login SA
EXEC [DW_Sample].dbo.sp_changedbowner @loginame = N'SA', @map = false
GO
--Change the recovery model to BULK_LOGGED
ALTER DATABASE [DW_Sample] SET RECOVERY BULK_LOGGED
GO

/*
STEP 2
Create individual objects i.e. fact and dimension tables and create
foreign keys to establish referential integrity between individual tables
*/
USE [DW_Sample]
GO
--Create the Dimension Tables
CREATE TABLE [dbo].[DimStores](
[StoreKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[StoreId] [nchar](12) NOT NULL,
[StoreName] [nvarchar](50) NOT NULL,
[StoreAddress] [nvarchar] (80) NOT NULL,
[StoreCity] [nvarchar] (40) NOT NULL,
[StoreState] [nvarchar] (12) NOT NULL,
[StoreZip] [nvarchar] (12) NOT NULL,
[IsCurrent] [int] NOT NULL
)
GO
CREATE TABLE [dbo].[DimPublishers](
[PublisherKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[PublisherId] [nchar](12) NOT NULL,
[PublisherName] [nvarchar](50) NOT NULL
)
GO
CREATE TABLE [dbo].[DimDates](
[DateKey] int NOT NULL PRIMARY KEY IDENTITY (1, 1),
[Date] datetime NOT NULL,
[DateName] nVarchar(50),
[Month] int NOT NULL,
[MonthName] nVarchar(50) NOT NULL,
[Quarter] int NOT NULL,
[QuarterName] nVarchar(50) NOT NULL,
[Year] int NOT NULL,
[YearName] nVarchar(50) NOT NULL
)
GO
CREATE TABLE [dbo].[DimAuthors](
[AuthorKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[AuthorId] [nchar](12) NOT NULL,
[AuthorName] [nvarchar](100) NOT NULL,
[AuthorState] [nchar](12) NOT NULL
)
GO
CREATE TABLE [dbo].[DimTitles](
[TitleKey] [int] NOT NULL PRIMARY KEY IDENTITY (1, 1),
[TitleId] [nvarchar](12) NOT NULL,
[TitleName] [nvarchar](100) NOT NULL,
[TitleType] [nvarchar](50) NOT NULL,
[PublisherKey] [int] NOT NULL,
[TitlePrice] [decimal](18, 4) NOT NULL,
[PublishedDateKey] [int] NOT NULL
)
GO
--Create the Fact Tables
CREATE TABLE [dbo].[FactTitlesAuthors](
[TitleKey] [int] NOT NULL,
[AuthorKey] [int] NOT NULL,
[AuthorOrder] [int] NOT NULL,
CONSTRAINT [PK_FactTitlesAuthors] PRIMARY KEY CLUSTERED
( [TitleKey] ASC, [AuthorKey] ASC )
)
GO
CREATE TABLE [dbo].[FactSales](
[OrderNumber] [nvarchar](50) NOT NULL,
[OrderDateKey] [int] NOT NULL,
[TitleKey] [int] NOT NULL,
[StoreKey] [int] NOT NULL,
[SalesQuantity] [int] NOT NULL,
CONSTRAINT [PK_FactSales] PRIMARY KEY CLUSTERED
( [OrderNumber] ASC,[OrderDateKey] ASC, [TitleKey] ASC, [StoreKey] ASC )
)
GO
--Add Foreign Keys
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimPublishers]
FOREIGN KEY([PublisherKey]) REFERENCES [dbo].[DimPublishers] ([PublisherKey])
GO
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT
[FK_FactTitlesAuthors_DimAuthors]
FOREIGN KEY([AuthorKey]) REFERENCES [dbo].[DimAuthors] ([AuthorKey])
GO
ALTER TABLE [dbo].[FactTitlesAuthors] WITH CHECK ADD CONSTRAINT
[FK_FactTitlesAuthors_DimTitles]
FOREIGN KEY([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
GO
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimStores]
FOREIGN KEY([StoreKey]) REFERENCES [dbo].[DimStores] ([Storekey])
GO
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimTitles]
FOREIGN KEY([TitleKey]) REFERENCES [dbo].[DimTitles] ([TitleKey])
GO
ALTER TABLE [dbo].[FactSales] WITH CHECK ADD CONSTRAINT [FK_FactSales_DimDates]
FOREIGN KEY([OrderDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
GO
ALTER TABLE [dbo].[DimTitles] WITH CHECK ADD CONSTRAINT [FK_DimTitles_DimDates]
FOREIGN KEY([PublishedDateKey]) REFERENCES [dbo].[DimDates] ([DateKey])
GO

/*
STEP 3
Back up new database to disk placing the file in the directory nominated
and restore database using the back up file replacing the one already created
with the backed up one
*/
BACKUP DATABASE [DW_Sample]
TO DISK =
N'C:\DW_Sample_Files\DW_Sample_BackUp\DW_Sample.bak'
GO
USE [Master]
RESTORE DATABASE [DW_Sample]
FROM DISK =
N'C:\DW_Sample_Files\DW_Sample_BackUp\DW_Sample.bak'
WITH REPLACE
GO
ALTER DATABASE [DW_Sample]
SET MULTI_USER

After successful code execution you should be left roughly the same message as per image below.

DW_Sample_restore_info_HTDM

This concludes this post and, as usual, all code samples and solution files can be found and downloaded from HERE. In the NEXT POST I would like to dig dipper into technical aspects of a data mart development and touch on SQL which is used to populate the data mart tables which will later become a prelude to SSIS package structure and design in subsequent post.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

Tags: ,

How To Build A Data Mart Using Microsoft BI Stack Part 1 – Introduction and OLTP Database Analysis

September 8th, 2013 / 2 Comments » / by admin

First of all, as a little introduction, I would like to give you the heads-up that if you are keen to understand the whole SDLC of data mart development, this post is just a prelude to the whole series of posts dealing with the task of creating a comprehensive BI project using Microsoft technology, starting from conducting a little data recognisances, through to creating ETL routines and SQL code logic, cubes, reports and finally deployment and solution fine-tuning.  So if you are keen to follow along through the next few posts, you should be able to have a good understanding of how to take a similar project from start to finish using similar approach (links to remaining posts in this series will be added to the very bottom of this page as they accumulate if you wish to jump ahead). Also, I will be placing all relevant files to each post in this series on my public SkyDrive folder (here is the LINK) if you wish to download any resources, code samples, solution files etc.

The reason why I decided to come up with a series of posts outlining how to build a data mart from scratch is because although there is a large pool of materials and publications dealing with the methodologies, approaches and general direction and know-how when developing a robust data mart, I have always felt that there is not enough hands-on, pragmatic, nuts-and-bolts information on the subject. Yes, there are tons of articles and books on individual realms of BI that will teach you how to become proficient ELT developer or an expert MDX programmer, but hardly any of them take the practical approach and apply it from beginning to an end. Having said that, please also note that most data mart or data warehouse development projects are quite complex and deal with a large amount of business specific intricacies hence it is impossible for a short blog post to cover those. Therefore, what you will see over the next few posts relates to the most basic and rudimentary principles and approaches to creating a typical data mart. This, however, should be enough to give anyone a decent overview of the process which then can be applied to other projects without muddying the water too much.

OLTP Database Analysis

Let’s envisage that all the prelude work that precedes the technical aspects is behind us. The feasibility study has been conducted, end users interviewed, project signed off etc. and we, as developers, finally get a chance to get our hands dirty and look at the data. Without further ado, let’s explore the first aspect of data mart development – transactional data. Some would argue that you can build a data mart on data coming from sources other than OLTP database and that this step should not be number one on your agenda i.e. you should be interviewing relevant parties, conducting feasibility analysis etc.…I agree with all of those premises. However, it is much easier to explain the data flow if we have something familiar to work with (in this case our old and trusty PUBS database). Besides, all non-technical prerequisites information can be found in other resources so this will not be the focus of this post. Also, please note that medium to large projects will typically have a “data” person as part of the team; someone who has a fine understanding of transactional systems as well as the business rules that govern those strictly from the data point of view.  Typically, data modellers, working in conjunction with business analyst(s) will fill that role. However, it is not unusual that a developer is tasked with defining and designing the dimensional database schema, so be prepared to pitch in, particularly when only a small data mart, not a large enterprise data warehouse is what’s required.

To follow on you will need a copy of PUBS database restored on your MS SQL Server instance. Detailed explanation on how to achieve this can be found in numerous places on the internet (Google is your friend), however, for your convenience, I also included it as part of the solution files which can be found in my SkyDrive folder HERE. The newly created database schema should roughly look as per the image below.

Pubs_OLTP_schema_HTDM

The key table in the above ERD diagram is the dbo.sales table which contains transactional records of each sale e.g. order number, order date, quantity or store id where the transaction was made. Sales transactions will become the focal point of our solution, however, other tables are just as important. Let’s quickly run through some of the entities to see what value they present to the overall solution.

  • dbo.stores and dbo.titles tables contain information on which books are sold to which stores therefore they are both useful and worth taking into consideration
  • dbo.publishers table contains publishers information. As each title has a publisher assign to it, it may be worth including it in the overall design as end users may want to use this information for further sales analysis
  • dbo.authors table contain information on which authors are responsible for writing each book. Sales data will most likely be analysed by titles, however, given the fact that authors’ information is easily accessible; we may also want to take it into consideration
  • dbo.employee table contains data about employees who work for each of the publishers which in the context of sales is irrelevant even though on the face value it may seem otherwise
  • dbo.jobs table contains information on the employment details for publishers employees which does not present any value at this stage
  • dbo.titleauthor table, apart from tracking royalty percentage given to various authors also acts as a bridge table (many-to-many relationship) between dbo.titles and dbo.authors tables. Given the fact that both will be used, this table needs to be included
  • dbo.roysched table tracks ranges of royalties based on the amount of books sold. At this stage, even though this information affects the profit data, this table can be treated as irrelevant to the overall sales information and will not be taken into consideration
  • dbo.pub_info table will not be used in this project as the data it hold does not have a direct impact on the sales information
  • dbo.discounts seems incomplete (only three records with many NULL values) so at this stage will not be used as part of the project

Once we have relevant tables defined, it is time to look at individual attributes and determine which columns contain important information we may consider bringing across into our data mart. Selecting from each table identified as a good source of information, let’s look at their content and what can be utilised to make the most out of the data stored bearing in mind the difference between a measured value and a descriptive value. In the data warehouse measured values are translated into measures, and descriptive values are translated into dimensional attributes.

Many developers will have differing opinions on what is the best way to design a data warehouse. But there are common characteristics you can expect to see in all of them. The first common characteristic is a set of values used for reports. These are called measures. For example, Inventory Units and Sales Dollars can be considered measures. Another common characteristic found in data warehouses is a set of dimensions. Dimensions describe the measured data. Examples of dimensions include the dates that the Inventory Units were documented or the zip code of the customers who bought a particular product. More information on measures versus dimensions topic can be found on the Internet (describing their characteristics is not the purpose of this post); alternatively you can read the widely regarded books and published materials from two most prolific pioneers of data modelling and data warehousing concepts, Ralph Kimbal and Bill Inmon. Also, since neither this post, nor other subsequent posts will cover topics such as granularity, degenerate, junk or roleplaying dimensions, bridge tables etc. in extensive details, please bear in mind that a typical data warehouse project is much more involved and complex and you will probably need to be familiar with such intricacies.

Back to our data and exploring what sort of information the previously described tables hold that may be of interest to data warehouse users.

  • Selecting a subset of records from dbo.sales table reveals that qty attribute is a good candidate for a measure as it’s is quantifiable, additive and can provide a business user with the information he/she most likely would want to analyse. Other fields are more descriptive and may be used to provide a context to the sales quantity e.g. ord_date can tell us when the order happened, stor_id describes which location the order originated from etc. Some of those attributes will be necessary to include as dimensional keys as they form a linkage to other tables which in turn contain more descriptive data e.g. stor_id attribute needs reference dbo.stores table in order for us to find out other store details such as location or name
  • Selecting from dbo.titles table reveals few interesting facts. First of all, title column contains useful descriptions corresponding to each title_id which is also grouped by individual type. This can potentially allow for rolling titles up by type classification. Moreover, we have a price column which is imperative in finding out the monetary amounts, not just quantities which we get from dbo.sales table
  • Looking at dbo.publishers table we can see that publishers’ names along with some location details occupy majority of the data. Publisher name attribute would be something that we may consider using in the report so we can dimension our data by those attributes
  • dbo.authors table contains, as the name suggests, details pertaining to authors. It’s a good example of a table which has data that can be useful from a dimension point of view as most of its attributes are descriptive
  • Looking at the data from dbo.titleauthor table reveals that this object is necessary in order to connect dbo.authors and dbo.titles together. Therefore, the two columns – au_id and title_id – which make up the composite keys are essential to be included
  • dbo.stores table, similarly to dbo.authors table, contains mainly descriptive data referring to stores’ locations and names. This is another good example for a dimension with all attributes presenting a value for potential repot slicing, dicing and drill-through

Apart from the objects we have analysed that may potentially turn/merge into either fact table(s) or dimension tables, nearly all data marts also include a ‘synthetic’ dimension containing dates and the corresponding formats which gets populated based on specific business needs. Even though this table does not participate in the same process we go through when modelling dimensions and facts based on OLTP database, it is imperative to consider its structure as part of the overall OLAP schema planning as dates (along with their derived values) are almost always imperative to providing valuable context for the fact analysed (more on how to build a DimDates table for this solution in the subsequent post for this series; alternatively check out THIS post for more comprehensive solution).

It seems that now we should have enough conceptual details around the objects, the data they hold and whether they are useful for proceeding with the data mart solution development. In the NEXT POST I will look at some architectural pitfalls when moving into the next phase of development and start to develop the code base for advancing into OLAP schema creation.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

Tags: