How To Build A Data Mart Using Microsoft BI Stack Part 5 – Defining SSAS Project And Its Dimensions

September 16th, 2013 / 2 Comments » / by admin

In the PREVIOUS POST I explored SQL Server Integration Services functionality and how the SQL code samples integrated into developing an ETL routine provide automated data mart loading. In this post I would like to start exploring the concept of using SQL Server Analysis Services as a storage engine for our dimensional data and how to start building a simple SSAS project with Data Source View and SASS dimensions defined.

Let’s open up BIDS or SQL Server Data Tools application (depending on which SQL Server version you’re using) and create an empty SSAS project. The first thing to do is to set up a data source connection. This is a wizard-driven task and we can execute it by right-clicking on Data Sources folder inside Solution Explorer pane and selection New Data Source from options provided. If you’re following along from PREVIOUS POST, a previously defined connection which we established in order to create our SSIS package should already be available to you. If not, please complete the wizard selecting DW_Sample database we create as our relational data mart for this project. Also, I recommend that you use Windows User Name and Password option when providing login credentials.

Next, it is time to create a data source view. Data source view (DSV) is a logic layer of abstraction consisting of table(s) with each table containing a saved SQL statement. It provides an interface and separation layer between relational database and cubes which can be very beneficial e.g. adjusting data model to fit the cube design without changing the underlying table structure. To start creating the DSV, right-click on Data Source View under Solution Explorer pane and select New Data Source View to kick off the wizard. Stepping through the wizard, after we selected the previously defined Data Source, we should be presented with Select Tables and Views pane as per image below.

Select_tables_and_views_DSV_wizard_HTDM

Let’s select all available objects i.e. two fact tables and five dimension tables by moving them from left-hand side to right-hand side by dragging and dropping or alternatively using  buttons between the two windows. On wizard completion we should be presented with a Data Source View table schema as our as per image below.

Data_source_view_SASS_HTDM

As we have foreign key constants already defined, the wizard automatically maps logical relationships between tables based on these constraints. If the constraints were missing, we would need to define them in a separate step. In the Data Source View Designer, both SQL Server tables and views are presented as tables. Each of those tables on the designer surface represents an underlying SQL select statement. For example, the DimAuthors table represents the following SQL statement: SELECT AuthorKey, AuthorId, AuthorName, AuthorState FROM DimAuthors. Analysis Services server uses these select statements to copy data into the cube.

Next step is to define dimensions which describe attributes of measured values. In the DW_Sample_SSAS project, we currently have a single measure we need to describe: SalesQuantity. The SalesQuantity attributes are an order number, order date, title, and store. Accordingly, we create a dimension for each of these descriptors. To create a dimension, launch the Dimension Wizard by right-clicking the Dimensions folder in Solution Explorer and selecting New Dimension from the context menu. On screen number two called Create Creation Method let’s select Use an Existing Method option (as we already have fact and dimension tables defined in the database). Following on, we will specify Data Source View (we only have one in this project) and select a table from the Main Table dropdown box. Each dimension may have only one table, which of course will be its main table. In cases where the dimension is designed in a snowflake format, such as the Titles dimension in the DW_Sample_SSAS project, the main table is the one directly connected to the fact table. This means you do not have to select both the DimTitles table and the DimPublishers tables, only selecting the main DimTitles table as per image below.

Source_info_dims_wizard_HTDM

Once you select a table, select one or more columns that define the table’s dimensional keys. Typically, they are singular key columns such as the TitleKey as per image above. If you have a composite key, you can select a second column by clicking the Add key column after you choose the initial key column. The last dropdown box on this page allows you to select a name column. The name column is the one that holds a label for the key column’s value. Clicking the Next button advances the wizard to one of two pages: the Dimension Attributes page or the Select Related Tables page. If the only relationship line you have in the data source view is to the fact table, you will advance to the Select Dimension Attributes page. If, however, you have tables in a snowflake pattern with relationship lines connecting to other dimensional tables, you will advance to the Select Related Tables page as per image below.

Related_tables_dims_wizard_HTDM

In our example, The DimTitles table is connected to both the DimPublishers and DimDates tables in the data source view. Therefore, the Select Related Tables page displays both of these tables with a checked checkbox. It is optional to include a related table; therefore, the wizard allows you to uncheck the checkbox if you feel a related table is not appropriate for the dimension you are currently building. In the case of the DimTitles table, we will leave both tables checked because they include additional information that we want to include as part of the Titles dimension. Next, on the Select Dimension Attributes page, you can choose to select all or some of the available attributes. In the current example, the DimTitles table relates to both the DimPublishers table and the DimDates table to form a snowflake design within the data source view. Because of this, we see many columns available from multiple dimensions in the Select Dimensional Attributes page as per image below.

Dim_attributes_dims_wizard_1_HTDM

It makes sense to include some of these columns from each dimension table, but not necessarily all of them. Including publishers’ names, for example, will allow for creating reports that can group titles by publishers. The same is true of the title types. But the artificial publisher key or the artificial date key (without an associated name column) is not as useful, because it is unlikely a report will benefit from either one of these columns. They have no significance outside the context of the data warehouse design. In the same step we can also define whether we permit the attribute to be visible to end users through Enable Browsing option and define the attribute type.

Dim_attributes_dims_wizard_2_HTDM

Most software ignores the Attribute Type setting; therefore, it has little to no impact on creating reports. Microsoft has included the setting for application programmed to use it, but leaving the selection at Regular is most often the appropriate choice. Occasionally an attribute’s type must be more definitive. For example, when the dimensional attribute includes date data, the attribute type should always be configured to reflect the content. The date data determines how SSAS performs aggregations and how MDX functions are processed. In our current example date, month, quarter, and year attributes must be configured accordingly. Once all parameters have been defined we can save the dimension with a new name and proceed to further configuration steps before we can deploy them.

In order to cut the number of steps and screenshots, I have put together this short document which takes each project dimension and configures its attributes. It has all the necessary steps to configure dimensions as per above recommendations and if you’re keen to follow along you can access/download it from HERE.

Believe or not, there is still quite some work to be done to configure the dimensions and fact tables in order to create an SSAS cube. In the NEXT PART to this series, I will go over some of the dimension properties and hierarchies that still require attention before explaining cube creation and deployment process.  As usual, the code used in this series as well as solution files for this SSAS project can be found and downloaded from HERE.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

Tags: ,

How To Build A Data Mart Using Microsoft BI Stack Part 4 – Data Mart Load Using SSIS

September 15th, 2013 / No Comments » / by admin

In previous posts to this series (HERE, HERE and HERE) I outlined some of the key concepts of data warehouse design and modelling, analysed the source objects and data they hold and build the code base for data mart objects deployment and population. In this post I will explore SQL Server Integration Services functionality and how the code samples from PREVIOUS POST will be integrated into developing an ETL routine to automate data mart loading.

Without further ado, let’s open up BIDS or SQL Server Data Tools application (depending on which SQL Server version you’re using) and create an empty SSIS project. Next, let’s set up a database connection in Connection Managers pane pointing it to DW_Sample database. The connection type we will be using here is OLE DB connection and to kick off the connection wizard setup pane just right click inside Connection Managers pane and select OLE DB from the list provided as per image below.

Conn_manager_setup_HTDM

Although there is only going to be one package doing all the work, for sake of coherence and aesthetics, I would like to create logic boundaries and group similar tasks together e.g. all tasks pertaining to populating dimension objects would be encapsulated into one logic group, all tasks responsible for database and objects preparation before the load is commenced will be grouped together etc. To achieve this I will first lay out a number of Sequence Containers in Control Flow pane, which will manage the groupings and provide greater control of tasks execution. Let’s create four groups of tasks and name them in accordance with how they will be utilised in this package as per image below. I have also joined them with a default precedence constraint arrows and arranged them so they execute in the specific order – prepare database objects, load dimension data, load fact data and finally restore foreign keys constraints as the last group.

Initial_DW_sample_SSIS_HTDM

Let’s start filling in the gaps with ETL components and corresponding code from previous post. We begin with first Sequence Container and place two Execute SQL Task (EST) containers inside the first Sequence Container, joining them together using default precedence constraints. Next, let’s go through the remaining Sequence Containers and place more Execute SQL Task (EST) components inside them as per the following sequence: Populate Dimension Objects Sequence Container – 6 ESTs, Populate Fact Objects Sequence Container – 2 ESTs, Finish ETL Routine Sequence Container – 1 EST. We will also adjust the names of ESTs to correspond to their function in this SSIS solution. So far the package should look as per image below.

DW_sample_SSIS_control_flow1_HTDM

Now that we have the skeleton of our package, we can start populating individual components with the SQL code we developed as per PREVIOUS POST. Let’s open up first Execute SQL Task in Prepare ETL Routine Sequence Container by double-clicking on in and adjust Connection property by selecting the connection name of our database established earlier. Next, clicking on ellipsis next to SQLStatement property we bring up the SQL editor where we paste the SQL for each task we want to accomplish. Let’s paste SQL responsible for dropping foreign key constraints and click OK.

Insert_SQL_code_into_task_HTDM

If you remember individual code snippets from PREVIOUS POST, most of those had number assigned to them for referencing when designing our SSIS package. This is going to make our life easier when looking up SQL representing the functionality of each SSIS component so rather than repeating the above process of adjusting individual components’ SQLStatement  properties I have numbered the transformations according to how they relate to the code from PREVIOUS POST.

DW_sample_SSIS_control_flow2_HTDM

All we need to do now is to reference each Execute SQL Task component to the code snippet from PREVIOUS POST and repeat the process of selecting database connection and pasting SQL code into the editor for each task. You can retrieve the SQL from the previous post HERE or download the whole script from HERE. If you choose to download it, just make sure to break it down according to the functionality each section provides and copy and paste only applicable fragments. Once all tasks have been populated with the code hopefully all warning icons on individual tasks disappeared and we can execute the package by pressing F5 or clicking on the Start Debugging icon in the top pane.

Run_DW_Sample_SSIS_HTDM

You can re-run the package as many times as you wish, however, please note two important things about this solution. Number one, the code section responsible for truncating DimDates table as well as the code which populates this dimension can be disabled once DimDates has been populated. We do not need to repopulate date dimension more than once as this data is not subjected to changes and inserting it once is enough (unless more dates need to be added). Number two, please also note that as per previous post explanation, one of the dimensions (DimStores) and two fact tables are following incremental inserts/updates logic. This means that, in case of DimStores, once this object is populated, only changed data and new records are permitted to filter through by how the SQL code was implemented. Again, in real world scenario e.g. in production there is no need to truncate it every time the package runs. In fact, it may be detrimental to do so as all history changes will be lost. Also, in case of fact tables, they should not be emptied out as the core purpose of data warehouse is to collect historical information and store it in a central repository. In some environments, only inserts are allowed to ensure that all history is preserved. If OLTP database (data source) is occasionally ‘trimmed down’ to improve performance, a big chunk of that data will be lost when reloading data mart thus once more invalidating the core principle of data warehouse – to keep all history data for analytical purposes. If you wish to follow these principles, you can alter the package and its SQL code (once the package was run at least once and all objects have data in them) in the following manner.

  • Remove the following line from Execute SQL Task number 2 (truncates database objects):
TRUNCATE TABLE [dbo].[FactSales]
TRUNCATE TABLE [dbo].[FactTitlesAuthors]
TRUNCATE TABLE [dbo].[DimStores]
TRUNCATE TABLE [dbo].[DimDates]
  • Disable (do not delete) Execute SQL Task number 3 and 4 (populates DimDates table). You can do this by right-click on the container and selecting Disable from the options available

The package should run without issues, however, this time the difference is that we are not removing anything from two fact tables and the two dimensions – DimDates and DimStores. We also leave DimDates table out of the process as its data does not come from any source system and once populated can be removed/disabled. The successfully executed package should look like as per image below.

Final_DW_sample_SSIS_HTDM

This is one of the many ways to use SSIS for data loading. Instead of using INSERT and SELECT SQL statements in every component, you could also use SELECT only and map out the attributes produced to the source schema by means of using Data Flow Task as per image below.

Data_flow_sample_task_HTDM

You could also use views, stored procedures, User-Defined Functions etc. – SSIS is quite flexible in how you wish to pull your data in and I strongly suggest you explore other options. Also, it is worth highlighting that due to the simplicity of this design, I only had a chance to use very limited number of components and options available in SSIS development environment. There is a huge myriad of tools, options and features available at your disposal that SSIS provides out-of-the-box and some transformations can be built with an extreme level of complexity. That is partially what is great about this tool – you can create a simple ETL and automate the laborious data centric processes with just a few drag-and-drop moves but SSIS also provides enough flexibility and power to cater for most advanced scenarios e.g. scripting in .NET, database maintenance tasks, data mining processing etc.

This concludes this post and, as usual, the code used in this series as well as solution files for this SSIS project can be found and downloaded from HERE. In the NEXT ITERATION to this series I will dive into the topic of SQL Server Analysis Services (SSAS), cubes and OLAP databases.

Please also check other posts from this series:

All SQL code and solution files can be found and downloaded from HERE.

Tags: ,