Microsoft Azure SQL Data Warehouse Quick Review and Amazon Redshift Comparison – Part 2

January 15th, 2016 / 4 Comments » / by admin

In the first part of this series I briefly explored Microsoft Azure Data Warehouse key differentiating features that set it apart from the likes of AWS Redshift and outlined how we can load the Azure DW with sample TICKIT database data. In this final post I will go over some of the findings from comparing query execution speed across AWS/Azure DW offerings and briefly look at the reporting/analytics status quo.

My test ‘hardware’ set-up composition was characterized by the following specifications where, as mentioned in PART ONE, Amazon seems to adopt a bit less of a cloak-and-dagger approach.

Azure_SQL_DW_Review_Test_Instances_Specs

Graphing the results, I assigned the following monikers to each environment based on its specifications/price point.

  • Azure SQL DW 100 DWU – Azure Data Warehouse 100 Data Warehouse Units
  • Azure SQL DW 500 DWU – Azure Data Warehouse 500 Data Warehouse Units
  • Azure SQL DW 1000 DWU – Azure Data Warehouse 1000 Data Warehouse Units
  • Azure SQL DW 2000 DWU – Azure Data Warehouse 2000 Data Warehouse Units
  • Redshift 3 x dc1.large – AWS Redshift 3 x large dense compute nodes
  • Redshift 2 x dc1.8xlarge – AWS Redshift 2 x extra-large dense compute nodes

Testing Methodology and Configuration Provisions

Each table created on Azure DW also had clustered columnstore index created on it, which as of December 2015, is a default behaviour when issuing a CREATE TABLE DDL statement unless specified otherwise i.e. using a HEAP option with the table creation statement. Clustered columnstore indexes provide better performance and data compression than a heap or rowstore clustered index. Also, each fact table and its data created in Azure DW was distributed using hashing values on identity fields e.g. Sales table –> saleid column, Listing table –> listid column, whereas a round-robin algorithm was used for dimension tables data distribution. A hash distributed table is a table whose rows are dispersed across multiple distributions based on a hash function applied to a column. When processing queries involving distributed tables, SQL DW instances execute multiple internal queries, in parallel, within each SQL instance, one per distribution.  These separate processes (independent internal SQL queries) are executed to handle different distributions during query and load processing. A round-robin distributed table is a table where the data is evenly (or as evenly as possible) distributed among all the distributions. Buffers containing rows of data are allocated in turn (hence the name round robin) to each distribution. The process is repeated until all data buffers have been allocated. At no stage is the data sorted or ordered in a round robin distributed table. A round robin distribution is sometimes called a random hash for this reason.

Each table on AMS Redshift database was assigned a distkey and a sortkey which are a powerful set of tools for optimizing query performance. A table’s distkey is the column on which it’s distributed to each node. Rows with the same value in this column are guaranteed to be on the same node. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed. A table’s sortkey is the column by which it’s sorted within each node. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

The following allocation was made for each table in the TICKIT database schema on AWS Redshift database.

Azure_SQL_DW_Review_Tickit_Distkeys_and Sortkeys

On Azure Data Warehouse, after populating TICKIT schema and before running all queries, clustered columnstore indexes were created on each table. Also, at a minimum, Microsoft recommends creating at least single-column statistics on every column on every table which in case of SQL Data Warehouse are user-defined i.e. need to be created manually and managed manually. As SQL Data Warehouse does not have a system stored procedure equivalent to ‘sp_create_stats’ in SQL Server, the following stored procedure was used to create a single column statistics object on every column of the database that didn’t already one, with FULLSCAN option defined for @create_type.

CREATE PROCEDURE    [dbo].[prc_sqldw_create_stats]
(   @create_type    tinyint -- 1 default 2 Fullscan 3 Sample
,   @sample_pct     tinyint
)
AS

IF @create_type NOT IN (1,2,3)
BEGIN
    THROW 151000,'Invalid value for @stats_type parameter. Valid range 1 (default), 2 (fullscan) or 3 (sample).',1;
END;

IF @sample_pct IS NULL
BEGIN;
    SET @sample_pct = 20;
END;

IF OBJECT_ID('tempdb..#stats_ddl') IS NOT NULL
BEGIN;
    DROP TABLE #stats_ddl;
END;

CREATE TABLE #stats_ddl
WITH    (   DISTRIBUTION    = HASH([seq_nmbr])
        ,   LOCATION        = USER_DB
        )
AS
WITH T
AS
(
SELECT      t.[name]                        AS [table_name]
,           s.[name]                        AS [table_schema_name]
,           c.[name]                        AS [column_name]
,           c.[column_id]                   AS [column_id]
,           t.[object_id]                   AS [object_id]
,           ROW_NUMBER()
            OVER(ORDER BY (SELECT NULL))    AS [seq_nmbr]
FROM        sys.[tables] t
JOIN        sys.[schemas] s         ON  t.[schema_id]       = s.[schema_id]
JOIN        sys.[columns] c         ON  t.[object_id]       = c.[object_id]
LEFT JOIN   sys.[stats_columns] l   ON  l.[object_id]       = c.[object_id]
                                    AND l.[column_id]       = c.[column_id]
                                    AND l.[stats_column_id] = 1
LEFT JOIN   sys.[external_tables] e ON  e.[object_id]       = t.[object_id]
WHERE       l.[object_id] IS NULL
AND         e.[object_id] IS NULL -- not an external table
)
SELECT  [table_schema_name]
,       [table_name]
,       [column_name]
,       [column_id]
,       [object_id]
,       [seq_nmbr]
,       CASE @create_type
        WHEN 1
        THEN    CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name + '_'+column_name)+' ON '+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+')' AS VARCHAR(8000))
        WHEN 2
        THEN    CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name + '_'+column_name)+' ON '+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+') WITH FULLSCAN' AS VARCHAR(8000))
        WHEN 3
        THEN    CAST('CREATE STATISTICS '+QUOTENAME('stat_'+table_schema_name+ '_' + table_name + '_'+column_name)+' ON '+QUOTENAME(table_schema_name)+'.'+QUOTENAME(table_name)+'('+QUOTENAME(column_name)+') WITH SAMPLE '+@sample_pct+'PERCENT' AS VARCHAR(8000))
        END AS create_stat_ddl
FROM T
;

DECLARE @i INT              = 1
,       @t INT              = (SELECT COUNT(*) FROM #stats_ddl)
,       @s NVARCHAR(4000)   = N''
;

WHILE @i <= @t
BEGIN
    SET @s=(SELECT create_stat_ddl FROM #stats_ddl WHERE seq_nmbr = @i);

    PRINT @s
    EXEC sp_executesql @s
    SET @i+=1;
END

DROP TABLE #stats_ddl;

AWS Redshift database, on the other hand, was issued a VACUUM REINDEX command for each table in the TICKIT database in order to analyse the distribution of the values in interleaved sort key columns, reclaims disk space and resorts all rows. Amazon recommends running ANALYZE command alongside the VACUUM statement whenever a significant number of rows is added, deleted or modified to update the statistics metadata, however, given that COPY command automatically updates statistics after loading an empty table, the statistics should already be up-to-date.

Disclaimer

Please note that beyond the measures outlined above, no tuning, tweaking or any other optimization techniques were employed when making this comparison, therefore, the results are only indicative of the performance level in the context of the same data and configuration used. Also, all tests were run when Microsoft Azure SQL Data Warehouse was still in preview hence it is possible that most of the service performance deficiencies or alternatively performance-to-price ratio have now been rectified/improved thus subsequently altering the value proposition if implemented today. Finally, all queries executed in this comparison do not reflect any specific business questions, therefore all SQL statements have a very arbitrary structure and have been assembled to only test computational performance of each service. As such, they do not represent any typical business cases and can only be viewed as a rudimentary performance testing tool.

Test Results

When comparing execution speeds on the same data between those two different services only similar instance classes should be stacked up against each other (typically based on hardware specifications). Given the sometimes ambiguous and deceptive nature of the cloud and its providers’ statements regarding the actual hardware selection/implementation, in this case it is only fair to match them based on the actual service price points. As such, when analysing the results, I excluded Azure SQL Data Warehouse 500 DTU and Azure SQL Data Warehouse 2000 DTU instances (displayed for reference only) and focused on AWS Redshift similarly priced Azure SQL Data Warehouse 100 DTU and Azure SQL Data Warehouse 1000 DTU instances. Also, storage and network transfer fees were not taken into consideration when matching up the price points as these were negligible with the amount of data I used in my case.

All SQL queries run to collect execution times (expressed in seconds) as well as graphical representation of their duration are as per below.

--QUERY 1
SELECT  COUNT(*)
FROM    ( SELECT    COUNT(*) AS counts ,
                    salesid ,
                    listid ,
                    sellerid ,
                    buyerid ,
                    eventid ,
                    dateid ,
                    qtysold ,
                    pricepaid ,
                    commission ,
                    saletime
          FROM      sales
          GROUP BY  salesid ,
                    listid ,
                    sellerid ,
                    buyerid ,
                    eventid ,
                    dateid ,
                    qtysold ,
                    pricepaid ,
                    commission ,
                    saletime
        ) sub;

Azure_SQL_DW_Review_Test_SQL1

--QUERY 2
WITH    cte
          AS ( SELECT   SUM(s.pricepaid) AS pricepaid ,
                        sub.* ,
                        u1.email ,
                        DENSE_RANK() OVER ( PARTITION BY sub.totalprice ORDER BY d.caldate DESC ) AS daterank
               FROM     sales s
                        JOIN ( SELECT   l.totalprice ,
                                        l.listid ,
                                        v.venuecity ,
                                        c.catdesc
                               FROM     listing l
                                        LEFT JOIN event e ON l.eventid = e.eventid
                                        INNER JOIN category c ON e.catid = c.catid
                                        INNER JOIN venue v ON e.venueid = v.venueid
                               GROUP BY v.venuecity ,
                                        c.catdesc ,
                                        l.listid ,
                                        l.totalprice
                             ) sub ON s.listid = sub.listid
                        LEFT JOIN users u1 ON s.buyerid = u1.userid
                        LEFT JOIN Date d ON d.dateid = s.dateid
               GROUP BY sub.totalprice ,
                        sub.listid ,
                        sub.venuecity ,
                        sub.catdesc ,
                        u1.email ,
                        d.caldate
             )
    SELECT  COUNT(*)
    FROM    cte;

Azure_SQL_DW_Review_Test_SQL2

--QUERY 3
SELECT  COUNT(*)
FROM    ( SELECT    s.* ,
                    u.*
          FROM      sales s
                    LEFT JOIN event e ON s.eventid = e.eventid
                    LEFT JOIN date d ON d.dateid = s.dateid
                    LEFT JOIN users u ON u.userid = s.buyerid
                    LEFT JOIN listing l ON l.listid = s.listid
                    LEFT JOIN category c ON c.catid = e.catid
                    LEFT JOIN venue v ON v.venueid = e.venueid
          INTERSECT
          SELECT    s.* ,
                    u.*
          FROM      sales s
                    LEFT JOIN event e ON s.eventid = e.eventid
                    LEFT JOIN date d ON d.dateid = s.dateid
                    LEFT JOIN users u ON u.userid = s.buyerid
                    LEFT JOIN listing l ON l.listid = s.listid
                    LEFT JOIN category c ON c.catid = e.catid
                    LEFT JOIN venue v ON v.venueid = e.venueid
          WHERE     d.holiday = 0
        ) sub;

Azure_SQL_DW_Review_Test_SQL3

--QUERY 4
SELECT  COUNT(*)
FROM    ( SELECT    SUM(sub1.pricepaid) AS price_paid ,
                    sub1.caldate AS date ,
                    sub1.username
          FROM      ( SELECT    s.pricepaid ,
                                d.caldate ,
                                u.username
                      FROM      Sales s
                                LEFT JOIN date d ON s.dateid = s.dateid
                                LEFT JOIN users u ON s.buyerid = u.userid
                      WHERE     s.saletime BETWEEN '2008-12-01' AND '2008-12-31'
                    ) sub1
          GROUP BY  sub1.caldate ,
                    sub1.username
        ) AS sub2;

Azure_SQL_DW_Review_Test_SQL4

--QUERY 5
WITH    CTE1 ( c1 )
          AS ( SELECT   COUNT(1) AS c1
               FROM     Sales s
               WHERE    saletime BETWEEN '2008-12-01' AND '2008-12-31'
               GROUP BY salesid
               HAVING   COUNT(*) = 1
             ),
        CTE2 ( c2 )
          AS ( SELECT   COUNT(1) AS c2
               FROM     Listing e
               WHERE    listtime BETWEEN '2008-12-01' AND '2008-12-31'
               GROUP BY listid
               HAVING   COUNT(*) = 1
             ),
        CTE3 ( c3 )
          AS ( SELECT   COUNT(1) AS c3
               FROM     Date d
               GROUP BY dateid
               HAVING   COUNT(*) = 1
             )
    SELECT  COUNT(*)
    FROM    CTE1
            RIGHT JOIN CTE2 ON CTE1.c1 <> CTE2.c2
            RIGHT JOIN CTE3 ON CTE3.c3 = CTE2.c2
                               AND CTE3.c3 <> CTE1.c1
    GROUP BY CTE1.c1
    HAVING  COUNT(*) > 1;

Azure_SQL_DW_Review_Test_SQL5

--QUERY 6
WITH    cte
          AS ( SELECT   s1.* ,
                        COALESCE(u.likesports, u.liketheatre, u.likeconcerts,
                                 u.likejazz, u.likeclassical, u.likeopera,
                                 u.likerock, u.likevegas, u.likebroadway,
                                 u.likemusicals) AS interests
               FROM     sales s1
                        LEFT JOIN users u ON s1.buyerid = u.userid
               WHERE    EXISTS ( SELECT s2.*
                                 FROM   sales s2
                                        LEFT JOIN event e ON s2.eventid = e.eventid
                                        LEFT JOIN date d ON d.dateid = s2.dateid
                                        LEFT JOIN listing l ON l.listid = s2.listid
                                                              AND s1.salesid = s2.salesid )
                        AND NOT EXISTS ( SELECT s3.*
                                         FROM   sales s3
                                                LEFT JOIN event e ON s3.eventid = e.eventid
                                                LEFT JOIN category c ON c.catid = e.catid
                                         WHERE  c.catdesc = 'National Basketball Association'
                                                AND s1.salesid = s3.salesid )
             )
    SELECT  COUNT(*)
    FROM    cte
    WHERE   cte.dateid IN (
            SELECT  dateid
            FROM    ( SELECT    ss.dateid ,
                                ss.buyerid ,
                                RANK() OVER ( PARTITION BY ss.dateid ORDER BY ss.saletime DESC ) AS daterank
                      FROM      sales ss
                    ) sub
            WHERE   sub.daterank = 1 );

Azure_SQL_DW_Review_Test_SQL6

--QUERY 7
WITH    Temp ( sum_pricepaid, sum_commission, sum_qtysold, eventname, caldate, username, Row1 )
          AS ( SELECT   SUM(s.pricepaid) AS sum_pricepaid ,
                        SUM(s.commission) AS sum_commission ,
                        SUM(s.qtysold) AS sum_qtysold ,
                        e.eventname ,
                        d.caldate ,
                        u.username ,
                        ROW_NUMBER() OVER ( PARTITION BY e.eventname ORDER BY e.starttime DESC ) AS Row1
               FROM     Sales s
                        JOIN Event e ON s.eventid = e.eventid
                        JOIN Date d ON s.dateid = d.dateid
                        JOIN Listing l ON l.listid = s.listid
                        JOIN Users u ON l.sellerid = u.userid
               WHERE    d.holiday <> 1
               GROUP BY e.eventname ,
                        d.caldate ,
                        e.starttime ,
                        u.username
             )
    SELECT  COUNT(*)
    FROM    Temp
    WHERE   Row1 = 1;

Azure_SQL_DW_Review_Test_SQL7

--QUERY 8
SELECT  COUNT(*)
FROM    ( SELECT    a.pricepaid ,
                    a.eventname ,
                    NTILE(10) OVER ( ORDER BY a.pricepaid DESC ) AS ntile_pricepaid ,
                    a.qtysold ,
                    NTILE(10) OVER ( ORDER BY a.qtysold DESC ) AS ntile_qtsold ,
                    a.commission ,
                    NTILE(10) OVER ( ORDER BY a.commission DESC ) AS ntile_commission
          FROM      ( SELECT    SUM(pricepaid) AS pricepaid ,
                                SUM(qtysold) AS qtysold ,
                                SUM(commission) AS commission ,
                                e.eventname
                      FROM      Sales s
                                JOIN Date d ON s.dateid = s.dateid
                                JOIN Event e ON s.eventid = e.eventid
                                JOIN Listing l ON l.listid = s.listid
                      WHERE     d.caldate > l.listtime
                      GROUP BY  e.eventname
                    ) a
        ) z;

Azure_SQL_DW_Review_Test_SQL8

Even taking Azure SQL DW 500 DWU and Azure SQL DW 2000 DWU out of equation, at face value, it is easy to conclude that Amazon Redshift MPP performed much better across most queries. With the exception of query 8, where Azure SQL DW came out performing a lot better Redshift, Azure SQL DW took a lot longer to compute the results, in some instances 40 times slower then cooperatively priced Redshift instance e.g. query 2. The very last query results, however, show Azure SQL DW performing considerably better in comparison to Redshift, with Amazon’s service also having troubles executing query 6 (hence the lack of results for this particular execution) and indicating the lack of support for the sub-query used by throwing the following error message.

Azure_SQL_DW_Review_Redshift_Failed_SQL

This only goes to show that the performance achieved from those two cloud MPP databases heavily depends on the type of workload performed, with some queries favouring AWS offering more than Azure and vice versa. Also, it is worth noting that for a typical business scenarios where standard queries issued by analysts will mostly be comprised of some level of data aggregation across a few schema objects e.g. building a summary report, the performance difference was marginal. Interestingly, query number 5 performed identically irrespective of the number of DTUs allocated to the instance. Majority of the time allocated to computing the output was dedicated to a ‘ReturnOperation’, as shown by the optimizer, with the following query issued to this step.

Azure_SQL_DW_Review_SQL5_ReturnOperation_Step

This leads me to another observation – the performance increase between 100, 500, 1000 and 2000 DWU increments wasn’t as linear as I expected across all queries (average execution speed). With the exception of query 5 which looks like an outlier, the most significant improvements were registered when moving from 100 DWU to 500 DWU (which roughly corresponded to price increments). However, going above 500 DWU yielded less noticeable speed increases for further cumulative changes to the instance scale.

Azure_SQL_DW_Review_Avg_Exec_Times_All_Queries

Data Connectivity and Visualization

Figuring out how to best store the data underpinning business transactions is only half the battle; making use of it through reporting and analytics is the ultimate goal of data collection. Using the same test data set I briefly tried Microsoft’s Power BI Desktop and Tableau applications to create a simple visualizations of the sales data stored in the fictitious TICKIT database. Microsoft SQL DW Azure portal also offers direct access link to Power BI online analytics service (see image below), with Azure SQL Data Warehouse listed among the supported data sources but I opted for a local install of Power BI Desktop instead for my quick analysis.

Azure_SQL_DW_Review_Azure_Portal_PowerBI_Link

Power BI Desktop offers Azure SQL DW specific connection as per image below but specifying a standard SQL Server instance (Azure or on premise) also yielded a valid connection.

Azure_SQL_DW_Review_PowerBI_SQLDW_Conn_Selection

Building a sample ‘Sales Analysis’ report was a very straightforward affair but most importantly the execution speed when dragging measures and dimensions around onto the report development pane was very fast in DirectQuery mode even when configured with minimal number of DWUs. Once all object relationships and their cardinality was configured I could put together a simple report or visualize my data without being distracted by query performance lag or excessive wait times.

In case of Tableau, you can connect to Azure SQL Data Warehouse simply by selecting Microsoft SQL Server from the server options available, there is no Azure SQL DW dedicated connection type. Performance-wise, Tableau executed all aggregation and sorting queries with minimal lag and I could easily perform basic exploratory data analysis on my test data set with little to no delay.

Below is a short example video clip depicting the process of connecting to Azure SQL DW (running on the minimal configuration from performance standpoint i.e. 100 DWU) and creating a few simple graphs, aggregating sales data. You can immediately notice that in spite of minimal resource allocation the queries run quite fast (main table – dbo.Sales – comprised of over 50,000,000 records) with little intermission between individual aggregation executions. Query execution speed improved considerably when the instance was scaled out to 500 DWU with same or similar queries recording sub-second execution times.

Conclusion

With the advent of big data and ever increasing need for more computational power and bigger storage, cloud deployments are poised to become the de facto standard for any data volume, velocity and variety-intensive workloads. Microsoft has already established itself as a formidable database software vendor with its SQL Server platform gradually taking the lead away from its chief competitor in this space – Oracle (at least if you believe the 2015 Gartner Magic Quadrant for operational DBMSs) and has had a robust and service-rich cloud (Azure) database strategy in place for some time. Yet, it was Amazon, with its Redshift cloud MPP offering that saw the window of opportunity to fill the market void for a cheap and scalable data warehouse solution, something that Microsoft has not offered in its Azure ecosystem until now.

On the whole, Microsoft’s first foray into cloud-enabling their MPP architecture looks promising. As stated before this product is still in preview so chances are that there is still a ton of quirks and imperfections to iron out, however, I generally like the features, flexibility and non-byzantine approach to scaling that it offers. I like the fact that there are differentiating factors that set it apart from other vendors e.g. ability to query seamlessly across both relational data in a relational database and non-relational data in common Hadoop formats (PolyBase), hybrid architecture between on-premises and cloud and most of all, ability to pause, grow and shrink the data warehouse a matter of seconds. The fact that, for example, Azure SQL DW can integrate with an existing on-premises infrastructure and allow for a truly hybrid approach should open up new options for streamlining, integrating and cost-saving across data storage and analytics stack e.g. rather than provisioning secondary on-premises APS (Analytics Platform System) for disaster recovery and business continuity best practices (quite an expensive exercise), one could simply utilize Azure SQL DW as a drop-in replacement. Given that storage and compute have been decoupled, Azure SQL DW could operate mostly in a paused state, allowing for considerable savings, yet fulfilling the need for a robust, on-demand back up system in case of the primary appliance failure, as per diagram below (click on image to expand).

Azure_SQL_DW_Review_DRandBC_Scenario

In terms of performance/cost I think that once the service comes out of preview I would expect either for the performance to improve or the price to be lowered to make it truly competitive with what’s on offer from AWS. It’s still early days and very much a work in progress so I’m looking forward to see what this service can provide with a production-ready release, but Azure SQL DW looks like a solid attempt in disrupting Amazon’s dominance of Redshift in the cloud data warehouse space, the fastest growing service in their ever expanding arsenal.

Tags: , , ,

Microsoft Azure SQL Data Warehouse Quick Review and Amazon Redshift Comparison – Part 1

January 15th, 2016 / No Comments » / by admin

Note: Part 2 can be found HERE and code files downloaded from HERE

Introduction

Last time I wrote about cloud-enabled data warehousing (see the 3-part series starting HERE), only one vendor and product qualify as a truly enterprise PaaS offering – Amazon Redshift running on AWS. Yes, there were other providers coming out of the woodwork, boasting their readiness to take on this juggernaut of cloud computing e.g. Bityota (already acquired by LifeLock) or SnowflakeDB (headed by a former Microsoft executive Bob Muglia) but none of them made it past the TechCrunch mention deep into the CxO’s wallets and capitalized on the data boom. Microsoft has been touting their Azure Data Warehouse as a formidable competitor to Redshift since mid-2015, with a host of features that may appeal to spending-cautious departments making their first foray into the cloud. In addition to competing on price, Microsoft also strives to differentiate itself with several features that Redshift does not offer. Some distinctive characteristics that separate Azure Data Warehouse from Redshift are:

  • Wide support of SQL and integration with other services – The SQL Data Warehouse extends the T-SQL constructs most developers are already familiar with to create indexes, partitions and stored procedures, which allow for an easy migration to the cloud. With native integration with Azure Data Factory, Azure Machine Learning and Power BI, Microsoft claims that customers are able to quickly ingest data, utilize learning algorithms, and visualize data born either in the cloud or on-premises
  • Separating compute and storage with ability to pause an instance – Azure SQL Data Warehouse independently scales compute and storage so customers only pay for the query performance they need. Dynamic pause enables businesses to optimize the utilization of the compute infrastructure by ramping down compute while persisting the data with no need to back up the data, delete the existing cluster, and, upon resume, generate a new cluster and restore data
  • PolyBase for structured and unstructured data blending – SQL Data Warehouse can query unstructured and semi-structured data stored in Azure Storage, Hortonworks Data Platform, or Cloudera using familiar T-SQL skills
  • Hybrid infrastructure for supporting on-premises and/or in the cloud

At its core, SQL Data Warehouse uses Microsoft’s massive parallel processing (MPP) architecture, originally designed to run some of the largest on-premises enterprise data warehouses. This architecture takes advantage of built-in data warehousing performance improvements and also allows SQL Data Warehouse to easily scale-out and parallelize computation of complex SQL queries. In addition, SQL Data Warehouse’s architecture is designed to take advantage of its presence in Azure. Combining these two aspects, the architecture breaks up into 4 key components:

  • Control node: You connect to the Control node when using SQL Data Warehouse with any development, loading, or business intelligence tools. In SQL Data Warehouse, the Control node is a SQL Database, and connecting it looks and feels like a standard SQL Database. However, under the surface, it coordinates all of the data movement and computation that takes place in the system. When a command is issued to the Control node, it breaks it down into a set of queries that will be passed onto the compute nodes of the service
  • Compute Nodes: Like the control node, the compute nodes of SQL Data Warehouse are powered using SQL Databases. Their job is to serve as the compute power of the service. Behind the scenes, any time data is loaded into SQL Data Warehouse, it is distributed across the nodes of the service. Then, any time the control node receives a command it breaks it into pieces for each compute node, and the compute nodes operate over their corresponding data. After completing their computation, compute nodes pass partial results to the control node which then aggregates results before returning an answer
  • Storage: All storage for SQL Data Warehouse is standard Azure Storage Blobs. This means that when interacting with data, compute nodes are writing and reading directly to/from Blobs. Azure Storage’s ability to expand transparently and nearly limitlessly allows us to automatically scale storage, and to do so separately from compute. Azure Storage also allows us to persist storage while scaling or paused, streamline our back-up and restore process, and have safer, more fault tolerant storage
  • Data Movement Services: The final piece holding everything together in SQL Data Warehouse is our Data Movement Services. The data movement services allows the control node to communicate and pass data to all of the compute nodes. It also enables the compute nodes to pass data between each other, which gives them access to data on other compute nodes, and allows them to get the data that they need to complete joins and aggregations

Azure_SQL_DW_Review_SQLDW_Architecture_Overview

Microsoft provides a good intro into Azure Data Warehouse which outlines all of the above features and more in the video below.

All features and fancy bells and whistles aside, a typical business would also be interested in what sort of performance per dollar Microsoft’s offering would bring to the table as well as the level of integration the product can provide. As indicated in my previous post on AWS Redshift, building an enterprise data warehouse is a very costly and time-intensive activity regardless of which vendor or technology one chooses to peruse. In most if not all cases performance consideration would only constitute a fraction of the overall project plan so performance statistics alone should not become the sole merit on the technology choice, especially when coupled with a paradigm-shifting data warehouse architecture like the cloud computing. With that in mind, query execution speed is one of the most basic and fastest way of testing vendor offering and in this post I will provide a rudimentary analysis of the query execution speed based on Amazon’s TICKIT database which I used for my Redshift review a while ago and try to compare and contrast Azure Data Warehouse (preview edition) with the AWS equivalent based on predefined price/performance points. I would also like to briefly touch on Azure Data Warehouse integration with other vendors/partners, manly from analytical and reporting angle e.g. connecting from and analyzing the data in Tableau and Microsoft Power BI.

Pricing Overview

In case of Azure Data Warehouse Compute and Storage are billed separately. Compute usage is represented with DWU (Data Warehouse Unit) and customers can scale up and down the level of performance/DWUs they need by 100 DWU blocks. The preview price for the compute is USD 1.03/hr (~$766/mo) capped at 2000 DWU. When it comes to storage, Azure distinguishes between four types – Block blobs, Page Blobs and Disks, Tables and Queues, and Files, where total cost depends on how much you store, the volume of storage transactions and outbound data transfers, and which data redundancy option you choose. For Azure Data Warehouse, storage rates are based on standard RA-GRS Page Blob rates and cost around $0.1529 per GB (First 1 TB / Month).

Microsoft is still quite clandestine about the actual hardware used to power Azure Data Warehouse, presumably due to the fact that it is still in a preview mode. The DWU – the unit of performance/cost that is used to scale processing capabilities is pretty much the only variable one can adjust to increase/decrease performance/price, which is not necessarily a bad thing – rather than choosing from a myriad of hardware configuration options I’d much rather select a unit of computational capability tied to a specific cost corresponding to the prospective workload. Quoting Microsoft here, ‘A DWU represents the power of the query performance and is quantified by workload objectives: how fast rows are scanned, loaded, and copied’. A bit cloak and dagger but simple and straightforward at the same time. Having said that, when attending one of Microsoft session at the Ignite conference in Gold Coast, Australia, Microsoft representative reluctantly stated that the closest hardware specifications equivalent of 100 DTU would equate to roughly one compute unit i.e. 6 cores and 48GB of memory. Again, this is at the time of publishing this post so these provisions will most likely scale up as updated version gets released. Besides, I can only see abstracting physical hardware capacity as a good thing – all I really want to do is to be as declarative as possible and tell the vendor how high (and how economically) I want them to jump and they do it, without having to worry about how it all ‘hangs together’ at the back end. Isn’t that the beauty of cloud computing after all? The image below is the only things I could find that provides a rough description of how DTU selection affects performance levels.

Azure_SQL_DW_Review_DTU_Measure

Amazon, on the other hand, provides a very high-level outline of some of the key features tied to individual service e.g. I/O speed for the direct-attached storage or memory size. Below is the outline of the instance classes and related parameters I used for query performance testing between individual service types. Below table outlines the ‘testbed’ configurations I used to measure SQL execution performance on both services.

Azure_SQL_DW_Review_Test_Instances_Specs

Provisioning both services could not be any simpler, in both vendors’ case it’s just a matter of selecting the desired computing capacity with its associated cost, deciding on the naming conventions for the designated servers and databases and finally configuring firewall to allow client access. In case of Azure I stumbled across a small issue when deploying the instance ‘got stuck’ on ‘In Progress’ status for over 19 hours, with no details provided by the portal on why the deployment was taking so long (see image below). Selecting a different region (initial region allocation for Azure DW deployment was Australia Southeast) seemed to rectify the problem but I wish Azure provided more information when roadblocks like this one occur, other than just a progress bar and a sparse message.

Azure_SQL_DW_Review_Extended_Provisioning_Time

Test Data and Load

So what is Azure Data Warehouse performance like in comparison to the Amazon Redshift equivalent? To answer this question, first I needed to expand on the data provided by Amazon (TICKIT database) and copy it across to both services. This sample database helps analysts track sales activity for the fictional TICKIT web site, where users buy and sell tickets online for sporting events, shows, and concerts. In particular, analysts can identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. Analysts can use this information to provide incentives to buyers and sellers who frequent the site, to attract new users, and to drive advertising and promotions. TICKIT database consists of seven tables: two fact tables and five dimensions as per the schema below.

Azure_SQL_DW_Review_TICKIT_db_schema_ERD

Given the fact that there was not nearly enough data in the flat files provided (you can download text files used for this exercise from HERE) I needed a way to ‘grow’ the main fact table dataset to a more respectable size. The script I used for this purpose (saved in my OneDriver folder HERE) simply creates tables schema, inserts flat files data into the newly created database and ‘stretches out’ the main fact table into 50,000,000 rows. The last section also replaces the original sales comma delimited file with a series of files generated after sales table data expansion, each holding just over 5,000,000 records. This is to facilitate subsequent data import where smaller files are more ‘predictable’ and easier to manage when loading over the network. These 10 files containing sales data, in conjunction with the remaining dimensional data files, will form the basis for my quick query execution analysis.

Loading data into both services is a straightforward affair. In case of Redshift I used the same method as last time (see my previous blog post HERE) i.e. using Amazon’s recommended COPY utility to move text files across from S3 bucket into Redshift. There are other ways to do it but since this approach worked really well for me last time I decided to stick with the tried and tested methodology.

SQL Data Warehouse presents numerous options for loading data including PolyBase, Azure Data Factory, BCP command-line utility, SQL Server Integration Services (SSIS) as well as 3rd party data loading tools.  Each method comes with its own pros and cons. Below are the slide deck shots from the Gold Coast Microsoft Ignite conference presentation roughly outlining key advantages/disadvantages and features characterizing those approaches.

Azure_SQL_DW_Review_Data_Load_Technique_1

Azure_SQL_DW_Review_Data_Load_Technique_2

Azure_SQL_DW_Review_Data_Load_Technique_3

The following table details the results of four separate Azure SQL Data Warehouse load tests using PolyBase, BCP, SQLBulkCopy/ADF and SSIS.

Azure_SQL_DW_Review_Data_Load_Performance

As you can see, the PolyBase method shows a significantly higher throughput rate compared to BCP, SQLBulkCopy, and SSIS Control-node client gated load methods. If PolyBase is not an option, however, BCP provides the next best load rate. Regarding loads that improved based on concurrent load (the third row in the chart), keep in mind that SQL Data Warehouse supports up to 32 concurrent queries (loads).

For this demo BCP looked like the most straightforward way to populate TICKIT database schema so I settled for the BCP utility. To partially automate BCP execution for each table I have deployed a simple batch file script (see code below) which, apart from loading the data from flat files, also creates TestDW database schema via a range of DDL statements being sourced from a SQL file, checks for record count discrepancies, and finally creates clustered columnstore indexes and statistics on all tables. If using it, just ensure that parameters highlighted have been populated with values corresponding to your environment and as always, this code, along with other scripts, can be downloaded from my OneDrive folder HERE.

REM TICKIT Database Objects Definition, Data Load and Stats Refresh Job

@echo off

REM ============================
REM Start User Defined Variables
REM ============================

set server=
set user=
set password=
set database=

REM ==========================
REM End User Defined Variables
REM ==========================

set tables=Users, Venue, Category, Date, Event, Listing, Sales
set views=vstats_columns
set storedprocs=prc_sqldw_create_stats
set ddl=tickit_ddl.sql
set validation=tickit_check.sql
set idxs_and_statistics=tickit_create_idxs_and_statistics.sql
set schema=dbo
set p1=.
set logs=%p1%\logs
set load="C:\Program Files\Microsoft SQL Server\110\Tools\Binn\bcp.exe"
set mode=reload
set login_timeout=120

mkdir %logs% 2> nul
del %logs%\*.log 2> nul

if "%server%"=="" (
    echo %date% %time% Server needs to be specified.
    goto eof
)

if "%user%"=="" (
    echo %date% %time% User needs to be specified.
    goto eof
)

if "%password%"=="" (
    echo %date% %time% Password needs to be specified.
    goto eof
)

if "%database%"=="" (
    echo %date% %time% Database needs to be specified.
    goto eof
)

if not exist %load% (
    echo %date% %time% Bcp must be installed.
    goto eof
)

echo %date% %time% Dropping Existing TICKIT Tables, Views and Stored Procedures

for %%t in (%tables%) do (
    sqlcmd -S "%server%" -U %user% -P %password% -d %database% -I -l %login_timeout% -e -Q "IF EXISTS (SELECT NULL FROM sys.tables WHERE name = '%%t') DROP TABLE %%t" >> %logs%\drop_tables.log
)

for %%v in (%views%) do (
    sqlcmd -S "%server%" -U %user% -P %password% -d %database% -I -l %login_timeout% -e -Q "IF EXISTS (SELECT NULL FROM sys.views WHERE name = '%%v') DROP VIEW %%v" >> %logs%\drop_views.log
)

for %%s in (%storedprocs%) do (
    sqlcmd -S "%server%" -U %user% -P %password% -d %database% -I -l %login_timeout% -e -Q "IF EXISTS (SELECT NULL FROM sys.objects WHERE type = 'P' AND OBJECT_ID = OBJECT_ID('%%s')) DROP PROCEDURE %%s" >> %logs%\drop_stored_procs.log
)

echo %date% %time% Existing TICKIT Tables and Views Dropped

echo %date% %time% Creating TICKIT Tables and Views

sqlcmd -S "%server%" -U %user% -P %password% -d %database% -I -i %p1%\%ddl% -l %login_timeout% -b -e >> %logs%\ddl.log

if %ERRORLEVEL% NEQ 0 (
   echo %date% %time% Create DDL statement failed. Please look at the file %output_file% for errors.
   goto eof
)

echo %date% %time% TICKIT Tables and Views Created

echo %date% %time% Loading TICKIT Tables

for %%x in (%tables%) do (

    @echo on
    echo. >> %logs%\loads.log
    echo %date% %time% Loading %%x >> %logs%\loads.log

    IF "%%x"=="Sales" (
	%load% %schema%.%%x in "%p1%\NewSales1.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales2.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales3.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales4.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales5.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales6.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales7.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales8.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales9.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	%load% %schema%.%%x in "%p1%\NewSales10.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
    ) ELSE (
    	%load% %schema%.%%x in "%p1%\%%x.txt" -S%server%  -U%user% -P%password% -d%database% -q -c -t  "|" >> %logs%\loads.log
	)
    @echo off

    if %ERRORLEVEL% NEQ 0 (
         echo %date% %time% Load for table %%x failed. Please look at the file %logs%\load_%%x.log for errors.
    ) ELSE (
          echo %date% %time% Table %%x loaded
    )

)

echo %date% %time% TICKIT Tables Loaded

echo %date% %time% Validating Row Counts
sqlcmd -S "%server%" -U %user% -P %password% -d %database% -I -i %p1%\%validation% -l %login_timeout%

echo %date% %time% Creating Clustered ColumStore Indexes and Statistics on all Tables
sqlcmd -S "%server%" -U %user% -P %password% -d %database% -I -i %p1%\%idxs_and_statistics% -l %login_timeout% -e -p >> %logs%\idxs_and_statistics.log
echo %date% %time% ColumnStore Indexes and Statistics Created on all Tables

:eof

Alternatively, providing you have already created all necessary tables on your DW instance, you can simply copy and paste the following ‘one-liners’ into your command prompt, adjusting files location, server name, database name and credentials parameters to match your environment.

bcp Users in c:\Tickit_Extended_DB_Files\users.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t "|"
bcp Venue in c:\Tickit_Extended_DB_Files\venue.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Category in c:\Tickit_Extended_DB_Files\category.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Date in c:\Tickit_Extended_DB_Files\date.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Event in c:\Tickit_Extended_DB_Files\event.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Listing in c:\Tickit_Extended_DB_Files\listing.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales1.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales2.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales3.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales4.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales5.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales6.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales7.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales8.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales9.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"
bcp Sales in c:\Tickit_Extended_DB_Files\NewSales10.txt -S <Server Name> -d <Database Name> -U <Username> -P <password> -q -c -t  "|"

It is also worth mentioning that Microsoft provides a handy little utility to create import and export BCP files based on configured set of parameters driven by a wizard-like process. The tool is aptly called Data Warehouse Migration Utility and even though still in preview, it provides an automated schema and data migration scripts generation from SQL Server and Azure SQL Database to Azure SQL Data Warehouse.

Azure_SQL_DW_Review_DW_Migration_Utility

The download link as well as the tool overview page can be found HERE.

In the next post I will provide test results from a selection of sample queries executed on both platforms stacked up against each other and briefly touch on reporting from Azure SQL DW using Power BI and Tableau tools.

Tags: , , ,