bicortex

Vertica MPP Database Overview and TPC-DS Benchmark Performance Analysis (Part 4)

June 11th, 2018 / 2 Comments » / by admin

Testing Continued…

In the previous post I started looking at some of the TPC-DS queries’ performance across a single node and multi-node environments. This post continues with the theme of performance analysis across the remaining set of queries and looks at high-level interactions between Vertica and Tableau/PowerBI. Firstly though, let’s look at the remaining ten queries and how their execution times fared in the context of three different data and cluster sizes.

--Query 44
SELECT asceding.rnk,
               i1.i_product_name best_performing,
               i2.i_product_name worst_performing
FROM  (SELECT *
       FROM   (SELECT item_sk,
                      Rank()
                        OVER (
                          ORDER BY rank_col ASC) rnk
               FROM   (SELECT ss_item_sk         item_sk,
                              Avg(ss_net_profit) rank_col
                       FROM   tpc_ds.store_sales ss1
                       WHERE  ss_store_sk = 4
                       GROUP  BY ss_item_sk
                       HAVING Avg(ss_net_profit) > 0.9 *
                              (SELECT Avg(ss_net_profit)
                                      rank_col
                               FROM   tpc_ds.store_sales
                               WHERE  ss_store_sk = 4
                                      AND ss_cdemo_sk IS
                                          NULL
                               GROUP  BY ss_store_sk))V1)
              V11
       WHERE  rnk < 11) asceding,
      (SELECT *
       FROM   (SELECT item_sk,
                      Rank()
                        OVER (
                          ORDER BY rank_col DESC) rnk
               FROM   (SELECT ss_item_sk         item_sk,
                              Avg(ss_net_profit) rank_col
                       FROM   tpc_ds.store_sales ss1
                       WHERE  ss_store_sk = 4
                       GROUP  BY ss_item_sk
                       HAVING Avg(ss_net_profit) > 0.9 *
                              (SELECT Avg(ss_net_profit)
                                      rank_col
                               FROM   tpc_ds.store_sales
                               WHERE  ss_store_sk = 4
                                      AND ss_cdemo_sk IS
                                          NULL
                               GROUP  BY ss_store_sk))V2)
              V21
       WHERE  rnk < 11) descending,
      tpc_ds.item i1,
      tpc_ds.item i2
WHERE  asceding.rnk = descending.rnk
       AND i1.i_item_sk = asceding.item_sk
       AND i2.i_item_sk = descending.item_sk
ORDER  BY asceding.rnk
LIMIT 100;

--Query 46
SELECT c_last_name,
               c_first_name,
               ca_city,
               bought_city,
               ss_ticket_number,
               amt,
               profit
FROM   (SELECT ss_ticket_number,
               ss_customer_sk,
               ca_city            bought_city,
               Sum(ss_coupon_amt) amt,
               Sum(ss_net_profit) profit
        FROM   tpc_ds.store_sales,
               tpc_ds.date_dim,
               tpc_ds.store,
               tpc_ds.household_demographics,
               tpc_ds.customer_address
        WHERE  store_sales.ss_sold_date_sk = date_dim.d_date_sk
               AND store_sales.ss_store_sk = store.s_store_sk
               AND store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
               AND store_sales.ss_addr_sk = customer_address.ca_address_sk
               AND ( household_demographics.hd_dep_count = 6
                      OR household_demographics.hd_vehicle_count = 0 )
               AND date_dim.d_dow IN ( 6, 0 )
               AND date_dim.d_year IN ( 2000, 2000 + 1, 2000 + 2 )
               AND store.s_city IN ( 'Midway', 'Fairview', 'Fairview',
                                     'Fairview',
                                     'Fairview' )
        GROUP  BY ss_ticket_number,
                  ss_customer_sk,
                  ss_addr_sk,
                  ca_city) dn,
       tpc_ds.customer,
       tpc_ds.customer_address current_addr
WHERE  ss_customer_sk = c_customer_sk
       AND customer.c_current_addr_sk = current_addr.ca_address_sk
       AND current_addr.ca_city <> bought_city
ORDER  BY c_last_name,
          c_first_name,
          ca_city,
          bought_city,
          ss_ticket_number
LIMIT 100;

--Query 47
WITH v1 
     AS (SELECT i_category, 
                i_brand, 
                s_store_name, 
                s_company_name, 
                d_year, 
                d_moy, 
                Sum(ss_sales_price)         sum_sales, 
                Avg(Sum(ss_sales_price)) 
                  OVER ( 
                    partition BY i_category, i_brand, s_store_name, 
                  s_company_name, 
                  d_year) 
                                            avg_monthly_sales, 
                Rank() 
                  OVER ( 
                    partition BY i_category, i_brand, s_store_name, 
                  s_company_name 
                    ORDER BY d_year, d_moy) rn 
         FROM   tpc_ds.item,
                tpc_ds.store_sales,
                tpc_ds.date_dim,
                tpc_ds.store
         WHERE  ss_item_sk = i_item_sk 
                AND ss_sold_date_sk = d_date_sk 
                AND ss_store_sk = s_store_sk 
                AND ( d_year = 1999 
                       OR ( d_year = 1999 - 1 
                            AND d_moy = 12 ) 
                       OR ( d_year = 1999 + 1 
                            AND d_moy = 1 ) ) 
         GROUP  BY i_category, 
                   i_brand, 
                   s_store_name, 
                   s_company_name, 
                   d_year, 
                   d_moy), 
     v2 
     AS (SELECT v1.i_category, 
                v1.d_year, 
                v1.d_moy, 
                v1.avg_monthly_sales, 
                v1.sum_sales, 
                v1_lag.sum_sales  psum, 
                v1_lead.sum_sales nsum 
         FROM   v1, 
                v1 v1_lag, 
                v1 v1_lead 
         WHERE  v1.i_category = v1_lag.i_category 
                AND v1.i_category = v1_lead.i_category 
                AND v1.i_brand = v1_lag.i_brand 
                AND v1.i_brand = v1_lead.i_brand 
                AND v1.s_store_name = v1_lag.s_store_name 
                AND v1.s_store_name = v1_lead.s_store_name 
                AND v1.s_company_name = v1_lag.s_company_name 
                AND v1.s_company_name = v1_lead.s_company_name 
                AND v1.rn = v1_lag.rn + 1 
                AND v1.rn = v1_lead.rn - 1) 
SELECT * 
FROM   v2 
WHERE  d_year = 1999 
       AND avg_monthly_sales > 0 
       AND CASE 
             WHEN avg_monthly_sales > 0 THEN Abs(sum_sales - avg_monthly_sales) 
                                             / 
                                             avg_monthly_sales 
             ELSE NULL 
           END > 0.1 
ORDER  BY sum_sales - avg_monthly_sales, 
          3
LIMIT 100;

--Query 54
WITH my_customers
     AS (SELECT DISTINCT c_customer_sk,
                         c_current_addr_sk
         FROM   (SELECT cs_sold_date_sk     sold_date_sk,
                        cs_bill_customer_sk customer_sk,
                        cs_item_sk          item_sk
                 FROM   tpc_ds.catalog_sales
                 UNION ALL
                 SELECT ws_sold_date_sk     sold_date_sk,
                        ws_bill_customer_sk customer_sk,
                        ws_item_sk          item_sk
                 FROM   tpc_ds.web_sales) cs_or_ws_sales,
                tpc_ds.item,
                tpc_ds.date_dim,
                tpc_ds.customer
         WHERE  sold_date_sk = d_date_sk
                AND item_sk = i_item_sk
                AND i_category = 'Sports'
                AND i_class = 'fitness'
                AND c_customer_sk = cs_or_ws_sales.customer_sk
                AND d_moy = 5
                AND d_year = 2000),
     my_revenue
     AS (SELECT c_customer_sk,
                Sum(ss_ext_sales_price) AS revenue
         FROM   my_customers,
                tpc_ds.store_sales,
                tpc_ds.customer_address,
                tpc_ds.store,
                tpc_ds.date_dim
         WHERE  c_current_addr_sk = ca_address_sk
                AND ca_county = s_county
                AND ca_state = s_state
                AND ss_sold_date_sk = d_date_sk
                AND c_customer_sk = ss_customer_sk
                AND d_month_seq BETWEEN (SELECT DISTINCT d_month_seq + 1
                                         FROM   tpc_ds.date_dim
                                         WHERE  d_year = 2000
                                                AND d_moy = 5) AND
                                        (SELECT DISTINCT
                                        d_month_seq + 3
                                         FROM   tpc_ds.date_dim
                                         WHERE  d_year = 2000
                                                AND d_moy = 5)
         GROUP  BY c_customer_sk),
     segments
     AS (SELECT Cast(( revenue / 50 ) AS INT) AS segment
         FROM   my_revenue)
SELECT segment,
               Count(*)     AS num_customers,
               segment * 50 AS segment_base
FROM   segments
GROUP  BY segment
ORDER  BY segment,
          num_customers
LIMIT 100;

--Query 57
WITH v1
     AS (SELECT i_category,
                i_brand,
                cc_name,
                d_year,
                d_moy,
                Sum(cs_sales_price)                                    sum_sales
                ,
                Avg(Sum(cs_sales_price))
                  OVER (
                    partition BY i_category, i_brand, cc_name, d_year)
                avg_monthly_sales
                   ,
                Rank()
                  OVER (
                    partition BY i_category, i_brand, cc_name
                    ORDER BY d_year, d_moy)                            rn
         FROM   tpc_ds.item,
                tpc_ds.catalog_sales,
                tpc_ds.date_dim,
                tpc_ds.call_center
         WHERE  cs_item_sk = i_item_sk
                AND cs_sold_date_sk = d_date_sk
                AND cc_call_center_sk = cs_call_center_sk
                AND ( d_year = 2000
                       OR ( d_year = 2000 - 1
                            AND d_moy = 12 )
                       OR ( d_year = 2000 + 1
                            AND d_moy = 1 ) )
         GROUP  BY i_category,
                   i_brand,
                   cc_name,
                   d_year,
                   d_moy),
     v2
     AS (SELECT v1.i_brand,
                v1.d_year,
                v1.avg_monthly_sales,
                v1.sum_sales,
                v1_lag.sum_sales  psum,
                v1_lead.sum_sales nsum
         FROM   v1,
                v1 v1_lag,
                v1 v1_lead
         WHERE  v1.i_category = v1_lag.i_category
                AND v1.i_category = v1_lead.i_category
                AND v1.i_brand = v1_lag.i_brand
                AND v1.i_brand = v1_lead.i_brand
                AND v1. cc_name = v1_lag. cc_name
                AND v1. cc_name = v1_lead. cc_name
                AND v1.rn = v1_lag.rn + 1
                AND v1.rn = v1_lead.rn - 1)
SELECT *
FROM   v2
WHERE  d_year = 2000
       AND avg_monthly_sales > 0
       AND CASE
             WHEN avg_monthly_sales > 0 THEN Abs(sum_sales - avg_monthly_sales)
                                             /
                                             avg_monthly_sales
             ELSE NULL
           END > 0.1
ORDER  BY sum_sales - avg_monthly_sales,
          3
LIMIT 100;

--Query 64
WITH cs_ui 
     AS (SELECT cs_item_sk, 
                Sum(cs_ext_list_price) AS sale, 
                Sum(cr_refunded_cash + cr_reversed_charge 
                    + cr_store_credit) AS refund 
         FROM   tpc_ds.catalog_sales,
                tpc_ds.catalog_returns
         WHERE  cs_item_sk = cr_item_sk 
                AND cs_order_number = cr_order_number 
         GROUP  BY cs_item_sk 
         HAVING Sum(cs_ext_list_price) > 2 * Sum( 
                cr_refunded_cash + cr_reversed_charge 
                + cr_store_credit)), 
     cross_sales 
     AS (SELECT i_product_name         product_name, 
                i_item_sk              item_sk, 
                s_store_name           store_name, 
                s_zip                  store_zip, 
                ad1.ca_street_number   b_street_number, 
                ad1.ca_street_name     b_streen_name, 
                ad1.ca_city            b_city, 
                ad1.ca_zip             b_zip, 
                ad2.ca_street_number   c_street_number, 
                ad2.ca_street_name     c_street_name, 
                ad2.ca_city            c_city, 
                ad2.ca_zip             c_zip, 
                d1.d_year              AS syear, 
                d2.d_year              AS fsyear, 
                d3.d_year              s2year, 
                Count(*)               cnt, 
                Sum(ss_wholesale_cost) s1, 
                Sum(ss_list_price)     s2, 
                Sum(ss_coupon_amt)     s3 
         FROM   tpc_ds.store_sales,
                tpc_ds.store_returns,
                cs_ui, 
                tpc_ds.date_dim d1,
                tpc_ds.date_dim d2,
                tpc_ds.date_dim d3,
                tpc_ds.store,
                tpc_ds.customer,
                tpc_ds.customer_demographics cd1,
                tpc_ds.customer_demographics cd2,
                tpc_ds.promotion,
                tpc_ds.household_demographics hd1,
                tpc_ds.household_demographics hd2,
                tpc_ds.customer_address ad1,
                tpc_ds.customer_address ad2,
                tpc_ds.income_band ib1,
                tpc_ds.income_band ib2,
                tpc_ds.item
         WHERE  ss_store_sk = s_store_sk 
                AND ss_sold_date_sk = d1.d_date_sk 
                AND ss_customer_sk = c_customer_sk 
                AND ss_cdemo_sk = cd1.cd_demo_sk 
                AND ss_hdemo_sk = hd1.hd_demo_sk 
                AND ss_addr_sk = ad1.ca_address_sk 
                AND ss_item_sk = i_item_sk 
                AND ss_item_sk = sr_item_sk 
                AND ss_ticket_number = sr_ticket_number 
                AND ss_item_sk = cs_ui.cs_item_sk 
                AND c_current_cdemo_sk = cd2.cd_demo_sk 
                AND c_current_hdemo_sk = hd2.hd_demo_sk 
                AND c_current_addr_sk = ad2.ca_address_sk 
                AND c_first_sales_date_sk = d2.d_date_sk 
                AND c_first_shipto_date_sk = d3.d_date_sk 
                AND ss_promo_sk = p_promo_sk 
                AND hd1.hd_income_band_sk = ib1.ib_income_band_sk 
                AND hd2.hd_income_band_sk = ib2.ib_income_band_sk 
                AND cd1.cd_marital_status <> cd2.cd_marital_status 
                AND i_color IN ( 'cyan', 'peach', 'blush', 'frosted', 
                                 'powder', 'orange' ) 
                AND i_current_price BETWEEN 58 AND 58 + 10 
                AND i_current_price BETWEEN 58 + 1 AND 58 + 15 
         GROUP  BY i_product_name, 
                   i_item_sk, 
                   s_store_name, 
                   s_zip, 
                   ad1.ca_street_number, 
                   ad1.ca_street_name, 
                   ad1.ca_city, 
                   ad1.ca_zip, 
                   ad2.ca_street_number, 
                   ad2.ca_street_name, 
                   ad2.ca_city, 
                   ad2.ca_zip, 
                   d1.d_year, 
                   d2.d_year, 
                   d3.d_year) 
SELECT cs1.product_name, 
       cs1.store_name, 
       cs1.store_zip, 
       cs1.b_street_number, 
       cs1.b_streen_name, 
       cs1.b_city, 
       cs1.b_zip, 
       cs1.c_street_number, 
       cs1.c_street_name, 
       cs1.c_city, 
       cs1.c_zip, 
       cs1.syear, 
       cs1.cnt, 
       cs1.s1, 
       cs1.s2, 
       cs1.s3, 
       cs2.s1, 
       cs2.s2, 
       cs2.s3, 
       cs2.syear, 
       cs2.cnt 
FROM   cross_sales cs1, 
       cross_sales cs2 
WHERE  cs1.item_sk = cs2.item_sk 
       AND cs1.syear = 2001 
       AND cs2.syear = 2001 + 1 
       AND cs2.cnt <= cs1.cnt 
       AND cs1.store_name = cs2.store_name 
       AND cs1.store_zip = cs2.store_zip 
ORDER  BY cs1.product_name, 
          cs1.store_name, 
          cs2.cnt;

--Query 74
WITH year_total 
     AS (SELECT c_customer_id    customer_id, 
                c_first_name     customer_first_name, 
                c_last_name      customer_last_name, 
                d_year           AS year1, 
                Sum(ss_net_paid) year_total, 
                's'              sale_type 
         FROM   tpc_ds.customer,
                tpc_ds.store_sales,
                tpc_ds.date_dim
         WHERE  c_customer_sk = ss_customer_sk 
                AND ss_sold_date_sk = d_date_sk 
                AND d_year IN ( 1999, 1999 + 1 ) 
         GROUP  BY c_customer_id, 
                   c_first_name, 
                   c_last_name, 
                   d_year 
         UNION ALL 
         SELECT c_customer_id    customer_id, 
                c_first_name     customer_first_name, 
                c_last_name      customer_last_name, 
                d_year           AS year1, 
                Sum(ws_net_paid) year_total, 
                'w'              sale_type 
         FROM   tpc_ds.customer,
                tpc_ds.web_sales,
                tpc_ds.date_dim
         WHERE  c_customer_sk = ws_bill_customer_sk 
                AND ws_sold_date_sk = d_date_sk 
                AND d_year IN ( 1999, 1999 + 1 ) 
         GROUP  BY c_customer_id, 
                   c_first_name, 
                   c_last_name, 
                   d_year) 
SELECT t_s_secyear.customer_id, 
               t_s_secyear.customer_first_name, 
               t_s_secyear.customer_last_name 
FROM   year_total t_s_firstyear, 
       year_total t_s_secyear, 
       year_total t_w_firstyear, 
       year_total t_w_secyear 
WHERE  t_s_secyear.customer_id = t_s_firstyear.customer_id 
       AND t_s_firstyear.customer_id = t_w_secyear.customer_id 
       AND t_s_firstyear.customer_id = t_w_firstyear.customer_id 
       AND t_s_firstyear.sale_type = 's' 
       AND t_w_firstyear.sale_type = 'w' 
       AND t_s_secyear.sale_type = 's' 
       AND t_w_secyear.sale_type = 'w' 
       AND t_s_firstyear.year1 = 1999 
       AND t_s_secyear.year1 = 1999 + 1 
       AND t_w_firstyear.year1 = 1999 
       AND t_w_secyear.year1 = 1999 + 1 
       AND t_s_firstyear.year_total > 0 
       AND t_w_firstyear.year_total > 0 
       AND CASE 
             WHEN t_w_firstyear.year_total > 0 THEN t_w_secyear.year_total / 
                                                    t_w_firstyear.year_total 
             ELSE NULL 
           END > CASE 
                   WHEN t_s_firstyear.year_total > 0 THEN 
                   t_s_secyear.year_total / 
                   t_s_firstyear.year_total 
                   ELSE NULL 
                 END 
ORDER  BY 1, 
          2, 
          3
LIMIT 100;

--Query 75
WITH all_sales
     AS (SELECT d_year,
                i_brand_id,
                i_class_id,
                i_category_id,
                i_manufact_id,
                Sum(sales_cnt) AS sales_cnt,
                Sum(sales_amt) AS sales_amt
         FROM   (SELECT d_year,
                        i_brand_id,
                        i_class_id,
                        i_category_id,
                        i_manufact_id,
                        cs_quantity - COALESCE(cr_return_quantity, 0)        AS
                        sales_cnt,
                        cs_ext_sales_price - COALESCE(cr_return_amount, 0.0) AS
                        sales_amt
                 FROM   tpc_ds.catalog_sales
                        JOIN tpc_ds.item
                          ON i_item_sk = cs_item_sk
                        JOIN tpc_ds.date_dim
                          ON d_date_sk = cs_sold_date_sk
                        LEFT JOIN tpc_ds.catalog_returns
                               ON ( cs_order_number = cr_order_number
                                    AND cs_item_sk = cr_item_sk )
                 WHERE  i_category = 'Men'
                 UNION
                 SELECT d_year,
                        i_brand_id,
                        i_class_id,
                        i_category_id,
                        i_manufact_id,
                        ss_quantity - COALESCE(sr_return_quantity, 0)     AS
                        sales_cnt,
                        ss_ext_sales_price - COALESCE(sr_return_amt, 0.0) AS
                        sales_amt
                 FROM   tpc_ds.store_sales
                        JOIN tpc_ds.item
                          ON i_item_sk = ss_item_sk
                        JOIN tpc_ds.date_dim
                          ON d_date_sk = ss_sold_date_sk
                        LEFT JOIN tpc_ds.store_returns
                               ON ( ss_ticket_number = sr_ticket_number
                                    AND ss_item_sk = sr_item_sk )
                 WHERE  i_category = 'Men'
                 UNION
                 SELECT d_year,
                        i_brand_id,
                        i_class_id,
                        i_category_id,
                        i_manufact_id,
                        ws_quantity - COALESCE(wr_return_quantity, 0)     AS
                        sales_cnt,
                        ws_ext_sales_price - COALESCE(wr_return_amt, 0.0) AS
                        sales_amt
                 FROM   tpc_ds.web_sales
                        JOIN tpc_ds.item
                          ON i_item_sk = ws_item_sk
                        JOIN tpc_ds.date_dim
                          ON d_date_sk = ws_sold_date_sk
                        LEFT JOIN tpc_ds.web_returns
                               ON ( ws_order_number = wr_order_number
                                    AND ws_item_sk = wr_item_sk )
                 WHERE  i_category = 'Men') sales_detail
         GROUP  BY d_year,
                   i_brand_id,
                   i_class_id,
                   i_category_id,
                   i_manufact_id)
SELECT prev_yr.d_year                        AS prev_year,
               curr_yr.d_year                        AS year1,
               curr_yr.i_brand_id,
               curr_yr.i_class_id,
               curr_yr.i_category_id,
               curr_yr.i_manufact_id,
               prev_yr.sales_cnt                     AS prev_yr_cnt,
               curr_yr.sales_cnt                     AS curr_yr_cnt,
               curr_yr.sales_cnt - prev_yr.sales_cnt AS sales_cnt_diff,
               curr_yr.sales_amt - prev_yr.sales_amt AS sales_amt_diff
FROM   all_sales curr_yr,
       all_sales prev_yr
WHERE  curr_yr.i_brand_id = prev_yr.i_brand_id
       AND curr_yr.i_class_id = prev_yr.i_class_id
       AND curr_yr.i_category_id = prev_yr.i_category_id
       AND curr_yr.i_manufact_id = prev_yr.i_manufact_id
       AND curr_yr.d_year = 2002
       AND prev_yr.d_year = 2002 - 1
       AND Cast(curr_yr.sales_cnt AS DECIMAL(17, 2)) / Cast(prev_yr.sales_cnt AS
                                                                DECIMAL(17, 2))
           < 0.9
ORDER  BY sales_cnt_diff
LIMIT 100;

--Query 97
WITH ssci
     AS (SELECT ss_customer_sk customer_sk,
                ss_item_sk     item_sk
         FROM   tpc_ds.store_sales,
                tpc_ds.date_dim
         WHERE  ss_sold_date_sk = d_date_sk
                AND d_month_seq BETWEEN 1196 AND 1196 + 11
         GROUP  BY ss_customer_sk,
                   ss_item_sk),
     csci
     AS (SELECT cs_bill_customer_sk customer_sk,
                cs_item_sk          item_sk
         FROM   tpc_ds.catalog_sales,
                tpc_ds.date_dim
         WHERE  cs_sold_date_sk = d_date_sk
                AND d_month_seq BETWEEN 1196 AND 1196 + 11
         GROUP  BY cs_bill_customer_sk,
                   cs_item_sk)
SELECT Sum(CASE
                     WHEN ssci.customer_sk IS NOT NULL
                          AND csci.customer_sk IS NULL THEN 1
                     ELSE 0
                   END) store_only,
               Sum(CASE
                     WHEN ssci.customer_sk IS NULL
                          AND csci.customer_sk IS NOT NULL THEN 1
                     ELSE 0
                   END) catalog_only,
               Sum(CASE
                     WHEN ssci.customer_sk IS NOT NULL
                          AND csci.customer_sk IS NOT NULL THEN 1
                     ELSE 0
                   END) store_and_catalog
FROM   ssci
       FULL OUTER JOIN csci
                    ON ( ssci.customer_sk = csci.customer_sk
                         AND ssci.item_sk = csci.item_sk )
LIMIT 100;

--Query 99
SELECT Substr(w_warehouse_name, 1, 20), 
               sm_type, 
               cc_name, 
               Sum(CASE 
                     WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30 ) THEN 1 
                     ELSE 0 
                   END) AS '30 days',
               Sum(CASE 
                     WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30 ) 
                          AND ( cs_ship_date_sk - cs_sold_date_sk <= 60 ) THEN 1 
                     ELSE 0 
                   END) AS '31-60 days',
               Sum(CASE 
                     WHEN ( cs_ship_date_sk - cs_sold_date_sk > 60 ) 
                          AND ( cs_ship_date_sk - cs_sold_date_sk <= 90 ) THEN 1 
                     ELSE 0 
                   END) AS '61-90 days',
               Sum(CASE 
                     WHEN ( cs_ship_date_sk - cs_sold_date_sk > 90 ) 
                          AND ( cs_ship_date_sk - cs_sold_date_sk <= 120 ) THEN 
                     1 
                     ELSE 0 
                   END) AS '91-120 days',
               Sum(CASE 
                     WHEN ( cs_ship_date_sk - cs_sold_date_sk > 120 ) THEN 1 
                     ELSE 0 
                   END) AS '>120 days'
FROM   tpc_ds.catalog_sales,
       tpc_ds.warehouse,
       tpc_ds.ship_mode,
       tpc_ds.call_center,
       tpc_ds.date_dim
WHERE  d_month_seq BETWEEN 1200 AND 1200 + 11 
       AND cs_ship_date_sk = d_date_sk 
       AND cs_warehouse_sk = w_warehouse_sk 
       AND cs_ship_mode_sk = sm_ship_mode_sk 
       AND cs_call_center_sk = cc_call_center_sk 
GROUP  BY Substr(w_warehouse_name, 1, 20), 
          sm_type, 
          cc_name 
ORDER  BY Substr(w_warehouse_name, 1, 20), 
          sm_type, 
          cc_name
LIMIT 100;

I didn’t set out on this experiment to compare Vertica’s performance to other vendors in the MPP category, so I cannot comment on how it could verse against other RDBMS systems. It would have been interesting to see how, for example, Pivotal’s Greenplum would match up when testing it out on the identical hardware and data volumes but this 4-part series was primarily focused on establishing the connection between the two variables: the number of nodes and data size. From that perspective, Vertica performs very well ‘out-of-the-box’, and without any tuning or tweaking it managed to not only execute the queries relatively fast (given the mediocre hardware specs) but also take full advantage of the extra computing resources to distribute the load across the available nodes and cut queries execution times.

Looking at the results and how they present the dichotomies across the multitude of different configurations, the first thing that jumps out is the fact the differences between performance levels across various data volumes and node counts are very linear. On an average, the query execution speed doubles for every scaling factor increase and goes down by a factor of two for every node that’s removed. There are some slight variations but on the whole Vertica performance is consistent and predictable, and all twenty queries exhibiting similar pattern when dealing with more data and/or more computing power thrown at it (additional nodes).

Looking at the side-by-side comparison of the execution results across single and multi-node configurations the differences between across scaling factors of 100, 200 and 300 are very consistent (click on image to expand).

Converting these differences (single vs multi-node deployment) into percentage yielded an average of 243% increase for the scaling factor of 100, 252% increase for the scaling factor of 200 and 217% increase for the scaling factor of 300 as per the chart below (click on image to expand).

Finally, let’s look at how Vertica (three-node cluster with 100 scaling factor TPC-DS data) performed against some ad hoc generated queries in PowerBI and Tableau applications. Below are two short video clips depicting a rudimentary analysis of store sales data (this fact table contains close to 300 million records) against date, item, store and customer_address dimensions. It’s the type of rapid-fire, exploratory analysis one would conduct to analyse retail data to answer some immediate questions. I also set up Tableau and PowerBI side-by-side a process viewer/system monitor (htop), with four panes displaying core performance metrics e.g. CPU load, swap, memory status etc. across the three nodes and the machine hosting visualisation application (bottom view) so that I could observer (and record) systems’ behaviour when interactively aggregating data. In this way it was easy to see load distribution and how each host reacted (performance-wise) to queries issued based on this analysis. With respect to PowerBI, you can see that my three years old work laptop I recorded this on was not up to the task due to large amount of CPU cycles consumed by screen-recording software alone. On the flip side, Tableau run on my dual-CPU Mac Pro so the whole process was stutter-free and could reflect the fact that given access to a better hardware, PowerBI may also perform better.

I used Tableau Desktop version 2018.1 with native support for Vertica. As opposed to PowerBI, Tableau had no issues reading data from ‘customer’ dimension but in order to make those two footages as comparable as possible I abandoned it and used ‘customer_address’ instead. In terms of raw performance, this mini-cluster actually turned out to be borderline usable and in most cases I didn’t need to wait more than 10 seconds for aggregation or filtering to complete. Each node CPU spiked close to 100 percent during queries execution but given the volume of data and the mediocre hardware this data was crunched on, I would say that Vertica performed quite well.

PowerBI version 2.59 (used in the below footage) included Vertica support in Beta only so chances are the final release will be much better supported and optimised. PowerBI wasn’t too cooperative when selecting the data out of ‘customer’ dimension (see footage below) but everything else worked as expected e.g. geospatial analysis, various graphs etc. Performance-wise, I still think that Tableau was more workable but given the fact that Vertica support was still in preview and my pedestrian-quality laptop I was testing it on, chances are that PowerBI would be a viable choice for such exploratory analysis, especially when factoring in the price – free!

Conclusion

There isn’t much I can add to what I have already stated based on my observations and performance results. Vertica easy set-up and hassle-free, almost plug-in configuration has made it very easy to work with. While I haven’t covered most of its ‘knobs and switches’ that come built-it to fine-tune some of its functionality, the installation process was simple and more intuitive the some of the other commercial RDBMS products out there, even the ones which come with pre-built binaries and executable packages. I only wish Vertica come with the latest Ubuntu Server LTS version support as the time of writing this post Ubuntu Bionic Beaver was the latest LTS release while Vertica 9.1 Ubuntu support only extended to version 14, dating back to July 2016 and supported only until April 2019.

Performance-wise, Vertica exceeded my expectations. Running on hardware which I had no real use for and originally intended to sell on Ebay for a few hundred dollars (if lucky), it managed to crunch through large volumes of data (by relative standards) in a very respectable time. Some of the more processing-intensive queries executed in seconds and scaled well (linearly) across additional nodes and increased data volumes. It would be a treat to compare it against the likes of Greenplum or CitusDB and explore some of its other features e.g. Machine Learning or Hadoop integration (an idea for a future blog) as SQL queries execution speed on structured data isn’t its only forte.

Posted in: MPP RDBMS, SQL

Tags: Analytics, Big Data, MPP RDBMS, SQL, Vertica

Vertica MPP Database Overview and TPC-DS Benchmark Performance Analysis (Part 3)

March 2nd, 2018 / No Comments » / by admin

In Post 1 I outlined the key architectural principles behind Vertica’s design and what makes it one of the top MPP/analytical databases available today. In Post 2 I went over the installation process across one and multiple nodes as well as some of Vertica’s ‘knobs and buttons’ which give the administrator a comprehensive snapshot of the system performance. In this instalment I will focus on loading the data into Vertica’s cluster and running some analytical queries across TPC-DS datasets to compare execution times for single and multi-node environments. The comparison is not attempting to contrast Vertica’s performance with any of its competing MPP database vendors e.g. Greenplum or AWS Redshift (that may come in the future posts). Rather, the purpose of this post is to ascertain how queries execution are impacted based on the number of nodes in the cluster and whether the performance increases or decreases yielded were linear. Also, given that Vertica is a somewhat esoteric DBMS, I wanted to highlight the fact that with a fairly minimum set up and configuration, this MPP database can provide a lot of bang for your buck when dealing with analytical workloads. Yes, most developers or analysts would love nothing more than to jump on the cloud bandwagon and use ‘ops-free’ Google’s BigQuery or even AWS Redshift but the reality is that most SMBs are not ready to pivot their operational model and store their customers’ data in the public cloud. Databases like Vertica provide a reasonable alternative to a long established players in this market e.g. Oracle DB or IBM DB2 and allow the so-called big data demands to be addressed with relative ease i.e. multi-model deployment, full-featured SQL API, MPP architecture, in-database machine learning etc.

Cluster Setup and Data Load

There are quite a few data sets and methodologies that can be used to gauge data storage and processing system’s performance e.g. TLC Trip Record Data released by the New York City Taxi & Limousine Commission has gained a lot of traction amongst big data specialists, however, TPC-approved benchmarks have long been considered as the most objective, vendor-agnostic way to perform data-specific hardware and software comparisons, capturing the complexity of modern business processing to a much greater extent than its predecessors. I have briefly outlined data generation mechanism used in the TPC-DS suite of tools in my previous blog HERE so I will skip the details and use three previously generated datasets for this demo – one with the scaling factor of 100 (~100GB), one with the scaling factor of 200 (~200GB) and another with the scaling factor of 300 (~300GB). All three datasets have not turned out to be perfectly uniform in terms of expected rows count (presumably due to selecting the unsupported scaling factors), with large deviations recorded across some files e.g. ‘inventory’ table containing fewer records in 200GB dataset than in the 100GB one. However, the three datasets were quite linear in terms of raw data volume increases which was also reflected in runtimes across number of subsequent queries.

As mentioned in Post 2, the hardware used for this demo is not exactly what you would have in mind when provisioning an enterprise-class database cluster. With that in mind, the results of all the analytical queries I run are only indicative of the performance level in the context of the same data and configuration used. Chances are that with a proper, server-level hardware this comparison would be like going from the Wright brothers first airplane to Starship Enterprise but at least I endeavoured to make this as empirical as possible. To be completely transparent, here is the view of how these tiny Lenovo desktops are configured.

Also, for good measure I run a quick test on the storage performance using the ‘measure_locatioon_performance’ vsql function run on one of the machines (all three nodes are identical in terms of the individual components setup). These units are equipped with an SSD each (as opposed to a slower mechanical drive), however, their SATA2 interface limits the transfer rates of what these Crucial MX300 are technically capable of to just over 360 MB/sec as per the image below.

Finally, the fact I’m pushing between 100 and 300 gigabytes of data through this cluster with only 16GB of memory available per node means that most of the data needs to be read from disk. Most analytical databases are designed to cache data in memory, providing minimal latency and fast response time. As this hardware does not do justice to the data volumes queried in this demo, with proper reference architecture for this type of system I am certain that the performance would increase by an order of magnitude.

Vertica documentation provides a detail outline of different techniques used to load, transform and monitor the acquisitions. As reiterating the details of all the options and approaches is outside the scope of this post, I will only touch on some of the core concepts – for a comprehensive guide on data loading best practices and different options available please refer to their documentation or THIS visual guide. By far, the most effective way to load the data into Vertica is using COPY command. The COPY statement loads data from a file stored on the host or client (or in a data stream) into a database table. You can pass the COPY statement many different parameters to define various options such as: the format of the incoming data, metadata about the data load, which parser COPY should use, how to transform data as it is loaded or how to handle errors. Vertica’s hybrid storage model provides a great deal of flexibility for loading and managing data.

For this demo I copied and staged the three data sets – 100GB, 200GB and 300GB – on the m92p1 host (node no1 in my makeshift 3-node cluster) and used the following shell script to (1) create necessary tables on the previously created ‘vtest’ database and ‘tpc_ds’ schema and (2) load the data into the newly created tables.

#!/bin/bash
VPASS="YourPassword"
DBNAME="vtest"
SCHEMANAME="tpc_ds"
DATAFILEPATH="/home/dbadmin/vertica_stuff/TPC_DS_Data/100GB/*.dat"
SQLFILEPATH="/home/dbadmin/vertica_stuff/TPC_DS_SQL/create_pgsql_tables.sql"

/opt/vertica/bin/vsql -f "$SQLFILEPATH" -U dbadmin -w $VPASS -d $DBNAME
for file in $DATAFILEPATH
do
    filename=$(basename "$file")
    tblname=$(basename "$file" | cut -f 1 -d '.')
    echo "Loading file "$filename" into a Vertica host..."
    #single node only:
    echo "COPY $SCHEMANAME.$tblname FROM LOCAL '/home/dbadmin/vertica_stuff/TPC_DS_Data/100GB/$filename' \
        DELIMITER '|' DIRECT;" | \
         /opt/vertica/bin/vsql \
         -U dbadmin \
         -w $VPASS \
         -d $DBNAME
done

The sql file with all the DDL statements for tables’ creation can be downloaded from my OneDrive folder HERE. The script loaded the data using DIRECT option thus straight into ROS (Read Optimised Store) to avoid engaging the Tuple Mover – database optimiser component that moves data from memory (WOS) to disk (ROS). The load times (1000 Mbps Ethernet), along with some other supporting information, for each of the three datasets are listed below.

Testing Methodology and Results

A full TPC-DS benchmark is comprised of 99 queries and governed by very specific set of rules. For brevity, this demo only includes 20 queries i.e. (query number 5, 9, 10, 13, 17, 24, 31, 33, 34, 35, 44, 46, 47, 54, 57, 64, 74, 75, 97 and 99) as a subset sample for two reasons.

The hardware, database setup, data volumes etc. analysed does not follow the characteristics and requirements outline by the TPC organisation and under these circumstances would be considered an ‘unacceptable consideration’. For example, a typical benchmark submitted by a vendor needs to include a number of metrics beyond query throughput e.g. a price-performance ratio, data load times, the availability date of the complete configuration etc. The following is a sample of a compliant reporting of TPC-DS results: ‘At 10GB the RALF/3000 Server has a TPC-DS Query-per-Hour metric of 3010 when run against a 10GB database yielding a TPC-DS Price/Performance of $1,202 per query-per-hour and will be available 1-Apr-06’. As this demo and the results below go only skin-deep i.e. query execution times, as previously stated, are only indicative of the performance level in the context of the same data and configuration used, enterprise-ready deployment would yield different (better) results.
As this post is just for fun more than science, running 20 queries seems adequate enough to gauge the performance level dichotomies between the single and the multi-cluster environments.

Also, it is worth mentioning that no performance optimisation (database or OS level) was performed on the system. Vertica allows its administrator(s) to fine-tune various aspects of its operation through mechanisms such as statistics update, table partitioning, creating query or workload-specific projections, tuning execution plans etc. Some of those techniques require a deep knowledge of Vertica’s query execution engine while others can be achieved with little effort e.g. running DBD (Database Designer) – a GUI based tool which analyses the logical schema definition, sample data, and sample queries, and creates a physical schema in the form of a SQL script that you deploy automatically. While the initial scope of this series was intended to compare queries execution times across both versions (tuned and untuned), given the fact that Vertica’s is purposefully suited to run on a cluster of hosts, I decided to focus on contrasting single vs multi-node cluster deployment instead. Therefore, the results are only indicative of the performance level in the context of the same data and configuration used, with a good chance that further tuning, tweaking or other optimisation techniques would have yielded a much better performance outcomes.

OK, now with this little declaimer out of the way let’s look at how the first ten queries performed across the two distinct setups i.e. single node cluster and multi-node (3 nodes) cluster. Additional queries’ results as well as a quick look at a Tableau and PowerBI performance please refer to Part 4 of this series.

--Query 5
WITH ssr AS
(
         SELECT   s_store_id,
                  Sum(sales_price) AS sales,
                  Sum(profit)      AS profit,
                  Sum(return_amt)  AS returns1,
                  Sum(net_loss)    AS profit_loss
         FROM     (
                         SELECT ss_store_sk             AS store_sk,
                                ss_sold_date_sk         AS date_sk,
                                ss_ext_sales_price      AS sales_price,
                                ss_net_profit           AS profit,
                                Cast(0 AS DECIMAL(7,2)) AS return_amt,
                                Cast(0 AS DECIMAL(7,2)) AS net_loss
                         FROM   tpc_ds.store_sales
                         UNION ALL
                         SELECT sr_store_sk             AS store_sk,
                                sr_returned_date_sk     AS date_sk,
                                Cast(0 AS DECIMAL(7,2)) AS sales_price,
                                Cast(0 AS DECIMAL(7,2)) AS profit,
                                sr_return_amt           AS return_amt,
                                sr_net_loss             AS net_loss
                         FROM   tpc_ds.store_returns ) salesreturns,
                  tpc_ds.date_dim,
                  tpc_ds.store
         WHERE    date_sk = d_date_sk
         AND      d_date BETWEEN Cast('2002-08-22' AS DATE) AND      (
                           Cast('2002-08-22' AS DATE) + INTERVAL '14' day)
         AND      store_sk = s_store_sk
         GROUP BY s_store_id) , csr AS
(
         SELECT   cp_catalog_page_id,
                  sum(sales_price) AS sales,
                  sum(profit)      AS profit,
                  sum(return_amt)  AS returns1,
                  sum(net_loss)    AS profit_loss
         FROM     (
                         SELECT cs_catalog_page_sk      AS page_sk,
                                cs_sold_date_sk         AS date_sk,
                                cs_ext_sales_price      AS sales_price,
                                cs_net_profit           AS profit,
                                cast(0 AS decimal(7,2)) AS return_amt,
                                cast(0 AS decimal(7,2)) AS net_loss
                         FROM   tpc_ds.catalog_sales
                         UNION ALL
                         SELECT cr_catalog_page_sk      AS page_sk,
                                cr_returned_date_sk     AS date_sk,
                                cast(0 AS decimal(7,2)) AS sales_price,
                                cast(0 AS decimal(7,2)) AS profit,
                                cr_return_amount        AS return_amt,
                                cr_net_loss             AS net_loss
                         FROM   tpc_ds.catalog_returns ) salesreturns,
                  tpc_ds.date_dim,
                  tpc_ds.catalog_page
         WHERE    date_sk = d_date_sk
         AND      d_date BETWEEN cast('2002-08-22' AS date) AND      (
                           cast('2002-08-22' AS date) + INTERVAL '14' day)
         AND      page_sk = cp_catalog_page_sk
         GROUP BY cp_catalog_page_id) , wsr AS
(
         SELECT   web_site_id,
                  sum(sales_price) AS sales,
                  sum(profit)      AS profit,
                  sum(return_amt)  AS returns1,
                  sum(net_loss)    AS profit_loss
         FROM     (
                         SELECT ws_web_site_sk          AS wsr_web_site_sk,
                                ws_sold_date_sk         AS date_sk,
                                ws_ext_sales_price      AS sales_price,
                                ws_net_profit           AS profit,
                                cast(0 AS decimal(7,2)) AS return_amt,
                                cast(0 AS decimal(7,2)) AS net_loss
                         FROM   tpc_ds.web_sales
                         UNION ALL
                         SELECT          ws_web_site_sk          AS wsr_web_site_sk,
                                         wr_returned_date_sk     AS date_sk,
                                         cast(0 AS decimal(7,2)) AS sales_price,
                                         cast(0 AS decimal(7,2)) AS profit,
                                         wr_return_amt           AS return_amt,
                                         wr_net_loss             AS net_loss
                         FROM            tpc_ds.web_returns
                         LEFT OUTER JOIN tpc_ds.web_sales
                         ON              (
                                                         wr_item_sk = ws_item_sk
                                         AND             wr_order_number = ws_order_number) ) salesreturns,
                  tpc_ds.date_dim,
                  tpc_ds.web_site
         WHERE    date_sk = d_date_sk
         AND      d_date BETWEEN cast('2002-08-22' AS date) AND      (
                           cast('2002-08-22' AS date) + INTERVAL '14' day)
         AND      wsr_web_site_sk = web_site_sk
         GROUP BY web_site_id)
SELECT
         channel ,
         id ,
         sum(sales)   AS sales ,
         sum(returns1) AS returns1 ,
         sum(profit)  AS profit
FROM     (
                SELECT 'store channel' AS channel ,
                       'store'
                              || s_store_id AS id ,
                       sales ,
                       returns1 ,
                       (profit - profit_loss) AS profit
                FROM   ssr
                UNION ALL
                SELECT 'catalog channel' AS channel ,
                       'catalog_page'
                              || cp_catalog_page_id AS id ,
                       sales ,
                       returns1 ,
                       (profit - profit_loss) AS profit
                FROM   csr
                UNION ALL
                SELECT 'web channel' AS channel ,
                       'web_site'
                              || web_site_id AS id ,
                       sales ,
                       returns1 ,
                       (profit - profit_loss) AS profit
                FROM   wsr ) x
GROUP BY rollup (channel, id)
ORDER BY channel ,
         id
LIMIT 100;

--Query 9
SELECT CASE
         WHEN (SELECT Count(*)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 1 AND 20) &amp;gt; 3672 THEN
         (SELECT Avg(ss_ext_list_price)
          FROM   tpc_ds.store_sales
          WHERE
         ss_quantity BETWEEN 1 AND 20)
         ELSE (SELECT Avg(ss_net_profit)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 1 AND 20)
       END bucket1,
       CASE
         WHEN (SELECT Count(*)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 21 AND 40) &amp;gt; 3392 THEN
         (SELECT Avg(ss_ext_list_price)
          FROM   tpc_ds.store_sales
          WHERE
         ss_quantity BETWEEN 21 AND 40)
         ELSE (SELECT Avg(ss_net_profit)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 21 AND 40)
       END bucket2,
       CASE
         WHEN (SELECT Count(*)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 41 AND 60) &amp;gt; 32784 THEN
         (SELECT Avg(ss_ext_list_price)
          FROM   tpc_ds.store_sales
          WHERE
         ss_quantity BETWEEN 41 AND 60)
         ELSE (SELECT Avg(ss_net_profit)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 41 AND 60)
       END bucket3,
       CASE
         WHEN (SELECT Count(*)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 61 AND 80) &amp;gt; 26032 THEN
         (SELECT Avg(ss_ext_list_price)
          FROM   tpc_ds.store_sales
          WHERE
         ss_quantity BETWEEN 61 AND 80)
         ELSE (SELECT Avg(ss_net_profit)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 61 AND 80)
       END bucket4,
       CASE
         WHEN (SELECT Count(*)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 81 AND 100) &amp;gt; 23982 THEN
         (SELECT Avg(ss_ext_list_price)
          FROM   tpc_ds.store_sales
          WHERE
         ss_quantity BETWEEN 81 AND 100)
         ELSE (SELECT Avg(ss_net_profit)
               FROM   tpc_ds.store_sales
               WHERE  ss_quantity BETWEEN 81 AND 100)
       END bucket5
FROM   tpc_ds.reason
WHERE  r_reason_sk = 1;

--Query 10
SELECT cd_gender,
               cd_marital_status,
               cd_education_status,
               Count(*) cnt1,
               cd_purchase_estimate,
               Count(*) cnt2,
               cd_credit_rating,
               Count(*) cnt3,
               cd_dep_count,
               Count(*) cnt4,
               cd_dep_employed_count,
               Count(*) cnt5,
               cd_dep_college_count,
               Count(*) cnt6
FROM   tpc_ds.customer c,
       tpc_ds.customer_address ca,
       tpc_ds.customer_demographics
WHERE  c.c_current_addr_sk = ca.ca_address_sk
       AND ca_county IN ( 'Lycoming County', 'Sheridan County',
                          'Kandiyohi County',
                          'Pike County',
                                           'Greene County' )
       AND cd_demo_sk = c.c_current_cdemo_sk
       AND EXISTS (SELECT *
                   FROM   tpc_ds.store_sales,
                          tpc_ds.date_dim
                   WHERE  c.c_customer_sk = ss_customer_sk
                          AND ss_sold_date_sk = d_date_sk
                          AND d_year = 2002
                          AND d_moy BETWEEN 4 AND 4 + 3)
       AND ( EXISTS (SELECT *
                     FROM   tpc_ds.web_sales,
                            tpc_ds.date_dim
                     WHERE  c.c_customer_sk = ws_bill_customer_sk
                            AND ws_sold_date_sk = d_date_sk
                            AND d_year = 2002
                            AND d_moy BETWEEN 4 AND 4 + 3)
              OR EXISTS (SELECT *
                         FROM   tpc_ds.catalog_sales,
                                tpc_ds.date_dim
                         WHERE  c.c_customer_sk = cs_ship_customer_sk
                                AND cs_sold_date_sk = d_date_sk
                                AND d_year = 2002
                                AND d_moy BETWEEN 4 AND 4 + 3) )
GROUP  BY cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating,
          cd_dep_count,
          cd_dep_employed_count,
          cd_dep_college_count
ORDER  BY cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate,
          cd_credit_rating,
          cd_dep_count,
          cd_dep_employed_count,
          cd_dep_college_count
LIMIT 100;

--Query 13
SELECT Avg(ss_quantity),
       Avg(ss_ext_sales_price),
       Avg(ss_ext_wholesale_cost),
       Sum(ss_ext_wholesale_cost)
FROM   tpc_ds.store_sales,
       tpc_ds.store,
       tpc_ds.customer_demographics,
       tpc_ds.household_demographics,
       tpc_ds.customer_address,
       tpc_ds.date_dim
WHERE  s_store_sk = ss_store_sk
       AND ss_sold_date_sk = d_date_sk
       AND d_year = 2001
       AND ( ( ss_hdemo_sk = hd_demo_sk
               AND cd_demo_sk = ss_cdemo_sk
               AND cd_marital_status = 'U'
               AND cd_education_status = 'Advanced Degree'
               AND ss_sales_price BETWEEN 100.00 AND 150.00
               AND hd_dep_count = 3 )
              OR ( ss_hdemo_sk = hd_demo_sk
                   AND cd_demo_sk = ss_cdemo_sk
                   AND cd_marital_status = 'M'
                   AND cd_education_status = 'Primary'
                   AND ss_sales_price BETWEEN 50.00 AND 100.00
                   AND hd_dep_count = 1 )
              OR ( ss_hdemo_sk = hd_demo_sk
                   AND cd_demo_sk = ss_cdemo_sk
                   AND cd_marital_status = 'D'
                   AND cd_education_status = 'Secondary'
                   AND ss_sales_price BETWEEN 150.00 AND 200.00
                   AND hd_dep_count = 1 ) )
       AND ( ( ss_addr_sk = ca_address_sk
               AND ca_country = 'United States'
               AND ca_state IN ( 'AZ', 'NE', 'IA' )
               AND ss_net_profit BETWEEN 100 AND 200 )
              OR ( ss_addr_sk = ca_address_sk
                   AND ca_country = 'United States'
                   AND ca_state IN ( 'MS', 'CA', 'NV' )
                   AND ss_net_profit BETWEEN 150 AND 300 )
              OR ( ss_addr_sk = ca_address_sk
                   AND ca_country = 'United States'
                   AND ca_state IN ( 'GA', 'TX', 'NJ' )
                   AND ss_net_profit BETWEEN 50 AND 250 ) );

--Query 17
SELECT i_item_id,
               i_item_desc,
               s_state,
               Count(ss_quantity)                                        AS
               store_sales_quantitycount,
               Avg(ss_quantity)                                          AS
               store_sales_quantityave,
               Stddev_samp(ss_quantity)                                  AS
               store_sales_quantitystdev,
               Stddev_samp(ss_quantity) / Avg(ss_quantity)               AS
               store_sales_quantitycov,
               Count(sr_return_quantity)                                 AS
               store_returns_quantitycount,
               Avg(sr_return_quantity)                                   AS
               store_returns_quantityave,
               Stddev_samp(sr_return_quantity)                           AS
               store_returns_quantitystdev,
               Stddev_samp(sr_return_quantity) / Avg(sr_return_quantity) AS
               store_returns_quantitycov,
               Count(cs_quantity)                                        AS
               catalog_sales_quantitycount,
               Avg(cs_quantity)                                          AS
               catalog_sales_quantityave,
               Stddev_samp(cs_quantity) / Avg(cs_quantity)               AS
               catalog_sales_quantitystdev,
               Stddev_samp(cs_quantity) / Avg(cs_quantity)               AS
               catalog_sales_quantitycov
FROM   tpc_ds.store_sales,
       tpc_ds.store_returns,
       tpc_ds.catalog_sales,
       tpc_ds.date_dim d1,
       tpc_ds.date_dim d2,
       tpc_ds.date_dim d3,
       tpc_ds.store,
       tpc_ds.item
WHERE  d1.d_quarter_name = '1999Q1'
       AND d1.d_date_sk = ss_sold_date_sk
       AND i_item_sk = ss_item_sk
       AND s_store_sk = ss_store_sk
       AND ss_customer_sk = sr_customer_sk
       AND ss_item_sk = sr_item_sk
       AND ss_ticket_number = sr_ticket_number
       AND sr_returned_date_sk = d2.d_date_sk
       AND d2.d_quarter_name IN ( '1999Q1', '1999Q2', '1999Q3' )
       AND sr_customer_sk = cs_bill_customer_sk
       AND sr_item_sk = cs_item_sk
       AND cs_sold_date_sk = d3.d_date_sk
       AND d3.d_quarter_name IN ( '1999Q1', '1999Q2', '1999Q3' )
GROUP  BY i_item_id,
          i_item_desc,
          s_state
ORDER  BY i_item_id,
          i_item_desc,
          s_state
LIMIT 100;

--Query 24
WITH ssales 
     AS (SELECT c_last_name, 
                c_first_name, 
                s_store_name, 
                ca_state, 
                s_state, 
                i_color, 
                i_current_price, 
                i_manager_id, 
                i_units, 
                i_size, 
                Sum(ss_net_profit) netpaid 
         FROM   tpc_ds.store_sales,
                tpc_ds.store_returns,
                tpc_ds.store,
                tpc_ds.item,
                tpc_ds.customer,
                tpc_ds.customer_address
         WHERE  ss_ticket_number = sr_ticket_number 
                AND ss_item_sk = sr_item_sk 
                AND ss_customer_sk = c_customer_sk 
                AND ss_item_sk = i_item_sk 
                AND ss_store_sk = s_store_sk 
                AND c_birth_country = Upper(ca_country) 
                AND s_zip = ca_zip 
                AND s_market_id = 6 
         GROUP  BY c_last_name, 
                   c_first_name, 
                   s_store_name, 
                   ca_state, 
                   s_state, 
                   i_color, 
                   i_current_price, 
                   i_manager_id, 
                   i_units, 
                   i_size) 
SELECT c_last_name, 
       c_first_name, 
       s_store_name, 
       Sum(netpaid) paid 
FROM   ssales 
WHERE  i_color = 'papaya' 
GROUP  BY c_last_name, 
          c_first_name, 
          s_store_name 
HAVING Sum(netpaid) > (SELECT 0.05 * Avg(netpaid) 
                       FROM   ssales); 

WITH ssales 
     AS (SELECT c_last_name, 
                c_first_name, 
                s_store_name, 
                ca_state, 
                s_state, 
                i_color, 
                i_current_price, 
                i_manager_id, 
                i_units, 
                i_size, 
                Sum(ss_net_profit) netpaid 
         FROM   tpc_ds.store_sales,
                tpc_ds.store_returns,
                tpc_ds.store,
                tpc_ds.item,
                tpc_ds.customer,
                tpc_ds.customer_address
         WHERE  ss_ticket_number = sr_ticket_number 
                AND ss_item_sk = sr_item_sk 
                AND ss_customer_sk = c_customer_sk 
                AND ss_item_sk = i_item_sk 
                AND ss_store_sk = s_store_sk 
                AND c_birth_country = Upper(ca_country) 
                AND s_zip = ca_zip 
                AND s_market_id = 6 
         GROUP  BY c_last_name, 
                   c_first_name, 
                   s_store_name, 
                   ca_state, 
                   s_state, 
                   i_color, 
                   i_current_price, 
                   i_manager_id, 
                   i_units, 
                   i_size) 
SELECT c_last_name, 
       c_first_name, 
       s_store_name, 
       Sum(netpaid) paid 
FROM   ssales 
WHERE  i_color = 'chartreuse' 
GROUP  BY c_last_name, 
          c_first_name, 
          s_store_name 
HAVING Sum(netpaid) > (SELECT 0.05 * Avg(netpaid) 
                       FROM   ssales);

--Query 31
WITH ss
     AS (SELECT ca_county,
                d_qoy,
                d_year,
                Sum(ss_ext_sales_price) AS store_sales
         FROM   tpc_ds.store_sales,
                tpc_ds.date_dim,
                tpc_ds.customer_address
         WHERE  ss_sold_date_sk = d_date_sk
                AND ss_addr_sk = ca_address_sk
         GROUP  BY ca_county,
                   d_qoy,
                   d_year),
     ws
     AS (SELECT ca_county,
                d_qoy,
                d_year,
                Sum(ws_ext_sales_price) AS web_sales
         FROM   tpc_ds.web_sales,
                tpc_ds.date_dim,
                tpc_ds.customer_address
         WHERE  ws_sold_date_sk = d_date_sk
                AND ws_bill_addr_sk = ca_address_sk
         GROUP  BY ca_county,
                   d_qoy,
                   d_year)
SELECT ss1.ca_county,
       ss1.d_year,
       ws2.web_sales / ws1.web_sales     web_q1_q2_increase,
       ss2.store_sales / ss1.store_sales store_q1_q2_increase,
       ws3.web_sales / ws2.web_sales     web_q2_q3_increase,
       ss3.store_sales / ss2.store_sales store_q2_q3_increase
FROM   ss ss1,
       ss ss2,
       ss ss3,
       ws ws1,
       ws ws2,
       ws ws3
WHERE  ss1.d_qoy = 1
       AND ss1.d_year = 2001
       AND ss1.ca_county = ss2.ca_county
       AND ss2.d_qoy = 2
       AND ss2.d_year = 2001
       AND ss2.ca_county = ss3.ca_county
       AND ss3.d_qoy = 3
       AND ss3.d_year = 2001
       AND ss1.ca_county = ws1.ca_county
       AND ws1.d_qoy = 1
       AND ws1.d_year = 2001
       AND ws1.ca_county = ws2.ca_county
       AND ws2.d_qoy = 2
       AND ws2.d_year = 2001
       AND ws1.ca_county = ws3.ca_county
       AND ws3.d_qoy = 3
       AND ws3.d_year = 2001
       AND CASE
             WHEN ws1.web_sales > 0 THEN ws2.web_sales / ws1.web_sales
             ELSE NULL
           END > CASE
                   WHEN ss1.store_sales > 0 THEN
                   ss2.store_sales / ss1.store_sales
                   ELSE NULL
                 END
       AND CASE
             WHEN ws2.web_sales > 0 THEN ws3.web_sales / ws2.web_sales
             ELSE NULL
           END > CASE
                   WHEN ss2.store_sales > 0 THEN
                   ss3.store_sales / ss2.store_sales
                   ELSE NULL
                 END
ORDER  BY ss1.d_year;

--Query 33
WITH ss
     AS (SELECT i_manufact_id,
                Sum(ss_ext_sales_price) total_sales
         FROM   tpc_ds.store_sales,
                tpc_ds.date_dim,
                tpc_ds.customer_address,
                tpc_ds.item
         WHERE  i_manufact_id IN (SELECT i_manufact_id
                                  FROM   tpc_ds.item
                                  WHERE  i_category IN ( 'Books' ))
                AND ss_item_sk = i_item_sk
                AND ss_sold_date_sk = d_date_sk
                AND d_year = 1999
                AND d_moy = 3
                AND ss_addr_sk = ca_address_sk
                AND ca_gmt_offset = -5
         GROUP  BY i_manufact_id),
     cs
     AS (SELECT i_manufact_id,
                Sum(cs_ext_sales_price) total_sales
         FROM   tpc_ds.catalog_sales,
                tpc_ds.date_dim,
                tpc_ds.customer_address,
                tpc_ds.item
         WHERE  i_manufact_id IN (SELECT i_manufact_id
                                  FROM   tpc_ds.item
                                  WHERE  i_category IN ( 'Books' ))
                AND cs_item_sk = i_item_sk
                AND cs_sold_date_sk = d_date_sk
                AND d_year = 1999
                AND d_moy = 3
                AND cs_bill_addr_sk = ca_address_sk
                AND ca_gmt_offset = -5
         GROUP  BY i_manufact_id),
     ws
     AS (SELECT i_manufact_id,
                Sum(ws_ext_sales_price) total_sales
         FROM   tpc_ds.web_sales,
                tpc_ds.date_dim,
                tpc_ds.customer_address,
                tpc_ds.item
         WHERE  i_manufact_id IN (SELECT i_manufact_id
                                  FROM   tpc_ds.item
                                  WHERE  i_category IN ( 'Books' ))
                AND ws_item_sk = i_item_sk
                AND ws_sold_date_sk = d_date_sk
                AND d_year = 1999
                AND d_moy = 3
                AND ws_bill_addr_sk = ca_address_sk
                AND ca_gmt_offset = -5
         GROUP  BY i_manufact_id)
SELECT i_manufact_id,
               Sum(total_sales) total_sales
FROM   (SELECT *
        FROM   ss
        UNION ALL
        SELECT *
        FROM   cs
        UNION ALL
        SELECT *
        FROM   ws) tmp1
GROUP  BY i_manufact_id
ORDER  BY total_sales
LIMIT 100;

--Query 34
SELECT c_last_name, 
       c_first_name, 
       c_salutation, 
       c_preferred_cust_flag, 
       ss_ticket_number, 
       cnt 
FROM   (SELECT ss_ticket_number, 
               ss_customer_sk, 
               Count(*) cnt 
        FROM   tpc_ds.store_sales,
               tpc_ds.date_dim,
               tpc_ds.store,
               tpc_ds.household_demographics
        WHERE  store_sales.ss_sold_date_sk = date_dim.d_date_sk 
               AND store_sales.ss_store_sk = store.s_store_sk 
               AND store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk 
               AND ( date_dim.d_dom BETWEEN 1 AND 3 
                      OR date_dim.d_dom BETWEEN 25 AND 28 ) 
               AND ( household_demographics.hd_buy_potential = '>10000' 
                      OR household_demographics.hd_buy_potential = 'unknown' ) 
               AND household_demographics.hd_vehicle_count > 0 
               AND ( CASE 
                       WHEN household_demographics.hd_vehicle_count > 0 THEN 
                       household_demographics.hd_dep_count / 
                       household_demographics.hd_vehicle_count 
                       ELSE NULL 
                     END ) > 1.2 
               AND date_dim.d_year IN ( 1999, 1999 + 1, 1999 + 2 ) 
               AND store.s_county IN ( 'Williamson County', 'Williamson County', 
                                       'Williamson County', 
                                                             'Williamson County' 
                                       , 
                                       'Williamson County', 'Williamson County', 
                                           'Williamson County', 
                                                             'Williamson County' 
                                     ) 
        GROUP  BY ss_ticket_number, 
                  ss_customer_sk) dn, 
       tpc_ds.customer
WHERE  ss_customer_sk = c_customer_sk 
       AND cnt BETWEEN 15 AND 20 
ORDER  BY c_last_name, 
          c_first_name, 
          c_salutation, 
          c_preferred_cust_flag DESC;

--Query 35
SELECT ca_state,
               cd_gender,
               cd_marital_status,
               cd_dep_count,
               Count(*) cnt1,
               Stddev_samp(cd_dep_count),
               Avg(cd_dep_count),
               Max(cd_dep_count),
               cd_dep_employed_count,
               Count(*) cnt2,
               Stddev_samp(cd_dep_employed_count),
               Avg(cd_dep_employed_count),
               Max(cd_dep_employed_count),
               cd_dep_college_count,
               Count(*) cnt3,
               Stddev_samp(cd_dep_college_count),
               Avg(cd_dep_college_count),
               Max(cd_dep_college_count)
FROM   tpc_ds.customer c,
       tpc_ds.customer_address ca,
       tpc_ds.customer_demographics
WHERE  c.c_current_addr_sk = ca.ca_address_sk
       AND cd_demo_sk = c.c_current_cdemo_sk
       AND EXISTS (SELECT *
                   FROM   tpc_ds.store_sales,
                          tpc_ds.date_dim
                   WHERE  c.c_customer_sk = ss_customer_sk
                          AND ss_sold_date_sk = d_date_sk
                          AND d_year = 2001
                          AND d_qoy < 4)
       AND ( EXISTS (SELECT *
                     FROM   tpc_ds.web_sales,
                            tpc_ds.date_dim
                     WHERE  c.c_customer_sk = ws_bill_customer_sk
                            AND ws_sold_date_sk = d_date_sk
                            AND d_year = 2001
                            AND d_qoy < 4)
              OR EXISTS (SELECT *
                         FROM   tpc_ds.catalog_sales,
                                tpc_ds.date_dim
                         WHERE  c.c_customer_sk = cs_ship_customer_sk
                                AND cs_sold_date_sk = d_date_sk
                                AND d_year = 2001
                                AND d_qoy < 4) )
GROUP  BY ca_state,
          cd_gender,
          cd_marital_status,
          cd_dep_count,
          cd_dep_employed_count,
          cd_dep_college_count
ORDER  BY ca_state,
          cd_gender,
          cd_marital_status,
          cd_dep_count,
          cd_dep_employed_count,
          cd_dep_college_count
LIMIT 100;

I decided to break up the queries’ performance analysis into two parts. Further TPC-DS queries’ results, along with a quick look at how Vertica plays with Tableau and PowerBI applications, can be viewed in Part 4 to this series but we can already see the execution pattern and how data volumes and its distribution across multiple nodes impact the performance. Again, please refer to Post 4 for data on the additional ten queries performance results and my final comments on this experiment.

Posted in: MPP RDBMS, SQL

Tags: Analytics, Big Data, MPP RDBMS, SQL, Vertica

What you are looking at...

My name is Martin and this site is a random collection of recipes and reflections about various topics covering information management, data engineering, machine learning, business intelligence and visualisation plus everything else that I fancy to categorise under the 'analytics' umbrella. I'm a native of Poland but since my university days I have lived in Melbourne, Australia and worked as a DBA, developer, data architect, technical lead and team manager. My main interests lie in both, helping clients in technical aspects of information management e.g. data modelling, systems architecture, cloud deployments as well as business-oriented strategies e.g. enterprise data solutions project management, data governance and stewardship, data security and privacy or data monetisation. On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analysed and visualised, you've got my attention!

Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book, the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!

Subscribe via RSS | Comments (RSS)

Advertising

Recent Comments
1. terry w on Using Polybase and In-Database Python Runtime for Building SQL Server to Azure Data Lake Data Extraction PipelinesMay 12, 2025
  Never thought Polybase could be used in this capacity and don't think Microsoft advertised this feature (off-loading DB data as…
2. admin on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Hi Karissa It's definitely doable and I don't foresee too many changes from what what was outlined in the post.…
3. Karissa Lambert on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Thanks for the details write-up Martin. I have one question though, can something similar be developed on top of Azure…
4. Leonard Chu on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonApril 30, 2025
  Very informative, love these little nuggets of knowledge from the DuckDB community and anyone involved, thanks Martin!
5. admin on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonJanuary 23, 2025
  Not sure about Polars but I did take ClickHouse (chDB) and DuckDB for a quick test drive - you can…