bicortex

Vertica MPP Database Overview and TPC-DS Benchmark Performance Analysis (Part 2)

February 1st, 2018 / 1 Comment » / by admin

In Post 1 I outlined the key architectural principles behind Vertica’s design and what makes it one of the top MPP/analytical databases available today. In this instalment, I will go through the database installation process (across a single and multiple nodes), some of the issues I encountered (specific to the off-the-shelf hardware I used) as well as some other titbits regarding subsequent configuration and tuning. For a TLDR version of Vertica installation process you can watch the following video instead.

Vertica can be installed on any compatible x86 architecture hardware so for the purpose of this demo I decided dust off a few of my old and trusty Lenovo m92p units. Even though these machines are typically used to run desktop class applications in an office environment, they still should provide a decent evaluation platform if one wants to kick the tires on Vertica installation and configuration process. Heck, if Greenplum is proud to run their MPP database on a cluster of Raspberries Pi (check out the video HERE), why not run Vertica on a stack of low-powered Lenovo desktops under Ubuntu Server. Below is a photo of the actual machines running Vertica version 9.1 fitted with Intel Core i5 3470 CPU, 16GB of DDR3 memory and 512GB of Crucial MX300 SSD in each of the three nodes (bottom three units).

Installing Vertica

Before installing Vertica, the following key points should be taken into consideration.

Only one database instance can run per cluster. So, if we were to provision a three-node cluster, then all three nodes will be dedicated to one single database.
Only one instance of Vertica is allowed to run on a host at any time.
Only the root user or the user with the privileges (sudo) can run the install_vertica script.

Vertica installation process also imposes some rigid requirements around swap space available, dynamic CPU scaling, network configuration and disk space and memory requirements. As these are quite comprehensive in terms of actual Linux side configuration I won’t go into details on how these requirements and their parameters need to be tuned and adjusted (beyond the issues I encountered and the fixes I performed) as Vertica’s documentation available on their website outlines all these options in great details.

Below is a step by step overview of how to install Vertica (single node configuration) on Ubuntu Server 14.04 LTS with some of the problems encountered and the fixes implemented to rectify them. I will also provide details on how to add nodes to existing installation towards the end of this post.

To install Vertica as a single node cluster the following steps will be applied:

Download the Vertica installation package from the vendor’s website according to the Linux OS applicable to your environment. In my case, I used Ubuntu Server 14.04 LTS release so majority of the steps below apply to this particular Linux distribution.
Disable firewall.
```
sudo ufw disable 
```

Install supporting packages.

#required for interactivity with Administration Tools
apt-get install dialog 
#required for Administration Tools connectivity between nodes
apt-get install openssh

Install the downloaded package using standard comma
```
sudo dpkg -i /pathname/vertica_x.x.x.x.x.deb
```
Install Vertica by running the install_vertica script with the required parameters supplied.
```
/opt/vertica/sbin/install_vertica --hosts node0001,node0002,node0003\
--deb /tmp/vertica_9.0.x.x86_64.deb 
--dba-user dbadmin
```
In my case the initiated installation failed reporting the following errors (last three FAIL statements).

First issue related to the system reporting invalid CPU scaling. The installer allows CPU frequency scaling to be enabled when cpufreq scaling governor is set to performance. If the CPU scaling governor is set to on-demand, and ignore_nice_load is 1 (true), then the installer fails with the error S0140. CPU scaling is a hardware and software feature that helps computers conserver energy by slowing the processor when the system load is low, and speeding it up again when the system load increases. Typically, you disable the CPU scaling in the hosts’s BIOS but as it was the case with my machine, the settings tweaking did nothing to discourage Vertica installer from throwing the error presumably due to the fact I was running the T version of Intel CPU i.e. ‘Power-optimised lifestyle’. The next best thing to do was to tweak it through the Linux kernel or setting the CPU frequency governor to always run the CPU at full speed.

This can be fixed by installing cpufrequtils package and manually changing the CPU governor’s settings.
```
apt-get install cpufrequtils
sudo cpufreq-set -c 0 -g performance
sudo cpufreq-set -c 1 -g performance
sudo cpufreq-set -c 2 -g performance
sudo cpufreq-set -c 3 -g performance
```
Next, the installer raised an issue with the ntp deamon not running. The network time protocol (NTP) deamon must be running on all of the nodes in the cluster so that their clocks are synchronised for timing purposes. By default, the NPT deamon is not installed on some Ubuntu and Debian systems. First, let’s install it and start the NTP process.
```
sudo apt-get install ntp
sudo /etc/init.d/ntp reload
```
To verify the the Network Time Protocol deamon is operating correctly we can issue the following command on all nodes in the cluster.
```
ntpq -c rv | grep stratum
```
A stratum level of 16 indicates that NTP is not synchronising correctly. If a stratum level of 16 is detected, wait 15 minutes and issue the command again. If NTP continues to detect a stratum level of 16, verify that the NTP port (UDP Port 123) is open on all firewalls between the cluster and the remote machine to which you are attempting to synchronise.

Finally, the installer also complained that ‘the transparent hugepages is set to always and that it must be set to never or madvise. This can be ractified by editing the boot loader (for example /etc/grub.conf) or editing /etc/rc.local (on a system that supports rc.local) and adding the following script.
```
if test -f /sys/kernel/mm/transparent_hugepages/enabled; then
echo never > /sys/kernel/mm/transparent_hugepages/enabled
fi
```
Regardless of the approach, you must reboot your system for the settings to take effect, or run the following to echo lines to proceed with the install without rebooting.
```
sudo su
echo never > /sys/kernel/mm/transparent_hugepage/enabled
```
Next, when trying to run the installation script again, another issue (this time a warning) about swappiness came up as per the image below.

The swappiness kernel parameter defines the amount, and how often, the kernel copies RAM contents to a swap space. Vertica recommends a value of 1. To fix this issue we can run the following command to alter the parameter value.
```
echo 1 > /proc/sys/vm/swappiness
```
Fortunately, after rectifying the swappiness issue rest of the installation went trouble-free.
After we have installed Vertica on all desired nodes, it’s time to create a database. Log in as a new user (dbadmin in default scenarios) and connect to the admin panel by running the followinmand.
```
/opt/vertica/bin/adminTools
```
If you are connecting to admin tools for the first time, you will be prompted for a license key. Since we’re using the community edition we will just leave it blank and click OK.
After accepting EULA we can navigate to Administration Tools | Configuration Menu | Create Database menu and provide the database name and password.

Once the database is created we can connect to it using VSQL tool to perform admin tasks. Vertica also allows administrators to invoke most Administration Tools from the command line or a shell script. We can list all available tools and their commands and options in individual help text as per below.

$ admintools -a
Usage:
    adminTools [-t | --tool] toolName [options]
Valid tools are:
                command_host
                connect_db
                create_db
                database_parameters
                db_add_node
                db_remove_node
                db_replace_node
                db_status
                distribute_config_files
                drop_db
                host_to_node
                install_package
                install_procedure
                kill_host
                kill_node
                license_audit
                list_allnodes
                list_db
                list_host
                list_node
                list_packages
                logrotate
                node_map
                re_ip
                rebalance_data
                restart_db
                restart_node
                return_epoch
                revive_db
                set_restart_policy
                set_ssl_params
                show_active_db
                start_db
                stop_db
                stop_host
                stop_node
                uninstall_package
                upgrade_license_key
                view_cluster

Each tool comes with its own parameters. For example, the following three examples display the command option for command_host, connect_db and create_db tools.

Usage: command_host [options]

Options:
  -h, --help            show this help message and exit
  -c CMD, --command=CMD
                        Command to run
-------------------------------------------------------------------------
Usage: connect_db [options]

Options:
  -h, --help            show this help message and exit
  -d DB, --database=DB  Name of database to connect
  -p DBPASSWORD, --password=DBPASSWORD
                        Database password in single quotes
-------------------------------------------------------------------------
Usage: create_db [options]

Options:
  -h, --help            show this help message and exit
  -D DATA, --data_path=DATA
                        Path of data directory[optional] if not using compat21
  -c CATALOG, --catalog_path=CATALOG
                        Path of catalog directory[optional] if not using
                        compat21
  --compat21            (deprecated) Use Vertica 2.1 method using node names
                        instead of hostnames
  -d DB, --database=DB  Name of database to be created
  -l LICENSEFILE, --license=LICENSEFILE
                        Database license [optional]
  -p DBPASSWORD, --password=DBPASSWORD
                        Database password in single quotes [optional]
  -P POLICY, --policy=POLICY
                        Database restart policy [optional]
  -s NODES, --hosts=NODES
                        comma-separated list of hosts to participate in
                        database
  --skip-fs-checks      Skip file system checks while creating a database (not
                        recommended).

The Management Console provides some, but not all, of the functionality that the Administration Tools provides. In addition, the MC provides extended functionality not available in the Administration Tools, such as a graphical view of your Vertica database and detailed monitoring charts and graphs.

To install MC download the MC package from myVertica portal and save it to a location on the target server, such as /tmp. On the target server, log in as root or a user with sudo privileges, change directory to the location where you saved the MC package and install MC using your local Linux distribution package management system (for example, rpm, yum, zipper, apt, dpkg). In my case, since I run Ubuntu Server that will look like the following.

sudo dpkg -i vertica-console-<current-version>.deb

Next, open a browser and enter the IP address or host name of the server on which you installed MC, as well as the default MC port 5450. For example, you’ll enter one of the following.

https://xx.xx.xx.xx:5450/ 
https://hostname:5450/

When the Configuration Wizard dialog box appears, proceed to Configuring Management Console.

Vertica being a distributed database stores and retrieves data from multiple nodes in a typical setup. There are many reasons for adding one or more nodes to an installation of Vertica e.g. make a database K-safe, swapping nodes for maintenance or replacing/removing a node due to, for example, a malfunctioning hardware. To add a new node, we can use the update_vertica script. However, before adding a node to an existing cluster, the following prerequisites have to be kept in mind.

Make sure that the database is running.
Newly added node should be reachable by all existing nodes in the cluster.
Generally, it is not needed to shut down the Vertica database for expansion, but a shutdown is necessary if we are expending it from a single node cluster.

Adding a new node is as simple as running the after mentioned update_vertica script e.g.

/opt/vertica/sbin/update_vetica --add-hosts host(s) -deb package

The update_vertica script uses all the same options as install_vertica script and beyond Vertica installation it performs post-installation checks, modifies spread to encompass the larger cluster and configures the Admin Tools to work with the larger cluster.

Once you have added one or more hosts to the cluster, you can add them as nodes to the database. This can be accomplished using either the Management Console interface or the Administration Tools interface or alternatively the admintools command line (to preserve the specific order of the nodes you add). As adding nodes in the MC is GUI driven and thus very straightforward, the following process outlines how to use Administration Tools instead.

Open the Administration Tools.
One the Main Menu, select View Database Cluster State to verify that the database is running. If it is not, start it.
From the Main Menu, select Advanced Tools Menu and click OK.
In the Advanced Menu, select Cluster Management and click OK.
In the Cluster Management menu, select Add Host(s) and click OK.
Select the database to which you want to add or remove hosts, and then select OK. A list of unused hosts is displayed.
Select the hosts you want to add to the database and click OK
When prompted, click Yes to confirm that you want to add the hosts.
When prompted, enter the password for the database and then select OK.
When prompted that the hosts were successfully added, select OK.
Vertica now automatically starts the rebalancing process to populate the new node with data. When prompted, enter the path to a temporary directory that the Database Designer can use to rebalance the data in the database and select OK.
Either press enter to accept the default K-safety value, or enter a new higher value for the database and select OK.
Select whether Vertica should immediately start rebalancing the database, or whether it should create a script to rebalance the database later.
Review the summary of the rebalancing process and select Proceed.
If you chose to automatically rebalance, the rebalance process runs. If you chose to create a script, the script is generated and saved. In either case, you are shown a success screen, and prompted to select OK to end the Add Node process.

Since this demo uses three hosts (maximum allowed to kick some tires and try Vertica without an enterprise license), once configured and set up, this is how the cluster looks like in the management console (database and hosts views).

Aside from the Management Console interface, various parts of Vertica database can be administered and monitored via the system tables and log files. Vertica provides an API (application programming interface) for monitoring various features and functions within a database in the form of system tables. These tables provide a robust, stable set of views that let you monitor information about your system’s resources, background processes, workload, and performance, allowing you to more efficiently profile, diagnose, and view historical data equivalent to load streams, query profiles, tuple mover operations, and more. Because Vertica collects and retains this information automatically, you don’t have to manually set anything. You can write queries against system tables with full SELECT support the same way you perform query operations on base and temporary tables. You can query system tables using expressions, predicates, aggregates, analytics, subqueries, and joins. You can also save system table query results into a user table for future analysis.

System tables are grouped into the following schemas:

V_CATALOG – information about persistent objects in the catalog
V_MONITOR – information about transient system state

These schemas reside in the default search path so there is no need to specify schema.table in your queries unless you change the search path to exclude V_MONITOR or V_CATALOG or both. Most of the tables are grouped into the following areas:

System Information
System resources
Background processes
Workload and performance

The following two queries outline metadata information for two tables – databases and cpu_usage – from their respective v_catalog and v_monitor schemas.

Vertica also collects and logs various system events which can help in troubleshooting and general system performance. The events are collected in the following ways.

In the Vertica.log file
In the ACTIVE_EVENTS system table
Using SNMP
Using Syslog

This concludes Vertica installation process. Even though I came across a few issues, those were mainly specific to the hardware configuration I used for this demo and in a ‘real-world scenario’ Vertica installation process is usually trouble-free and can be accomplished in less then 30min. In the next post I will go through the data used for the TPC-DS benchmark testing, loading mechanism, some of the performance considerations imposed by its volume and hardware specifications used and provide an overview of the results of the first ten queries I run. For the remaining ten queries’ results as well as PowerBI and Tableau interactions please see the final instalment to this series – post 4 HERE.

Posted in: MPP RDBMS, SQL

Tags: Analytics, Big Data, MPP RDBMS, SQL, Vertica

TPC-DS big data benchmark overview – how to generate and load sample data

January 11th, 2018 / 6 Comments » / by admin

Ever wanted to benchmark your data warehouse and see how it stacks up against the competition? Ever wanted to have a point of reference which would give you industry-applicable and measurable set of metrics you could use to objectively quantify your data warehouse or decision support system’s performance? And most importantly, have you ever considered which platform and vendor will provide you with the most bang for your buck setup?

Well, arguably one of the best way to find out is to simulate a suit of TPC-DS workloads (queries and data maintenance) against your target platform to obtain a representative evaluation of the System Under Test’s (SUT) performance as a general-purpose decision support system. There are a few other data sets and methodologies that can be used to gauge data storage and processing system’s performance e.g. TLC Trip Record Data released by the New York City Taxi & Limousine Commission has gained a lot of traction amongst big data specialists, however, TPC-approved benchmarks have long been considered as the most objective, vendor-agnostic way to perform data-specific hardware and software comparisons, capturing the complexity of modern business processing to a much greater extent than its predecessors.

In this post I will explore how to generate test data and test queries using dsdgen and dsqgen utilities on a windows machine against the product supplier snowflake-type schema as well as how to load test data into the created database in order to run some or all of the 99 queries TPC-DS defines. I don’t want to go too much into details as TPC already provides a very comprehensive overview of the database and schema it provides, along with detail description of constraints and assumptions. This post is more focused on how to generate the required components used to evaluate the target platform the database used to for TPC-DS data sets and data loading.

In this post I will be referring to TPC-DS version 2.5.0rc2, the most current version at the time of this post publication. The download link can be found on the consortium website. The key components used for system’s evaluation using TPC-DS tools include the following:

Data generator called dsdgen used for data sets creation
Query generator called dsqgen used for creating benchmark query sets
Three SQL files called tpcds.sql, tpcds_source.sql and tpc_ri.sql which create a sample implementation of the logical schema for the data warehouse

There are other components present in the toolkit, however, for the sake of brevity I will not be discussing data maintained functionality, verification data, alternative query sets or verification queries in this post. Let’s start with dsdgen tool – an executable used to create raw data sets in a form of files. The dsdgen solution supplied as part of the download needs to be complied before the use. For me it was as simple as firing up Microsoft Visual Studio, loading up dbgen2 solution and building the final executable. Once complied, the tool can be controlled from the command line by a combination of switches and environment variables assumed to be single flags preceded by a minus sign, optionally followed by an argument. In its simplest form of execution, it dsdgen can be called from Powershell e.g. I used the following combination of flags to generate the required data (1GB size) in a C:\TPCDSdata\SourceFiles\ folder location and to measure execution time.

Measure-Command {.\dsdgen /SCALE 1 /DIR C:\TPCDSdata\SourceFiles /FORCE /VERBOSE}

The captured output below shows that the execution time on my old Lenovo laptop (i7-4600U @ 2.1GHz, 12GB RAM, SSD HDD & Win10 Pro) was close to 10 minutes for 1GB of data generating sequentially.

The process can be sped up using PARALLEL flag which will engage multiple CPU cores, distributing workload across multiple processes. The VERBOSE parameter is pretty much self-explanatory – progress messages are displayed as the data is generated. FORCE flag overwrites previously created files. Finally, the scale parameter requires a little bit more explanation.

The TPC-DS benchmark defines a set of discrete scaling points – scale factors – based on the approximate size of the raw data produced by dsdgen. The set of scaling factors defined for the TPC-DS is 1TB, 3TB, 10TB, 30TB and 100TB where a terabyte (TB) is defined as 2 to the power of 40 bytes. The required row count for each permissible scale factor and each table in the test database is as follows.

Once the data has been generated, we can create a set of scripts used to simulate reporting/analytical business scenarios using another tool in the TPC-DS arsenal – dsqgen. As with the previous executable, dsqgen runs off the command line. Passing applicable query template (in this case Microsoft SQL Server specific), it creates a single SQL file containing queries used across a multitude of scenarios e.g. reporting queries, ad hoc queries, interactive OLAP queries as well as data mining queries. I used the following command to generate the SQL scripts file.

dsqgen /input .\query_templates\templates.lst /directory .\query_templates /dialect sqlserver /scale 1

Before I get to how the data loading script and schema overview, a quick mention of the final set of SQL scripts, tpcds.sql file, which create a sample implementation of the logical schema for the data warehouse. This file contains the necessary DDLs so running those on the environment of your choice should be executed before generated data is loaded into the data warehouse. Also, since the data/files generated by dsdgen utility come with a *.dat extension, it may be worthwhile to do some cleaning up before the target schema is populated. The following butchered Python script converts all *.dat files into *.csv files and breaks up some of the larger ones into a collection of smaller ones suffixed in an orderly fashion i.e. filename_1, filename_2, filename_3 etc. based on the values set for some of the variables declared at the start. This step is entirely optional but I found that some system utilities used for data loading find it difficult to deal with large files e.g. scaling factor greater than 100GB. Splitting a large file into a collection of smaller ones may be an easy loading speed optimisation technique, especially if the process can be parallelized.

from os import listdir, rename, remove, path, walk
from shutil import copy2, move
from csv import reader, writer
import pyodbc 
import win32file as win
from tqdm import tqdm
import platform as p

srcFilesLocation = path.normpath('C:/TPCDSdata/SourceFiles/') 
srcCSVFilesLocation = path.normpath('C:/TPCDSdata/SourceCSVFiles/') 
tgtFilesLocation = path.normpath('C:/TPCDSdata/TargetFiles/')    
delimiter = '|'
keepSrcFiles = True
keepSrcCSVFiles = True
rowLimit = 2000000
outputNameTemplate = '_%s.csv'
keepHeaders = False
            
def getSize(start_path):
    total_size = 0
    for dirpath, dirnames, filenames in walk(start_path):
        for f in filenames:
            fp = path.join(dirpath, f)
            total_size += path.getsize(fp)            
    return total_size 

def freeSpace(start_path):
    secsPerClus, bytesPerSec, nFreeClus, totClus = win.GetDiskFreeSpace(start_path)
    return secsPerClus * bytesPerSec * nFreeClus

def removeFiles(DirPath):
    filelist = [ f for f in listdir(DirPath)]
    for f in filelist:
        remove(path.join(DirPath, f))

def SplitAndMoveLargeFiles(file, rowCount, srcCSVFilesLocation, outputNameTemplate, keepHeaders=False):
    filehandler = open(path.join(srcCSVFilesLocation, file), 'r')
    csv_reader = reader(filehandler, delimiter=delimiter)
    current_piece = 1
    current_out_path = path.join(tgtFilesLocation, path.splitext(file)[0]+outputNameTemplate  % current_piece)
    current_out_writer = writer(open(current_out_path, 'w', newline=''), delimiter=delimiter)
    current_limit = rowLimit
    if keepHeaders:
        headers = next(csv_reader)
        current_out_writer.writerow(headers)
    pbar=tqdm(total=rowCount)        
    for i, row in enumerate(csv_reader):      
        pbar.update()             
        if i + 1 > current_limit:
            current_piece += 1
            current_limit = rowLimit * current_piece
            current_out_path = path.join(tgtFilesLocation, path.splitext(file)[0]+outputNameTemplate  % current_piece)
            current_out_writer = writer(open(current_out_path, 'w', newline=''), delimiter=delimiter)
            if keepHeaders:
                current_out_writer.writerow(headers)          
        current_out_writer.writerow(row)
    pbar.close()

def SourceFilesRename(srcFilesLocation, srcCSVFilesLocation):
    srcDirSize = getSize(srcFilesLocation)
    diskSize = freeSpace(srcFilesLocation)
    removeFiles(srcCSVFilesLocation)
    for file in listdir(srcFilesLocation):
        if file.endswith(".dat"):
            if keepSrcFiles:
                if srcDirSize >= diskSize:
                    print ('''Not enough space on the nominated disk to duplicate the files (you nominated to keep source files as they were generated i.e. with the .dat extension). 
                    Current source files directory size is {} MB. At least {} MB required to continue. Bailing out...'''
                    .format(round(srcDirSize/1024/1024,2), round(2*srcDirSize/1024/1024),2))
                    exit(1)
                else:                    
                    copy2(path.join(srcFilesLocation , file), srcCSVFilesLocation)
                    rename(path.join(srcCSVFilesLocation , file), path.join(srcCSVFilesLocation , file[:-4]+'.csv'))
            else:
                move(srcFilesLocation + file, srcCSVFilesLocation)
                rename(path.join(srcCVSFilesLocation , file), path.join(srcCSVFilesLocation , file[:-4]+'.csv'))

def ProcessLargeFiles(srcCSVFilesLocation,outputNameTemplate,rowLimit):
    fileRowCounts = []
    for file in listdir(srcCSVFilesLocation):
        if file.endswith(".csv"):
            fileRowCounts.append([file,sum(1 for line in open(path.join(srcCSVFilesLocation , file), newline=''))])               
    removeFiles(tgtFilesLocation)
    for file in fileRowCounts: 
        if file[1]>rowLimit:
            print("Processing File:", file[0],)
            print("Measured Row Count:", file[1])
            SplitAndMoveLargeFiles(file[0], file[1], srcCSVFilesLocation, outputNameTemplate)
        else:
            #print(path.join(srcCSVFilesLocation,file[0]))
            move(path.join(srcCSVFilesLocation, file[0]), tgtFilesLocation)
    removeFiles(srcCSVFilesLocation)            

if __name__ == "__main__":
    SourceFilesRename(srcFilesLocation, srcCSVFilesLocation)
    ProcessLargeFiles(srcCSVFilesLocation, outputNameTemplate, rowLimit)

The following screenshot demonstrates execution output when the file is run in Powershell. Since there were only two tables with the record count greater than the threshold set in the rowLimit variable i.e. 2000000 rows, only those two tables get broken up into smaller chunks.

Once the necessary data files have been created and the data warehouse is prepped for loading we can proceed to load the data. Depending on the volume of data, which in case of TPC-DS directly correlates with the scaling factor used when running dsdgen utility, a number of different methods can be used to populate the schema. Looking at Microsoft SQL Server as a test platform, the most straightforward approach would be to use BCP utility. The following Python script executes bulk copy program utility across many cores/CPUs to load the data created in the previous step (persisted in C:\TPCSdata\TargetFiles\ directory). The script utilises python’s multiprocessing library to make the most out of the resources available.

from multiprocessing import Pool, cpu_count
from os import listdir,path, getpid,system
import argparse

odbcDriver = '{ODBC Driver 13 for SQL Server}'

parser = argparse.ArgumentParser(description='TPC-DS Data Loading Scriptby bicortex.com')
parser.add_argument('-S','--svrinstance', help='Server and Instance Name of the MSSQL installation', required=True)
parser.add_argument('-D','--db', help='Database Name', required=True)
parser.add_argument('-U','--username', help='User Name', required=True)
parser.add_argument('-P','--password', help='Autenticating Password', required=True)
parser.add_argument('-L','--filespath', help='UNC path where the files are located', required=True)
args = parser.parse_args()

if not args.svrinstance or not args.db or not args.username or not args.password or not args.filespath:
    parser.print_help()
    exit(1)

def loadFiles(tableName, fileName):    
    print("Loading from CPU: %s" % getpid())
    fullPath = path.join(args.filespath, fileName)
    bcp = 'bcp %s in %s -S %s -d %s -U %s -P %s -q -c -t "|" -r"|\\n"' % (tableName, fullPath, args.svrinstance, args.db, args.username, args.password)
    #bcp store_sales in c:\TPCDSdata\TargetFiles\store_sales_1.csv -S <servername>\<instancename> -d <dbname> -U <username> -P <password> -q -c -t  "|" -r "|\n"  
    print(bcp)
    system(bcp)
    print("Done loading data from CPU: %s" % getpid())

if __name__ == "__main__":    
    p = Pool(processes = 2*cpu_count())
    for file in listdir(args.filespath):
        if file.endswith(".csv"):
            tableName = ''.join([i for i in path.splitext(file)[0] if not i.isdigit()]).rstrip('_')
            p.apply_async(loadFiles, [tableName, file])
    p.close()
    p.join()

The following screenshot (you can also see data load sample execution video HERE) depicts the Python script loading 10GB of data, with some files generated by dsdgen utility further split up into smaller chunks by the the split_files.py script. The data was loaded into a Microsoft SQL Server 2016 Enterprise instance running on Azure virtual machine with 20 virtual cores (Intel E5-2673 v3 @ 2.40GHz) and 140GB of memory. Apart from the fact that it’s very satisfying to watch 20 cores being put to work simultaneously, spawning multiple BCP instances significantly reduces data load times, even when accounting for locking and contention.

I have not bothered to test the actual load times as SQL Server has a number of knobs and switches which can be used to adjust and optimise system’s performance. For example, in-memory optimised tables, introduced in SQL Server 2014, store their data in memory using multiple versions of each row’s data. This technique is characterised as ‘non-blocking multi-version optimistic concurrency control’ and eliminates both locks and latches so a separate post may be in order to do justice to this feature and conduct a full performance analysis in a fair and unbiased manner.

So there you go – a fairly rudimentary rundown of some of the features of TPC-DS benchmark. For more detailed overview of TPC-DS please refer to the TPC website and their online resources HERE.

Posted in: Azure, How To's, Programming, SQL Server

Tags: Azure, Big Data, Programming, Python, SQL, SQL Server

What you are looking at...

My name is Martin and this site is a random collection of recipes and reflections about various topics covering information management, data engineering, machine learning, business intelligence and visualisation plus everything else that I fancy to categorise under the 'analytics' umbrella. I'm a native of Poland but since my university days I have lived in Melbourne, Australia and worked as a DBA, developer, data architect, technical lead and team manager. My main interests lie in both, helping clients in technical aspects of information management e.g. data modelling, systems architecture, cloud deployments as well as business-oriented strategies e.g. enterprise data solutions project management, data governance and stewardship, data security and privacy or data monetisation. On the whole, I am very fond of anything closely or remotely related to data and as long as it can be represented as a string of ones and zeros and then analysed and visualised, you've got my attention!

Outside sporadic updates to this site I typically find myself fiddling with data, spending time with my kids or a good book, the gym or watching a good movie while eating Polish sausage with Zubrowka (best served on rocks with apple juice and a lime twist). Please read on and if you find these posts of any interests, don't hesitate to leave me a comment!

Subscribe via RSS | Comments (RSS)

Advertising

Recent Comments
1. terry w on Using Polybase and In-Database Python Runtime for Building SQL Server to Azure Data Lake Data Extraction PipelinesMay 12, 2025
  Never thought Polybase could be used in this capacity and don't think Microsoft advertised this feature (off-loading DB data as…
2. admin on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Hi Karissa It's definitely doable and I don't foresee too many changes from what what was outlined in the post.…
3. Karissa Lambert on SQL Server to Snowflake Solution Architecture and Implementation – How to Extract, Ingest and Model Wide World Importers Database using Snowflake PlatformMay 5, 2025
  Thanks for the details write-up Martin. I have one question though, can something similar be developed on top of Azure…
4. Leonard Chu on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonApril 30, 2025
  Very informative, love these little nuggets of knowledge from the DuckDB community and anyone involved, thanks Martin!
5. admin on AWS S3 data ingestion and augmentation patterns using DuckDB and PythonJanuary 23, 2025
  Not sure about Polars but I did take ClickHouse (chDB) and DuckDB for a quick test drive - you can…

bicortex

Vertica MPP Database Overview and TPC-DS Benchmark Performance Analysis (Part 2)

Installing Vertica

TPC-DS big data benchmark overview – how to generate and load sample data

What you are looking at...

Subscribe

Advertising

Tags

Recent Comments