Snowflake Cloud Data Warehouse Review – TPC-DS Benchmark Performance Analysis and Why It’s Becoming the Post-Hadoop Big Data Nirvana

Note: All artifacts used in this demo can be downloaded from my shared OneDrive folder HERE.

Introduction

I have been working in the BI/Analytics field for a long time and have seen a lot of disruptive technologies in this space going through the hype cycle. Moving from SMP technologies to MPP databases, the era of Hadoop, the meteoric rise of Python and R for data science, serverless architectures and data lakes on cloud platforms etc. ensured that this industry stays more vibrant than ever. I see new frameworks, tools and paradigms raising to the surface almost daily and there are always new things to learn and explore. A few weeks ago, I had an interesting conversation with a recruiter about what’s trendy and interesting in the space of BI and Analytics and what the businesses she’s been helping with resourcing are looking for in terms of the IT talent pool. I was expecting to hear things like AI PaaS, Synthetic Data, Edge Analytics and all the interesting (and sometimes a bit over-hyped) stuff that the cool kids in Silicon Valley are getting into. Instead, she asserted that apart from public cloud platforms and all the run-of-the-mill BI tools like PowerBI and Looker, there is a growing demand for Snowflake projects. Mind you, this is an Australian market so chances are your neck of the woods may look completely different, but suffice to say, I didn’t expect a company not affiliated with a major public cloud provider to become so popular, particularly in this already congested market for cloud data warehousing solutions.

Snowflake, a direct competitor to Amazon’s Redshift, Microsoft’s Synapse and Google’s BigQuery, has been around for some time, but only recently I also started hearing more about it becoming the go-to technology for storing and processing large volumes of data in the cloud. I was also surprised to see Snowflake positioned in the ‘Leaders’ quadrant in the Data Management Solutions for Analytics chart, dethroning the well-known and respected industry stalwarts such as Vertica or Greenplum.

In this post, I want to look at Snowflake performance characteristics by creating a TPC-DS benchmark schema and executing a few analytical queries to see how it fares in comparison to some of the other MPP databases I worked with and/or tested. It’s not supposed to be an exhaustive overview or test and as such, these benchmark results should not be taken as Gospel. Having worked with databases all my career, I always look forward to exploring new technologies and even though I don’t expect to utilise Snowflake in my current project, it’s always exciting to kick the tires on a new platform or service.

OK, let’s dive in and see what’s in store!

Snowflake Architecture Primer

Snowflake claims to be built for cloud from the ground-up and does not offer on-premise deployment option, it’s purely a data warehouse as a service (DWaaS) offering. Everything – from database storage infrastructure to the compute resources used for analysis and optimisation of data – is handled by Snowflake. In this respect, Snowflake is not that much different from Google’s BigQuery although it’s difficult to contrast those two, mainly due to different pricing models (excluding storage cost) – BigQuery charges per volume of data scanned whereas Snowflake went with a more traditional model i.e. compute size and resource time allocation.

Snowflake’s architecture is a hybrid of traditional shared-disk database architectures and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the data warehouse. But just like shared-nothing architectures, Snowflake processes queries using massively parallel processing compute clusters, where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.

Snowflake’s unique architecture consists of three key layers:

  • Database Storage – When data is loaded into Snowflake, Snowflake reorganizes that data into its internaly optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage. Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake
  • Query Processing – Query execution is performed in the processing layer. Snowflake processes queries using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider. Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses
  • Cloud Services – The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider

Additionally, Snowflake can be hosted on any of the top three public cloud provides i.e. Azure, AWS and GCP across different regions, baring in mind that Snowflake accounts do not support multi-region deployments.

Loading TPC-DS Data

Data can be loaded into Snowflake is a number of ways, for example, using established ETL/ELT tools, using Snowpipe for continuous loading, WebUI or SnowSQL, and in a variety of formats e.g. JSON, AVRO, Parquet, CSV, Avro or XML. There are a few things to keep in mind when loading data into Snowflake which relate to file size, file splitting, file staging etc. The general recommendation is to produce files roughly 10MB to 100MB in size, compressed. Data loads of large Parquet files e.g. greater than 3GB could time out, therefore it is best practice to split those into smaller files to distribute the load among the servers in an active warehouse.

In order to run the TPC-DS queries on the Snowflake warehouse, I loaded the sample CSV-format data (scale factor of 100 and 300) from my local storage into the newly created schema using a small Python script. Snowflake also provides for staging files in public clouds object and blob storage e.g. AWS S3 but in this exercise, I transferred those from my local hard drive and staged them internally in Snowflake. As illustrated in the diagram below, loading data from a local file system is performed in two, separate steps:

  • Upload (i.e. stage) one or more data files to a Snowflake stage (named internal stage or table/user stage) using the PUT command
  • Use the COPY INTO command to load the contents of the staged file(s) into a Snowflake database table

As TPC-DS data does not conform to some of the requirements outlined above, the following script processed those files to ensure the format and size are consistent. The below script enables files format renaming (from DAT to CSV), breaking up larger files into smaller ones and removing trailing characters from line endings. It also creates the required TPC-DS schema and loads the pre-processed data into its objects.

#!/usr/bin/python
import snowflake.connector
from pathlib import PurePosixPath
import os
import re
import sys
from timeit import default_timer as timer, time
from humanfriendly import format_timespan
from csv import reader, writer
from multiprocessing import Pool, cpu_count


# define file storage locations variables
sql_schema = PurePosixPath("/Users/UserName/Location/SQL/tpc_ds_schema.sql")
tpc_ds_files_raw = PurePosixPath("/Volumes/VolumeName/Raw/")
tpc_ds_files_processed = PurePosixPath("/Volumes/VolumeName/Processed/")

# define other variables and their values
encoding = "iso-8859-1"
snow_conn_args = {
    "user": "username",
    "password": "password",
    "account": "account_name_on_a_public_cloud",
}
warehouse_name = "TPCDS_WH"
db_name = "TPCDS_DB"
schema_name = "tpc_ds"
output_name_template = "_%s.csv"
file_split_row_limit = 10000000


def get_sql(sql_schema):
    """
    Acquire sql ddl statements from a file stored in the sql_schema variable location
    """
    table_name = []
    query_sql = []

    with open(sql_schema, "r") as f:
        for i in f:
            if i.startswith("----"):
                i = i.replace("----", "")
                table_name.append(i.rstrip("\n"))

    temp_query_sql = []
    with open(sql_schema, "r") as f:
        for i in f:
            temp_query_sql.append(i)
        l = [i for i, s in enumerate(temp_query_sql) if "----" in s]
        l.append((len(temp_query_sql)))
        for first, second in zip(l, l[1:]):
            query_sql.append("".join(temp_query_sql[first:second]))
    sql = dict(zip(table_name, query_sql))
    return sql


def process_files(
    file,
    tpc_ds_files_raw,
    tpc_ds_files_processed,
    output_name_template,
    file_split_row_limit,
    encoding,
):
    """
    Clean up TPC-DS files (remove '|' row terminatinating characters)
    and break them up into smaller chunks in parallel
    """
    file_name = file[0]
    line_numer = 100000
    lines = []
    with open(
        os.path.join(tpc_ds_files_raw, file_name), "r", encoding=encoding
    ) as r, open(
        os.path.join(tpc_ds_files_processed, file_name), "w", encoding=encoding
    ) as w:
        for line in r:
            if line.endswith("|\n"):
                lines.append(line[:-2] + "\n")
                if len(lines) == line_numer:
                    w.writelines(lines)
                    lines = []
        w.writelines(lines)

    row_count = file[1]
    if row_count > file_split_row_limit:
        keep_headers = False
        delimiter = "|"
        file_handler = open(
            os.path.join(tpc_ds_files_processed, file_name), "r", encoding=encoding
        )
        csv_reader = reader(file_handler, delimiter=delimiter)
        current_piece = 1
        current_out_path = os.path.join(
            tpc_ds_files_processed,
            os.path.splitext(file_name)[0] +
            output_name_template % current_piece,
        )

        current_out_writer = writer(
            open(current_out_path, "w", encoding=encoding, newline=""),
            delimiter=delimiter,
        )
        current_limit = file_split_row_limit
        if keep_headers:
            headers = next(csv_reader)
            current_out_writer.writerow(headers)
        for i, row in enumerate(csv_reader):
            # pbar.update()
            if i + 1 > current_limit:
                current_piece += 1
                current_limit = file_split_row_limit * current_piece
                current_out_path = os.path.join(
                    tpc_ds_files_processed,
                    os.path.splitext(file_name)[0]
                    + output_name_template % current_piece,
                )
                current_out_writer = writer(
                    open(current_out_path, "w", encoding=encoding, newline=""),
                    delimiter=delimiter,
                )
                if keep_headers:
                    current_out_writer.writerow(headers)
            current_out_writer.writerow(row)
        file_handler.close()
        os.remove(os.path.join(tpc_ds_files_processed, file_name))


def build_schema(sql_schema, warehouse_name, db_name, schema_name, **snow_conn_args):
    """
    Build TPC-DS database schema based on the ddl sql returned from get_sql() function
    """
    with snowflake.connector.connect(**snow_conn_args) as conn:
        cs = conn.cursor()
        try:
            cs.execute("SELECT current_version()")
            one_row = cs.fetchone()
            if one_row:
                print("Building nominated warehouse, database and schema...")
                cs.execute("USE ROLE ACCOUNTADMIN;")
                cs.execute(
                    """CREATE OR REPLACE WAREHOUSE {wh} with WAREHOUSE_SIZE = SMALL 
                                                            AUTO_SUSPEND = 300 
                                                            AUTO_RESUME = TRUE 
                                                            INITIALLY_SUSPENDED = FALSE;""".format(
                        wh=warehouse_name
                    )
                )
                cs.execute("USE WAREHOUSE {wh};".format(wh=warehouse_name))
                cs.execute(
                    "CREATE OR REPLACE DATABASE {db};".format(db=db_name))
                cs.execute("USE DATABASE {db};".format(db=db_name))
                cs.execute(
                    "CREATE OR REPLACE SCHEMA {sch};".format(sch=schema_name))
                sql = get_sql(sql_schema)
                for k, v in sql.items():
                    try:
                        k = k.split()[-1]
                        cs.execute(
                            "drop table if exists {sch}.{tbl};".format(
                                sch=schema_name, tbl=k
                            )
                        )
                        cs.execute(v)
                    except Exception as e:
                        print(e)
                        print("Error while creating nominated database object", e)
                        sys.exit(1)
        except (Exception, snowflake.connector.errors.ProgrammingError) as error:
            print("Could not connect to the nominated snowflake warehouse.", error)
        finally:
            cs.close()


def load_data(encoding, schema_name, db_name, tpc_ds_files_processed, **snow_conn_args):
    """
    Create internal Snowflake staging area and load csv files into Snowflake schema
    """
    stage_name = "tpc_ds_stage"
    paths = [f.path for f in os.scandir(tpc_ds_files_processed) if f.is_file()]
    with snowflake.connector.connect(**snow_conn_args) as conn:
        cs = conn.cursor()
        cs.execute("USE ROLE ACCOUNTADMIN;")
        cs.execute("USE WAREHOUSE {wh};".format(wh=warehouse_name))
        cs.execute("USE DATABASE {db};".format(db=db_name))
        cs.execute("USE SCHEMA {sch};".format(sch=schema_name))
        cs.execute("drop stage if exists {st};".format(st=stage_name))
        cs.execute(
            """create stage {st} file_format = (TYPE = "csv" 
                                FIELD_DELIMITER = "|" 
                                RECORD_DELIMITER = "\\n")
                                SKIP_HEADER = 0 
                                FIELD_OPTIONALLY_ENCLOSED_BY = "NONE" 
                                TRIM_SPACE = FALSE 
                                ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE 
                                ESCAPE = "NONE" 
                                ESCAPE_UNENCLOSED_FIELD = "\\134" 
                                DATE_FORMAT = "AUTO" 
                                TIMESTAMP_FORMAT = "AUTO" 
                                NULL_IF = ("NULL")""".format(
                st=stage_name
            )
        )
        for path in paths:
            try:
                file_row_rounts = sum(
                    1
                    for line in open(
                        PurePosixPath(path), "r+", encoding="iso-8859-1", newline=""
                    )
                )
                file_name = os.path.basename(path)
                # table_name = file_name[:-4]
                table_name = re.sub(r"_\d+", "", file_name[:-4])
                cs.execute(
                    "TRUNCATE TABLE IF EXISTS {tbl_name};".format(
                        tbl_name=table_name)
                )
                print(
                    "Loading {file_name} file...".format(file_name=file_name),
                    end="",
                    flush=True,
                )
                sql = (
                    "PUT file://"
                    + path
                    + " @{st} auto_compress=true".format(st=stage_name)
                )
                start = timer()
                cs.execute(sql)
                sql = (
                    'copy into {tbl_name} from @{st}/{f_name}.gz file_format = (ENCODING ="iso-8859-1" TYPE = "csv" FIELD_DELIMITER = "|")'
                    ' ON_ERROR = "ABORT_STATEMENT" '.format(
                        tbl_name=table_name, st=stage_name, f_name=file_name
                    )
                )
                cs.execute(sql)
                end = timer()
                time = round(end - start, 1)
                db_row_counts = cs.execute(
                    """SELECT COUNT(1) FROM {tbl}""".format(tbl=table_name)
                ).fetchone()
            except (Exception, snowflake.connector.errors.ProgrammingError) as error:
                print(error)
                sys.exit(1)
            finally:
                if file_row_rounts != db_row_counts[0]:
                    raise Exception(
                        "Table {tbl} failed to load correctly as record counts do not match: flat file: {ff_ct} vs database: {db_ct}.\
                            Please troubleshoot!".format(
                            tbl=table_name,
                            ff_ct=file_row_rounts,
                            db_ct=db_row_counts[0],
                        )
                    )
                else:
                    print("OK...loaded in {t}.".format(
                        t=format_timespan(time)))


def main(
    sql_schema,
    warehouse_name,
    db_name,
    schema_name,
    tpc_ds_files_raw,
    tpc_ds_files_processed,
    output_name_template,
    file_split_row_limit,
    encoding,
    **snow_conn_args
):
    """
    Rename TPC-DS files, do a row count for each file to determine the number of smaller files larger ones need to be split into 
    and kick off parallel files cleanup and split process. Finally, kick off schema built and data load into Snowflake
    """
    for file in os.listdir(tpc_ds_files_raw):
        if file.endswith(".dat"):
            os.rename(
                os.path.join(tpc_ds_files_raw, file),
                os.path.join(tpc_ds_files_raw, file[:-4] + ".csv"),
            )
    fileRowCounts = []
    for file in os.listdir(tpc_ds_files_raw):
        if file.endswith(".csv"):
            fileRowCounts.append(
                [
                    file,
                    sum(
                        1
                        for line in open(
                            os.path.join(tpc_ds_files_raw, file),
                            encoding=encoding,
                            newline="",
                        )
                    ),
                ]
            )
    p = Pool(processes=cpu_count())
    for file in fileRowCounts:
        p.apply_async(
            process_files,
            [
                file,
                tpc_ds_files_raw,
                tpc_ds_files_processed,
                output_name_template,
                file_split_row_limit,
                encoding,
            ],
        )
    p.close()
    p.join()
    build_schema(sql_schema, warehouse_name, db_name,
                 schema_name, **snow_conn_args)
    load_data(encoding, schema_name, db_name,
              tpc_ds_files_processed, **snow_conn_args)


if __name__ == "__main__":
    main(
        sql_schema,
        warehouse_name,
        db_name,
        schema_name,
        tpc_ds_files_raw,
        tpc_ds_files_processed,
        output_name_template,
        file_split_row_limit,
        encoding,
        **snow_conn_args
    )

Once executed, the script produced 246 CSV files (up from 26 in the original data set) and on load completion, the following data volume (uncompressed) and row count metadata was created in Snowflake warehouse.

TPC-DS Benchmark Sample

Although Snowflake conveniently provides 10TB and 100TB versions of TPC-DS data, along with samples of all the benchmark’s 99 queries, I really wanted to go through the whole process myself to help me understand the service and tooling it provides in more details. In hindsight, it turned out to be an exercise in patience thanks to my ISP’s dismal upload speed so if you’re looking at kicking the tires and simply running a few queries yourself, it may be quicker to use Snowflake’s provided data sets.

It is worth mentioning that no performance optimisation was performed on the queries, data or system itself. Unlike other vendors which allows their administrators to fine-tune various aspects of its operation through mechanisms such as statistics update, table partitioning, creating query or workload-specific projections, tuning execution plans etc. Snowflake obfuscates a lot of those details away. Beyond elastically scaling compute down or up or defining cluster key to help with execution on very large tables, there is very little knobs and switches one can play with. That’s a good thing and unless you’re a consultant making money helping your clients with performance tuning, many engineers will be happy to jump on the bandwagon for this reason alone.

Also, data volumes analysed in this scenario do not follow the characteristics and requirements outline by the TPC organisation and under these circumstances would be considered an ‘unacceptable consideration’. For example, a typical benchmark submitted by a vendor needs to include a number of metrics beyond query throughput e.g. a price-performance ratio, data load times, the availability date of the complete configuration etc. The following is a sample of a compliant reporting of TPC-DS results: ‘At 10GB the RALF/3000 Server has a TPC-DS Query-per-Hour metric of 3010 when run against a 10GB database yielding a TPC-DS Price/Performance of $1,202 per query-per-hour and will be available 1-Apr-06’. As this demo and the results below go only skin-deep i.e. query execution times, the results are only indicative of the performance level in the context of the same data and configuration used.

OK, now with this little declaimer out of the way let’s look at how the 24 queries performed across the two distinct setups i.e. a Small warehouse and a Large warehouse across 100GB and 300GB datasets. I used the following Python script to return queries execution time.

#!/usr/bin/python
import configparser
import sys
from os import path, remove
from pathlib import PurePosixPath
import snowflake.connector
from timeit import default_timer as timer
from humanfriendly import format_timespan

db_name = "TPCDS_DB"
schema_name = "tpc_ds"
warehouse_name = "TPCDS_WH"
snow_conn_args = {
    "user": "username",
    "password": "password",
    "account": "account_name_on_a_public_cloud",
}

sql_queries = PurePosixPath(
    "/Users/UserName/Location/SQL/tpc_ds_sql_queries.sql"
)


def get_sql(sql_queries):
    """
    Source operation types from the tpc_ds_sql_queries.sql SQL file.
    Each operation is denoted by the use of four dash characters 
    and a corresponding query number and store them in a dictionary 
    (referenced in the main() function). 
    """
    query_number = []
    query_sql = []

    with open(sql_queries, "r") as f:
        for i in f:
            if i.startswith("----"):
                i = i.replace("----", "")
                query_number.append(i.rstrip("\n"))

    temp_query_sql = []
    with open(sql_queries, "r") as f:
        for i in f:
            temp_query_sql.append(i)
        l = [i for i, s in enumerate(temp_query_sql) if "----" in s]
        l.append((len(temp_query_sql)))
        for first, second in zip(l, l[1:]):
            query_sql.append("".join(temp_query_sql[first:second]))
    sql = dict(zip(query_number, query_sql))
    return sql


def run_sql(
    query_sql, query_number, warehouse_name, db_name, schema_name
):
    """
    Execute SQL query as per the SQL text stored in the tpc_ds_sql_queries.sql
    file by it's number e.g. 10, 24 etc. and report returned row number and 
    query processing time
    """
    with snowflake.connector.connect(**snow_conn_args) as conn:
        try:
            cs = conn.cursor()
            cs.execute("USE ROLE ACCOUNTADMIN;")
            cs.execute("USE WAREHOUSE {wh};".format(wh=warehouse_name))
            cs.execute("USE DATABASE {db};".format(db=db_name))
            cs.execute("USE SCHEMA {sch};".format(sch=schema_name))
            query_start_time = timer()
            rows = cs.execute(query_sql)
            rows_count = sum(1 for row in rows)
            query_end_time = timer()
            query_duration = query_end_time - query_start_time
            print(
                "Query {q_number} executed in {time}...{ct} rows returned.".format(
                    q_number=query_number,
                    time=format_timespan(query_duration),
                    ct=rows_count,
                )
            )
        except (Exception, snowflake.connector.errors.ProgrammingError) as error:
            print(error)
        finally:
            cs.close()


def main(param, sql):
    """
    Depending on the argv value i.e. query number or 'all' value, execute 
    sql queries by calling run_sql() function
    """
    if param == "all":
        for key, val in sql.items():
            query_sql = val
            query_number = key.replace("Query", "")
            run_sql(
                query_sql,
                query_number,
                warehouse_name,
                db_name,
                schema_name
            )
    else:
        query_sql = sql.get("Query" + str(param))
        query_number = param
        run_sql(
            query_sql,
            query_number,
            warehouse_name,
            db_name,
            schema_name
        )


if __name__ == "__main__":
    if len(sys.argv[1:]) == 1:
        sql = get_sql(sql_queries)
        param = sys.argv[1]
        query_numbers = [q.replace("Query", "") for q in sql]
        query_numbers.append("all")
        if param not in query_numbers:
            raise ValueError(
                "Incorrect argument given. Looking for <all> or <query number>. Specify <all> argument or choose from the following numbers:\n {q}".format(
                    q=", ".join(query_numbers[:-1])
                )
            )
        else:
            param = sys.argv[1]
            main(param, sql)
    else:
        raise ValueError(
            "Too many arguments given. Looking for 'all' or <query number>."
        )

First up is the 100GB dataset, with queries running on empty cache (cached results were sub-second therefore I did not bother to graph those here), across Small and Large warehouses. When transitioning from the Small to the Large warehouse, I also dropped the existing warehouse (rather than resizing it), created a new one and re-populated it to ensure that not data is returned from cache.

While these results and test cases are probably not truly representative of the ‘big data’ workloads, they provide a quick snapshot of how traditional analytical workloads would perform on the Snowflake platform. Scanning large tables with hundreds of millions of rows seemed to perform quite well, and, for the most part, performance differences were consistent with the increase in compute capacity. Cached queries run nearly instantaneously so looking at the execution pattern for cached vs non-cached results there was contest (and no point in comparing and graphing the results). Interestingly, I did a similar comparison for Google’s BigQuery platform (link HERE) and Snowflake seems to perform a lot better in this regard i.e. cached results are returned with low latency, usually in less than one second. Next up I graphed the 300GB scaling factor, again repeating the same queries across Small and Large warehouse size.

Snowflake’s performance on the 300GB dataset was very good and majority of the queries returned relatively fast. Side by side, it is easy to see the difference the larger compute allocation made to queries’ execution time, albeit at the cost of increased credits usage. Current pricing model dictates that Large warehouse is 4 times as expensive as a Small one. Given the price of a credit (standard, single-cluster warehouse) in AWS Sydney region is currently set to USD 2.75, 22 dollars per hour for the Large warehouse (not including storage and cloud services cost), it becomes a steep proposition at close to AUD 100K/year (assuming that only 12/24 hours will be charged and only during work days). Cost notwithstanding, Snowflake is a powerful platform and the ability to crunch a lot of data at this speed and on demand is commendable. Although a bit of an apples-to-oranges comparison, you can also view how BigQuery performed on exactly identical datasets in one of my previous posts HERE. Also, just for reference, I below is an image of metadata maintained by Snowflake itself on all activities across the warehouse, filtered to display only SELECT statements information (un-cached 300GB Large warehouse run in this case).

Snowflake Reference Architectures

What I’m currently seeing in the industry from the big data architecture patterns point of view is that business tend to move towards deployments favored by either data engineering teams or BI/Analytics teams. The distinction across those two groups is heavily determined by the area of expertise and the tooling used, with DE shops more reliant on software engineering paradigms and BI folks skills skewed more towards SQL for most of their data processing needs. This allows teams not versed in software development to use vanilla SQL to provision and manage most of their data-related responsibilities and therefore Snowflake to effectively replace development-heavy tasks with the simplicity of its platform. Looking at what an junior analyst can achieve with just a few lines of SQL in Snowflake allows many businesses to drastically simplify and, at the same time, supercharge their analytical capability, all without the knowledge of programming in Scala or Python, distributed systems architecture, bespoke ETL frameworks or intricate cloud concepts knowledge. In the past, a plethora of tools and platforms would be required to shore up analytics pipelines and workloads. With Snowflake, many of those e.g. traditional, files-based data lakes, semi-structured data stores etc. can become redundant, with Snowflake becoming the linchpin of the enterprise information architecture.

The following are some examples of reference architectures used across several use cases e.g. IoT, streaming data stack, ML and data science, demonstrating Snowflake’s capabilities in supporting those varied scenarios with their platform.

Conclusion

Data warehousing realm has been dominated by a few formidable juggernauts of this industry and no business outside of few start-ups would be willing to bet their data on an up-and-coming entrant who is yet to prove itself. A few have tried but just as public cloud space is starting to consolidate and winners take it all, data warehousing domain has always been difficult to disrupt. In the ever-expanding era of cloud computing, many vendors are rushing to retool and retrofit their existing offering to take advantage of this new paradigm, but ultimately few offer innovative and fresh approaches. Starting from the ground-up and with no legacy and technical debt to worry about, Snowflake clearly is trying to reinvent the approach and I have to admit I was quite surprised by not only how well Snowflake performed in my small number of tests but also how coherent the vision is. They are not trying to position themselves as the one-stop-shop for ETL, visualisation and data warehousing. There is no mention of running their workloads on GPUs or FPGAs, no reference to Machine Learning or AI (not yet anyway) and on the face value, the product is just a reinvented data warehouse cloud appliance. Some may consider it a shortcoming as integration with other platforms, services and tools (something that Microsoft and AWS are doing so well) may be a paramount criterion. However, this focus on ensuring that the one domain they’re targeting gets the best engineering in its class is clearly evident. Being Jack of all trades and master of none may work for the Big Cheese so Snowflake is clearly trying to come up with a superior product in a single vertical and not devalue their value proposition – something that they have been able to excel at so far.

After nearly 15 years of working in the information management field and countless encounters with different vendors, RDBMSs, platforms etc. Snowflake seems like a breath of fresh air to me. It supports a raft of different connectors and drivers, speaks plain SQL, is multi-cloud, separates storage from compute, scales automatically, requires little to no tuning and performs well. It also comes with some great feature built-in, something that in case of other vendors may require a lot of configuration, expensive upgrades or even bespoke development e.g. time travel, zero-copy cloning, materialized views, data sharing or continuous data loading using Snowpipe. On the other hand, there are a few issues which may become a deal-breaker for some. A selection of features is only available in the higher (more expensive) tiers e.g. time travel or column-level security, there is no support for constraints and JavaScript-based UDFs/stored procedures are a pain to use. The Web UI is clean but a bit bare-bones and integration with some of the legacy tools may prove to be challenging. Additionally, the sheer fact that you’re not in control of the infrastructure and data (some industries are still limited to how and where their data is stored) may turn some interested parties away, especially that MPP engines like Greenplum, Vertica or ClickHouse can be deployed on both private and public clouds. Below is an excerpt from 2020 Gartner Magic Quadrant for Cloud Database Management Systems, outlining Snowflake’s core strengths and weaknesses.

What I personally liked the most about this service is how familiar and simple it is, while providing all required features in a little to no-ops package. I liked the fact that in spite of all the novel architecture it’s built on, it feels just like another RDBMS. If I could compare it to a physical thing, it would be a (semi)autonomous electric car. Just like any automobile, EVs sport four wheels, lights, doors and everything else that unmistakably make them a vehicle. Likewise, Snowflake speaks SQL (a 40 years old language), has tables, views, UDFs and has all the features traditional databases have had for decades. However, just as the new breed of autonomous cars don’t come with a manual transmission, gas-powered engines, detailed mechanical service requirements or even a steering wheel, Snowflake looks to be built for the future, blending old-school familiarity and pioneering technology together. It is no-fuss, low-ops, security-built-in, cloud-only service which feels instantaneously recognizable and innovative at the same time. It’s not your grandpa’s data warehouse and, in my view, it has nearly all the features to rejuvenate and simplify a lot of the traditional BI architectures. The following is a comparison of a traditional vs contemporary data warehouse architectures, elevating Snowflake to the centerpiece of the data storage and processing platform.

Many vendors are rushing to adapt and transform their on-premise-first data warehouse software with varied success. Oracle, IBM or Microsoft come to mind when looking at a popular RDBMSs engines running predominantly in customers’ data centers. All of those vendors are now pushing hard to customize their legacy products and ensure that the new iteration of their software is cloud-first but existing technical debt and the scope of investment already made may create unbreakable shackles, stifling innovation and compromising progress. Snowflake’s new approach to solving data management problems has gained quite a following and many companies are jumping on board in spite the fact that they are not affiliated with any of the major cloud vendors and the competition is rife – only last week Google announced BigQuery Omni. Looking at one of the most recent Data Warehouse market analysis (source HERE), Snowflake has been making some serious headway among Fortune 100 companies and is quickly capitalizing on its popularity and accelerating adoption rate. The chart below is a clear evidence of this and does not require any explanation – Snowflake is on a huge roll.

It’s still early days and time will tell if Snowflake can maintain the momentum but having spent a little time with it, I understand the appeal and why their service is gaining in popularity. If I was looking at storing and processing large volume of data and best of breed data warehousing capability, Snowflake would definitely be on my shopping list.

http://scuttle.org/bookmarks.php/pass?action=add

Tags: , , , , , ,

This entry was posted on Saturday, July 25th, 2020 at 5:24 pm and is filed under Cloud Computing, MPP RDBMS, Programming, SQL. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

20 Responses to “Snowflake Cloud Data Warehouse Review – TPC-DS Benchmark Performance Analysis and Why It’s Becoming the Post-Hadoop Big Data Nirvana”

Boris Tyukin July 26th, 2020 at 7:11 am

Thanks for the great post! Really enjoyed reading your blog and subscribed to it not to miss your great content.

admin July 26th, 2020 at 3:41 pm

Thanks for taking the time mate and glad you found it useful!

Scott Marshall July 27th, 2020 at 2:51 pm

Great job on the post.

Have also been hearing about Snowflake a bit more compared to Redshift or Bigquery so it may be true that it’s getting pretty good or better than the alternatives. We did a comparison on our own datasets about a year ago (mainly based on our typical workloads and not synthetic benchmarks) and it was a bit too expensive so we went ahead with on-prem Greenplum – a lot more maintenance but fast enough for us (we already had a PostgreSQL DBA on staff to help out). Snowflake is super fast and not much tuning required but you have to go cloud-first.

Rocco Faust July 29th, 2020 at 5:21 pm

Great post! Snowflake rocks!

Jim L August 9th, 2020 at 11:58 am

Good work on the post. Any reason as to why you haven’t you included all TPC-DS queries in your benchmark? Also, it would have been interesting to see those queries executed on a multi-cluster warehouse to see how performance is affected. Anyway, bookmarked your post cause the Python scrips are very useful. Thanks, JL

admin August 9th, 2020 at 8:45 pm

Hi Jim

I did not include all queries as this post was meant to just scratch the surface of Snowflake performance and I had no intention to make it a comprehensive analysis or benchmark. This was also the reason why I did not use multi-cluster warehouses. Additionally, multi-cluster warehouses are great with concurrent database access issues so queries executed in a sequential manner (as they were in my tests) would not benefit from this scale-out setup.

Glad you found the Python scripts helpful!

Jerry Sook August 17th, 2020 at 6:58 pm

Hi Martin

Great post BTW.

How do you think Snowflake compares to other, more old-school MPP engines e.g. Vertica? We are looking at a project ATM (around 10TB of data) and already excluded Redshift and Synapse. Would like to go with something that has been around for a few years and has a decent customer base but not sure if Snowflake is mature enough ATM (we’re not fussed about the cloud but more about stability and long-term support).

Cheers,
Jerry

admin August 18th, 2020 at 7:46 am

Hi Jerry

From the short time I’ve spent with Snowflake I reckon it compares favorably, mainly due to the low maintenance cost. I am not just talking OPEX in the money you pay to Snowflake but also how much time (thus $$$) you would need to spend to setup, maintain and troubleshoot platforms such as Vertica or Netezza.

Vertica is one of my favorite MPP databases and despite it’s age it’s still a viable solution if you’d like to run mainly on-premise. It’s wickedly fast too. But that implies you need to have the technical know-how and invest some time to realize the true potential of its power. Not the case with Snowflake – you can be up and running in no time and you can focus on what’s important i.e. insights and analytics, rather than cluster setup and configuration.

I don’t know exactly what your project requirements are so I can’t recommend one over the other with certainty but I came away quite impressed with Snowflake. The market is heating up though and there are other vendors which may suit your business case better e.g. for true speed nothing can beat ClickHouse ATM or if open-source platform is important to you Greenplum looks like a good choice. Those are niche cases though so depending on what you need I suggest doing a small POC on your data and see if the takeaways are in line with your expectations.

Vinod September 3rd, 2020 at 9:36 am

Hi sir

Congrats on your post. I like Snowflake also and would like to practice to implement in my team. Do you have more code in repo I can use to customize. It would help me very much. My case I wish to extract from Azure blob and not local files. Thank you.

admin September 4th, 2020 at 8:13 am

Hi mate

Sorry, what you see is what you get.

Mario Lopez December 12th, 2020 at 1:49 pm

Hi there

Very insightful post, thanks for sharing.

Any thoughts on Snowflake vs Redshift (or Synapse)? We are re-evaluating our BI/DW stack and pretty much decided on Redshift but everyone I know is moving to Snowflake for new projects. Also, our project sponsor and PM are hang up on using Synapse (we’re heavily invested in Microsoft ecosystem but not Azure)…I just find it a bit clunky to use and overall prefer AWS tools.

Cheers, Mario

admin December 12th, 2020 at 6:03 pm

Hi Mario

I suggest that to establish any of those platforms suitability, you do a little POC on your data. That’s the only way you will be able to tell which one is better. Try to stay agnostic and conduct proper research as Redshift, Synapse and Snowflake come with their distinct pros and cons. For example, if you’re tightly integrated into AWS ecosystem and have the engineering capabilities to manage Redshift clusters, it can be a good alternative to Snowflake. Likewise, if you’re a MSFT shop and thinking about moving into Azure (specially with Apache Spark as your data processing platform), Synapse may make more sense. Many are drinking Snowflake cool-aid ATM and it is quickly becoming the go-to platform for data warehousing needs but you will only know whether it’s right for you when you take it for a spin (on your data) and evaluate it on your own merits – I can’t stress this enough.

Timmy W December 28th, 2020 at 9:10 am

Hey

Great post buddy! Helped me in understanding pros and cons of the product. Keep up the good work.

TW

Pacman December 28th, 2020 at 9:17 am

Hi Martin

Insightful post and good content covered.

What ETL solution would you use to load data into Snowflake using batch mode?

Packman

admin December 28th, 2020 at 6:10 pm

Hi Packman

There are plenty of ETL options out there, ranging from new-ish cloud-only platforms e.g. Matillion to industry stalwarts e.g. Informatica. It really depends on your use case and specific factors e.g. budget, time, dev skills etc. I have a friend who works for a gaming industry (big corp) and they just moved away from Cloudera to Snowflake and are using Apache Airflow powered by Astronomer platform. Also heard good things about dbt (data build tool). Suggest you do your research and compare products/platforms as there is a lot of choice out there. Finally, I would go with ELT instead of ETL paradigm…..Snowflake is powerful enough to crunch your data and can be used as a data lake.

Darius Mahan Jr February 24th, 2021 at 2:46 pm

Hi Martin

Not looking for free advice but it would be great if you could let me know what size of data is Snowflake ideally good at managing and querying (minimum, not maximum)? I’m working for a small org and we have tens of gigs of data (currently on Azure SQL Server) but the architect can’t stop talking about migrating off of MSSQL into cloud only database like Snowflake.

I personally don’t see the point of it. Query latency on our MSSQL DBs is reasonable and I don’t think we will grow to such extent to warrant migrating into MPP-style database. Additionally, we would need to change a lot of architecture around it as it’s been built to work with Microsoft technology exclusively.

Am I the only one who thinks that cloud is not the solution to all the problems?

Appreciate your input into it.

P.S. I work as a BI analyst, not a DBA or data engineer so my view may be a bit myopic.

admin February 26th, 2021 at 3:27 pm

Hi Darius

My go-to responses for these type of questions has always been “you will only know if it’s the right service/technology/framework etc. for you or your use case when you take it for a spin (on your data) and evaluate it on your own merits”. I don’t want to sound like an armchair expert and give you advice without fully understanding your predicament. Having said that, based on what you wrote, tens of gigs (even hundreds) are not something which can’t be handled by most RDBMS engines (commercial or open source ones) with ease. Whilst I understand the drive to simplify and modernize, you will be hard-pressed to realize the true potential of today’s MPP-style database engines with small data sets. Taking gigabytes of data to Snowflake seems wasteful to me unless you’re expecting exponential-level growth patters or engaging in a CV-driven architecture. It would probably make more sense to try to optimize what you already have e.g. refractor your schema, add columnstore indexes etc. or move to the cloud but use something like Azure SQL DB. You mentioned that you’re already reliant on Microsoft technology so maybe running a small POC in Azure i.e. uplift to SQL DB may solve your problems without having to boil the ocean and move to a different vendor.

I sometimes see business or teams buying into low-ops platforms like BigQuery or Snowflake because there’s nothing to manage but remember that’s a double-edge sword i.e. you can’t peek behind the curtain whether you want it or not. Managing gigs of data in today’s platforms is fairly trivial so taking the nuclear option and going for Hadoop (used to be quite trendy) or costly MPP database like Snowflake or Redshift or BigQuery is a bit shortsighted IMHO.

Anyway, I don’t really know whether your use-case is specific enough to warrant using platforms like Snowflake so please take my advice with caution. For all I know, your architect may be 100% right as don’t see the full picture. However, as interesting and cool as Snowflake is, I don’t think it would be on my list based on the requirements you outlined. I also drank some of their post-IPO cool-aid but at the end of the day it’s just a tool for a specific purpose and yours is not its target market.

Rowan Keeton April 28th, 2021 at 2:11 pm

Hello Martin

Interesting post and good insight into Snowflake’s pros and cons.

Any chance you have something similar on Synapse in Azure?

Cheers, Rowan

admin May 5th, 2021 at 5:03 pm

Hi Rowan

Not yet but should be looking into it soon so stay tuned!

leroy martin August 29th, 2021 at 7:03 pm

Hi Martin

If building a brand new data warehouse would you recommend Snowflake or something from the actual cloud vendor, for example Redshift? My company has about 400GB of uncompressed data and I was leaning towards Snowflake but we’re AWS so I think something like Redshift may be easier to integrate with.
Great post BTW.

Leroy

Leave a Reply