{"id":4536,"date":"2023-01-20T20:05:43","date_gmt":"2023-01-20T10:05:43","guid":{"rendered":"http:\/\/bicortex.com\/?p=4536"},"modified":"2023-02-06T15:55:31","modified_gmt":"2023-02-06T05:55:31","slug":"data-build-tool-dbt-the-emerging-standard-for-building-sql-first-data-transformation-pipelines-part-2","status":"publish","type":"post","link":"http:\/\/bicortex.com\/bicortex\/data-build-tool-dbt-the-emerging-standard-for-building-sql-first-data-transformation-pipelines-part-2\/","title":{"rendered":"Data Build Tool (DBT) &#8211; The Emerging Standard For Building SQL-First Data Transformation Pipelines &#8211; Part 2"},"content":{"rendered":"<p class=\"Standard\" style=\"text-align: justify;\">In <a href=\"http:\/\/bicortex.com\/data-build-tool-dbt-the-emerging-standard-for-building-sql-first-data-transformation-pipelines-part-1\/\" target=\"_blank\" rel=\"noopener\">Part 1<\/a> of this series, I went over the high-level introduction to dbt, stood up a small development SQL Server database in Azure, acquired a sample size of Google Analytics data to populate a few tables used in this post and finally installed and configured dbt on a local environment. In this post I\u2019d like to dive deeper into dbt functionality and outline some of its key features that make it an attractive proposition for data transformation and loading part of the ETL\/ELT process.<\/p>\n<h3 style=\"text-align: center;\">DBT Models<\/h3>\n<p class=\"Standard\" style=\"text-align: justify;\">In dbt framework, a model is simply a SELECT SQL statement. When executing dbt run command, dbt will build this model in our data warehouse by wrapping it in a CREATE VIEW AS or CREATE TABLE AS statement. By default dbt is configured to persist those SELECT SQL statements as views, however, this behaviour can be modified to take advantage of other materialization options e.g. table, ephemeral or incremental. Each type of materialization has its advantages and disadvantages and should be evaluated based on a specify use case and scenario. The following image depicts core pros and cons associated with each approach.<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Materialization_Types.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4576\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Materialization_Types.png\" alt=\"\" width=\"580\" height=\"648\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Materialization_Types.png 656w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Materialization_Types-268x300.png 268w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">Before we dive headfirst into creating dbt models, first, let\u2019s explore some high-level guiding principles around structuring our project files. The team at dbt recommends organizing your models into at least two different folders \u2013 staging and marts. In a simple project, these may be the only models we build; more complex projects may have a number of intermediate models that provide a better logical separation as well as accessories to these models.<\/p>\n<ul>\n<li style=\"text-align: justify;\">Staging models &#8211; the atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or standardized into a consistent format<\/li>\n<li style=\"text-align: justify;\">Marts models &#8211; models that represent business processes and entities, abstracted from the data sources that they are based on. Where the work of staging models is limited to cleaning and preparing, fact tables are the product of substantive data transformation: choosing (and reducing) dimensions, date-spinning, executing business logic, and making informed decisions based on business requirements<\/li>\n<\/ul>\n<p class=\"Standard\" style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Initial_Project_Structure.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-4541 alignleft\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Initial_Project_Structure.png\" alt=\"\" width=\"275\" height=\"255\" \/><\/a>Sometimes, mainly due to the level of data complexity or additional security requirements, further logical separation is recommended. In this case &#8216;Sources&#8217; models layer is introduced before data is loaded into the Staging layer. Sources store schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">Because we often work with multiple data sources, in our Staging and Marts directories, we create one folder per source \u2013 in our case, since we\u2019re only working with a single source, we will simply call these google_analytics. Conforming to the dbt minimum standards for project organization and layout i.e. Staging and Marts layers, let\u2019s create the required folders so that the overall structure looks as the one on the left.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">At this point we should have everything in place to build our first model based on the table we created in the Azure SQL Server DB in the previous post. Creating simple models is dbt is a straightforward affair and in this case it\u2019s just a SELECT SQL statement. To begin, in our staging\\google_analytics folder we create a SQL file, name it after the source table we will be staging and save it with the following two-line statement.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n{{ config(materialized='table') }}\r\nSELECT * FROM ga_data\r\n<\/pre>\n<p class=\"Standard\" style=\"text-align: justify;\">The top line simply tells dbt to materialize the output as a physical table (default is a view) and in doing that select everything from our previously created dbo.ga_data table into a new stg.ga_data table. dbt uses Jinja templating language, making a dbt project an ideal programming environment for SQL. With Jinja, we can do transformations which are not typically possible in SQL, for example, using environment variables or macros to abstract snippets of SQL, which is analogous to functions in most programming languages. Whenever you see a {{ &#8230; }}, you&#8217;re already using Jinja.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">To execute this model, we will simply issue &#8216;dbt run&#8217; command (here with an optional parameter &#8216;&#8211;select staging&#8217;, denoting the name of the model we want to compile) and the output should tell us that we successfully created a staging version of our ga_data table.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\"><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_First_Model_Execution_Output.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4545\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_First_Model_Execution_Output.png\" alt=\"\" width=\"580\" height=\"188\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_First_Model_Execution_Output.png 754w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_First_Model_Execution_Output-300x97.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">Obviously, in order to build more complete analytics, we need to combine data from across multiple tables and data sources so let\u2019s create another table called \u2018ga_geo_ref_data\u2019 containing latitude, longitude and display name values using Geopy Python library. Geopy makes it easy to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders. This will provide us with an additional reference data which we will blend with the core &#8216;ga_data&#8217; table\/model and create a single, overarching dataset containing both: Google Analytics data and reference Geo data used to enrich it.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nfrom pathlib import PureWindowsPath\r\nimport pyodbc\r\nimport pandas as pd\r\nfrom geopy.geocoders import Nominatim\r\n\r\n_SQL_SERVER_NAME = 'gademosqlserver2022.database.windows.net'\r\n_SQL_DB = 'sourcedb'\r\n_SQL_USERNAME = 'testusername'\r\n_SQL_PASSWORD = 'MyV3ry$trongPa$$word'\r\n_SQL_DRIVER = '{ODBC Driver 18 for SQL Server}'\r\n\r\ngeolocator = Nominatim(user_agent='testapp')\r\n\r\n\r\ndef enrich_with_geocoding_vals(row, val):\r\n    loc = geolocator.geocode(row, exactly_one=True, timeout=10)\r\n    if val == 'lat':\r\n        if loc is None:\r\n            return -1\r\n        else:\r\n            return loc.raw&#x5B;'lat']\r\n    if val == 'lon':\r\n        if loc is None:\r\n            return -1\r\n        else:\r\n            return loc.raw&#x5B;'lon']\r\n    if val == 'name':\r\n        if loc is None:\r\n            return 'Unknown'\r\n        else:\r\n            return loc.raw&#x5B;'display_name']\r\n    else:\r\n        pass\r\n\r\n\r\ntry:\r\n    with pyodbc.connect('DRIVER='+_SQL_DRIVER+';SERVER='+_SQL_SERVER_NAME+';PORT=1433;DATABASE='+_SQL_DB+';UID='+_SQL_USERNAME+';PWD=' + _SQL_PASSWORD) as conn:\r\n        with conn.cursor() as cursor:\r\n            if not cursor.tables(table='ga_geo_ref_data', tableType='TABLE').fetchone():\r\n                cursor.execute('''CREATE TABLE dbo.ga_geo_ref_data (ID INT IDENTITY (1,1),\r\n                                                            Country NVARCHAR (256),\r\n                                                            City NVARCHAR (256),\r\n                                                            Latitude DECIMAL(12,8),\r\n                                                            Longitude DECIMAL(12,8),\r\n                                                            Display_Name NVARCHAR (1024))''')\r\n            cursor.execute('TRUNCATE TABLE dbo.ga_geo_ref_data;')\r\n            query = &quot;SELECT country, city FROM dbo.ga_data WHERE city &lt;&gt; '' AND country &lt;&gt; '' GROUP BY country, city;&quot;\r\n            df = pd.read_sql(query, conn)\r\n            df&#x5B;'latitude'] = df&#x5B;'city'].apply(\r\n                enrich_with_geocoding_vals, val='lat')\r\n            df&#x5B;'longitude'] = df&#x5B;'city'].apply(\r\n                enrich_with_geocoding_vals, val='lon')\r\n            df&#x5B;'display_name'] = df&#x5B;'city'].apply(\r\n                enrich_with_geocoding_vals, val='name')\r\n            for index, row in df.iterrows():\r\n                cursor.execute('''INSERT INTO dbo.ga_geo_ref_data\r\n                                    (Country,\r\n                                    City,\r\n                                    Latitude,\r\n                                    Longitude,\r\n                                    Display_Name)\r\n                          values (?, ?, ?, ?, ?)''',\r\n                               row&#x5B;0], row&#x5B;1], row&#x5B;2], row&#x5B;3], row&#x5B;4])\r\n\r\n            cursor.execute('SELECT TOP (1) 1 FROM dbo.ga_geo_ref_data')\r\n            rows = cursor.fetchone()\r\n            if rows:\r\n                print('All Good!')\r\n            else:\r\n                raise ValueError(\r\n                    'No data generated in the source table. Please troubleshoot!'\r\n                )\r\nexcept pyodbc.Error as ex:\r\n    sqlstate = ex.args&#x5B;1]\r\n    print(sqlstate)\r\n<\/pre>\n<p class=\"Standard\" style=\"text-align: justify;\">We will also materialize this table using the same technique we tested before and now we should be in a position to create our first data mart object, combining ga_geo_ref_data and ga_data into a single table.<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Stg_to_Mart_Load_View.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4636\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Stg_to_Mart_Load_View.png\" alt=\"\" width=\"580\" height=\"363\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Stg_to_Mart_Load_View.png 677w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Stg_to_Mart_Load_View-300x188.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">This involves creating another SQL file, this time in our marts\\google_analytics folder, and using the following query to blend these two data sets together.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n{{ config(materialized='table') }}\r\n\r\nSELECT ga.*, ref_ga.Latitude, ref_ga.Longitude, ref_ga.Display_Name \r\nFROM {{ ref('ga_data') }} ga\r\nLEFT JOIN {{ ref('ga_geo_ref_data') }} ref_ga\r\nON ga.country = ref_ga.country \r\nAND ga.city = ref_ga.city\r\n<\/pre>\n<p class=\"Standard\" style=\"text-align: justify;\">As with one of the previous queries, we\u2019re using the familiar configuration syntax in the first line but there is also an additional reference configuration applied which uses the most important function in dbt &#8211; the ref() function. For building more complex models, ref() function is very handy as it allows us to refer to other models. ref() is, under the hood, actually doing two important things. First, it is interpolating the schema into our model file to allow us to change our deployment schema via configuration. Second, it is using these references between models to automatically build the dependency graph. This will enable dbt to deploy models in the correct order when using &#8216;dbt run&#8217; command.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">If we were to run this model as is, dbt would concatenate our default schema name (as expressed in the profiles.yml file) with the schema we would like to output it into. It\u2019s a default behavior which we need to override using a macro. Therefore, to change the way dbt generates a schema name, we should add a macro named generate_schema_name to the project, where we can then define our own logic. We will place the following bit of code in the macros folder in our solution and define our custom schema name in the dbt_project.yml file as per below<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n{% macro generate_schema_name(custom_schema_name, node) -%}\r\n    {%- set default_schema = target.schema -%}\r\n    {%- if custom_schema_name is none -%}\r\n        {{ default_schema }}\r\n    {%- else -%}\r\n        {{ custom_schema_name | trim }}\r\n    {%- endif -%}\r\n{%- endmacro %}\r\n<\/pre>\n<pre class=\"brush: yaml; title: ; notranslate\" title=\"\">\r\nname: 'azure_sql_demo'\r\nversion: '1.0.0'\r\nconfig-version: 2\r\n\r\n# This setting configures which 'profile' dbt uses for this project.\r\nprofile: 'azure_sql_demo'\r\n\r\n# These configurations specify where dbt should look for different types of files.\r\n# The 'model-paths' config, for example, states that models in this project can be\r\n# found in the 'models\/' directory. You probably won't need to change these!\r\nmodel-paths: &#x5B;'models']\r\nanalysis-paths: &#x5B;'analyses']\r\ntest-paths: &#x5B;'tests']\r\nseed-paths: &#x5B;'seeds']\r\nmacro-paths: &#x5B;'macros']\r\nsnapshot-paths: &#x5B;'snapshots']\r\n\r\ntarget-path: 'target'  # directory which will store compiled SQL files\r\nclean-targets:         # directories to be removed by `dbt clean`\r\n  - 'target'\r\n  - 'dbt_packages'\r\n\r\n# Configuring models\r\n# Full documentation: https:\/\/docs.getdbt.com\/docs\/configuring-models\r\nmodels:\r\n  azure_sql_demo:\r\n    staging:\r\n      +materialized: view\r\n      +schema: stg\r\n    marts:\r\n      +materialized: view\r\n      +schema: mart\r\n<\/pre>\n<p class=\"Standard\" style=\"text-align: justify;\">With everything in place, we can now save our SQL code into a file called ga_geo.sql and execute dbt to materialize it in a mart schema using &#8216;dbt run &#8211;select marts&#8217; command. On model built completion, our first mart table should be persisted in the database as per the image below (click to enlarge).<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_First_Mart_Model_Content.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4571\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_First_Mart_Model_Content.png\" alt=\"\" width=\"580\" height=\"389\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_First_Mart_Model_Content.png 1065w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_First_Mart_Model_Content-300x201.png 300w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_First_Mart_Model_Content-1024x687.png 1024w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_First_Mart_Model_Content-768x515.png 768w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">Another great feature of dbt is the ability to create snapshots which is synonymous with the concept of Slowly Changing Dimensions (SCD) in data warehousing. Snapshots implement Type-2 SCD logic, identifying how a row in a table changes over time. If we were to issue an ALTER statement to our &#8216;ga_data&#8217; table located in the staging schema and add one extra column denoting when the row was created or updated, we could track it\u2019s history using a typical SCD-2 pattern i.e. expiring and watermarking rows which were changed or added using a combination of GUIDs and date attributes.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">For this demo let&#8217;s execute the following SQL statement to alter our ga_data object, adding a new field called UpdatedAt with a default value of current system timestamp.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\nALTER TABLE stg.ga_data\r\nADD UpdatedAt DATETIME DEFAULT SYSDATETIME()\r\n<\/pre>\n<p class=\"Standard\" style=\"text-align: justify;\">Once we have our changes in place we can add the following SQL to our solution under the snapshots node and call it ga_data_snapshot.sql.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">\r\n{% snapshot ga_data_snapshot %}\r\n    {{\r\n        config(\r\n          target_schema = 'staging',\r\n          unique_key = 'ID',\r\n          strategy = 'check',\r\n          check_col = 'all'\r\n        )\r\n    }}\r\n    SELECT * FROM ga_data\r\n{% endsnapshot %}\r\n<\/pre>\n<p class=\"Standard\" style=\"text-align: justify;\">Next, running &#8216;dbt snapshot&#8217; command a new table in the staging schema is created and a few additional attributes added to allow for SCD Type-2 tracking (click on image to enlarge).<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Snapshot_SQL_View.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4623\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Snapshot_SQL_View.png\" alt=\"\" width=\"580\" height=\"283\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Snapshot_SQL_View.png 1361w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Snapshot_SQL_View-300x147.png 300w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Snapshot_SQL_View-1024x500.png 1024w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Snapshot_SQL_View-768x375.png 768w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">Snapshots are a powerful feature in dbt that facilitate keeping track of our mutable data through time and generally, they&#8217;re as simple as creating any other type of model. This allows for very simple implementation of the SCD Type-2 pattern, with no complex MERGE or UPDATE and INSERT (upsert) SQL statements required.<\/p>\n<h3 style=\"text-align: center;\">DBT Tests<\/h3>\n<p class=\"Standard\" style=\"text-align: justify;\">Testing data solutions has been notoriously difficult and data validation and QA has always been an afterthought. Best case scenario, third party applications had to be used to guarantee data conformance and minimal standards. Worst case, tests were not developed at all and the job of validating the final output fell on analysts or report developers, eyeballing dashboards before pushing them into production. In extreme cases, I even saw customers being delegated to the roles of unsuspected testers, having raising support tickets due to dashboards coming up empty.<\/p>\n<p class=\"Standard\" style=\"text-align: justify;\">In dbt, tests are assertions we make about the models and other resources in our dbt project. When we run dbt test, dbt will tell us if each test in our project passes or fails. There are two ways of defining tests in dbt:<\/p>\n<ul>\n<li style=\"text-align: justify;\">A singular test is testing in its simplest form: If we can write a SQL query that returns failing rows, we can save that query in a .sql file within our test directory. It&#8217;s now a test, and it will be executed by the dbt test command<\/li>\n<li style=\"text-align: justify;\">A generic test is a parametrized query that accepts arguments. The test query is defined in a special test block (similar to a macro). Once defined, you we reference the generic test by name throughout our .yml files \u2014 define it on models, columns, sources, snapshots, and seeds. dbt ships with four generic tests built in: unique, not_null, accepted_values and relationships<\/li>\n<\/ul>\n<p class=\"Standard\" style=\"text-align: justify;\">In our scenario, we will use generic tests to ensure that:<\/p>\n<ul>\n<li style=\"text-align: justify;\">The id field on our ga_geo_ref_data is unique and does not contain any NULL values<\/li>\n<li style=\"text-align: justify;\">The DeviceCategory attribute should only contain a list of accepted values<\/li>\n<\/ul>\n<p class=\"Standard\" style=\"text-align: justify;\">Test definitions are stored in our staging directory, in a yml file called &#8216;schema.yml&#8217; and once we issue dbt test command, the following output is generated, denoting all test passed successfully.<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Test_YML_File_And_Execution_Output.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4567\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_Test_YML_File_And_Execution_Output.png\" alt=\"\" width=\"580\" height=\"507\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Test_YML_File_And_Execution_Output.png 935w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Test_YML_File_And_Execution_Output-300x262.png 300w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_Test_YML_File_And_Execution_Output-768x672.png 768w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<h3 style=\"text-align: center;\">DBT Docs<\/h3>\n<p class=\"Standard\" style=\"text-align: justify;\">One of the great features of dbt is the fact we can easily generate a fully self-documenting solution, together with a lineage graph that helps easily navigate through the nodes and understand the hierarchy. dbt provides a way to generate documentation for our dbt project and render it as a website. The documentation includes the following:<\/p>\n<ul>\n<li style=\"text-align: justify;\">Information about our project: including model code, a DAG of our project, any tests we&#8217;ve added to a column, and more<\/li>\n<li style=\"text-align: justify;\">Information about our data warehouse: including column data types, and table sizes. This information is generated by running queries against the information schema<\/li>\n<\/ul>\n<p class=\"Standard\" style=\"text-align: justify;\">Running &#8216;dbt docs generate&#8217; command instructs dbt to compiles all relevant information about our project and warehouse into manifest.json and catalog.json files. Next, executing &#8216;dbt docs serve&#8217; starts a local web server (http:\/\/127.0.0.1:8080) and allows dbt to use these JSON files to generate a local website. We can see a representation of the project structure, a markdown description for a model, and a list of all of the columns (with documentation) in the model. Additionally, we can click the green button in the bottom-right corner of the webpage to expand a &#8216;mini-map&#8217; of our DAG with, relevant lineage information (click on image to expand).<\/p>\n<p><a href=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_GA_Data_Docs_And_DAG.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-4569\" src=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/\/2022\/09\/DBT_GA_Data_Docs_And_DAG.png\" alt=\"\" width=\"580\" height=\"559\" srcset=\"http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_GA_Data_Docs_And_DAG.png 1270w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_GA_Data_Docs_And_DAG-300x289.png 300w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_GA_Data_Docs_And_DAG-1024x987.png 1024w, http:\/\/bicortex.com\/bicortex\/wp-content\/post_content\/2022\/09\/DBT_GA_Data_Docs_And_DAG-768x740.png 768w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><\/a><\/p>\n<h3 style=\"text-align: center;\">Conclusion<\/h3>\n<p class=\"Standard\" style=\"text-align: justify;\">I barely scratched the surface of dbt can do to establish a full-fledged framework for data transformations using SQL and it looks like the company is not resting on its laurels, adding more features and partnering with other vendors. From my limited time with dbt, the key benefits that allow it to stand out in the sea of other tools include:<\/p>\n<ul>\n<li style=\"text-align: justify;\">Quickly and easily provide clean, transformed data ready for analysis: dbt enables data analysts to custom-write transformations through SQL SELECT statements. There is no need to write boilerplate code. This makes data transformation accessible for analysts that don\u2019t have extensive experience in other programming languages, as the initial learning curve is quite low<\/li>\n<li style=\"text-align: justify;\">Apply software engineering practices\u2014such as modular code, version control, testing, and continuous integration\/continuous deployment (CI\/CD)\u2014to analytics code. Continuous integration means less time testing and quicker time to development, especially with dbt Cloud. You don\u2019t need to push an entire repository when there are necessary changes to deploy, but rather just the components that change. You can test all the changes that have been made before deploying your code into production. dbt Cloud also has integration with GitHub for automation of your continuous integration pipelines, so you won\u2019t need to manage your own orchestration, which simplifies the process<\/li>\n<li style=\"text-align: justify;\">Build reusable and modular code using Jinja. dbt allows you to establish macros and integrate other functions outside of SQL&#8217;s capabilities for advanced use cases. Macros in Jinja are pieces of code that can be used multiple times. Instead of starting at the raw data with every analysis, analysts instead build up reusable data models that can be referenced in subsequent work<\/li>\n<li style=\"text-align: justify;\">Maintain data documentation and definitions within dbt as they build and develop lineage graphs: Data documentation is accessible, easily updated, and allows you to deliver trusted data across the organization. dbt automatically generates documentation around descriptions, models dependencies, model SQL, sources, and tests. dbt creates lineage graphs of the data pipeline, providing transparency and visibility into what the data is describing, how it was produced, as well as how it maps to business logic<\/li>\n<li style=\"text-align: justify;\">Perform simplified data refreshes within dbt Cloud: There is no need to host an orchestration tool when using dbt Cloud. It includes a feature that provides full autonomy with scheduling production refreshes at whatever cadence the business wants<\/li>\n<\/ul>\n<p class=\"Standard\" style=\"text-align: justify;\">Obviously, dbt is not a perfect solution and some of its current shortcomings include:<\/p>\n<ul>\n<li style=\"text-align: justify;\">It covers only the T of ETL, so you will need separate tools to perform Extraction and Load<\/li>\n<li style=\"text-align: justify;\">It&#8217;s SQL based; it might offer less readability compared with tools that have an interactive UI<\/li>\n<li style=\"text-align: justify;\">Sometimes circumstances necessitate rewriting of macros used at the backend. Overriding this standard behavior of dbt requires knowledge and expertise in handling source code<\/li>\n<li style=\"text-align: justify;\">Integrations with many (less popular) database engines are missing or handled by community-supported adapters e.g. Microsoft SQL Server, SQLite, MySQL<\/li>\n<\/ul>\n<p class=\"Standard\" style=\"text-align: justify;\">All in all, I really enjoyed the multitude of features dbt carries in its arsenal and understand why it&#8217;s become the favorite horse in the race to dominate minds and hearts of analytics and data engineers. It&#8217;s a breath of fresh air, as its main focus is SQL &#8211; a novel approach in the landscape dominated by Python and Scala, it runs in the cloud and on-premises and has good external vendors&#8217; support. Additionally, it has some bells and whistles which typically involve integrating with other 3rd party tooling e.g. tests and docs. Finally, at its core, it&#8217;s an open-source product and as such, anyone can take it for a spin and start building extensible, modular and reusable &#8216;plumbing&#8217; for your next project.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Part 1 of this series, I went over the high-level introduction to dbt, stood up a small development SQL Server database in Azure, acquired a sample size of Google Analytics data to populate a few tables used in this post and finally installed and configured dbt on a local environment. In this post I\u2019d [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[61,36,56,5],"tags":[65,62,58,79,87,18,24,41,49,19],"class_list":["post-4536","post","type-post","status-publish","format-standard","hentry","category-cloud-computing","category-data-mining","category-programming","category-sql","tag-azure","tag-cloud-computing","tag-data-modelling","tag-data-warehouse","tag-dbt","tag-microsoft","tag-programming","tag-python","tag-sql","tag-sql-server"],"aioseo_notices":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/4536","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/comments?post=4536"}],"version-history":[{"count":67,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/4536\/revisions"}],"predecessor-version":[{"id":4639,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/posts\/4536\/revisions\/4639"}],"wp:attachment":[{"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/media?parent=4536"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/categories?post=4536"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bicortex.com\/bicortex\/wp-json\/wp\/v2\/tags?post=4536"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}