That’s dbt? WOW!

2022-10-13

If you’re data enthusiast, it’s definitely worth trying. Dbt (data build tool) is relatively fresh (v. 1.2) and open-source tool that perfectly fills the gap in data engineering stack. It’s dedicated for data engineers that love to code and hate to drag-and-drop. Particularly, for analytics engineers that work somewhere between data engineers and data analysts or share both functions at the same time. Its main interface (dbt Core) is limited to command line and works with my IDE (IntelliJ). It provides exactly what I love to work with. It’s up to me, how I use it.

In the box, we get many features, the best is integration with git repository. That elevates data project management to a new level and puts analytics engineers on par with software engineers. Dbt has a big GitHub, Slack or Stack Overflow community that literally develop the tool there.

You can run dbt model on web interface, command line or any terminal with dbt CLI installed:

dbt run --profile dev --target local --model dim_clients

The dbt run command executes --model dim_clients using --profile dev to connect--target local. Syntax is exactly the same for all operating systems. You can work on Linux, while other work on iOS or other. Objects are stored in folder structure that represent dbt relation types, like models, seeds, snapshots, sources. We can use sub-folders to group our models the way we like best.

Take a look on a sample model:

{{
  config(
    tags=["hourly"],
    alias='clients',
    unique_key='client_id',
    post_hook= [
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_client_id ON {{ this }} (client_id),
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_updated_at ON {{ this }} (updated_at DESC)
    ]
  )
}}

SELECT
  id AS client_id,
  name AS client_name,
  created_at,
  updated_at,
  CURRENT_TIMESTAMP AS loaded_at
FROM
  {{ source('src_dev', 'clients') }}
{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

The model looks SQL-ish, has some Jinja injections and additional sections that are uncommon for other SQL projects I worked on. It’s a bit confusing at the first look, but works together smoothly. As a result we get compiled SQL code (including Jinja injections) that may be compiled (compile) or executed (run) against db engine. I can compare it to a smart dynamic SQL, that gets all benefits from functional programming. Jinja is pythonish templating engine that is used in this case to achieve DRY (don’t repeat yourself) coding. There are some community packages containing smart snippets for dbt or hacks available in the Internet. I believe it’s just the beginning, the potential of the tool and community that builds it is way more than that.

Going back to the code. First section plays essential configuration role.

{{
  config(
    tags=["hourly"],
    alias='clients',
    unique_key='client_id',
    post_hook= [
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_client_id ON {{ this }} (client_id),
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_updated_at ON {{ this }} (updated_at DESC)
    ]
  )
}}

In {{ config }} block we can specify model parameters, which gives us a lot of flexibility how to execute model or other features, like pre_hooks, or post_hooks. These configurations are model-specific. We can also have more parameters, applicable on other levels. For example, dbt allows us to execute custom commands, hooks before or after commit. It’s like having a trigger, but on a db level. Smart! Some features are being handled by generic macros, like {% if is_incremental() %}. Other may be developed by user or Stack Overflow. Macros, models, files are available in open-source packages.

{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

This section is responsible for incremental load. As well as the others, commonly used data warehousing features, like slowly changing dimensions type 2, called here snapshots, are also managed by generic or user-defined macros.

model timing

Except extended coding, dbt gives us some kind of orchestrator and scheduler (in dbt Cloud version). It builds the queue and sets the order of model execution based on dependencies which are captured by dbt. That’s why instead source.clients we use {{ source('src_dev', 'clients') }} . It’s internal reference to a model or other source in our project, not a database table. Dbt is so smart, that wen we running it with hint +ex. --model dim_clients+, dbt engine picks up all depending models and executes them all in correct order. When adding new model to the project, dependencies put it in correct place in queue, without even letting us know about. That’s what dbt essentially is.

DAG (Directed Acyclic Graph)

What’s more? Testing, both generic and custom. The tool provides us whole module responsible for it. By default, we may check if generated model columns are:

  • unique
  • not_null
  • contain accepted_values
  • are integral (values are matching in related tables)

or run custom SQL scripts to verify integrity or business assumptions. Tests are run independently from models.

Last, but not least there is documentation. In dbt we can write documentation next to the code in .yml files. It’s nice to document while coding in the same tool using the same framework and filesystem. All docs are generated automatically as a web page or JSON file. It contains all metadata, our descriptions and DAG diagrams.

These features, makes dbt a very universal tool. All steps we want to perform while data wrangling is not encapsulated in blocks that we need to drag-and-drop or connect with an arrow like we are used to. It’s all about writing SQL and Jinja code.

You would ask, what’s wrong with it? The answer is: it is still fresh. 1.2 version was just recently released. However, the time it was a novel tool and could be deemed too risky has already passed. We can say, it is referred to as a part of modern data stack. There is a rapidly growing interest in dbt in the data engineering community. One thing has to be stressed before finish. Not all database adapters are available at the moment and some are only community supported. Other controversial fact is that dbt has features which are developed independently across database systems. It was a bit awkward when I found out that one really useful feature was developed to just one adapter. It would be nice to have alignment across all platforms when possible.

I recommend dbt for all data projects which we do not wish to over-engineer. Entry level is relatively low, deployment really simple and it’s cloud-based. You get really much in a very short time. However, the fact that we have v1.2 may cause to perceive it as immature enough to replace old-fashion ETL tools. It’s more about the inertia in large organizations which make them resistant to adopt new tools as they appear. I believe that decision makers need some time to understand its potential. Additionally, there are still relatively few engineers that have hands on experience with dbt. I am looking forward to seeing what other features come with new release.

Originally posted on: https://wasimara.hashnode.dev/thats-dbt-wow