Uncategorised

That’s dbt? WOW!

If you’re data enthusiast, it’s definitely worth trying. DBT (data build tool) is relatively fresh (v. 1.2) and open-source tool that perfectly fills the gap in data engineering stack. It’s dedicated for data engineers that love to code and hate to drag-and-drop. Particularly, for analytics engineers that work somewhere between data engineers and data analysts or share both functions at the same time. Its main interface (dbt Core) is limited to command line and works with my IDE (IntelliJ). It provides exactly what I love to work with. It’s up to me, how I use it.

In the box, we get many features, the best is integration with git repository. That elevates data project management to a new level and puts analytics engineers on par with software engineers. Dbt has big GitHub, Slack or Stack Overflow community that literally develop the tool there.

You can run dbt model on web interface, command line or any terminal with dbt CLI installed:

dbt run --profile dev --target local --model dim_clients

The dbt run command executes --model dim_clients using --profile dev to connect--target local. Syntax is exactly the same for all operating systems. You can work on Linux, while other work on iOS or other. Objects are stored in folder structure that represent dbt relation types, like models, seeds, snapshots, sources. We can use sub-folders to group our models the way we like best.

Take a look on a sample model:

{{
  config(
    tags=["hourly"],
    alias='clients',
    unique_key='client_id',
    post_hook= [
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_client_id ON {{ this }} (client_id),
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_updated_at ON {{ this }} (updated_at DESC)
    ]
  )
}}

SELECT
  id AS client_id,
  name AS client_name,
  created_at,
  updated_at,
  CURRENT_TIMESTAMP AS loaded_at
FROM
  {{ source('src_dev', 'clients') }}
{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

The model looks SQL-ish, has some Jinja injections and additional sections that are uncommon for other SQL projects I worked on. It’s a bit confusing at the first look, but works together smoothly. As a result we get compiled SQL code (including Jinja injections) that may be compiled (compile) or executed (run) against db engine. I can compare it to a smart dynamic SQL, that gets all benefits from functional programming. Jinja is pythonish templating engine that is used in this case to achieve DRY (don’t repeat yourself) coding. There are some community packages containing smart snippets for dbt or hacks available in the Internet. I believe it’s just the beginning, the potential of the tool and community that builds it is way more than that.

Going back to the code. First section plays essential configuration role.

{{
  config(
    tags=["hourly"],
    alias='clients',
    unique_key='client_id',
    post_hook= [
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_client_id ON {{ this }} (client_id),
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_updated_at ON {{ this }} (updated_at DESC)
    ]
  )
}}

In {{ config }} block we can specify model parameters, which gives us a lot of flexibility how to execute model or other features, like pre_hooks, or post_hooks. These configurations are model-specific. We can also have more parameters, applicable on other levels. For example, dbt allows us to execute custom commands, hooks before or after commit. It’s like having a trigger, but on a db level. Smart! Some features are being handled by generic macros, like {% if is_incremental() %}. Other may be developed by user or Stack Overflow. Macros, models, files are available in open-source packages.

{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

This section is responsible for incremental load. As well as the others, commonly used data warehousing features, like slowly changing dimensions type 2, called here snapshots, are also managed by generic or user-defined macros.

dbt model timing visualisation example
model timing

Except extended coding, dbt gives us some kind of orchestrator and scheduler (in dbt Cloud version). It builds the queue and sets the order of model execution based on dependencies which are captured by dbt. That’s why instead source.clients we use {{ source('src_dev', 'clients') }} . It’s internal reference to a model or other source in our project, not a database table. Dbt is so smart, that wen we running it with hint +ex. --model dim_clients+, dbt engine picks up all depending models and executes them all in correct order. When adding new model to the project, dependencies put it in correct place in queue, without even letting us know about. That’s what dbt essentially is.

An example of dbt DAG
DAG (Directed Acyclic Graph)

What’s more? Testing, both generic and custom. The tool provides us whole module responsible for it. By default, we may check if generated model columns are:

  • unique
  • not_null
  • contain accepted_values
  • are integral (values are matching in related tables)

or run custom SQL scripts to verify integrity or business assumptions. Tests are run independently from models.

Last, but not least there is documentation. In dbt we can write documentation next to the code in .yml files. It’s nice to document while coding in the same tool using the same framework and filesystem. All docs are generated automatically as a web page or JSON file. It contains all metadata, our descriptions and DAG diagrams.

These features, makes dbt a very universal tool. All steps we want to perform while data wrangling is not encapsulated in blocks that we need to drag-and-drop or connect with an arrow like we are used to. It’s all about writing SQL and Jinja code.

You would ask, what’s wrong with it? The answer is: it is still fresh. 1.2 version was just recently released. However, the time it was a novel tool and could be deemed too risky has already passed. We can say, it is referred to as a part of modern data stack. There is a rapidly growing interest in dbt in the data engineering community. One thing has to be stressed before finish. Not all database adapters are available at the moment and some are only community supported. Other controversial fact is that dbt has features which are developed independently across database systems. It was a bit awkward when I found out that one really useful feature was developed to just one adapter. It would be nice to have alignment across all platforms when possible.

I recommend dbt for all data projects which we do not wish to over-engineer. Entry level is relatively low, deployment really simple and it’s cloud-based. You get really much in a very short time. However, the fact that we have v1.2 may cause to perceive it as immature enough to replace old-fashion ETL tools. It’s more about the inertia in large organizations which make them resistant to adopt new tools as they appear. I believe that decision makers need some time to understand its potential. Additionally, there are still relatively few engineers that have hands on experience with dbt. I am looking forward to seeing what other features come with new release.

Originally posted on: https://wasimara.hashnode.dev/thats-dbt-wow

Bay near Degerby, Finland at dusk. A metaphor of a data lake. Uncategorised

Rough waters of the Data Lake

We wrote in another post that the term “data warehouse” is often misunderstood. Same stands for “data lake” – a newer denotation which also refer to data storage. The two are frequently confused. Another common misconception is to consider a data lake as a replacement or a modern version of a data warehouse. To add to the confusion there appeared very recently “a data lakehouse” – actively promoted as the mixture of a data warehouse and a data lake, supposedly taking the best of each of them.

When discussing about data lake we should start with why this concept appeared at all. Let’s get back to early 2000′.

It starts from the source

The data warehouse had been a well-established concept back then. Yet, many of the data warehouse projects were failures. How many – no one really knows, some sources report that more than 50%. The exact number is neither important here nor possible to obtain. From experience we know that at least some of the large-scale projects declared as “success” were far from delivering to the initial expectations.

From the other hand there appeared “big data”. The “big data” was another ill-defined term, used to describe anything which did not fit into RDBMS available back then. It might have been because of the sheer volume of the data or the difficulty to squeeze them into the relational model of a database. In these circumstances Hadoop was created. It came with Hadoop Distributed File System (HDFS) as storage and Map-Reduce as programming model. The data were stored as flat files, so there was no need to think of keys, attributes, columns and relations. Hadoop quickly become popular for processing data at a scale, without investing in expensive hi-end hardware and equally expensive database licences.

This was clearly an alternative way to collect data and process them for query and analysis. Few years later in the community of Hadoop users a new concept, called “a data lake”, was created. Similarly to “big data” it lacked (and still lacks) a precise definition. The classic description comes from James Dixon’s blog article published in 2010 and is figurative rather than concrete:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

James Dixon

The comparison between a small volume of a bottle of water and a vast capacity of a lake, with the seemingly endless possibilities of using it, easily creates an impression of superiority of the data lake. If you got such an impression we strongly encourage you to read the original article (it’s short and concise). One of the important points is that Dixon refers to the data originating from a single system – hence he uses a “datamart” term, not a “data warehouse”.

Dive in a lake, drink from a bottle

So, in its origins a data lake was not meant to be a simple replacement for a data warehouse, which typically integrates data from many sources. Dealing with unstructured data was not the main driving force either – nearly all of the data handled in practice were structured or semi-structured. It was conceived to process the data of a volume not fitting (back then) into RDBMS. More importantly, it was made to allow answering the questions not yet known at the moment of designing the system.

The goal of answering the questions not yet known has far-reaching consequences. One cannot simply make a selection: these data are of key business importance, the other are of no value beyond transient operational purposes. These we keep, the other we throw away. If we do not know the question then we cannot know which data would be useful to answer it. So, we store everything, just in case. Even if some data seem useless today, they may serve someone to answer a question which may appear tomorrow. Obviously, in this situation any data modelling effort at this stage would be futile. The only practical thing to do now is to save the data as they are, without pre-processing, transformation or modelling. It will be the responsibility of someone who would use the data for some purpose, some time in the future.

It is the second substantial difference between the data warehouse and the data lake. The data warehouse is meant to answer (mostly) predetermined questions about well known business processes. The main use of a data warehouse is descriptive analytics bounded to a known domain. Consequently, the data are modelled according to the known business processes they describe. The data are integrated. Both integration and modelling entail pre-processing and data transformation. Indeed, it may resemble bottled water – extracted from the source, purified and delivered ready to drink, in a handy packaging. The data lake does not possess any of these features. It was meant to allow exploratory analysis, limited only by imagination and availability of the data. It stores raw, unprocessed data. Any (pre-)processing happens only as a part of the analytical process.

Eutrophication of the lake

The data lake concept got quite some popularity. Unfortunately, the original idea was distorted shortly after it appeared. Many people, disappointed with large, expensive and often unsuccessful data warehouse projects perceived data lake as a replacement or an alternative to the data warehouse. “Big data” was trendy, “Hadoop” was trendy, so the “data lake” was trendy, too. The old-fashioned data warehouse was passé, there was a new, hip data warehouse 2.0, called “data lake”. The data lake concept was very quickly simplified to statements like: “Hadoop replaces the role of OLAP (online analytical processing) in preparing data to answer specific questions[1]. It boiled down to a strategy: “Let’s throw all our data to Hadoop, call it a data lake and we will live happily ever after“. Well, not quite…

Großes Moor bei Becklingen - as a metaphor of a data swamp.
Großes Moor bei Becklingen

With all this hype from one side and common misunderstandings of the whole concept from the other it is no wonder that the disappointments and criticism appeared. A large part of the criticism were in fact based on false assumptions or getting the idea wrong. The original author tried to address this criticism and in particular the misconceptions in his post “Data Lakes Revisited” already in 2014. He wrote (again we encourage to read the full post):

A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.

A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.

James Dixon

But his voice was either not heard in the whole fuzz about big data or it was already too late. The meaning of the term “data lake” had already drifted away from the original concept and there was no easy way back. The misunderstandings of the concept continued to be there, so was the criticism based on them.

What some people got wrong from the idea was that all the data should be stored on Hadoop and this was what already made a data lake. The data lake was treated as a replacement for a data warehouse, even though it had never meant to be one. There were data stores resembling data warehouses built on Hadoop, with Hive or Impala as a data access interface, called a “data lake”. They were used mainly to answer the same pre-defined business questions, as the data warehouse would normally do. The proper data model was often missing, as was the data governance. Some people must have assumed that if the data do not need a pre-defined schema they also do not require data governance. The data was stored “raw”, as they came from the source, without any pre-processing or proper integration. Obviously, the systems built according to these misconceptions most often did not deliver to the expectations. These so-called “data lakes” were turning into what critics named data swamps.

Where is the water now?

Today there is still no single definition of what the data lake is. And if we look at what the popular definitions or descriptions say we can see how much the data lake is different now from the original concept:

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics (…).

Amazon

A data lake is a system or repository of data stored in its natural/raw format (…). A data lake is usually a single store of data including raw copies of source system data, (…) and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data (…), semi-structured data (…), unstructured data (…).

wikipedia

A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.

Gartner

These descriptions focus on multiple sources, many purposes or application, raw format of stored data and possibility to store semi-structured or unstructured data. As we already know the idea of storing data from many sources was not there initially and possibility to store unstructured data was not the main concern either. What remained of the original concept is that the data are stored in raw format, as they come from their source, without any preprocessing or aggregations. There is no structure or schema defined beforehand. The data are supposed to serve for various tasks, apparently including (simple) analytics and reporting. This is a change from the original goal of answering questions not yet known to answering all questions, including the pre-defined ones.

The raw data storage format, lack of predetermined schema and consequently lack of data pre-processing are the key differences between the data lake and the data warehouse.

The logical consequence of the above is that a data lake is not suitable for standard business analytics and reporting. Reports created from the data which are not properly integrated, cleansed and curated will very likely contain discrepancies and errors. All the data pre-processing and integration would have to be done during building a particular report. It is a major effort, which needs to be repeated for every analysis. Obviously, there is a risk it would be done slightly differently each time, thus producing different results.

And what if we do all these pre-processing, data cleansing and integration steps only once? And then store the clean, conformed, structured and perhaps aggregated data in a single place, readily available for trustworthy analytics and reporting? An excellent idea! This is (roughly) how a data warehouse is built…