Uncategorised

That’s dbt? WOW!

If you’re data enthusiast, it’s definitely worth trying. DBT (data build tool) is relatively fresh (v. 1.2) and open-source tool that perfectly fills the gap in data engineering stack. It’s dedicated for data engineers that love to code and hate to drag-and-drop. Particularly, for analytics engineers that work somewhere between data engineers and data analysts or share both functions at the same time. Its main interface (dbt Core) is limited to command line and works with my IDE (IntelliJ). It provides exactly what I love to work with. It’s up to me, how I use it.

In the box, we get many features, the best is integration with git repository. That elevates data project management to a new level and puts analytics engineers on par with software engineers. Dbt has big GitHub, Slack or Stack Overflow community that literally develop the tool there.

You can run dbt model on web interface, command line or any terminal with dbt CLI installed:

dbt run --profile dev --target local --model dim_clients

The dbt run command executes --model dim_clients using --profile dev to connect--target local. Syntax is exactly the same for all operating systems. You can work on Linux, while other work on iOS or other. Objects are stored in folder structure that represent dbt relation types, like models, seeds, snapshots, sources. We can use sub-folders to group our models the way we like best.

Take a look on a sample model:

{{
  config(
    tags=["hourly"],
    alias='clients',
    unique_key='client_id',
    post_hook= [
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_client_id ON {{ this }} (client_id),
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_updated_at ON {{ this }} (updated_at DESC)
    ]
  )
}}

SELECT
  id AS client_id,
  name AS client_name,
  created_at,
  updated_at,
  CURRENT_TIMESTAMP AS loaded_at
FROM
  {{ source('src_dev', 'clients') }}
{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

The model looks SQL-ish, has some Jinja injections and additional sections that are uncommon for other SQL projects I worked on. It’s a bit confusing at the first look, but works together smoothly. As a result we get compiled SQL code (including Jinja injections) that may be compiled (compile) or executed (run) against db engine. I can compare it to a smart dynamic SQL, that gets all benefits from functional programming. Jinja is pythonish templating engine that is used in this case to achieve DRY (don’t repeat yourself) coding. There are some community packages containing smart snippets for dbt or hacks available in the Internet. I believe it’s just the beginning, the potential of the tool and community that builds it is way more than that.

Going back to the code. First section plays essential configuration role.

{{
  config(
    tags=["hourly"],
    alias='clients',
    unique_key='client_id',
    post_hook= [
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_client_id ON {{ this }} (client_id),
      CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_updated_at ON {{ this }} (updated_at DESC)
    ]
  )
}}

In {{ config }} block we can specify model parameters, which gives us a lot of flexibility how to execute model or other features, like pre_hooks, or post_hooks. These configurations are model-specific. We can also have more parameters, applicable on other levels. For example, dbt allows us to execute custom commands, hooks before or after commit. It’s like having a trigger, but on a db level. Smart! Some features are being handled by generic macros, like {% if is_incremental() %}. Other may be developed by user or Stack Overflow. Macros, models, files are available in open-source packages.

{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

This section is responsible for incremental load. As well as the others, commonly used data warehousing features, like slowly changing dimensions type 2, called here snapshots, are also managed by generic or user-defined macros.

dbt model timing visualisation example
model timing

Except extended coding, dbt gives us some kind of orchestrator and scheduler (in dbt Cloud version). It builds the queue and sets the order of model execution based on dependencies which are captured by dbt. That’s why instead source.clients we use {{ source('src_dev', 'clients') }} . It’s internal reference to a model or other source in our project, not a database table. Dbt is so smart, that wen we running it with hint +ex. --model dim_clients+, dbt engine picks up all depending models and executes them all in correct order. When adding new model to the project, dependencies put it in correct place in queue, without even letting us know about. That’s what dbt essentially is.

An example of dbt DAG
DAG (Directed Acyclic Graph)

What’s more? Testing, both generic and custom. The tool provides us whole module responsible for it. By default, we may check if generated model columns are:

  • unique
  • not_null
  • contain accepted_values
  • are integral (values are matching in related tables)

or run custom SQL scripts to verify integrity or business assumptions. Tests are run independently from models.

Last, but not least there is documentation. In dbt we can write documentation next to the code in .yml files. It’s nice to document while coding in the same tool using the same framework and filesystem. All docs are generated automatically as a web page or JSON file. It contains all metadata, our descriptions and DAG diagrams.

These features, makes dbt a very universal tool. All steps we want to perform while data wrangling is not encapsulated in blocks that we need to drag-and-drop or connect with an arrow like we are used to. It’s all about writing SQL and Jinja code.

You would ask, what’s wrong with it? The answer is: it is still fresh. 1.2 version was just recently released. However, the time it was a novel tool and could be deemed too risky has already passed. We can say, it is referred to as a part of modern data stack. There is a rapidly growing interest in dbt in the data engineering community. One thing has to be stressed before finish. Not all database adapters are available at the moment and some are only community supported. Other controversial fact is that dbt has features which are developed independently across database systems. It was a bit awkward when I found out that one really useful feature was developed to just one adapter. It would be nice to have alignment across all platforms when possible.

I recommend dbt for all data projects which we do not wish to over-engineer. Entry level is relatively low, deployment really simple and it’s cloud-based. You get really much in a very short time. However, the fact that we have v1.2 may cause to perceive it as immature enough to replace old-fashion ETL tools. It’s more about the inertia in large organizations which make them resistant to adopt new tools as they appear. I believe that decision makers need some time to understand its potential. Additionally, there are still relatively few engineers that have hands on experience with dbt. I am looking forward to seeing what other features come with new release.

Originally posted on: https://wasimara.hashnode.dev/thats-dbt-wow

Bay near Degerby, Finland at dusk. A metaphor of a data lake. Uncategorised

Rough waters of the Data Lake

We wrote in another post that the term “data warehouse” is often misunderstood. Same stands for “data lake” – a newer denotation which also refer to data storage. The two are frequently confused. Another common misconception is to consider a data lake as a replacement or a modern version of a data warehouse. To add to the confusion there appeared very recently “a data lakehouse” – actively promoted as the mixture of a data warehouse and a data lake, supposedly taking the best of each of them.

When discussing about data lake we should start with why this concept appeared at all. Let’s get back to early 2000′.

It starts from the source

The data warehouse had been a well-established concept back then. Yet, many of the data warehouse projects were failures. How many – no one really knows, some sources report that more than 50%. The exact number is neither important here nor possible to obtain. From experience we know that at least some of the large-scale projects declared as “success” were far from delivering to the initial expectations.

From the other hand there appeared “big data”. The “big data” was another ill-defined term, used to describe anything which did not fit into RDBMS available back then. It might have been because of the sheer volume of the data or the difficulty to squeeze them into the relational model of a database. In these circumstances Hadoop was created. It came with Hadoop Distributed File System (HDFS) as storage and Map-Reduce as programming model. The data were stored as flat files, so there was no need to think of keys, attributes, columns and relations. Hadoop quickly become popular for processing data at a scale, without investing in expensive hi-end hardware and equally expensive database licences.

This was clearly an alternative way to collect data and process them for query and analysis. Few years later in the community of Hadoop users a new concept, called “a data lake”, was created. Similarly to “big data” it lacked (and still lacks) a precise definition. The classic description comes from James Dixon’s blog article published in 2010 and is figurative rather than concrete:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

James Dixon

The comparison between a small volume of a bottle of water and a vast capacity of a lake, with the seemingly endless possibilities of using it, easily creates an impression of superiority of the data lake. If you got such an impression we strongly encourage you to read the original article (it’s short and concise). One of the important points is that Dixon refers to the data originating from a single system – hence he uses a “datamart” term, not a “data warehouse”.

Dive in a lake, drink from a bottle

So, in its origins a data lake was not meant to be a simple replacement for a data warehouse, which typically integrates data from many sources. Dealing with unstructured data was not the main driving force either – nearly all of the data handled in practice were structured or semi-structured. It was conceived to process the data of a volume not fitting (back then) into RDBMS. More importantly, it was made to allow answering the questions not yet known at the moment of designing the system.

The goal of answering the questions not yet known has far-reaching consequences. One cannot simply make a selection: these data are of key business importance, the other are of no value beyond transient operational purposes. These we keep, the other we throw away. If we do not know the question then we cannot know which data would be useful to answer it. So, we store everything, just in case. Even if some data seem useless today, they may serve someone to answer a question which may appear tomorrow. Obviously, in this situation any data modelling effort at this stage would be futile. The only practical thing to do now is to save the data as they are, without pre-processing, transformation or modelling. It will be the responsibility of someone who would use the data for some purpose, some time in the future.

It is the second substantial difference between the data warehouse and the data lake. The data warehouse is meant to answer (mostly) predetermined questions about well known business processes. The main use of a data warehouse is descriptive analytics bounded to a known domain. Consequently, the data are modelled according to the known business processes they describe. The data are integrated. Both integration and modelling entail pre-processing and data transformation. Indeed, it may resemble bottled water – extracted from the source, purified and delivered ready to drink, in a handy packaging. The data lake does not possess any of these features. It was meant to allow exploratory analysis, limited only by imagination and availability of the data. It stores raw, unprocessed data. Any (pre-)processing happens only as a part of the analytical process.

Eutrophication of the lake

The data lake concept got quite some popularity. Unfortunately, the original idea was distorted shortly after it appeared. Many people, disappointed with large, expensive and often unsuccessful data warehouse projects perceived data lake as a replacement or an alternative to the data warehouse. “Big data” was trendy, “Hadoop” was trendy, so the “data lake” was trendy, too. The old-fashioned data warehouse was passé, there was a new, hip data warehouse 2.0, called “data lake”. The data lake concept was very quickly simplified to statements like: “Hadoop replaces the role of OLAP (online analytical processing) in preparing data to answer specific questions[1]. It boiled down to a strategy: “Let’s throw all our data to Hadoop, call it a data lake and we will live happily ever after“. Well, not quite…

Großes Moor bei Becklingen - as a metaphor of a data swamp.
Großes Moor bei Becklingen

With all this hype from one side and common misunderstandings of the whole concept from the other it is no wonder that the disappointments and criticism appeared. A large part of the criticism were in fact based on false assumptions or getting the idea wrong. The original author tried to address this criticism and in particular the misconceptions in his post “Data Lakes Revisited” already in 2014. He wrote (again we encourage to read the full post):

A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.

A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.

James Dixon

But his voice was either not heard in the whole fuzz about big data or it was already too late. The meaning of the term “data lake” had already drifted away from the original concept and there was no easy way back. The misunderstandings of the concept continued to be there, so was the criticism based on them.

What some people got wrong from the idea was that all the data should be stored on Hadoop and this was what already made a data lake. The data lake was treated as a replacement for a data warehouse, even though it had never meant to be one. There were data stores resembling data warehouses built on Hadoop, with Hive or Impala as a data access interface, called a “data lake”. They were used mainly to answer the same pre-defined business questions, as the data warehouse would normally do. The proper data model was often missing, as was the data governance. Some people must have assumed that if the data do not need a pre-defined schema they also do not require data governance. The data was stored “raw”, as they came from the source, without any pre-processing or proper integration. Obviously, the systems built according to these misconceptions most often did not deliver to the expectations. These so-called “data lakes” were turning into what critics named data swamps.

Where is the water now?

Today there is still no single definition of what the data lake is. And if we look at what the popular definitions or descriptions say we can see how much the data lake is different now from the original concept:

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics (…).

Amazon

A data lake is a system or repository of data stored in its natural/raw format (…). A data lake is usually a single store of data including raw copies of source system data, (…) and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data (…), semi-structured data (…), unstructured data (…).

wikipedia

A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.

Gartner

These descriptions focus on multiple sources, many purposes or application, raw format of stored data and possibility to store semi-structured or unstructured data. As we already know the idea of storing data from many sources was not there initially and possibility to store unstructured data was not the main concern either. What remained of the original concept is that the data are stored in raw format, as they come from their source, without any preprocessing or aggregations. There is no structure or schema defined beforehand. The data are supposed to serve for various tasks, apparently including (simple) analytics and reporting. This is a change from the original goal of answering questions not yet known to answering all questions, including the pre-defined ones.

The raw data storage format, lack of predetermined schema and consequently lack of data pre-processing are the key differences between the data lake and the data warehouse.

The logical consequence of the above is that a data lake is not suitable for standard business analytics and reporting. Reports created from the data which are not properly integrated, cleansed and curated will very likely contain discrepancies and errors. All the data pre-processing and integration would have to be done during building a particular report. It is a major effort, which needs to be repeated for every analysis. Obviously, there is a risk it would be done slightly differently each time, thus producing different results.

And what if we do all these pre-processing, data cleansing and integration steps only once? And then store the clean, conformed, structured and perhaps aggregated data in a single place, readily available for trustworthy analytics and reporting? An excellent idea! This is (roughly) how a data warehouse is built…

Uncategorised

What is a data warehouse?

The concept of a data warehouse is quite old, it originates from late 1980’s[1]. Yet, in our practice we find out the term “data warehouse” is often misunderstood. The presence of a similar term, a data lake, does not makes things easier and the two concepts are frequently confused or mixed.

The definition

There are two canonical definitions of a data warehouse:

A data warehouse is a copy of transaction data specifically structured for query and analysis.”

Ralph Kimball

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”

Bill Inmon

Both of these definitions are very concise and rather high-level. They say what properties a data warehouse have or what conditions a data store should fulfil to be a data warehouse. In fact they do not directly describe how a data warehouse looks like. Nor how to build one. But if we read them carefully we will know precisely enough what is and what is not a data warehouse.

What is not a data warehouse

If we read again the above definitions we can see they both define a data warehouse as a kind of a data collection (“copy of transactional data”, “collection of data”). So, certainly it is not a process of collecting, transforming or analysing the data. It is not a tool for data integration. Neither it is a reporting tool. It is a data storage and this is the key property.

But not every data storage is a data warehouse. This data must serve a specific purpose. Kimbal says they are for “query and analysis”, Inmon describe it as “support of management’s decision making process”. Obviously, a collection of data not suitable for these purposes cannot be called a data warehouse. A folder with scanned invoices from the last 5 years, although it certainly contains valuable information, is completely unsuitable for analysis. An S3 bucket with a collection of logs or JSON-structured website event details are hardly usable for such purpose. None of these data stores can be considered a data warehouse.

Even a transactional database, holding all the transactional information in the neatly designed 3rd Normal Form, does not fulfill our conditions. It is made for operating some system or application and storing (business) transaction data. It is specifically structured for recording transactions, but not for query and analysis. Although technically one can run queries against a transactional database, it is not designed for the purpose of running analytical queries. And it contains the transactional data, not a copy of transactional data as Kimball defines. Inmon’s conditions of non-volatility and integration of data are not fulfilled either.

Getting to the details

Subject-oriented

An important feature of a data warehouse is that it is “subject-oriented”. The data should be organised around subjects they describe or refer to, rather than by systems they originate from. Kimball et al. in their classic Data Warehouse Toolkit Classics wrote the key point is to identify and describe the business processes. These processes, not the organisational structure of the company or the architecture of computer systems used, impose how the data should be structured.

For example, the typical sales process is composed of several steps: making an order, payment, shipment, etc. Each of these steps may be handled by a separate organisational unit or even by an external provider. Each of them would use various databases or information systems to do the job. The role of the data warehouse is to describe the complete process, not a separate components. The organisation of the data revolves around the process, not departments or origins of the data.

Integrated

The fact the data come from various sources brings us to the next feature: integrated. It does not only mean the multitude of sources. More importantly it means the data coming from different systems must be aligned and unified. This includes introducing a single measurement system, uniform data formats, a consistent naming convention. The conflicts between the data originated from different sources should be resolved.

Non-volatile

The “non-volatile” attribute entail a permanent storage of the data. So, no system which collects, transforms or generates the data “on the fly” can be considered a data warehouse, according to the definition. Moreover, the non-volatility means the data generally should only be added, but not removed from there. That is logical – the data depicts some business processes at a given point in time. Even if the state of the process changes in the future the data collected so far carry the facts about the past states. If we want to use them for analysis of the history of such process or its changes in time we should not delete or alter such information.

The restriction of non-altering information is not absolute and, contrary to popular belief, is not the same as non-altering the underlying data. It does not entail the read-only access to the data, apart from loading them. It also does not mean the old data cannot be overwritten by the new data. It certainly does not mean the execution of ALTER statement is prohibited. However, any of such operations should not destroy the historical information or be well justified. Classic examples of perfectly legitimate overwriting or altering the data are handling Type 1 and Type 2 slowly changing dimensions.

Time-variant

A snapshot of the transaction data, properly integrated and structured fulfils all the above discussed conditions. At the first sight it matches the complete Kimball’s definition. But what would be the real usefulness of such a snapshot for analysis or decision making process? The analysis would be limited to this specific time the snapshot was taken and (maybe) the history. As time passes the utility of such snapshot would quickly degrade. It is obvious, that for making decisions everyone would like to have the most up to date information. For the data warehouse to serve it purpose its content must be regularly and frequently updated.

Certainly, if one attempts to answer the same question based on two different sets of information the result may be two very different answers. If the data in the data warehouse changes in time then the results of the queries will generally depend on the time they are executed. Technically, the data warehouse is a time-variant system. In practice it means it is not a mere store of past data, but it is live, it changes.

The time-variant attribute from Imnon’s definition is commonly misinterpreted as “referring to historical data” or “focusing on change in time”[2]. Of course, the data warehouse stores historical data and stores them on much longer horizon than the transactional systems do. The analysis of changes in time plays an important role in decision making processes. It is often how the collected data are used. But time-variant really means the output changes in time. Which is because the data in the data warehouse are not static – they are being added and updated.

Is data warehouse a database?

None of the definition mention explicitly a database. Kimball only says the data must be “specifically structured”. Until recently it was fairly obvious that if the data are stored for the purpose of analysis or decision support they must be stored in a database. In fact both Kimball and Inmon in their methods described the implementation of a data warehouse in a relational database.

However, their original works on this topic are from the 1990’s, with some updates in 2000’s. These were pre-Hadoop times and there were practically no other storage options suitable “for query and analysis”. Today it would be possible to complete the task even without a database, using alternative storage options and query engine. However, implementations of a data warehouse with a relational database (like Oracle or PostgreSQL) or a Massive Parallel Processing (MPP) database (like Amazon Redshift, Netezza, Vertica or Greenplum) are by far the most common.

Today, with the development of Hadoop, cloud and related tools there are more options. There have been many data warehouse implementations with data stored as files on HDFS. Hive or Impala can then be used as the query engine to provide SQL-like experience. Similar tools exist in public cloud – Amazon Redshift Spectrum or Athena being just two examples.

Irrespectively of the storage system and query engine used the principles do not change: the data must be specifically structured for query and analysis. Dumping on HDFS or S3 no matter what and no matter how will result in troubles and not a data warehouse.

Not all databases are being equal

Although it should be obvious, it is worth repeating that not every database is a data warehouse or can be treated as such. Even if it contains a lot of historical data, about various topics. A distinction must be made between an On-Line Transaction Processing (OLTP) and On-Line Analytical Processing (OLAP) systems. The first are made to quickly execute a predefined set of operations, each of them acting on a single record or at most a few of them. The typical operations are writing single records or updating the existing ones. OLTP systems may handle a very large number of such operations, but each individual operation is rather simple.

In contrast OLAP systems, in which category a data warehouse falls, are made for analysis of the large sets of historical data. The queries are relatively infrequent, but often they operate on millions of records. They are almost exclusively read queries and hence the data warehouse should be optimised for read access. The data load processes typically write whole batches of data rather than single records.

The list of differences is not complete, but it should already be evident these two kinds of databases must be designed in a completely different way. No wonder there exists Relational Database Management Systems (RDBMS) targetted for OLTP or “general purpose” applications and those made specifically for data warehousing. Mixing the two purposes or running analytical queries directly on the OLTP database (even on the “read-only” replica) cannot lead to good results, especially at a scale.

But how to build it?

So far we discussed only what a data warehouse is or what it is not. Apart from a few hints we did not give a clear recipe on how such a system should be built. Even a very brief guide would have to be many pages long and it would only cover the basics and the most common concepts. Moreover, there are 3 distinct methodologies for tackling the task, and there are also specific system- and tool-dependent variations of them.

Those interested in the details should refer to the classic materials:

Of course, you can also contact us and we will be happy to advise and assist you.

[1] The term “data warehouse” certainly appeared in “An architecture for a business and information system” paper in 1988, but very likely had been in use long before.

[2] Even Oracle in their, otherwise very good, Data Warehouse Guide makes this mistake.

Dimensional modelling in a data warehouse. Engineering drawing with dimensions - as an illustration that dimensions and measurements are crucial in every engineering work. Uncategorised

Notes on dimensional modelling

In the other post on Coronavirus reporting we touched briefly the topic of dimensions. In particular we discussed about the temporal dimension. Since dimensional modelling is a basic concept in the data warehouse design it deserves a separate post.

The temporal dimension is commonly used in practically every business and science domain. Time / date is commonly considered as an example of a conformed dimension. In the end the 7th of November will always be the 7th of November and not the 25th of October, irrespectively of the context [1]. In reality, we typically deal with multiple different time/date dimensions – date of order is generally different that date of delivery and different than date of payment. An attempt to combine these 3 facts in one drill-across report using a single ‘date’ dimension will inevitably produce nonsense. These are 3 different, conformed dimensions. In practice it is incredibly important to clearly distinguish different kinds of dates or timestamps.

Know your process

In some cases this distinction is relatively easy, as the difference comes from the business process. In the example above it easily understandable for everyone, that the delivery cannot occur before the order and there may be a remarkable delay between these two events. It is obvious, the delivery date and the order date are two different things. This is an inherent feature of a business process we are describing using analytical tools.

In other cases – and they are often more difficult to handle – the difference comes from how we measure certain quantities or report on events. In the previously described Coronavirus example we were really interested to know the date of infection. But this was not an event we could monitor or measure directly. The event we could easily monitor was the submission of the report of the identified case. There is a significant delay between the two events. Moreover, this delay is not constant and can vary from case to case. It depends heavily on the details of the whole diagnostics and reporting procedure. These details and their influence on the results should be evident for an insider, but very likely will be unknown to the final user of the analysis prepared on the base of the collected data.

How to buy on-line without a single website visit

Let’s consider a process of making a purchase in an on-line shop. Typically, a user visits the website, add some products to the basket, initiates check-out, finalises it and pays. Of course, there may be many different scenarios, but for sake of simplicity let’s restrict our consideration to these five steps. In the ideal world we can register and timestamp the events corresponding to each step. Very likely each of the steps will be handled by a different (micro-)service or even by an external provider. The data about the website visits would be collected and stored differently than the data about payments. And, of course, one might report on the number of website visits per day and, independently, on the number of transactions per day.

It is not difficult to figure out that it takes some time from the website visit to the payment. Hence, the timestamp of the visit event will never be the same as the timestamp of the payment event. It does not take much to realise that a user can browse the site for many hours, then add some products to basket, then leave this basket for another couple of hours or even days, and only then continue with checkout.

Obviously, if we try to combine our facts about visits and facts about transactions in one report using “date” – supposedly a conformed dimension – we will obtain a mess. This mess will probably not be immediately visible on a very high level. But on a specific dimension splits we might see a theoretically impossible situation of transactions without a visit at some dates.

Clearly, there is no single “date” dimension. There are several: visit date, transaction date, e.t.c. We may report the numbers by visit date or by transaction date, but we cannot use them interchangeably and should not mix them together. As a logical consequence, we should clearly distinguish the different types of dates and timestamps by using clear and consistent naming in the data warehouse. But naming is a topic by itself and one day we may cover it separately.

What about other dimensions?

While time and date are important, they are not the only dimension typically used. The other common example of a dimension is “customer”. Every business has customers. But “a customer” may have a different meaning depending who we speak to. In a CRM system one would often store both actual customers and prospects customers. If you talk to a sales representative or an account manager he would use the word “customer” to name both. For finance department “a customer” very likely will be someone they can issue an invoice (or have issued one in the past). For legal “a customer” might be someone we have a valid contract with – so no prospects, and no former customers. Some other department might even define it as “someone, who has bought from us in the last 3 months”.

However, in this case we can define a superset of all “customers”. This superset will contain all the prospects, actual customers, former customers, e.t.c. If we manage to collect all the relevant attributes for such a customer dimension and we have a single primary key to identify an entry – we have a conformed dimension. Despite the fact we use “customer” in many different contexts, we have a single conformed dimension, which we can use to create coherent drill-across reports.

The problem with “customer” is not that it can mean different things, the problem is the data about customers are present in many different systems. In general, different departments, and consequently the systems they use, may need various subsets of data. The data of a particular system may be incomplete, outdated of simply false. Some of the data in system A may contradict the data in system B. Thus, apart from having a single data warehouse definition of what a customer is, we also need a single source of truth about its attributes. But this is a separate topic of Master Data Management.

[1] Unless one makes a conversion from Gregorian to Julian calendar.

RKI Coronavirus daily new cases report change over week 2020-04-21 – 2020-04-28. Visible change of past values (green and red). Uncategorised

On Coronavirus Analytics and Reporting

The evolution of the COVID-19 coronavirus epidemic triggered an enormous interest in crunching the available data. There are plenty of data on this topic readily available, which makes this task to appear particularly easy.

Both official institutions and individuals have created a number of various visualisations, of various quality and various practical usefulness. I must admit that after having a look at a couple of them I was quite disappointed. Most of them, while often visually appealing, has significant shortcomings. We will not discuss the details here; those interested in the topic can find some critical thoughts here. The most important take is that the most of these visualisations do not really facilitate any decision making process. Even if they present trustworthy data they often do it the wrong way. Typically, they present answer on “what has already happened?” or “how does it looks like now?” – which is not yet bad as such. At the same time they make it difficult to find out “how the situation has been changing?”, not to mention the most important aspect: “what will / can happen in the future?“.

Of course, many attempts has been made to answer the latter question. The availability of the various data made many amateurs, both when it comes to epidemiology and handling data, to build and publish their “models”. This activity was particularly visible at the early stages of the epidemic. The “models” were very often not much more than an exponential regression based on few available data points. Very simple, there is nothing which could go wrong, right? Apparently, many people done seemingly the same thing, on ostensibly the same data and they… got totally different results. The only common thing was that the curve was rising quickly. But this was visible even without a “model”. We will mercifully not point to any examples. In fact, the most outrageous ones does not seem to be easily available in the Internet any more.

We will not try here to show another visualisation – there is enough of them. Neither we will attempt to build another predictive model of the course of the epidemic. While we are experts in data, we definitely prefer to leave disease spread modelling to professionals in epidemiology. Instead, we start with something very simple: the basic facts and metrics. Why? Because they are the cornerstone of every business analytics, not only the coronavirus ones.

Facts, dimensions and metrics

In case of coronavirus disease the atomic event is an infection. In the other words infections are our facts. Now, what are the most basic metrics on that facts? Of course, it is the number of infections and we will focus on that one. Each of the infections occurs at some point in time, at some geographical location and to a person of a certain characteristics (age, gender, profession, …). These are the potential dimensions. The important one here, and the most common in nearly every business analysis, is time. That is why we take that one as an example.

Immediately we can conceive the first metric: the total number of infections or more precisely the cumulative number of infections to date. This is the metric the most frequently shown in all tables and visualisations. Typically, the ‘date’ being ‘today’. But if have a closer look at this metric we would noticed that this is similar to the total number of a website visits, or the total number of customers or the total amount of sales (to date). These are all vanity metrics. They all will always grow, by definition. At the same time they carry very little useful information on how are we really doing.

Thus, we are not that much interested in the cumulative number of infections to date. We would rather look at the number of (new) infections per day. The business equivalents would be, for example, the number of website visits per day or the daily amount of sales. From this metric we will be able to easily see what is the trend (increasing, stable, decreasing) or if Monday is always the worst day of the week. This one is useful.

So, let’s have a look at how this looks like on a chart. As said previously the data are easily available, so are the charts. One of the most popular services showing curated data and some charts is Worldometer. Since we are located in Berlin let’s have a look at the data concerning Germany. The official source of German data is Robert Koch Institute. There are also a lot of other information and scientific studies on their website, so if you understand German it is well worth reading.

The Worldometer’s Germany Daily New Cases is updated daily and there is no history provided. At the time of writing it looked like that:

Data visualisation: Coronavirus daily new cases in Germany as of 2020-04-28, based on the original chart by Worldometer. X-axis adjusted to show data after 2020-03-01.
Coronavirus daily new cases in Germany as of 2020-04-28, by Worldometer. X-axis adjusted to show data after 2020-03-01.

The respective RKI charts are given in Daily Situation Reports (updated daily, archive available). The one from the time of writing looked like that:

Report of Coronavirus cases in Germany per day as of 2020-04-28 reported by Robert Koch Institute.
Coronavirus daily new cases in Germany as of 2020-04-28, by RKI.

Not all dates are equal

It does not take much time to realise they are substantially different. It is not just a matter of different scale on Y axis and different graphical representation. They looks like they were reporting on completely different data. But this is not the case – Worldometer takes the official data from RKI. Beyond any doubt, the source data are the same.

In both cases we use seemingly the same metric: the number of (new) infections aggregated and charted against seemingly the same dimension: date. If we have a closer look we realise that for Worldometer “date” means “date of reporting”. At the same time RKI uses “date of onset of symptoms” (if known) or “date of reporting” (if date of onset of symptoms is not known). In fact, we have here two distinctly different dimensions, disguised under the same name: “date”. This simple observation is of crucial importance and we will discuss it in a separate post.

To be exact the values on the RKI chart are plotted against two different dimensions, which are marked on the same X-axis. This is a very rare thing to do and normally we would discourage such attempts. However, in this case such a presentation serves a purpose.

Understand the process

To understand why it is important we first need to understand the measurement and reporting process. In the ideal world we would like to report on the date of infection. Obviously, in reality we do not know this date. In science, if we cannot measure or calculate a certain value directly we can try to estimate it. So, let’s think what would be the best estimate of it.

The sequence of events in the whole infection and diagnose process (largely simplified for the purpose of this post) may be as follows: one gets infected, after a few days develops symptoms, the samples are taken, then they are delivered to the lab, processed in the lab, the results are obtained, the result are reported through the official channels. As we understand, in general case the symptoms may be developed on any stage of the sequence, theoretically even after the case is officially reported, or not at all. Those interested in details may read more on RKI website.

What is important for our further consideration is that there are multiple steps and consequently multiple potential timestamps to be used. The delays between the steps are expressed in hours or days, are variable and are significant from the perspective of the dynamics of the process we are trying to depict. In the other words, the case reported today may have been infected 3 days ago or 10 days go. If we look again at the simplified sequence of events above it should be clear these variable delays come mostly from the procedure of diagnosing and reporting. When we realise that fact we can try to explain the sudden spikes and the weekly variations on the Worldometer’s chart.

As we can see the date of reporting is the very last date in the whole process. It is also very different than the date of infection we are looking for. The closest to it is the date of the onset of symptoms. Logically, RKI takes that one as the best estimate of the former. As the result we get the smoother chart, without the spikes or drops caused by irregularities in testing and reporting process. It much better depicts the process we are interested in. The problem is in many cases the date of onset of symptoms is not known, or cannot be given. This cases cannot be ignored or omitted, hence they are charted by the only certain and available date – the date of reporting. As mentioned above we are sceptical about the idea of mixing two distinct dimensions on the same report and we would normally discourage such practice in any business context. At the same time we clearly see the advantage of using the best estimate whenever possible.

Changing history

The way of charting taken by RKI has one more important feature: it can change retrospectively. One of the cases reported today might have the onset of symptoms 8 days ago, in the other one the symptoms might (in theory, at least) appear in 2 days from now. In both cases we would need to update the whole chart, not just the most recent date.

RKI Coronavirus daily new cases report change over week 2020-04-21 - 2020-04-28. Visible retrospective changes.
RKI Coronavirus daily new cases report change over week 2020-04-21 – 2020-04-28.

As the illustration here is how the RKI report has changed over 1 week, from 2020-04-21 to 2020-04-28. Retrospective changes in the recent days are remarkable and the adjustment of the numbers going back as far as few weeks can also be noticed.

We have seen it already

Such things are not unusual, it can happen in many applications that the events are reported with a significant delay and not necessarily in order they occur. Many modern data processing systems have built-in features which allow to handle such late arrivals easily. However, the consequence is that the report for the last month done today will be substantially different than the same report done in a week time from now. Many of business stakeholders find this uneasy to deal with such a volatility in their reporting. This often leads to picking a simple and stable “Worldometer-like” reports over a more adequate, but also more elaborate and volatile “RKI-like” ones. Even if the latter ones provide a better base for a data driven decision making.

As shown above, even in a relatively simple data analysis there are multiple decisions to be taken. Depending on how they are taken and how the whole data collection and analysis process conducted the reports of (seemingly) the same metrics, based on exactly same data may differ significantly. Clear definition of the relevant metrics and dimensions is the starting point for any business reporting. If done right – it can greatly facilitate further decision making process. If neglected or done carelessly – it will lead to discrepancies, misunderstandings and drawing false conclusions.