Rough waters of the Data Lake

2021-01-25

We wrote in another post that the term “data warehouse” is often misunderstood. Same stands for “data lake” – a newer denotation which also refer to data storage. The two are frequently confused. Another common misconception is to consider a data lake as a replacement or a modern version of a data warehouse. To add to the confusion there appeared very recently “a data lakehouse” – actively promoted as the mixture of a data warehouse and a data lake, supposedly taking the best of each of them.

When discussing about data lake we should start with why this concept appeared at all. Let’s get back to early 2000′.

It starts from the source

The data warehouse had been a well-established concept back then. Yet, many of the data warehouse projects were failures. How many – no one really knows, some sources report that more than 50%. The exact number is neither important here nor possible to obtain. From experience we know that at least some of the large-scale projects declared as “success” were far from delivering to the initial expectations.

From the other hand there appeared “big data”. The “big data” was another ill-defined term, used to describe anything which did not fit into RDBMS available back then. It might have been because of the sheer volume of the data or the difficulty to squeeze them into the relational model of a database. In these circumstances Hadoop was created. It came with Hadoop Distributed File System (HDFS) as storage and Map-Reduce as programming model. The data were stored as flat files, so there was no need to think of keys, attributes, columns and relations. Hadoop quickly become popular for processing data at a scale, without investing in expensive hi-end hardware and equally expensive database licences.

This was clearly an alternative way to collect data and process them for query and analysis. Few years later in the community of Hadoop users a new concept, called “a data lake”, was created. Similarly to “big data” it lacked (and still lacks) a precise definition. The classic description comes from James Dixon’s blog article published in 2010 and is figurative rather than concrete:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

James Dixon

The comparison between a small volume of a bottle of water and a vast capacity of a lake, with the seemingly endless possibilities of using it, easily creates an impression of superiority of the data lake. If you got such an impression we strongly encourage you to read the original article (it’s short and concise). One of the important points is that Dixon refers to the data originating from a single system – hence he uses a “datamart” term, not a “data warehouse”.

Dive in a lake, drink from a bottle

So, in its origins a data lake was not meant to be a simple replacement for a data warehouse, which typically integrates data from many sources. Dealing with unstructured data was not the main driving force either – nearly all of the data handled in practice were structured or semi-structured. It was conceived to process the data of a volume not fitting (back then) into RDBMS. More importantly, it was made to allow answering the questions not yet known at the moment of designing the system.

The goal of answering the questions not yet known has far-reaching consequences. One cannot simply make a selection: these data are of key business importance, the other are of no value beyond transient operational purposes. These we keep, the other we throw away. If we do not know the question then we cannot know which data would be useful to answer it. So, we store everything, just in case. Even if some data seem useless today, they may serve someone to answer a question which may appear tomorrow. Obviously, in this situation any data modelling effort at this stage would be futile. The only practical thing to do now is to save the data as they are, without pre-processing, transformation or modelling. It will be the responsibility of someone who would use the data for some purpose, some time in the future.

It is the second substantial difference between the data warehouse and the data lake. The data warehouse is meant to answer (mostly) predetermined questions about well known business processes. The main use of a data warehouse is descriptive analytics bounded to a known domain. Consequently, the data are modelled according to the known business processes they describe. The data are integrated. Both integration and modelling entail pre-processing and data transformation. Indeed, it may resemble bottled water – extracted from the source, purified and delivered ready to drink, in a handy packaging. The data lake does not possess any of these features. It was meant to allow exploratory analysis, limited only by imagination and availability of the data. It stores raw, unprocessed data. Any (pre-)processing happens only as a part of the analytical process.

Eutrophication of the lake

The data lake concept got quite some popularity. Unfortunately, the original idea was distorted shortly after it appeared. Many people, disappointed with large, expensive and often unsuccessful data warehouse projects perceived data lake as a replacement or an alternative to the data warehouse. “Big data” was trendy, “Hadoop” was trendy, so the “data lake” was trendy, too. The old-fashioned data warehouse was passé, there was a new, hip data warehouse 2.0, called “data lake”. The data lake concept was very quickly simplified to statements like: “Hadoop replaces the role of OLAP (online analytical processing) in preparing data to answer specific questions[1]. It boiled down to a strategy: “Let’s throw all our data to Hadoop, call it a data lake and we will live happily ever after“. Well, not quite…

Großes Moor bei Becklingen

With all this hype from one side and common misunderstandings of the whole concept from the other it is no wonder that the disappointments and criticism appeared. A large part of the criticism were in fact based on false assumptions or getting the idea wrong. The original author tried to address this criticism and in particular the misconceptions in his post “Data Lakes Revisited” already in 2014. He wrote (again we encourage to read the full post):

“A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.

A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.”

James Dixon

But his voice was either not heard in the whole fuzz about big data or it was already too late. The meaning of the term “data lake” had already drifted away from the original concept and there was no easy way back. The misunderstandings of the concept continued to be there, so was the criticism based on them.

What some people got wrong from the idea was that all the data should be stored on Hadoop and this was what already made a data lake. The data lake was treated as a replacement for a data warehouse, even though it had never meant to be one. There were data stores resembling data warehouses built on Hadoop, with Hive or Impala as a data access interface, called a “data lake”. They were used mainly to answer the same pre-defined business questions, as the data warehouse would normally do. The proper data model was often missing, as was the data governance. Some people must have assumed that if the data do not need a pre-defined schema they also do not require data governance. The data was stored “raw”, as they came from the source, without any pre-processing or proper integration. Obviously, the systems built according to these misconceptions most often did not deliver to the expectations. These so-called “data lakes” were turning into what critics named data swamps.

Where is the water now?

Today there is still no single definition of what the data lake is. And if we look at what the popular definitions or descriptions say we can see how much the data lake is different now from the original concept:

“A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics (…).”

Amazon

“A data lake is a system or repository of data stored in its natural/raw format (…). A data lake is usually a single store of data including raw copies of source system data, (…) and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data (…), semi-structured data (…), unstructured data (…).”

wikipedia

“A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.”

Gartner

These descriptions focus on multiple sources, many purposes or application, raw format of stored data and possibility to store semi-structured or unstructured data. As we already know the idea of storing data from many sources was not there initially and possibility to store unstructured data was not the main concern either. What remained of the original concept is that the data are stored in raw format, as they come from their source, without any preprocessing or aggregations. There is no structure or schema defined beforehand. The data are supposed to serve for various tasks, apparently including (simple) analytics and reporting. This is a change from the original goal of answering questions not yet known to answering all questions, including the pre-defined ones.

The raw data storage format, lack of predetermined schema and consequently lack of data pre-processing are the key differences between the data lake and the data warehouse.

And what if we do all these pre-processing, data cleansing and integration steps only once? And then store the clean, conformed, structured and perhaps aggregated data in a single place, readily available for trustworthy analytics and reporting? An excellent idea! This is (roughly) how a data warehouse is built…