In this article, we’ll be tackling the key differences between data lakes and data warehouses, two of the main concepts in the data industry, the field that is often affectionately called “Big Data.” In the land known as Big Data, there are lakes, and there are warehouses. Not literally, of course – data lakes and data warehouses are typically virtual servers built using a cloud-based server architecture, and they serve similar purposes for data engineering and analytics.
A “data lake” is a location where various data “streams” (such as from real-time event data) can be collected and observed by data engineers and data analysts, in comparison to “siloed” data stores that may only be accessible to certain users or with certain software tools. A “data warehouse” is similar but typically implies that the data has been standardized to fit a schema.
Generally, data warehouses only handle structured data, but data lakes can handle structured, semi-structured, and unstructured data. The advantage of the data warehouse is that structured data lends itself more easily to data analytics and business intelligence (BI) reporting. And, in addition to processing streaming data, both data lakes and data warehouses can handle “batch” data delivered in one-off chunks (instead of a continuous stream of data). Let’s compare them.
Both data warehouses and data lakes store massive amounts of data, often measured in TB (terabytes) or PB (petabytes). Usually, you pay for cloud data platforms in terms of storage space (databases, in Snowflake’s data warehouse) separate from compute time, meaning the processing power you use (in Snowflake, virtual warehouses). Data warehouses and data lakes typically integrate with a cloud-based file storage solution, or file store, such as Amazon S3 (AWS Cloud Object Storage), though they may use some internal storage solution as well. Both data warehouses and data lakes allow data to be processed in-place so that it can be queried, and both solutions allow the data to be organized and tagged in some way for future reference.
As we’ll explore here, there are many important differences between data warehouses and data lakes. The main difference between a data lake and a data warehouse is that a data warehouse implies some type of reformatting (or “transformation”) in order to make the data more structured. The data in a data warehouse may also be pruned (or “reduced”) by removing irrelevant information from the original dataset in order to reduce storage costs and improve query performance. Reformatting the data in this way is typically irreversible, meaning that data is lost, in comparison to a data lake, where data is just stored and retrieved in its original format.
 Data Lake stores raw data, Data Warehouse stores transformed data
 Data Lake keeps all data
 Data Lake easily adapts to changes
“Unlike a data lake that stores unprocessed data, data warehouse stores already filtered data processed for a specified purpose.”
Mr. Kahuha also writes that, while data warehouses may be cheaper than data lakes, data warehouses usually can’t replace data lakes because they serve different purposes for organizations. For example, a company may keep all the data that their SaaS app creates inside a data lake for compliance purposes, but they might also need a data warehouse for generating business intelligence (BI) reports. The data warehouse might have the advantage that, by transforming the data and eliminating unnecessary parts of the dataset, analytical query performance improves from hours to minutes.
Let’s take a good look at the differences between these two key pieces of data technology, starting with the question of why we need data warehouses and data lakes in the first place.
Before we dive deep into the differences between data warehouses and data lakes, let’s consider 3 main reasons for developing a company-wide data architecture in the first place:
Modern businesses have vastly more data than ever before, and that data is rich for actionable insights. Now let’s move on to discussing where a data lake fits into the modern data stack.
While data warehouse is probably the more common term, we will start by defining a data lake. Going back to what the purpose of data architecture is – how companies combine all of these sources of information to make decisions – then a data lake is like a central landing zone for all of the company’s data. A data lake serves as the depot for many different sources of data. Typically, the data lake is unstructured, meaning it does not map to a neat database schema.
The data architecture in a company is constantly dumping data into the data lake, usually without any processing. If we’re using an unstructured data lake as a “catch-all” for all of the company’s data then, then the data will be in all different formats: JSON files, images, videos, text blobs, and CSV files. The data lake is just everything dumped into one single location.
The location of the data stored inside the data lake can be one of many choices, typically with a major cloud provider. The data lake’s file store could be an S3 Amazon bucket, it could be hosted in Microsoft Azure storage, or a company could use a modern cloud platform like Snowflake. The goal of the data lake is to just load all of the data into some type of database server, and organizations have a lot of flexibility to pick wherever they want to host their data lake and its file store. And this flexibility has become a requirement of modern data architecture in today’s businesses because they’re constantly generating more and more data. Companies don’t want to wait an hour or a day for data to arrive, as would be typical in an Extract-Transform-Load (ETL) data pipeline. The idea instead is to use “extract & load” to the data lake of either batch (large chunks of data) or streaming (real-time event) data so as to constantly be loading data into this data lake. Then, on some cadence (such as once an hour), the company pulls from the data lake to perform transformations or other types of data analytics.
So now let’s talk about the data warehouse, which can be thought of as coming after a data lake if we think about the process as linear. First, the data is generated, then it is loaded in raw (unstructured) form to a data lake, and then it is moved into a data warehouse for business intelligence (BI) and other analytics reporting.
Since the data lake as we’ve described it here is an unstructured repository of data in all of its. different forms, it’s not easy to relate to in any sense, meaning we can’t make decisions off of it.
The goal of the data warehouse is to take all of an organization's messy data from the data lake and turn it into a single source of clean, formatted, and organized data with some sort of structure. A data warehouse is often a separate database organized into schemas and tables.
At the level of the data warehouse, companies can develop concepts like facts and dimensions. Facts are quantitative transactional data, such as transaction numbers, that are ready for quantitative analysis. For example, the amount sold of a given product is called a “fact measure,” or, more commonly, a key performance indicator (KPI). Data warehouses can be engineered to store fact tables, which make specific facts (KPIs) readily accessible. In comparison, dimensions are attributes of facts, such as the date of a sale or a customer’s pieces of demographic information. An example of a website dimension would be the page’s name and URL. Dimensions may relate to many different facts and may change over time.
The data warehouse contains structured data in a way that the organization can use. There are numerous approaches to transforming data from its raw form to the data warehouse, including:
While these approaches all have their pros and cons, the objective of any of them is to create a data warehouse that serves as the single source of data in an organized form of the original raw data. Companies will typically update the data warehouse on a certain cadence, such as hourly or daily, by pulling from the data lake, running transformations, adding business logic, and taking other steps that are necessary to make the data easier to interpret.
The end goal of the data warehouse is so that the company has something clean and easy to work with. Not only is the data warehouse more organized and easier to interpret than the catch-all data lake where anything goes, the data warehouse’s performance might be better as well. The performance improvements of a data warehouse (such as Snowflake) come from an internal architecture that is optimized for analytical queries using keys and relationships.
Regardless of their similarities, data lakes and date warehouses are quite different from each other. Data lakes, due to their support for unstructured content, can be complex to navigate and may require data engineers to set up data pipelines, while some data warehouses are more user-friendly and suitable for use by less technical users in an enterprise.
Another key difference is that, since data lakes usually store raw data, their storage capacity must be higher than that of data warehouses. Meanwhile, since they only store processed data, data warehouses can save a lot of storage space, which translates into money savings. In addition, since it’s already processed, the data in a data warehouse is frequently much more accessible and understandable by a less technical audience than the data in a data lake.
At a high level, the purpose of a data warehouse is to take unstructured data and create a single source of organized truth for businesses to use however they want, typically to make decisions from, but also to provide up-to-date information through software tools like product dashboards. Unlike the data lake, the data warehouse will present the data with a consistent look and feel, and things probably make more sense. That’s an advantage of a data warehouse.
“Where a data warehouse is a well-structured, schema-first repository with known data access patterns, a data lake is a large storage area that holds data as-received with minimal transformations. Additionally, data lakes ingest and store data at scale in loosely defined business domain areas. They have become synonymous with big data solutions because they excel at the ‘3 Vs' of big data: volume, variety, and velocity.”
–Tyler Estes on the Credera blog
As we’ve seen, in terms of data architecture built to support business decision-making, data lakes can be thought of as the first step (capturing all data in any format), and data warehouses can be thought of as the second step (organizing that data in a central location). These boundaries are not always clearly defined, and some cloud-based solutions, such as Snowflake, can serve as both the data lake and data warehouse at the same time. Nevertheless, data lakes and data warehouses are two critically important components of the modern data stack.
Did you know dashboards and other data apps can be built from data already in Snowflake, without any further data engineering? That’s why we built Propel – because we believe organizations using data warehouses should be able to ship data apps in minutes, not months.
If you enjoyed this article on data engineering, please follow us @propeldatacloud on Twitter, sign up for our free email newsletter, or join our customer waiting list for the latest news and educational content from Propel Data.
As businesses strive to become data-driven, the separation of storage and compute has become a critical factor in data platforms.
In the “Big Data” industry, there are big differences among the work responsibilities of data scientists, data engineers, and data analysts.