Data Engineering

What is the separation of storage and compute in data platforms and why does it matter?

As businesses strive to become data-driven, the separation of storage and compute has become a critical factor in data platforms. By separating these two components, businesses can take further advantage of the advances in cloud computing for their data analytics.

Photo by Propel

Abstract background image

The separation of storage and compute is a new paradigm in data platforms. By separating storage and compute, they can be consumed, scaled, and priced independently. This allows businesses to pay for what they use and nothing more. As customers, we want to avoid waste in all forms, including unused capacity. Let's take a closer look at how this separation works and the benefits it brings.

Before this separation was possible, many companies were running systems like Hadoop or HDFS where the dominant paradigm was that compute and storage were tightly coupled 1:1. For example, if your queries got more complicated, you had to resize the HDFS cluster; if the amount of data you had increased, you had to resize the HDFS cluster. And this always had to be done for the peak queries and peak storage. It resulted in a very inelastic data architecture with a lot of waste, but this was state-of-the-art in 2006 when it was first released and is used still today.

In this post, we will cover how the separation of storage and compute works, how you can take advantage of it, what are some of the key technologies that make it possible, and some examples of data platforms that implement the separation of storage and compute.

How does the separation of storage and compute work?

When you buy a computer or rent an AWS instance, it comes with one or more CPUs and one or more hard drives, the compute and storage. Typically, more powerful CPUs are paired with larger hard drives. So if you only need the more powerful compute, the extra storage goes unused, and vice versa.

With the advances in serverless computing and cloud storage, data platforms can now combine whatever storage or compute they need on-demand vs relying on predefined instance configurations. The "infinite storage" offered by services like AWS S3 makes it so that data platforms don't even need to provision more storage as customers use more space. Similarly, elastic computing enables data platforms to offer different levels of compute capacity on demand that customers can choose to meet their query processing requirements.

If a customer on a data platform needs a lot more storage, but a small amount of compute processing power, the platform can offer this by allocating more storage to that specific customer and a smaller processor for compute because they are not coupled 1:1.

What are the benefits of separating storage and compute?

The are many benefits of separating storage and compute in data platforms. Here are some of the main ones:

  • <span class="heavy">You only pay for what you use:</span> With the separation of storage and compute, you only pay for the resources you actually use. If you have a lot of data but don't need to do much processing on it, you won't be paying for expensive compute resources that you don't need.
  • <span class="heavy">Flexible Scaling:</span> You can scale your storage and compute independently to match your needs, potentially unlocking new use cases. If you need to do more processing, you can simply add more compute resources.
  • <span class="heavy">Better Utilization of Resources:</span> By being able to scale storage and compute independently, data platforms can make better use of their resources. For example, if you have a lot of data but don't need to do much processing on it, you can use less expensive storage options and save money.

How can you take advantage of the separation of storage and compute?

There are a few steps businesses can take advantage of the separation of storage and compute in data platforms:

  1. <span class="heavy">Identify your workloads:</span> Identify what are the discrete data workloads you have. For example, internal reporting could be one workload and customer-facing analytics for your product could be another.
  2. <span class="heavy">Allocate resources to each workload:</span> Once you have identified the workloads you need, the next step is to allocate resources. Following our example, internal employees might be OK waiting 30 seconds for reports to load, while customers have no tolerance for slow dashboards. In this case, you could allocate more resources to your customer-facing analytics workload.
  3. <span class="heavy">Optimize your resources:</span> Finally, make sure to optimize your resources to get the most out of them. This is going to depend on the platform, especially if these resources need to be provisioned versus offered on demand.

What are some examples of data platforms that separate storage and compute?

There are a few data platforms that have adopted this new paradigm:

  • Snowflake - In Snowflake you pay for storage and compute independently. Compute is provisioned with "Warehouses" that have different computing power.
  • Dremio - Dremio is a data lakehouse based on the open-source Apache Iceberg table format. It offers different compute instances to process data that lives in your S3 bucket. You pay for S3 storage independently.
  • Propel - is an Analytics API platform to build customer-facing analytics. In Propel you can customize the compute assigned to your different web or mobile apps. Storage and compute are priced independently.

What are the key technologies that make the separation of compute and storage possible?

Every data platform that takes advantage of the separation of storage and compute needs a storage layer (an object store or network attached storage), table and file formats, and a query engine. Some, like Snowflake, use proprietary technologies for each. The examples below are open source technologies for each of these categories (except the storage services).

Object storage and network attached storage: AWS S3, GCP Cloud Storage, or Azure Blob Store

Object Storage: Services like AWS S3 can be used to store files on-demand with "infinite storage". There is no need to provision or manage capacity and it is completely decoupled from the instance or serverless that is doing the processing.

Network Attached Storage (NAS): NAS is a storage architecture where data is stored on a dedicated server that is attached to a network. This allows users to access the data over the network using protocols like SMB (Server Message Block) or NFS (Network File System). NAS allows to swap or sometimes resize the storage capacity without changing the processing.

Open table and file formats: Apache Iceberge and Apache Parquet

Apache Iceberg is an open table format designed to store vast amounts of data for analytics. Iceberg tables can be used with different data processing engines such as Spark, Trino, Flink, or Hive. It works with Parquet files, giving them SQL table-like semantics through its metadata.

Apache Parquet is a file format that is used to store data in a columnar format. This allows for faster reads and reduced disk space. Not only is the Parquet format open source, but it also has an entire ecosystem of tools to help you create, read and transform data.

Query engines: DuckDB, Trino and Spark

The query engine actually executes queries against your tables, so it needs to support your underlying table format, file format, and object storage.

DuckDB is an in-process SQL OLAP database with a highly optimized query engine. It supports reading from object stores like S3, and it can perform predicate and projection pushdown when scanning Parquet files. These optimizations mean queries that might have previously required a cluster to complete can be executed on a single host, due to better utilization of compute.

However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel.

Final thoughts on the separation of storage and compute

The separation of storage and compute is a new paradigm in data platforms that offers many benefits, including flexible scaling, better utilization of resources, and only paying for what you use. Not all data platforms take advantage of this new paradigm. There are key open-source technologies that not only make it possible but also accelerate its adoption. Snowflake, Dremio, and Propel are examples of data platforms that take advantage of this new paradigm. To take advantage of the separation of storage and compute in your own application, make sure you identify your workloads and allocate resources accordingly.

Related Content

Creating charts for data visualization and analytics is difficult by hand, illustrated by this drawing of a line chart on graph paper with a pen and ruler on a wooden table, so we’ve selected our favorite React charting libraries: Recharts, Echarts for React, React ChartJS 2, and VISX.

Data Engineering

Best React Charting Libraries for Data Visualization and Analytics

We've picked Recharts, Echarts, React ChartJS 2, and VISX as the best charting libraries for data visualization and data analytics in React.

Data stream illustration

Data Engineering

How to Build an Incremental Model for Events Using dbt and Snowflake

Learn how to use the incremental model in dbt to manage data streams in your Snowflake warehouse.