How to select a table engine and sorting key
In this guide, you’ll learn how to select the optimal table engine and sorting key for your table in Propel’s Serverless ClickHouse.
Background
This section provides some background on ClickHouse table engines and the role the sorting key plays while organizing data in tables.
👉 How to choose the right table engine and a good sorting key.
Understanding table engines
Table engines determine how ClickHouse tables store, process, read, and update their data.
Table engines are organized in families. The main family (and the one that Propel supports) is the MergeTree family. Other, less common table engines are beyond the scope of this guide.
Overview of the MergeTree table engines
The MergeTree is ClickHouse's primary table engine family. MergeTree tables optimize for fast inserts and reads. They achieve this by storing the inserted data in parts and merging those parts in the background. The various MergeTree tables differ in their merging approach, with each having unique merging rules.
The table types include:
- MergeTree: The most common engine in the MergeTree family. It is designed for append-only, immutable data, as merging consolidates parts but never deletes records.
- ReplacingMergeTree: A variation of the MergeTree engine. When merging data parts, it replaces old rows with new ones that have the same sorting key.
- SummingMergeTree: This engine is designed for use cases where you need to calculate the total for numeric columns. When merging data parts, it sums up columns for rows with the same sorting key.
- AggregatingMergeTree: This engine is ideal for incremental data aggregation. When merging data parts, it aggregates columns for rows with the same sorting key.
- CollapsingMergeTree: This engine is designed for scenarios where you need to account for a record's state changes. When merging data parts, it deletes (”collapses”) pairs of records with the same sorting key and opposite sign column.
- VersionedCollapsingMergeTree: This engine is an extension of the CollapsingMergeTree. It uses a different collapsing algorithm that allows inserting the data in any order.
Understanding the sorting key
The sorting key in ClickHouse is a set of one or more columns that ClickHouse uses to organize the rows within a table. The sorting key determines the order of the rows in the table, influences how parts are merged, and significantly impacts the query performance.
If the rows are sorted well, ClickHouse can efficiently skip over unneeded rows and thus optimize query performance.
How to choose the right table engine and a good sorting key
In this section, we’ll guide you through how to choose the right table engine and sorting key for your table.
What type of data do you have?
The first step is to figure out what type of data you have.
Three common types of data stored: append-only data, mutable records, and incremental aggregations.
- Append-only data: This type of data refers to data that is only added to and never modified or deleted. Once a record is inserted, it remains unchanged. Examples include:
- Events
- Log data
- Transaction records
- Time-series data
- Any immutable data
- Mutable records: These are records that can be updated or changed after they've been inserted. Examples include:
- Dimension data
- Entity data like customers, employees, users, products, etc.
- Pre-aggregated data where rows are periodically updated
- Incremental aggregations: This type of data refers to aggregations that are calculated incrementally as data is written. Examples include:
- Real-time game leaderboard
- Real-time e-commerce sales KPI
Append-only data
The recommended table engine for append-only data is the MergeTree, which is the most common choice in ClickHouse. It provides fast inserts and great query performance.
Append-only data is best stored in the MergeTree table engine
Next, we’ll help you choose a good sorting key for your table. This will depend on the schema.
A good sorting key usually has 3 to 5 columns. The following questions will help you choose the right sorting key for your query access patterns.
- Do your records have a primary timestamp that you will filter on, like a
created_at
date?- If yes, include it in the sorting key.
- Are there other columns you will filter on most queries? For example,
customer_id
orworkspace_id
columns for multi-tenant apps. You don’t need to include all columns you are going to filter on, only the filters that you are always (or most often) going to use.- If yes, place them to the left of the timestamp.
- Put columns with low cardinality first (most important for filtering) and then columns with high cardinality (and less important for filtering).
- If you have something like hierarchy or tree-like relations between the columns, put the columns from “root” to “leaves”. For example
continent
,country
, thencity name
.
Example: Sorting key for an events table in a multi-tenant app
The sorting key for an events table will depend on the specific schema and query access patterns.
For example, imagine our application supports multiple customers. Each customer can have multiple workspaces, with events belonging to individual workspaces. Then, the schema for our events table could look like this:
created_at
(timestamp)customer_id
(string)workspace_id
(string)event_type
(string)payload
(JSON)
Let’s assume that your query filters will of the form:
customer_id = 'customer 001'
AND workspace_id = 'workspace B'
AND created_at >= NOW() - INTERVAL '1 month'
Then your sorting key should be: customer_id
, workspace_id
, then created_at
.
- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "events"
description: "Data Pool for events"
timestamp: { columnName: "created_at" }
columns: [
{ columnName: "created_at", type: TIMESTAMP, isNullable: true }
{ columnName: "customer_id", type: TIMESTAMP, isNullable: true }
{ columnName: "workspace_id", type: STRING, isNullable: true }
{ columnName: "event_type", type: STRING, isNullable: true }
{ columnName: "payload", type: JSON, isNullable: true }
]
tableSettings: {
engine: {
mergeTree: {
type: MERGE_TREE
}
}
orderBy: ["customer_id", "workspace_id", "created_at"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}
The customer_id
is the lowest granularity so it should go first. The workspace_id
belongs to a customer, so they have a tree-like relationship, so it should go next. Once the data is reduced to a given workspace_id
of a customer_id
, we’d want it ordered by time, adding the created_at
last.
This way, your data will be sorted like the table below:
An illustration of the events table, with arrows indicating the sort order of its rows, first by the customer_id column, then by the workspace_id column, and finally by the created_at column.
Important things to note to take advantage of this sorting key:
- You can provide just the
customer_id
or both thecustomer_id
andworkspace_id
in the filters or GROUP BYs and still benefit from the sorting key as it reduces rows read. - If you just provide the
created_at
, you will NOT take advantage of the sorting key, as Propel will have to scan the full table. - With Materialized Views, you can create different versions of the same table with different sorting keys to support different query patterns.
Mutable records
The recommended table engine for mutable records is the ReplacingMergeTree. It provides fast inserts, handles updates, and provides great query performance for medium-sized tables (up to 100 million records). For tables over 100 million records, see the “Handling large updating tables” section below.
Mutable records are best stored in the ReplacingMergeTree table engine
We’ll help you choose a good sorting key and settings for your table. This will depend on the schema, how uniqueness is defined, and the scale of your data.
What column (it has to be immutable) determines the uniqueness of a record in your table?
If it is something like
user_id
, include it in the sorting key.If it is a combined uniqueness key, like an
order_id
and aproduct_id
, include them all in the sorting key.- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "users"
description: "Data Pool for users"
columns: [
{ columnName: "user_id", type: STRING, isNullable: false }
{ columnName: "created_at", type: TIMESTAMP, isNullable: true }
{ columnName: "first_name", type: STRING, isNullable: true }
{ columnName: "last_name", type: STRING, isNullable: true }
{ columnName: "email", type: STRING, isNullable: true }
]
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
}
}
orderBy: ["user_id"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}
Do your records contain a
version
orupdated_at
column that increases with each update to the record?Include it in the ReplacingMergeTree version column.
- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "users"
description: "Data Pool for users"
columns: [
{ columnName: "user_id", type: STRING, isNullable: false }
{ columnName: "created_at", type: TIMESTAMP, isNullable: true }
{ columnName: "first_name", type: STRING, isNullable: true }
{ columnName: "last_name", type: STRING, isNullable: true }
{ columnName: "email", type: STRING, isNullable: true }
{ columnName: "updated_at", type: TIMESTAMP, isNullable: false }
]
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
ver: "updated_at"
}
}
orderBy: ["user_id"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}
Propel automatically filters out older versions of rows when querying and periodically deletes them in the background during merges).
If your table has less than ~100 million records, this is all you need. You’ll get fast queries and updates based on the record identifier.
Since the sorting key determines the record uniqueness in ReplacingMergeTree tables, keeping things simple is advantageous.
Example: Sorting key for a “Users” entity table
The sorting key for an entity table will depend on how uniqueness is defined.
Consider the schema below:
user_id
(string)created_at
(string)first_name
(string)last_name
(string)email
(string)
In this case, the user_id
determines uniqueness. Therefore, your sorting key should be just user_id
.
- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "users"
description: "Data Pool for users"
columns: [
{ columnName: "user_id", type: STRING, isNullable: false }
{ columnName: "created_at", type: TIMESTAMP, isNullable: true }
{ columnName: "first_name", type: STRING, isNullable: true }
{ columnName: "last_name", type: STRING, isNullable: true }
{ columnName: "email", type: STRING, isNullable: true }
]
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
}
}
orderBy: ["user_id"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}
Putting the user_id
in the sorting key will enforce uniqueness during merges and allow for efficient user lookups by ID.
This way, your data will be sorted like the table below:
An illustration of the “users” entity table, with arrows indicating the sort order of its rows by the user_id column.
Handling large updating tables
If you have updatable data that is over 100 million records with mutable data that needs to be queried fast, the following questions will help you choose a sorting key and understand the tradeoffs.
Do you need to filter or aggregate your data by time? Like, “Give me all the users created last month”? Does your data have a
created_at
timestamp column?If so, you can add the
created_at
to the sorting key to the left of theuser_id
from the example above. Take into account that:- The
created_at
column cannot be mutable. - The
created_at
column will now be part of the uniqueness definition.
The resulting sorting key would be:
created_at
,user_id
.- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "users"
description: "Data Pool for users"
timestamp: { columnName: "created_at" }
columns: [
{ columnName: "user_id", type: STRING, isNullable: false }
{ columnName: "created_at", type: TIMESTAMP, isNullable: false }
{ columnName: "first_name", type: STRING, isNullable: true }
{ columnName: "last_name", type: STRING, isNullable: true }
{ columnName: "email", type: STRING, isNullable: true }
]
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
}
}
orderBy: ["created_at", "user_id"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}- The
Do you need to filter or aggregate your data by any other dimension? For example, “Give me all the users that signed up on mobile with an iOS device,” determined by a
signup_channel
andsignup_os
columns.If so, you can add the
signup_channel
andsignup_os
columns to the sorting key to the left of thecreated_at
and/oruser_id
from the examples above. Again, take into account that:- The
signup_channel
andsignup_os
columns cannot be mutable. - The
signup_channel
andsignup_os
columns will now be part of the uniqueness definition.
The resulting sorting key would be:
signup_channel
,signup_os
,created_at
,user_id
.- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "users"
description: "Data Pool for users"
timestamp: { columnName: "created_at" }
columns: [
{ columnName: "user_id", type: STRING, isNullable: false }
{ columnName: "created_at", type: TIMESTAMP, isNullable: false }
{ columnName: "first_name", type: STRING, isNullable: true }
{ columnName: "last_name", type: STRING, isNullable: true }
{ columnName: "email", type: STRING, isNullable: true }
{ columnName: "signup_channel", type: STRING, isNullable: false }
{ columnName: "signup_os", type: STRING, isNullable: false }
]
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
}
}
orderBy: ["signup_channel", "signup_os", "created_at", "user_id"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}- The
Incremental aggregations
The recommended table engine for incremental aggregations are the SummingMergeTree and AggregatingMergeTree, depending on the type of aggregations you need to do.
For simple SUM
and COUNT
aggregations, the SummingMergeTree will do. For other types of aggregations, like AVG
, MAX
, MIN
, PERCENTILES
, UNIQUES
, and others, we recommend you use the AggregatingMergeTree.
Incremental aggregations are best stored in the SummingMergeTree and AggregatingMergeTree table engines
We’ll help you choose a good sorting key and settings for your table. This will depend on the schema and the columns you need to group by.
- Which columns do you want to group by?
- Add them to the sorting key. Both the SummingMergeTree and the AggregatingMergeTree replace rows during merging with the same sorting key with a single row with the aggregate values.
- Are you only doing
SUM
orCOUNT
aggregations? i.e. using the SummingMergeTree?- Propel will automatically summarize all numeric columns not part of the sorting key. If you want to specify which columns to summarize, you can enter them in the “columns” parameter.
Example: Sorting key for a “Visits” aggregated table
The sorting key for an incremental aggregation table will include the dimensions you need to group by.
Consider the schema below:
domain
(string)path
(string)visits
(Int)
In this case, we want to group visits by domain
and path
.
Putting the domain
and path
in the sorting key will “roll up” the visits by two columns.
- Console
- API
- Terraform
mutation {
createDataPoolV2(
input: {
uniqueName: "visits"
description: "Data Pool for visits"
columns: [
{ columnName: "domain", type: STRING, isNullable: false }
{ columnName: "path", type: STRING, isNullable: false }
{ columnName: "visits", type: INT64, isNullable: true }
]
tableSettings: {
engine: {
summingMergeTree: {
type: SUMMING_MERGE_TREE
}
}
orderBy: ["domain", "path"]
}
}
) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
}
}
}
}
This way, your data will be sorted like the table below:
An illustration of the “visits” aggregated table, with rows aggregated by the domain and path columns.
Frequently asked questions
What is the difference between the sorting key and the primary key?
In ClickHouse, the sorting key and primary key are closely related, but serve different purposes:
- The sorting key is used to sort the rows stored within each table part. The sorting key also influences the merge behavior of the MergeTree family of table engines, as described above.
- The primary key is used for indexing. ClickHouse builds a sparse primary index that allows it to quickly skip over irrelevant rows during a query, based on the primary key.
If the primary key is unspecified, it defaults to the sorting key, and so the primary key and sorting key are often identical. In most cases, you don’t need separate sorting and primary keys.
It is important to note that in ClickHouse, the primary key doesn't enforce record uniqueness as it does in other databases.
What happens if I don’t define a sorting key for my table?
If you don't define a sorting key for your table in ClickHouse, the data in your table will not be organized in any specific order. This lack of order can lead to slower query performance, as ClickHouse will not be able to skip over irrelevant data during a query. It is always recommended to define a sorting key that aligns with your most common query patterns to optimize your table for fast data retrieval.
How can I change the sorting key of an existing table?
You can’t change the sorting key of an existing ClickHouse table. Instead, you will need to create a Materialized View that takes the records from the source table and writes them to a new table that has a different sorting key.
Do merges happen on a given schedule? Is it configurable?
Merges in ClickHouse do not occur on a predictable schedule and are not directly configurable. Instead, they are determined by a complex algorithm within ClickHouse. This algorithm considers various factors, such as the size of the data, the amount of available disk space, and the system load. Consequently, the timing and frequency of merges can vary greatly and are not directly controllable by the user.
Can a table have multiple sorting keys to support different access patterns?
No, a table cannot have multiple sorting keys in ClickHouse. However, you can create Materialized Views that are updated in real-time, each with a different sorting key to optimize different access patterns.
What happens if I use mutable columns in my sorting key?
If you use mutable columns in your sorting key in ClickHouse, it can lead to unexpected and incorrect query results. This is because the sorting key in ClickHouse is used to organize data within each table part at the time of data insertion. If a column value changes after it has been inserted, the physical order of the data in the table does not change. Therefore, the data organization based on the old value of the mutable column stays intact. This discrepancy between the actual value of the column and the physical data organization can cause ClickHouse to skip over relevant data during a query, leading to incorrect results.