How to select the best table engine and sorting key in ClickHouse
Identify primary timestamp
created_at
date. If yes, include it in the sorting key.Identify frequently filtered columns
customer_id
or workspace_id
columns for multi-tenant apps. You don’t need to include all columns you are going to filter on, only the filters that you are always (or most often) going to use. If you have such columns, place them to the left of the timestamp in the sorting key.Order columns by cardinality
Consider hierarchical relationships
continent
, country
, then city name
.created_at
(timestamp)customer_id
(string)workspace_id
(string)event_type
(string)payload
(JSON)customer_id
, workspace_id
, then created_at
.
customer_id
is the lowest granularity so it should go first. The workspace_id
belongs to a customer, so they have a tree-like relationship, so it should go next. Once the data is reduced to a given workspace_id
of a customer_id
, we’d want it ordered by time, adding the created_at
last.
This way, your data will be sorted like the table below:
customer_id
or both the customer_id
and workspace_id
in the filters or GROUP BYs and still benefit from the sorting key as it reduces rows read.created_at
, you will NOT take advantage of the sorting key, as Propel will have to scan the full table.Determine uniqueness column
user_id
, include it in the sorting key.order_id
and a product_id
, include them all in the sorting key.Check for version or updated_at column
version
or updated_at
column that increases with each update to the record?If yes, include it in the ReplacingMergeTree version column.ver
, is an optional column that allows you to control which
version of a row Propel considers ‘newest’. If unspecified, Propel considers
the most recently inserted row newest.Propel automatically filters out older versions of rows when querying and periodically
deletes them in the background during merges.Consider table size
user_id
(string)created_at
(string)first_name
(string)last_name
(string)email
(string)user_id
determines uniqueness. Therefore, your sorting key should be just user_id
.
user_id
in the sorting key will enforce uniqueness during merges and allow for efficient user lookups by ID.
This way, your data will be sorted like the table below:
created_at
timestamp column?
created_at
to the sorting key to the left of the user_id
from the example above. Take into account that:
created_at
column cannot be mutable.created_at
column will now be part of the uniqueness definition.created_at
, user_id
.
signup_channel
and signup_os
columns.
signup_channel
and signup_os
columns to the sorting key to the left of the created_at
and/or user_id
from the examples above. Again, take into account that:
signup_channel
and signup_os
columns cannot be mutable.signup_channel
and signup_os
columns will now be part of the uniqueness definition.signup_channel
, signup_os
, created_at
, user_id
.
SUM
and COUNT
aggregations, the SummingMergeTree will do. For other types of aggregations, like AVG
, MAX
, MIN
, PERCENTILES
, UNIQUES
, and others, we recommend you use the AggregatingMergeTree.
Identify grouping columns
Choose aggregation type
SUM
or COUNT
aggregations, which would indicate using the SummingMergeTree.Configure summarization (for SummingMergeTree)
domain
(string)path
(string)visits
(Int)domain
and path
.
Putting the domain
and path
in the sorting key will “roll up” the visits by two columns.
What is the difference between the sorting key and the primary key?
What happens if I don't define a sorting key for my table?
How can I change the sorting key of an existing table?
Do merges happen on a given schedule? Is it configurable?
Can a table have multiple sorting keys to support different access patterns?
What happens if I use mutable columns in my sorting key?