How to store JSON in ClickHouse the right way

There are multiple approaches to storing JSON in ClickHouse, depending on just how structured or unstructured your JSON is.

Photo: Propel

There are multiple options for storing JSON in ClickHouse. Which option you choose will depend on the type of JSON you need to store. If your JSON is structured and follows a known, fixed schema, you can use normal ClickHouse column types for storing your JSON; however, if your JSON is not structured and its schema can change, then you’ll want to review the unstructured approaches below.

Structured JSON

Storing structured JSON in ClickHouse, where the JSON follows a known, fixed schema, is straightforward: for every top-level key of your JSON schema, define a corresponding ClickHouse column with an appropriate type. For example, scalar JSON types (strings, booleans, and numbers) can be represented using ClickHouse String, Bool, and Float64 types. Complex JSON types (arrays and objects) can be represented in a few different ways. For example,

  • JSON arrays of arbitrary length can be represented using ClickHouse’s Array type.
  • JSON arrays of a fixed-length can be represented using ClickHouse’s Tuple type.
  • JSON objects with arbitrary keys can be represented using ClickHouse’s Map type.
  • JSON objects with a fixed schema can be represented using ClickHouse’s Tuple type.

The more facts about your JSON schema that you can encode in your ClickHouse table schema (for example, using a Tuple type with known keys versus a Map type with arbitrary keys), the more efficiently ClickHouse will be able to store your JSON and service queries. Additionally, by storing your JSON using native ClickHouse column types, you can use all of ClickHouse’s functionality, including data skipping indexes and querying with functions. Some of this functionality either does not work with other approaches or is more difficult to use.

Example

Create a table with a column for each key of your structured JSON, and then insert some data using the JSONEachRow format:

CREATE TABLE events (
  timestamp DateTime64 DEFAULT current_timestamp(),
  payload Tuple (message String, number Float64)
) ENGINE = MergeTree() ORDER BY timestamp;

INSERT INTO events FORMAT JSONEachRow
  {"payload":{"message":"hello","number":1}};

INSERT INTO events FORMAT JSONEachRow
  {"payload":{"message":"world","number":2}};

Now that the JSON data is inserted, you can query it. ClickHouse supports querying the table columns using dot notation. The syntax is very similar to the SQL/JSON simplified accessor (T860-T864) syntax standardized in SQL 2023:

SELECT payload.message, payload.number FROM events;
-- hello	1
-- world	2

Drawbacks

The structured approach works when your JSON follows a known, fixed schema; however, you have to manually manage the JSON key to ClickHouse column mapping. For example, as new JSON keys are introduced that you want to query on, you will need to issue ALTER TABLE statements to add or modify columns. Until you do so, new JSON keys will be ignored.

Additionally, the structured approach does not directly support storing JSON keys with heterogenous values. For example, consider the following two JSON objects:

{"payload":42}
{"payload":"forty-two"}

The “payload” key is either a number or a string, but normal ClickHouse columns cannot represent a union like this. In order to represent this, you must either (1) cast the “payload” key to a ClickHouse String, or (2) use one of the unstructured approaches described below.

Unstructured JSON

There are varying degrees of “unstructuredness” when it comes to unstructured JSON, and certain approaches to storing unstructured JSON in ClickHouse will perform better or worse depending on just how “unstructured” the JSON is. At one end of the spectrum, unstructured JSON may look structured in the sense that it largely follows a predictable schema, with slight variation over time. At the other end of the spectrum, unstructured JSON may contain completely arbitrary keys and values (perhaps hundreds if not thousands of keys). Finally, unstructured JSON may also contain heterogeneous values for each of its keys.

If any of this describes the type of JSON you want to store in ClickHouse and you cannot use a structured approach, then you should consider an unstructured approach.

Approach 1: JSON as string

It is possible to store serialized JSON in a ClickHouse String column and to query it using JSONExtract and related functions. This is a naive approach that, despite its drawbacks, supports storing arbitrary JSON objects, no matter how many keys or what values they may contain. For small tables where query performance is not a concern or for tables where the JSON data itself is rarely filtered on, this can be a suitable approach.

Example

Create a table with a String column to store the serialized JSON, and then insert some data using the JSONEachRow format:

CREATE TABLE events (
  timestamp DateTime64 DEFAULT current_timestamp(),
  payload String
) ENGINE = MergeTree() ORDER BY timestamp;

INSERT INTO events FORMAT JSONEachRow
  {"payload":{"message":"hello","number":1}};

INSERT INTO events FORMAT JSONEachRow
  {"payload":{"message":"world","number":2}};

Now that the JSON data is inserted, you can query it. Notice that dot notation no longer works; instead, you must query the JSON using JSONExtract and related functions:

SELECT
  JSONExtract(payload, 'message', 'String'),
  JSONExtract(payload, 'number', 'Float64')
FROM events;
-- hello	1
-- world	2

Drawbacks

The main drawback of this approach is that it defers all the work of parsing the JSON until the moment the JSON is queried. It also loses out on the benefits that a columnar database like ClickHouse offers: namely, skipping over keys that are irrelevant to the query. These facts result in queries executing more slowly using this approach.

Additionally, details about how the JSON is stored (serialized as a string) leak into the query layer, complicating the SQL that clients have to write. Clients cannot use dot notation for querying the JSON; instead, they must remember to call JSONExtract and related functions.

Approach 2: JSON data type

Another approach for storing JSON in ClickHouse is to use ClickHouse’s experimental JSON data type. First introduced in ClickHouse version 22.6, the experimental JSON data type achieves performance on par with the structured approach while still supporting (relatively) unstructured JSON. It can achieve this because the JSON data type automatically creates ”subcolumns” for every JSON key that it stores.

Example

As of ClickHouse version 23.11, the JSON data type remains experimental, so you must enable it by setting <span class="code-exp">allow_experimental_object_type = 1</span>. Once enabled, create a table containing a JSON data type column like this, and then insert some data:

SET allow_experimental_object_type = 1;

CREATE TABLE events (
  timestamp DateTime64 DEFAULT current_timestamp(),
  payload JSON
) ENGINE = MergeTree() ORDER BY timestamp;

INSERT INTO events FORMAT JSONEachRow
  {"payload":{"message":"hello","number":1}};

INSERT INTO events FORMAT JSONEachRow
  {"payload":{"message":"world","number":2}};

Now that the JSON data is inserted, you can query it. ClickHouse supports querying the JSON data type using dot notation. The syntax is very similar to the SQL/JSON simplified accessor (T860-T864) syntax standardized in SQL 2023:

SELECT payload.message, payload.number FROM events;
-- hello	1
-- world	2

Drawbacks

The JSON data type takes full advantage of ClickHouse’s columnar storage model: individual JSON keys are stored in their own subcolumns, meaning that subcolumns for unnecessary keys are skipped over at query-time and do not affect query performance. However, this can also lead to problems when storing arbitrary JSON objects. For example, take a look at the files that ClickHouse stored on disk for the example above:

$ ls data/_local/events/all_1_1_0/payload*.bin
data/_local/events/all_1_1_0/payload%2Emessage.bin
data/_local/events/all_1_1_0/payload%2Enumber.bin

Notice that each JSON key corresponds to a file on disk. You may then be wondering (1) what happens if a JSON key is very long? and (2) what happens if a JSON object contains many distinct keys? Indeed, both of these can lead to issues:

  1. If a JSON key is very long, then ClickHouse may fail to create the JSON key’s file on disk. Fortunately, there is a workaround. By setting <span class="code-exp">replace_long_file_name_to_hash = 1</span>, ClickHouse will hash any JSON key’s file name if it’s longer than <span class="code-exp">max_file_name_length</span>, and this will avoid the issue.
  2. If a JSON object contains many distinct keys, then this will lead to many distinct files on disk. If there are too many files on disk, ClickHouse can become overburdened by merging parts, and stop responding. There is a feature request to address this by limiting the number of subcolumns, but this feature has not yet been implemented.

Until the second issue is resolved, you may not want to use the JSON data type to store arbitrary JSON objects. It will be safer to store only JSON objects with few distinct keys.

Approach 3: JSON as pairwise arrays

As a final approach, it is possible to store JSON in ClickHouse using pairwise arrays. This unique approach was written about in a blog post by Uber. The general idea is this: for each scalar JSON type (string, boolean, and number), create two columns in ClickHouse. The first column is an array of JSON keys, and the second column is an array of JSON values. To look up the value for a particular JSON key, find its index $i$ in the array of JSON keys; then, look up the corresponding value at index $i$ in the array of JSON values.

It is more complicated to both store and query the JSON this way; however, unlike the JSON data type, this approach scales to handle an arbitrary number of JSON keys without running into issues with the number of files on disk. Additionally, performance is better than the “JSON as string” approach, because the JSON is parsed ahead of time.

Example

Create a table containing two columns for each scalar JSON type (string, boolean, and number). The first column will be an array of JSON keys, and the second column will be an array of JSON values:

CREATE TABLE events (
  timestamp DateTime64 DEFAULT current_timestamp(),

  `payload.string.names`  Array(String),
  `payload.string.values` Array(String),

  `payload.number.names`  Array(String),
  `payload.number.values` Array(Float64),

  `payload.bool.names`    Array(String),
  `payload.bool.values`   Array(Boolean)

) ENGINE = MergeTree() ORDER BY timestamp;

Next, insert the data. Note that a normal INSERT statement using the JSONEachRow format no longer works; instead, you will need to pre-process the JSON to extract its keys before inserting:

-- {"payload":{"message":"hello","number":1}}
INSERT INTO events (
  payload.string.names,
  payload.string.values,
  payload.number.names,
  payload.number.values
) VALUES (['message'], ['hello'], ['number'], [1]);

-- {"payload":{"message":"world","number":2}}
INSERT INTO events (
  payload.string.names,
  payload.string.values,
  payload.number.names,
  payload.number.values
) VALUES (['message'], ['world'], ['number'], [2]);

Now that the JSON data is inserted, you can query it. Notice that dot notation no longer works; instead, you must query the JSON using array functions:

SELECT
  payload.string.values[indexOf(payload.string.names, 'message')],
  payload.number.values[indexOf(payload.number.names, 'number')]
FROM events;
-- hello	1
-- world	2

Drawbacks

As mentioned above, it is more complicated to both store and query the JSON this way. When storing the JSON, you can no longer use simple INSERT statements; instead, you need to pre-process the JSON to extract its keys before inserting.

Additionally, details about how the JSON is stored (as pairwise arrays) leak into the query layer, complicating the SQL that clients have to write. Clients cannot use dot notation for querying the JSON; instead, they must use indexOf and array accesses.

Conclusion

Storing structured JSON in ClickHouse is straightforward and gives excellent performance; however, when it comes to unstructured JSON, the available approaches each come with their own benefits and drawbacks. The suitability of any particular approach depends on just how “unstructured” the JSON you need to store is. We’ve put together a simple table you can use in evaluating each approach:

Structured approach JSON as string JSON data type JSON as pairwise arrays
Supports unstructured JSON No Yes Yes (few keys) Yes
Supports many distinct keys No Yes No Yes
Supports dot notation Yes No Yes No
Query performance Great Poor Great OK

Looking forward, we expect that ClickHouse’s experimental JSON data type will stabilize and eventually be able to store JSON objects with arbitrarily many keys. We hope to be able to use it in the future. Until then, the “JSON as string” and “JSON as pairwise arrays” approaches are more suitable for storing JSON with arbitrary keys, although neither approach supports dot notation.

By the way, if you’ve read this far, did you know that Propel is a serverless alternative to ClickHouse that lets product teams ship real-time data apps in days instead of weeks? We offer a GraphQL API and React components for querying and visualizing your data. We’re built on top of ClickHouse with full support for querying JSON. You can post JSON data directly to us, or query JSON data from supported data sources. Try us out today and get $50 in free credits, no credit card required!

Related posts

Start shipping today

Deliver the analytics your customers have been asking for.