Updating data
Propel’s Serverless ClickHouse supports real-time updates and batch updates for updating data. Read on to learn when and how to use each method.
Real-time updates
Real-time updates allow you to update existing rows in a table, which are immediately reflected when querying.
Use cases
- Update rows in a table and query their most recent state
- Update slowly changing dimensions
- Update entity tables (users, accounts, etc)
How it works
Real-time updates in Propel are supported by ClickHouse’s ReplacingMergeTree table engine. This table engine replaces rows that share the same sorting key, keeping only the most recently inserted row (or the row with the largest “ver” column value).
De-duplication happens at query time and asynchronously, through background merges. For those familiar with ClickHouse, Propel automatically includes the "FINAL" modifier on queries to ReplacingMergeTree tables.
However, it's important to note that there are two implications:
- The row count in ReplacingMergeTree tables may be larger than the logical number of rows. This is because there may be duplicate rows in the table that haven’t been merged yet. These rows will be de-duplicated at query time.
- De-duplicating records at query time can have a performance impact on very large tables. This can be compensated for by using a large Propeller.
Create a Data Pool to handle real-time updates
In this section, we will guide you through the process of creating a Data Pool that can handle real-time updates. We will provide step-by-step instructions on how to create it via the Console, API, and Terraform.
- Console
- API
- Terraform
To create a Data Pool that can handle real-time updates in Console, first go to the “Data Pools” section, then click ”Create Data Pool”.
Select the Data Pool type you need to create.
In the “Table Settings” step, select “Mutable Data”.
Answer the questions on how to uniquely identify records and your query patterns. For more information, read the “Mutable records” section of the How to select a table engine and sorting key guide.
Next, you will get the suggested table settings. Note that the table engine is ReplacingMergeTree, and the sorting key determines the uniqueness of the row.
Click “Next” to finish setting up the Data Pool. Once it is created, you can go to the “Details” tab and verify the settings.
The API request below creates a Data Pool for mutable records. It uses the ReplacingMergeTree table engine, the timestamp, and order_item_id columns to sort and determine uniqueness.
mutation {
createDataPoolV2(input: {
uniqueName: "Updatable TacoSoft Order Items"
description: "Data Pool for updatable TacoSoft Order Items"
timestamp: {
columnName: "timestamp"
}
columns: [
{ columnName: "quantity", type: INT32, isNullable: false },
{ columnName: "taco_name", type: STRING, isNullable: false },
{ columnName: "sauce_name", type: STRING, isNullable: false },
{ columnName: "restaurant_id", type: STRING, isNullable: false },
{ columnName: "restaurant_name", type: STRING, isNullable: false },
{ columnName: "taco_total_price", type: FLOAT, isNullable: false },
{ columnName: "order_item_id", type: STRING, isNullable: false },
{ columnName: "tortilla_id", type: STRING, isNullable: false },
{ columnName: "toppings", type: JSON, isNullable: false },
{ columnName: "sauce_id", type: STRING, isNullable: false },
{ columnName: "taco_unit_price", type: FLOAT, isNullable: false },
{ columnName: "order_id", type: STRING, isNullable: false },
{ columnName: "order_item_generated_at", type: TIMESTAMP, isNullable: false },
{ columnName: "taco_id", type: STRING, isNullable: false },
{ columnName: "timestamp", type: TIMESTAMP, isNullable: false },
{ columnName: "tortilla_name", type: STRING, isNullable: false }
],
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
}
}
partitionBy: ["toYYYYMM(timestamp)"]
orderBy: ["timestamp", "order_item_id"]
}
}) {
dataPool {
id
uniqueName
description
tableSettings {
orderBy
partitionBy
}
}
}
}
The following Terraform creates a Data Pool for mutable records. It uses the ReplacingMergeTree table engine, the timestamp, and order_item_id columns to sort and determine uniqueness.
resource "propel_data_pool" "updatable_tacosoft_order_items" {
unique_name = "Updatable TacoSoft Order Items"
description = "Data Pool for updatable TacoSoft Order Items"
timestamp = "timestamp"
table_settings {
engine {
type = "REPLACING_MERGE_TREE"
}
partition_by = ["toYYYYMM(timestamp)"]
order_by = ["timestamp", "order_item_id"]
}
column {
name = "quantity"
type = "INT32"
nullable = false
}
column {
name = "taco_name"
type = "STRING"
nullable = false
}
column {
name = "sauce_name"
type = "STRING"
nullable = false
}
column {
name = "restaurant_id"
type = "STRING"
nullable = false
}
column {
name = "restaurant_name"
type = "STRING"
nullable = false
}
column {
name = "taco_total_price"
type = "FLOAT"
nullable = false
}
column {
name = "order_item_id"
type = "STRING"
nullable = false
}
column {
name = "tortilla_id"
type = "STRING"
nullable = false
}
column {
name = "toppings"
type = "JSON"
nullable = false
}
column {
name = "sauce_id"
type = "STRING"
nullable = false
}
column {
name = "taco_unit_price"
type = "FLOAT"
nullable = false
}
column {
name = "order_id"
type = "STRING"
nullable = false
}
column {
name = "order_item_generated_at"
type = "TIMESTAMP"
nullable = false
}
column {
name = "taco_id"
type = "STRING"
nullable = false
}
column {
name = "timestamp"
type = "TIMESTAMP"
nullable = false
}
column {
name = "tortilla_name"
type = "STRING"
nullable = false
}
}
Create a Materialized View to handle real-time updates
In this section, we will guide you through the process of creating a Materialized View that can handle real-time updates. We will provide step-by-step instructions on how to create it via the Console, API, and Terraform.
- Console
- API
- Terraform
To create a Materialized View that can handle real-time updates in the Console, first go to the “**Materialized View**” section and click ”**Create new Materialized View**.”
Then, enter the query that defines the Materialized View. For this example, we are going to duplicate rows from the “TacoSoft Demo Data” table, so we just need to select all records.
Select “New Data Pool”.
Give your destination Data Pool a name and description.
Select “Mutable records”, then click “Continue”.
Answer the questions on how to uniquely identify records and your query patterns. For more information, read the “Mutable records” section of the How to select a table engine and sorting key guide.
Next, you will get the suggested table settings. Note that the table engine is ReplacingMergeTree, and the sorting key determines the uniqueness of the row.
Click “Continue” to finish setting up the Materialized View.
Once you complete setting up the Materialized View, you will have a Data Pool with the de-duplicated records.
mutation {
createMaterializedView(input: {
uniqueName: "TacoSoft Deduplication Materialized View",
description: "Deduplicates TacoSoft records",
sql: """
SELECT
*
FROM "TacoSoft Demo Data"
""",
destination: {
newDataPool: {
timestamp: {
columnName: "timestamp"
},
uniqueName: "TacoSoft unique order items",
accessControlEnabled: true
tableSettings: {
engine: {
replacingMergeTree: {
type: REPLACING_MERGE_TREE
}
}
orderBy: ["timestamp", "order_item_id"]
}
}
},
backfillOptions: {
backfill: true
}
}) {
materializedView {
id
sql
uniqueName
}
}
}
resource "propel_materialized_view" "tacosoft_deduplication_materialized_view" {
unique_name = "TacoSoft Deduplication Materialized View"
description = "Deduplicates TacoSoft records"
sql = <<-SQL
SELECT
*
FROM "TacoSoft Demo Data"
SQL
new_data_pool {
unique_name = "TacoSoft unique order items"
timestamp = "timestamp"
access_control_enabled = true
table_settings {
engine {
type = "REPLACING_MERGE_TREE"
}
order_by = ["timestamp", "order_item_id"]
}
}
backfill = true
}
Batch updates
Batch updates are useful for one-off or automated jobs where you need to update many rows in a table.
Use cases
- Backfill columns
- Update column(s) for all table rows
- Update column(s) for table rows matching a filter
- Redact columns on specific rows
How it works
Batch updates in Propel are supported by table mutations in ClickHouse. When you perform a batch update, Propel issues an ALTER TABLE … UPDATE
statement to ClickHouse and monitors its progress. The mutation proceeds part-by-part, partition-by-partition, updating rows in the table.
Propel provides a simple way to update data asynchronously using the Console or the createUpdateDataPoolRecordsJob
API. An update data operation on a Data Pool updates the data matching the filters provided.
Create a batch update job
In this section, we guide you through the process of creating a batch update job via the Console and API.
There are two parts to the update job.
First, the list of filters that will be used for updating records. Records matching these filters will be updated.
Second, the columns and values to be updated. The value can be:
- Other column names. For example,
sauce_name
. - A string in single quotes. For example,
'al pastor'
. - A number. For example:
10
. - A JSON property. For example:
order.id
. - An expression. For example:
total_price * quantity or CONCAT(taco_name, '🌮')
.
'al pastor'
.- Console
- API
You can initiate an update job in the Console by navigating to the Data Pool from which you need to update data, clicking on the "Operations" tab, and then clicking “Update data.”
Here, you can specify the filters of the data to update and set the values to update.
Lastly, you can see the progress of the updates job and when it is completed.
Here's an example of how to update data using the API (read the docs for more details):
mutation {
createUpdateDataPoolRecordsJob (
input: {
dataPool: "DPO00000000000000000000000000"
filterSql: "restaurant_name='Farolito' AND taco_name='Veggie'"
set: [
{ "column": "taco_name", "expression": "'Vegetarian'"}
]
}
) {
id
}
}
Notes on updating non-nullable columns:
- Suppose we have a non-nullable column A and a nullable column B. We execute an update setting
A = B + 1
. The operation yields null if the job encounters a record where B is null. This will result in an error when attempting to assign it to A, since A cannot be null. Consequently, the job fails, and the remaining records remain unchanged, while the records processed before encountering the null value are updated. - If the column being updated has a different data type than that of the update expression, the result will be null. This could cause a similar error as the previous example if the column is non-nullable.
Conclusion
Managing updates in Propel's Serverless ClickHouse requires an understanding of the unique way it handles updates. Real-time updates are achieved through the use of the ReplacingMergeTree table engine, but there are implications to consider such as larger record counts and performance impacts on read operations.
Batch updates, on the other hand, are more suited for maintaining data integrity during schema changes and can be accomplished via the Console or the createUpdateDataPoolRecordsJob
API.