How we reduced our Snowflake spend by 20x

Here are the tricks we used to optimize our dbt incremental models to reduce our Snowflake spend by 20x while improving data quality and freshness.

Photo: Propel

If you prefer watching to reading, check out the recording of the live stream where we discussed these optimizations.

For context, Propel is an Analytics Platform as a Service that connects to Snowflake to serve data via low-latency APIs to power customer-facing applications.

At Propel, we build everything using an event-driven architecture. This means that different services emit events whenever there is a state change. For example, when a new Data Pool is created (Data Pools are Propel’s high-speed storage and cache), the application emits a com.propel.dataPool.created event, and every query emits a com.propel.query.created event.

One of the main use cases for Propel’s events is to power the analytics for our own customer-facing Console.

In this post, we will discuss our architecture and how we set up Snowflake and Propel for context. We will then dive into the hypotheses and experiments we ran to optimize our Snowflake costs. We will share what worked and what didn't, as well as provide code examples of how we implemented the optimizations that resulted in a 20x reduction in our Snowflake costs.

Propel’s architecture

Our event-driven system is built on AWS EventBridge, which serves as the central event bus. Different services consume these events to execute various types of business logic (which is beyond the scope of this post). The focus of this post is on how our events enable Propel's customer-facing analytics.

To use these events to power our analytics, we have an EventBridge rule that sends the events to a Kinesis Firehose that then lands those events in files in an Amazon S3 bucket.

This is an example payload for one of our events:

{
  "account": "XXXXXXXXX",
  "detail": {
    "data": {
      ...
    },
    "datacontenttype": "application/json",
    "id": "EVT01FXQ67JB0GS7JCZD80DA65S26",
    "source": "APP01FXPX8JJBXXGKV17FKCJ46XXM",
    "specversion": "1.0",
    "subject": "QRY01FXQ67J4KQR5SMQTBRCJPPZ30",
    "time": "2022-03-09T11:15:07.540Z",
    "type": "com.propel.metric.queried"
  },
  "detail-type": "com.propel.metric.queried",
  "id": "64f99fa8-3ae2-95f5-efe6-e4309438ffe9",
  "region": "us-east-2",
  "resources": [],
  "source": "APP01FXPX8JJBXXGKV17FKCJ46XXM",
  "time": "2022-03-09T11:15:07Z",
  "version": "0"
}

Propel’s Snowflake setup

We organize our Snowflake account into three databases: Raw, where the raw data is stored; Staging, where intermediate tables not intended for internal or application consumption are stored; and Analytics, where our ready-to-use data resides.

Once the events are in S3, we use Snowpipe to ingest those events into Snowflake. Originally, we landed those events into a table with a single column that had the JSON payload.

From the raw events table, we then use dbt to transform those events with incremental models:

The first model deduplicates and flattens the common fields.
Then, we have object-specific models (Queries, Accounts, Data Pools, etc.) that unpack fields from the JSON and join with other tables to enrich the data.

Our Snowflake and Propel setup

To power our customer-facing Console, we sync the enriched query events into Propel. Propel then gives us the APIs and UI components to build our application.

The Problem

Snowflake credit consumption was growing like crazy. Although Propel was processing more and more data, even doubling the volume every couple of weeks, the trend was not going in a good direction.

We needed to reverse this trend. The chart below shows our Snowflake credit consumption (y-axis) over time.

Hypothesis #1: Snowflake is not querying and ordering JSON efficiently (wrong)

We focused on the parts we knew were consuming the most credits. This was at the beginning of the pipeline.

As mentioned earlier, we were landing the JSON events into a single-column table. This is the Snowpipe SQL we used to create the initial ingestion pipeline.

CREATE PIPE RAW_EVENTS_PIPE auto_ingest = true AS
  COPY INTO "RAW"."EVENTS"."RAW_EVENT" (PAYLOAD) 
    FROM (
       SELECT
        $1
       FROM @EVENTS.RAW) FILE_FORMAT = ( TYPE = "JSON" )

This meant that Snowflake had to query nested data within the JSON. Initially, we thought Snowflake might be inefficient in querying the nested JSON, especially when it came to querying a date within a nested field and then ordering the entire table based on that date.

So, we decided to run our first experiment.

Experiment #1: Transform in ingest

To test our hypothesis, our initial experiment involved unpacking the JSON during data ingestion. We made changes to our Snowpipe SQL to transform the data during ingestion. This allowed us to determine if credits were consumed due to unpacking and ordering of the table or if there was another factor at play.

This is the SQL query we use to transform the JSON event during ingestion:

CREATE PIPE RAW_EVENT_PIPE auto_ingest = true AS
    COPY INTO "RAW"."EVENTS"."RAW_EVENT" (
        LOADED_AT, AWS_EVENT_CREATED_AT, AWS_EVENT_ID, AWS_EVENT_DETAIL_TYPE, AWS_EVENT_VERSION, 
        AWS_REGION, AWS_ACCOUNT, AWS_EVENT_SOURCE, AWS_PAYLOAD, PROPEL_EVENT_CREATED_AT, PROPEL_EVENT_SOURCE, 
        PROPEL_EVENT_SPECVERSION, PROPEL_EVENT_TYPE, PROPEL_EVENT_DATACONTENTTYPE, PROPEL_EVENT_SUBJECT, PROPEL_EVENT_ID, PROPEL_EVENT_DATA
    )
    FROM (
        SELECT
            METADATA$START_SCAN_TIME, 
            $1:time::datetime, 
            $1:id::string,
            $1:"detail-type"::string,
            $1:version::string,
            $1:region::string,
            $1:account::string,
            $1:source::string,
            $1, -- The entire event payload
            $1:detail:time::datetime,
            COALESCE($1:detail:source::string, $1:source::string),
            COALESCE($1:detail:specversion::string,'1.0'),
            COALESCE($1:detail:type::string, $1:"detail-type"::string),
            COALESCE($1:detail:datacontenttype::string,'application/json'),
            (COALESCE($1:detail:subject::string, COALESCE($1:detail:data::variant, $1:detail::variant):query:id::string)),
            (COALESCE($1:detail:id),
            COALESCE($1:detail:data::variant, $1:detail::variant)
        FROM @EVENTS.RAW)  
        FILE_FORMAT = ( TYPE = "JSON" )

With this change, as soon as the data landed in Snowflake, it was stored in a table with the AWS_EVENT_CREATED_AT column. We utilized this column to drive our dbt incremental model. This eliminated the need for Snowflake to query and order the data based on a nested column in the JSON.

We thought this would do it and that our consumption would go down.

WRONG: It turns out that Snowflake is incredibly efficient at querying JSON, even when it has to order the entire table based on a nested JSON property. This hypothesis was proved wrong, so we had to move on to our second hypothesis.

Hypothesis #2: Our dbt incremental model is doing something inefficient

If it wasn’t reading the JSON, it had to be how our dbt incremental model was running.

This is what our dbt model, which performed the initial flattening and unpacking of the JSON, looked like:

with events_flattened_deduped as (
    SELECT
        PROPEL_EVENT_ID,
        MAX("event":time::datetime) as AWS_EVENT_CREATED_AT,
        MAX("event":id::string) AS AWS_EVENT_ID,
        ...
        CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP) AS PROPEL_EVENT_TRANSFORMED_AT
    FROM {{ source('PROPEL_EVENTS', 'RAW_EVENT') }}
    WHERE
    GROUP BY PROPEL_EVENT_ID
    ORDER BY AWS_EVENT_CREATED_AT DESC
)
SELECT * from events_flattened_deduped

{% if is_incremental() %}

    -- this filter will only be applied on an incremental run
    where (AWS_EVENT_CREATED_AT > (select max(AWS_EVENT_CREATED_AT) from {{ this }}))

{% endif %}

This was the model configuration

models:
  - name: stg_events_flattened_deduped
    description: This model flattens out the event headers (both AWS and Propel headers) into their own columns and deduplicates any duplicated events. It does not unpack the payload.
    config:
      database: ANALYTICS
      schema: STAGING
      materialized: incremental
      tags: ['events']
      pre-hook: 'ALTER SESSION SET TIMEZONE = "UTC"'
      query_tag: 'dbt_stg_events_flattened_deduped_model'
      unique_key: PROPEL_EVENT_ID
      sort: AWS_EVENT_CREATED_AT

We knew that performing the GROUP BY PROPEL_EVENT_ID operation to duplicate records was costly. Therefore, we began exploring alternatives to eliminate or minimize the need for it.

Experiment #2: Remove the GROUP BY PROPEL_EVENT_ID

We attempted to remove the GROUP BY PROPEL_EVENT_ID entirely, but as expected, we encountered a significant number of duplicates. Since duplicates are not acceptable, this experiment was unsuccessful. However, we observed that the model ran faster and consumed fewer credits, which raised the question of whether there is a more efficient method for deduplication.

Experiment #3: Use the dbt merge incremental strategy without the GROUP BY PROPEL_EVENT_ID

We started digging into the debt incremental model. dbt offers different incremental strategies:

Append - takes the selected records and inserts them into the destination table.
Merge - looks if the unique key already exists in the destination table. If it does, it will update the record, preventing duplicates. And if it doesn’t, it will insert it.
Delete + insert - is very similar to merge, but instead of updating existing records and inserting new records, it deletes existing records and inserts both new and existing records.
Insert overwrite - The insert overwrite strategy deletes the selected partitions from the current destination table and inserts the selected transformed partitions into it.

We weren’t explicitly telling it to use merge, so we added that.

models:
  - name: stg_events_flattened_deduped
    description: This model flattens out the event headers (both AWS and Propel headers) into their own columns and deduplicates any duplicated events. It does not unpack the payload.
    config:
      database: ANALYTICS
      schema: STAGING
      materialized: incremental
      unique_key: PROPEL_EVENT_ID
      incremental_strategy: merge

It turns out that on Snowflake, dbt uses the merge strategy by default. That change didn't have much impact, but it helped us learn how the merge strategy works. The merge incremental strategy replaces records with the same unique ID. This seemed like exactly what we needed, but we were still seeing duplicates 🤔.

Although we did not expect this experiment to reduce consumption, our expectation was for it to eliminate duplicates. Upon further investigation into the merge strategy, we discovered that it merges the new data with the existing data, but it does not eliminate duplicates within the batch of new data. In event-driven architectures like Propel, duplicates occur in close succession. They occur when the application needs to retry due to a network blip or a similar issue. This means they are close in time, so the merge was only deduplicating events across different dbt runs but not within the same run.

This led us to our next experiment.

Experiment # 4: Deduplicate only the new data

With the original model we had, we were doing a group by on the RAW_EVENT table. This is a big table. It has hundreds of millions of records and is growing very fast.

This was our original model:

with events_flattened_deduped as (
    SELECT
        PROPEL_EVENT_ID,
        MAX("event":time::datetime) as AWS_EVENT_CREATED_AT,
        MAX("event":id::string) AS AWS_EVENT_ID,
        ...
        CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP) AS PROPEL_EVENT_TRANSFORMED_AT
    FROM {{ source('PROPEL_EVENTS', 'RAW_EVENT') }}
    WHERE
    GROUP BY PROPEL_EVENT_ID
    ORDER BY AWS_EVENT_CREATED_AT DESC
)
SELECT * from events_flattened_deduped

{% if is_incremental() %}

    -- this filter will only be applied on an incremental run
    where (AWS_EVENT_CREATED_AT > (select max(AWS_EVENT_CREATED_AT) from {{ this }}))

{% endif %}

Note that it had the is_incremental() macro at the end. This means that the events_flattened_deduped common table expression would query the entire events table and perform the GROUP BY PROPEL_EVENT_ID, which is a very expensive operation. Now that we know the merge strategy will handle merging the new data with the existing data based on our unique ID, all we need to do is duplicate the incoming new data.

To achieve this, observe how we have moved the is_incremental() macro inside the common table expression. This ensures that the expensive GROUP BY PROPEL_EVENT_ID operation is only performed on the new data being added, which is significantly smaller, rather than the entire table.

WITH events_flattened_deduped AS (
    SELECT
        PROPEL_EVENT_ID,
        MAX("event":time::datetime) as AWS_EVENT_CREATED_AT,
        MAX("event":id::string) AS AWS_EVENT_ID,
        ...
        CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP) AS PROPEL_EVENT_TRANSFORMED_AT
    FROM {{ source('PROPEL_EVENTS', 'RAW_EVENT') }}
    WHERE TRUE
    {% if is_incremental() %}
    -- this filter will only be applied on an incremental run. Putting this inside the CTE reduces the data processed by the query.
    AND LOADED_AT > (select max(LOADED_AT) from {{ this }})
    {% endif %}
    GROUP BY PROPEL_EVENT_ID
)
SELECT * from events_flattened_deduped

🎉 This worked! 🎉

This experiment was a success. We managed to significantly reduce the credit consumption and still remove all the duplicates reliably.

Lastly, we asked ourselves, can we optimize even further?

Experiment # 5: Limit the dbt merge by time

It turns out that you can use a little-known feature in dbt incremental models called incremental predicates. Per the dbt documentation:

incremental_predicates is an advanced use of incremental models, where data volume is large enough to justify additional investments in performance. This config accepts a list of any valid SQL expression(s). dbt does not check the syntax of the SQL statements.

As you can read from the official docs, it is not super clear what it does. The syntax is also a bit strange:

incremental_predicates: ["DBT_INTERNAL_DEST.LOADED_AT > DATEADD(HOUR, -3, current_date)"]

It uses DBT_INTERNAL_DEST, which is an internal alias for the target table. What this does is that it limits the merge to the records created in the last 3 hours (per the example above) instead of the entire table.

As mentioned earlier, in event-driven architecture, duplicates occur close in time. We already had a mechanism in place to deduplicate records within the same dbt run. If duplicated records fell on different runs, the merge process would handle them. For Propel, a 3-hour window proved sufficient to handle duplicates. However, in some cases, it may be necessary to consider longer merge windows. For instance, if you have events that arrive very late, days after the initial occurrence, it is advisable to keep the merge window open for a longer duration.

This is what our model configuration looked like with the incremental predicate.

models:
  - name: stg_events_flattened_deduped
    description: This model flattens out the event headers (both AWS and Propel headers) into their own columns and deduplicates any duplicated events. It does not unpack the payload.
    config:
      database: ANALYTICS
      schema: STAGING
      materialized: incremental
      unique_key: PROPEL_EVENT_ID
      incremental_strategy: merge
      # this limits the scan of the existing table to the last X hours/days of data
      incremental_predicates: ["DBT_INTERNAL_DEST.LOADED_AT > DATEADD(HOUR, -3, current_date)"]

This experiment worked and further reduced our credit consumption without compromising data quality.

Success 🎉

In the chart below, you can see the significant impact of the optimizations implemented. These optimizations have had a profound effect on overall credit consumption, data quality, and freshness. Since our pipelines were now more efficient, we could run them more often and get fresher data.

Check out the recording of the live stream where we discussed these optimizations.

What’s next

Next on our plan is to migrate our entire pipeline to Dynamic Tables. We plan to run both pipelines in parallel to benchmark the performance and cost.

Join us for the live discussion on LinkedIn
How we migrated out data pipelines from dbt to Snowflake’s Dynamic Tables
October 15 at 9:00 PST

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.