If you prefer watching to reading, check out the recording of the live stream where we discussed these optimizations.
For context, Propel is an Analytics Platform as a Service that connects to Snowflake to serve data via low-latency APIs to power customer-facing applications.
At Propel, we build everything using an event-driven architecture. This means that different services emit events whenever there is a state change. For example, when a new Data Pool is created (Data Pools are Propel’s high-speed storage and cache), the application emits a <span class="code-exp">com.propel.dataPool.created</span> event, and every query emits a <span class="code-exp">com.propel.query.created</span> event.
One of the main use cases for Propel’s events is to power the analytics for our own customer-facing Console.
In this post, we will discuss our architecture and how we set up Snowflake and Propel for context. We will then dive into the hypotheses and experiments we ran to optimize our Snowflake costs. We will share what worked and what didn't, as well as provide code examples of how we implemented the optimizations that resulted in a 20x reduction in our Snowflake costs.
Our event-driven system is built on AWS EventBridge, which serves as the central event bus. Different services consume these events to execute various types of business logic (which is beyond the scope of this post). The focus of this post is on how our events enable Propel's customer-facing analytics.
To use these events to power our analytics, we have an EventBridge rule that sends the events to a Kinesis Firehose that then lands those events in files in an Amazon S3 bucket.
This is an example payload for one of our events:
Propel’s Snowflake setup
We organize our Snowflake account into three databases: Raw, where the raw data is stored; Staging, where intermediate tables not intended for internal or application consumption are stored; and Analytics, where our ready-to-use data resides.
Once the events are in S3, we use Snowpipe to ingest those events into Snowflake. Originally, we landed those events into a table with a single column that had the JSON payload.
From the raw events table, we then use dbt to transform those events with incremental models:
- The first model deduplicates and flattens the common fields.
- Then, we have object-specific models (Queries, Accounts, Data Pools, etc.) that unpack fields from the JSON and join with other tables to enrich the data.
Our Snowflake and Propel setup
Snowflake credit consumption was growing like crazy. Although Propel was processing more and more data, even doubling the volume every couple of weeks, the trend was not going in a good direction.
We needed to reverse this trend. The chart below shows our Snowflake credit consumption (y-axis) over time.
Hypothesis #1: Snowflake is not querying and ordering JSON efficiently (wrong)
We focused on the parts we knew were consuming the most credits. This was at the beginning of the pipeline.
As mentioned earlier, we were landing the JSON events into a single-column table. This is the Snowpipe SQL we used to create the initial ingestion pipeline.
This meant that Snowflake had to query nested data within the JSON. Initially, we thought Snowflake might be inefficient in querying the nested JSON, especially when it came to querying a date within a nested field and then ordering the entire table based on that date.
So, we decided to run our first experiment.
Experiment #1: Transform in ingest
To test our hypothesis, our initial experiment involved unpacking the JSON during data ingestion. We made changes to our Snowpipe SQL to transform the data during ingestion. This allowed us to determine if credits were consumed due to unpacking and ordering of the table or if there was another factor at play.
This is the SQL query we use to transform the JSON event during ingestion:
With this change, as soon as the data landed in Snowflake, it was stored in a table with the <span class="code-exp">AWS_EVENT_CREATED_AT</span> column. We utilized this column to drive our dbt incremental model. This eliminated the need for Snowflake to query and order the data based on a nested column in the JSON.
We thought this would do it and that our consumption would go down.
WRONG: It turns out that Snowflake is incredibly efficient at querying JSON, even when it has to order the entire table based on a nested JSON property. This hypothesis was proved wrong, so we had to move on to our second hypothesis.
Hypothesis #2: Our dbt incremental model is doing something inefficient
If it wasn’t reading the JSON, it had to be how our dbt incremental model was running.
This is what our dbt model, which performed the initial flattening and unpacking of the JSON, looked like:
This was the model configuration
We knew that performing the <span class="code-exp">GROUP BY PROPEL_EVENT_ID</span> operation to duplicate records was costly. Therefore, we began exploring alternatives to eliminate or minimize the need for it.
Experiment #2: Remove the GROUP BY PROPEL_EVENT_ID
We attempted to remove the <span class="code-exp">GROUP BY PROPEL_EVENT_ID</span> entirely, but as expected, we encountered a significant number of duplicates. Since duplicates are not acceptable, this experiment was unsuccessful. However, we observed that the model ran faster and consumed fewer credits, which raised the question of whether there is a more efficient method for deduplication.
Experiment #3: Use the dbt merge incremental strategy without the GROUP BY PROPEL_EVENT_ID
We started digging into the debt incremental model. dbt offers different incremental strategies:
- Append - takes the selected records and inserts them into the destination table.
- Merge - looks if the unique key already exists in the destination table. If it does, it will update the record, preventing duplicates. And if it doesn’t, it will insert it.
- Delete + insert - is very similar to merge, but instead of updating existing records and inserting new records, it deletes existing records and inserts both new and existing records.
- Insert overwrite - The insert overwrite strategy deletes the selected partitions from the current destination table and inserts the selected transformed partitions into it.
We weren’t explicitly telling it to use merge, so we added that.
It turns out that on Snowflake, dbt uses the merge strategy by default. That change didn't have much impact, but it helped us learn how the merge strategy works. The merge incremental strategy replaces records with the same unique ID. This seemed like exactly what we needed, but we were still seeing duplicates 🤔.
Although we did not expect this experiment to reduce consumption, our expectation was for it to eliminate duplicates. Upon further investigation into the merge strategy, we discovered that it merges the new data with the existing data, but it does not eliminate duplicates within the batch of new data. In event-driven architectures like Propel, duplicates occur in close succession. They occur when the application needs to retry due to a network blip or a similar issue. This means they are close in time, so the merge was only deduplicating events across different dbt runs but not within the same run.
This led us to our next experiment.
Experiment # 4: Deduplicate only the new data
With the original model we had, we were doing a group by on the <span class="code-exp">RAW_EVENT</span> table. This is a big table. It has hundreds of millions of records and is growing very fast.
This was our original model:
Note that it had the <span class="code-exp">is_incremental()</span> macro at the end. This means that the <span class="code-exp">events_flattened_deduped</span> common table expression would query the entire events table and perform the <span class="code-exp">GROUP BY PROPEL_EVENT_ID</span>, which is a very expensive operation. Now that we know the merge strategy will handle merging the new data with the existing data based on our unique ID, all we need to do is duplicate the incoming new data.
To achieve this, observe how we have moved the is_incremental() macro inside the common table expression. This ensures that the expensive <span class="code-exp">GROUP BY PROPEL_EVENT_ID</span> operation is only performed on the new data being added, which is significantly smaller, rather than the entire table.
🎉 This worked! 🎉
This experiment was a success. We managed to significantly reduce the credit consumption and still remove all the duplicates reliably.
Lastly, we asked ourselves, can we optimize even further?
Experiment # 5: Limit the dbt merge by time
It turns out that you can use a little-known feature in dbt incremental models called incremental predicates. Per the dbt documentation:
<span class="code-exp-block">incremental_predicates</span> is an advanced use of incremental models, where data volume is large enough to justify additional investments in performance. This config accepts a list of any valid SQL expression(s). dbt does not check the syntax of the SQL statements.
As you can read from the official docs, it is not super clear what it does. The syntax is also a bit strange:
It uses <span class="code-exp">DBT_INTERNAL_DEST</span>, which is an internal alias for the target table. What this does is that it limits the merge to the records created in the last 3 hours (per the example above) instead of the entire table.
As mentioned earlier, in event-driven architecture, duplicates occur close in time. We already had a mechanism in place to deduplicate records within the same dbt run. If duplicated records fell on different runs, the merge process would handle them. For Propel, a 3-hour window proved sufficient to handle duplicates. However, in some cases, it may be necessary to consider longer merge windows. For instance, if you have events that arrive very late, days after the initial occurrence, it is advisable to keep the merge window open for a longer duration.
This is what our model configuration looked like with the incremental predicate.
This experiment worked and further reduced our credit consumption without compromising data quality.
In the chart below, you can see the significant impact of the optimizations implemented. These optimizations have had a profound effect on overall credit consumption, data quality, and freshness. Since our pipelines were now more efficient, we could run them more often and get fresher data.
Check out the recording of the live stream where we discussed these optimizations.
Next on our plan is to migrate our entire pipeline to Dynamic Tables. We plan to run both pipelines in parallel to benchmark the performance and cost.
Join us for the live discussion on LinkedIn
How we migrated out data pipelines from dbt to Snowflake’s Dynamic Tables
October 15 at 9:00 PST
If you are interested in diving deeper into Snowflake and Propel capabilities, here are some blog posts you might find useful:
- How to reduce Snowflake costs: A five-point checklist
- How to build a Snowflake API
- How to build in-product analytics with Snowflake and GraphQL
If you don’t have a Propel account yet, you can try Propel for free and start building customer-facing analytics on top of Snowflake with a low-latency data API.