An event-driven architecture allows distributed, microservice-based applications to scale seamlessly. It also enables independent development teams working on such applications to collaborate and contribute more cohesively. However, this type of architecture requires additional processing to ensure that the data is correct.
A common problem with large-scale, event-driven applications is duplicate events. For example, an IoT application used in logistics and transport operations generates events based on the status of a truck carrying goods from the warehouse to its destination. Depending on network outages and subsequent retries, these events may be duplicated many times for the same trip. Since the application processes each event independently, downstream applications have more difficulty handling these duplicate events. They must use extra logic to recognize duplicate events and only use the correct ones. Deduplication can address this challenge.
Deduplication processes are designed to eliminate duplicate events, ensuring that no event is processed twice. In the above IoT example, deduplicating the event data in its destination frees up the application to handle other important tasks and speeds up processing. It also helps ensure the correct reporting data and reduces storage space usage.
One way to implement deduplication is with dbt, a transformation tool for developing data sets from data warehouses for operational and reporting use cases. In the IoT example, that could mean aggregating event data and calculating performance metrics of individual trucks. dbt can build such aggregation logic from your data warehouse. It provides an integrated development environment (IDE), testing support, continuous integration/continuous deployment (CI/CD) practices, and version control, and it supports PostgreSQL, Redshift, BigQuery, Snowflake, and Apache Spark as data sources.
This tutorial will demonstrate how to deduplicate events in Snowflake using dbt.
About Event Duplication and Deduplication
Event-driven applications work with a set of producers and consumers. Producers trigger events, and consumers respond to those events asynchronously. This is an alternative to a monolithic application architecture; you can read more about the benefits here. However, such an architecture requires a reliable message broker to facilitate communication between producers and consumers.
Generally, message queues like Apache Kafka or AWS EventBridge form the foundation of an event-driven architecture. These message broker systems have been designed with scalability and throughput in mind and prioritize minimizing the loss of events. Although most message queues provide some messaging guarantees such as at-least-once or at-most-once, it’s generally expensive to provide a messaging guarantee of exactly-once delivery when operating at scale—resulting in event duplication.
Another reason for duplicate events may be the event source itself. In many workloads, especially in IoT use cases where a device can continuously emit events, duplicate events can frequently appear because of retries following network outages.
That said, having duplicate events is better than losing data, and this is why most systems prefer at-least-once delivery: event duplication at the source is a mechanism to prevent data loss.
Another reason for allowing event duplication is to eliminate the need for acknowledgments, which can cause more network traffic. For example, in the IoT use case, if there’s a network outage when sending an event, the system can either keep retrying or resend once it receives a non-receipt message from the backend. It’s more expensive to implement acknowledgment mechanisms than to handle duplicate events.
Deduplication ensures you have the data you need without processing events multiple times. For example, consider a microservice that handles shipping functionality in an e-commerce application. Such a microservice shouldn’t invoke its functionality more than once for the same purchase event under any conditions.
Deduplication also helps get correct values in aggregated reports and reconciliations in various application use cases. Many applications, particularly those responsible for IT observability, use alerts to notify operators. Deduplicating multiple events from the same source with the same parameters helps avoid triggering multiple alarms.
Although deduplication may seem straightforward, it can be challenging to devise a way to differentiate between duplicate and unique events. To do this, you need to find the combination of attributes of a data object (such as a table) that can uniquely identify each row. The complexity of finding such a combination will vary between use cases.
Implementing Deduplication in Snowflake with dbt
Snowflake is a cloud-based data platform that enterprises commonly use as a data warehouse and as a querying engine for data lakes. dbt comes with native support for Snowflake.
The high-level architecture of this implementation will look like the below image:
To understand how dbt can deduplicate event data in Snowflake, consider another IoT use case. Say a device in one of the server rooms of a data center sends event information about the room temperature to a control system. The data has the following attributes:
- <span class="code-exp">device_id</span>: The unique identifier of the device
- <span class="code-exp">mac_id</span>: The network identifier for the device
- <span class="code-exp">temperature</span>: The numerical value representing the temperature of the room
- <span class="code-exp">timestamp</span>: The date and time when the device sent the data
You’re going to create a table named
You’re going to create a table named <span class="code-exp">SENSOR_DATA</span> in a Snowflake database using these fields. In your Snowflake schema, execute the below SQL command:
In this case, the first two rows contain duplicate data:
The deduplication process always starts with a strategy to identify duplicate rows. In this case, you can use the <span class="code-exp">device_id</span> and <span class="code-exp">timestamp</span> columns as the attributes that uniquely identify temperature event data. Note that the Snowflake table will receive this data from many such devices and that the data volume will be large. Generally, the deduplication process runs at a frequency that matches the required output frequency of the application.
You’re going to build a dbt script to deduplicate data. To do this, you must implement a combination key using the previously identified attributes. dbt provides a utility function called <span class="code-exp">surrogate_key</span> to do this. The function takes a list of columns as its arguments. The snippet below shows how you can generate the new surrogate ID:
Save this dbt model as <span class="code-exp">temp_data_with_surrogate.sql</span>.
Next, use the <span class="code-exp">sur_id</span> surrogate key as the filter to create a new attribute that marks each row as duplicate or not. Snowflake doesn’t have a DISTINCT keyword, so you must use a combination of <span class="code-exp">ROW_NUMBER</span> and <span class="code-exp">QUALIFY</span> functions. Snowflake’s <span class="code-exp">ROW_NUMBER</span> function helps partition over a column and add a rank for each row. The code snippet below shows the process:
Save this dbt model as <span class="code-exp">temp_data_deduped.sql</span>. The table will look like this:
Typically, you need to set up these runs incrementally so that repeated runs don’t result in duplicate entries. To set up an incremental run, add the code below at the beginning of your <span class="code-exp">temp_data_deduped</span> model:
You can now run the models using dbt’s IDE. If everything goes well, you can find the directed acyclic graph (DAG) of your models displayed as shown below. Here, dbt has identified two models and their sequential execution in the DAG:
dbt is a great tool for running transformations on the data in your data warehouse. Besides supporting cloud data warehouses like Snowflake, it can capture lineage, support testing, and orchestrate jobs. Deduplication is a critical requirement in event-driven architecture, since providing exactly-once message guarantees is expensive for large data volume scenarios. As noted earlier, running deduplication at a destination data warehouse like Snowflake is often more optimal. dbt comes with a utility called <span class="code-exp">surrogate_key</span> as part of its <span class="code-exp">dbt-utils</span> package to implement this quickly.
If you are building data-driven experiences like dashboards, reports, in-product metrics or any other form of customer-facing analytics, consider Propel Data. The platform provides a set of analytics APIs that integrate with your existing ELT pipeline and warehouses to extract the relevant data. Propel Data also supports popular charting libraries like ECharts, Highcharts, and D3. Because it’s a managed service, you don’t have to worry about scaling as your data volume grows.
You can join the waitlist to learn more.