Data Engineering

How to Build an Incremental Model for Events Using dbt and Snowflake

The incremental model in data is a system that incrementally updates, adding new data to your data storage without overwriting or reprocessing the entire database. dbt has popularized this model for data transformations as an easy and more performant way to update your database, since the table is only updated with records from the last time dbt was run.

Photo by Propel

Data stream illustration

In this tutorial, you'll learn how to use the incremental model in dbt to manage data streams in your Snowflake warehouse. This should help you optimize your data engineering structure and improve data accessibility.

What Are Incremental Models

Incremental models play an important part in ensuring data readiness for your downstream processes. You can drastically cut down on the volume of data that needs to be converted during the course of your changes, which in turn enhances warehousing efficiency and lowers compute expenses.

An incremental model is especially useful when exploring extract, transform, load (ETL) processes within your organization and when syncing event-style data across systems such as your enterprise resource planning (ERP) and customer relationship management (CRM) structure. You can effectively limit the amount of data actively being changed and processed.

Incremental models are also the better choice in cases where compute costs need to be optimized, for example when processing large source data with millions of rows, or with necessary transformations that are expensive or time-consuming to run.

Incremental Models in dbt

Incremental model

In dbt, an incremental model of your data is built as a saved materialization of all rows of the source data. On dbt's first run, all available data is processed and stored as its own table. After that, each successive dbt run will only process the rows that you specify, inserting them or updating existing rows in the saved materialization of your source data.

The rows in your source data that have been added or updated since dbt last ran are often the rows you filter for on an incremental run. As a result, your model is developed incrementally with each dbt execution.

While setting up your incremental model represents an added complexity at the start, it leads to significant improvements in performance.

Building an Incremental Model Using dbt and Snowflake

dbt generally streamlines your data pipeline process, giving you more flexibility and boosting the transform aspects of your extract, load, transform (ELT) or ETL tasks. With dbt, you can optimize data readiness. You can aggregate, normalize, or sort your data however you wish for your downstream processes (analytics, AI, etc.) without continuously updating your pipeline and resending data.

Architecture diagram
Architecture diagram


Prerequisites

You'll need the following to build your incremental model:

• dbt, which can be installed on Mac, Windows, and Linux using package managers such as Homebrew and pip, or as a Docker image. This tutorial uses Windows.

• A Snowflake account with sample streaming data. You can sign up for a thirty-day trial.

Note: If using Windows, Python 3.7+ and git are prerequisites for dbt/Snowflake installation.

Setting Up Snowflake

On your Snowflake account, go to the **Marketplace** and search for the "Global Weather & Climate Data for BI" database. It contains weather and climate-specific data across countries and updates on an hourly basis.

Click **Get** and add the database to your Snowflake account:

Weather data
Weather data

You can check that the database has been added under **Data > Databases**:

Data view
Data view

Click the **+ Database** button and create another database, which will be a dbt access point where the software can create and store tables and views for you:

Create database
Create database

Lastly, create a warehouse on your Snowflake account under **Admin > Warehouses**. Warehouses are clusters of compute resources that are needed to perform various operations in Snowflake.

For this tutorial, create an **X-Small** warehouse:

Create Warehouse
Create Warehouse

Setting Up dbt

dbt utilizes technology-specific adapters to connect and run commands against your database platforms. For your Snowflake connection, you'll be installing a dbt-Snowflake adapter package and configuring dbt with Snowflake-specific information.

Install dbt locally with the following command on your CLI:

pip install dbt-snowflake
pip install

Next, input the following command to create your dbt project and connect to your Snowflake account:

dbt init

You'll be asked to input the name of the project and the database platform you want to use (Snowflake), as well as some authentication details such as your account name, username, role, and password. You'll also input details such as the database and schema that dbt is using (the second database you created and the PUBLIC schema). The process should look something like this:

dbt project
dbt project

With this, your dbt project should be created. A folder named after the project will be generated with the information you inputted, and it'll contain some default files.

In your CLI, change directory to the created folder, then run the `dbt debug` command to check the status of your dbt project and the Snowflake connection:

cd {dbt_project_name}
dbt debug
Debug connection

Creating an Incremental Model in dbt

Now that your Snowflake database has been set up and connected with dbt, you can use dbt to create models and transformations of your Snowflake data.

The weather data you added to your account has three available tables: <span class="code-exp">CLIMATOLOGY_DAY</span>, <span class="code-exp">FORECAST_DAY</span>, and <span class="code-exp">HISTORY_DAY</span>. You can find them under the views in the <span class="code-exp">STANDARD_TILE</span> schema. You'll use the <span class="code-exp">FORECAST_DAY</span> table, which contains records of locations and their weather forecasts and measurements.

To create your incremental model, create a .sql file within the <span class="code-exp">models</span> folder of your dbt project:

Folder location
Folder location

Copy the following code into your SQL file, replacing <span class="code-exp">{Weather_Database}</span> with the name of your imported weather data, and save:

{{    
config(       
materialized='incremental',        
unique_key=['postal_code','time_init_utc','doy_std']    
)
}}

select   
	*, max_temperature_air_2m_f - min_temperature_air_2m_f as range_temperature_2m_ffrom {Weather_Database}.STANDARD_TILE.FORECAST_DAY
  
{% if is_incremental() %}  

-- this filter will only be applied on an incremental run  where time_init_utc > (select max(time_init_utc) from {{ this }})

{% endif %}

The <span class="code-exp">config()</span> block tells dbt the kind of model you are building and constructs a unique ID for records, limiting duplicates in the incremental model. Your <span class="code-exp">unique_key</span> is a combination of the variables representing location, date of forecast, and the time it is valid for. This ensures that only new forecasts for each location are appended.

The <span class="code-exp">is_incremental()</span> function denotes the logic for filtering records from your incremental model. Here, you state that only records that come after the max date (<span class="code-exp">time_init_utc</span>) in the table should be added on dbt's subsequent executions.

For the main code body—the <span class="code-exp">select</span> statement—you get the raw weather data as well as a transformation of the air temperature data, creating a range variable from the minimum and maximum values per forecast. The incremental model will have all of the original data plus the range of air temperature for each record.

Once saved, this incremental model can be run with the following command:

dbt run

This command executes all available operations in your dbt project, including the default dbt-generated examples:

dbt execution
dbt execution

Since this is the first dbt execution, the incremental model will collate all available data:

Database model
Database model

As the source data updates, you can rerun the command to get an updated table with new records added in:

Updated model
Updated model

When using dbt for your production use cases, you'll want to schedule your dbt operations and executions rather than manually running <span class="code-exp">dbt run</span>. dbt offers a cloud service called dbt Cloud where you can create and schedule jobs with alerts and logs on the operations.

You can also utilize external scheduler software like Apache Airflow or GitLab CI/CD, or code cron jobs yourself.

Conclusion

Incremental models offer numerous benefits to your data-intensive projects. Using them helps you streamline your data readiness process while reducing compute costs and time. This helps you ensure more effective and efficient downstream operations, such as analytics, dashboarding, reporting, and predictive modeling.

dbt is a crucial tool in helping you work with this model. As you saw, it gives you the ability to execute more complex business logic earlier in your data structure. This will boost your data transformation processes and provide greater flexibility.

Related Content

Abstract background image

Data Engineering

What is the separation of storage and compute in data platforms and why does it matter?

As businesses strive to become data-driven, the separation of storage and compute has become a critical factor in data platforms.

3D Rendering of abstract fast moving binary computer data.

Data Engineering

How to get your data from an AWS RDS database into Snowflake

Learn how to move data from Amazon RDS into Snowflake so that it can be used for analytics.