unit testing in dbt
A macro-ful way to test incremental models and snapshots in dbt
and how to go about it
In my previous article How to set up a unit testing framework for your dbt projects, I addressed unit testing for relatively simple models inside of dbt. dbt has the concept of materialisations, where you can choose how your model physicalizes the data once it is run. According to dbt,
When using the
table
materialization, your model is rebuilt as a table on each run, via acreate table as
statement.
This is perhaps more familiar as a truncate-and-load strategy. However, constant rebuilds can be unwieldy and slow when dealing with large amounts of event-based data.
Incremental models are tables in dbt that are continuously transformed (as opposed to overwritten) on every run. This is useful to create models that insert/update records into a table since the last time the model was run. There are multiple incremental strategies and configurations that can be used with different warehousing tools, along with custom changes you can make.
Sometimes, you want to capture how your sources change over time. Snapshots are dbt’s implementation of type-2 slowly changing dimensions. They are select statements that reside in your snapshots folder.
You might choose to use an incremental model or a snapshot in your data warehouse, but how do you test it? In this second-parter, we’ll look at how we might try testing these models. As before, we’ll use some examples to walk through the steps. Set up your BigQuery project as outlined in the previous article, and let’s go!
Starting point
If you recollect, we created a simple model that modified the famous Iris dataset slightly and then tested it. Let’s make it incremental.
Simple as that. I’ve added in some audit columns (start dates and end dates for each record), but of course, you may require a variety of other audit fields. This model has the unique key defined in the config, meaning that the incremental strategy is merge
. More on that here.
Let’s also create a snapshot for the Iris dataset like so:
Notice that for the schema of the snapshot, I’ve created a variable. This is so that if the target is a unit test, the snapshot is created in the testing dataset as opposed to whatever dataset you might normally want your snapshots to sit in. Here’s what that variable looks like in dbt_project.yml
:
I’m using the check strategy here, but you might use timestamp. Either way should work with our approach.
Post-hooks
You’ll notice that in the above SQL, I’ve defined a post-hook in the config block. This runs a macro after the execution of the model or snapshot.
I’ve created a macro that sends an alter table statement to BigQuery. This statement drops all the audit columns in our tables, enabling us to effectively test our models. Don’t worry about this affecting tables in dev/prod; we’ve wrapped this query up in an if-statement. It checks to see if the string ‘unit-test’ is present in the target.name
, and only then runs that query.
I’ve created a list of audit columns that I might need to drop from my tables and referenced it in the above macro. That list is as follows:
You can add or delete to/from this as required. I’ve included the IF EXISTS
clause in my macro, so if there are extraneous columns in the list that do not exist in your table, this operation will not error out.
Why have I used {{ this }}
in the post-hook for my incremental model, but sent the actual model name in the snapshot’s post-hook? Because {{ this }}
does not work for snapshots. I’m sure dbt has a good reason for this, but for now, we’ll use this workaround.
Seed more seed files
There’s more nuance to how we create our seed files this time. Remember that an incremental model on the first pass transforms all the rows from the source table. On subsequent runs, only specified records are transformed. This means that the incremental model behaves differently on the first and second passes. We’ll create two input files (one for the first pass and one for the second) and similarly, two output files.
Input files:
Output files:
These output files represent the behaviour that we expect from our incremental model. On the first pass, all rows are transformed. With the second, the record with Id ‘4’ should be overwritten, while history is maintained and the new record inserted.
However, with snapshots, we expect dbt to handle the inserts and updates according to the behaviour defined in the documentation. All we need to test is any transformational logic we may have inside of our snapshots.
Here’s our input:
Output:
Place these files under your ‘data’ folder. Remember to add the column types inside of your dbt_project.yml
as shown in the previous article.
More targets for tests
Let’s define some new targets inside of our profiles.yml
. We create them in order to allow dbt to differentiate between the first and the second test pass. We’ll create unit-test-1
environment for the initial passes and unit-test-2
for the second pass. Create more if required depending on your models.
Macros, macros and more macros
With just a little more tinkering, you’re able to tell dbt what source to pick from depending on the target. I’ve remixed the source_for_test
and ref_for_test
macros to look for the string ‘unit-test’ inside of our target.name
. Take a look at them:
You’ll notice that there’s a new argument for both macros, snapshot_source
and snapshot_ref
. I’ve added these in so that you can pass through the name of the snapshot that is being tested to pick the right source. Take a look at the snapshot again and pay attention to the arguments in source_for_test
.
If no target is picked, the actual source or reference model is instead supplied through normal_ref_relation
and normal_source_relation
.
These macros can be further modified to reflect any more possible sources you may have for your models.
Defining the tests
Inside of our schema.yml, we’ll define some tests, wrapping them up in some Jinja magic to let dbt know what output file to compare our results to. I’ve used a replace() filter here that picks the right suffix based on the target.
These can get as complex as you like (like your unit testing macros) to cover all the possible test cases you might have for your warehouse.
You may have other conditions within your models that change the way they perform. These lead to different outputs and hence need different testing output files.
I’d like to take a moment here to acknowledge that this model behaviour is extremely simple, and I’ve seen some truly convoluted custom incremental strategies. I’m still reasonably confident that this approach can help tackle the unit testing problem, however many targets you may have to define.
Note: We’re still using dbt-utils, but there’s another package that’s quite useful called dbt-expectations that is inspired by the Great Expectations testing package for Python. It contains a variety of interesting tests you can run on your models, from simple checks to see if a column exists, to aggregate functions. Give it a whirl!
Execution
Let’s get on with executing our tests.
- Run
dbt deps
to install packages specified inside of your packages.ymldbt deps
- Run
dbt seed
to load the test files into the testing datasetdbt seed
- Run the model to be tested against the unit-test-1 target
Model -dbt run --target unit-test-1 --model <model_name>
Snapshot -dbt snapshot --target unit-test-1 --select <snapshot_name>
- Run the test against the unit-test-1 target
Model -dbt test --target unit-test-1 --model <model_name>
Snapshot -dbt test --target unit-test-1 --select <snapshot_name>
- Run the model to be tested against the unit-test-2 target
dbt run --target unit-test-2 --model <model_name>
- Run the test against the unit-test-2 target
dbt test --target unit-test-2 --model <model_name>
Conclusion
The idea of testing incremental models and snapshots can be confusing, but breaking their behaviours down into separate steps can help us test each individual bit. In the next part, we’ll take a look at how we can automate these tests, and maybe store the results in a meaningful manner. Also, thanks to Marat Levit for super helpful code (and formatting) tips!