unit testing in dbt

A macro-ful way to test incremental models and snapshots in dbt

and how to go about it

Betsy Varghese
Cognizant Servian
Published in
6 min readJul 15, 2021

--

In my previous article How to set up a unit testing framework for your dbt projects, I addressed unit testing for relatively simple models inside of dbt. dbt has the concept of materialisations, where you can choose how your model physicalizes the data once it is run. According to dbt,

When using the table materialization, your model is rebuilt as a table on each run, via a create table as statement.

This is perhaps more familiar as a truncate-and-load strategy. However, constant rebuilds can be unwieldy and slow when dealing with large amounts of event-based data.

Incremental models are tables in dbt that are continuously transformed (as opposed to overwritten) on every run. This is useful to create models that insert/update records into a table since the last time the model was run. There are multiple incremental strategies and configurations that can be used with different warehousing tools, along with custom changes you can make.

Sometimes, you want to capture how your sources change over time. Snapshots are dbt’s implementation of type-2 slowly changing dimensions. They are select statements that reside in your snapshots folder.

You might choose to use an incremental model or a snapshot in your data warehouse, but how do you test it? In this second-parter, we’ll look at how we might try testing these models. As before, we’ll use some examples to walk through the steps. Set up your BigQuery project as outlined in the previous article, and let’s go!

Starting point

If you recollect, we created a simple model that modified the famous Iris dataset slightly and then tested it. Let’s make it incremental.

Incremental model

Simple as that. I’ve added in some audit columns (start dates and end dates for each record), but of course, you may require a variety of other audit fields. This model has the unique key defined in the config, meaning that the incremental strategy is merge. More on that here.

Let’s also create a snapshot for the Iris dataset like so:

Snapshot

Notice that for the schema of the snapshot, I’ve created a variable. This is so that if the target is a unit test, the snapshot is created in the testing dataset as opposed to whatever dataset you might normally want your snapshots to sit in. Here’s what that variable looks like in dbt_project.yml:

I’m using the check strategy here, but you might use timestamp. Either way should work with our approach.

Post-hooks

You’ll notice that in the above SQL, I’ve defined a post-hook in the config block. This runs a macro after the execution of the model or snapshot.

I’ve created a macro that sends an alter table statement to BigQuery. This statement drops all the audit columns in our tables, enabling us to effectively test our models. Don’t worry about this affecting tables in dev/prod; we’ve wrapped this query up in an if-statement. It checks to see if the string ‘unit-test’ is present in the target.name, and only then runs that query.

Dropping columns

I’ve created a list of audit columns that I might need to drop from my tables and referenced it in the above macro. That list is as follows:

A list of un-testable fields

You can add or delete to/from this as required. I’ve included the IF EXISTS clause in my macro, so if there are extraneous columns in the list that do not exist in your table, this operation will not error out.

Why have I used {{ this }} in the post-hook for my incremental model, but sent the actual model name in the snapshot’s post-hook? Because {{ this }} does not work for snapshots. I’m sure dbt has a good reason for this, but for now, we’ll use this workaround.

Seed more seed files

There’s more nuance to how we create our seed files this time. Remember that an incremental model on the first pass transforms all the rows from the source table. On subsequent runs, only specified records are transformed. This means that the incremental model behaves differently on the first and second passes. We’ll create two input files (one for the first pass and one for the second) and similarly, two output files.

Input files:

Input for the first pass
Input for the second pass

Output files:

The expected output of the first pass
The expected output of the second pass

These output files represent the behaviour that we expect from our incremental model. On the first pass, all rows are transformed. With the second, the record with Id ‘4’ should be overwritten, while history is maintained and the new record inserted.

However, with snapshots, we expect dbt to handle the inserts and updates according to the behaviour defined in the documentation. All we need to test is any transformational logic we may have inside of our snapshots.

Here’s our input:

Input for the first pass

Output:

The expected output of the first pass

Place these files under your ‘data’ folder. Remember to add the column types inside of your dbt_project.yml as shown in the previous article.

More targets for tests

Let’s define some new targets inside of our profiles.yml. We create them in order to allow dbt to differentiate between the first and the second test pass. We’ll create unit-test-1 environment for the initial passes and unit-test-2 for the second pass. Create more if required depending on your models.

Macros, macros and more macros

With just a little more tinkering, you’re able to tell dbt what source to pick from depending on the target. I’ve remixed the source_for_test and ref_for_test macros to look for the string ‘unit-test’ inside of our target.name. Take a look at them:

Source macro
Ref macro

You’ll notice that there’s a new argument for both macros, snapshot_source and snapshot_ref. I’ve added these in so that you can pass through the name of the snapshot that is being tested to pick the right source. Take a look at the snapshot again and pay attention to the arguments in source_for_test.

If no target is picked, the actual source or reference model is instead supplied through normal_ref_relation and normal_source_relation.

These macros can be further modified to reflect any more possible sources you may have for your models.

Defining the tests

Inside of our schema.yml, we’ll define some tests, wrapping them up in some Jinja magic to let dbt know what output file to compare our results to. I’ve used a replace() filter here that picks the right suffix based on the target.

These can get as complex as you like (like your unit testing macros) to cover all the possible test cases you might have for your warehouse.

Defining tests

You may have other conditions within your models that change the way they perform. These lead to different outputs and hence need different testing output files.

I’d like to take a moment here to acknowledge that this model behaviour is extremely simple, and I’ve seen some truly convoluted custom incremental strategies. I’m still reasonably confident that this approach can help tackle the unit testing problem, however many targets you may have to define.

Note: We’re still using dbt-utils, but there’s another package that’s quite useful called dbt-expectations that is inspired by the Great Expectations testing package for Python. It contains a variety of interesting tests you can run on your models, from simple checks to see if a column exists, to aggregate functions. Give it a whirl!

Execution

Let’s get on with executing our tests.

  1. Run dbt deps to install packages specified inside of your packages.yml
    dbt deps
  2. Run dbt seed to load the test files into the testing dataset
    dbt seed
  3. Run the model to be tested against the unit-test-1 target
    Model - dbt run --target unit-test-1 --model <model_name>
    Snapshot - dbt snapshot --target unit-test-1 --select <snapshot_name>
  4. Run the test against the unit-test-1 target
    Model - dbt test --target unit-test-1 --model <model_name>
    Snapshot - dbt test --target unit-test-1 --select <snapshot_name>
  5. Run the model to be tested against the unit-test-2 target
    dbt run --target unit-test-2 --model <model_name>
  6. Run the test against the unit-test-2 target
    dbt test --target unit-test-2 --model <model_name>

Conclusion

The idea of testing incremental models and snapshots can be confusing, but breaking their behaviours down into separate steps can help us test each individual bit. In the next part, we’ll take a look at how we can automate these tests, and maybe store the results in a meaningful manner. Also, thanks to Marat Levit for super helpful code (and formatting) tips!

--

--