BigQuery dbt: modern problems require modern solutions

Do you have multiple interdependent views and tables in BigQuery? Would you like to easily build and visualise a query dependency map? Have you ever wondered if you can re-use SQL between transformations? Do you want an easy way to maintain separate production and development environments?

Photo by Christian Fregnan on Unsplash

If the answer is yes to any of these questions then read on...

Some great blogs talk about how to get started with dbt, however, in this blog I will be specifically cover the following:

  1. My opinion of dbt

Setting up some context

As a consultant across many clients, I have worked with BigQuery a lot in large enterprises and it can become a daunting task handling and maintaining your data warehouse when the number of datasets and tables increase astronomically.

For example, if table Z requires tables (X, Y) it’s easy enough to deploy these tables in the correct order but when your warehouse looks something like this:

Source: https://bit.ly/2X7tO32

The task of deploying all your views and tables is not so simple! The deployment tasks become increasingly challenging when you have to manage multiple environments and decide on your materialisation strategy across multiple datasets in BigQuery.

A typical approach would be to write custom code that explicitly defines a deployment order for your BigQuery objects. This custom code would also be used to enable you to choose which BigQuery objects you wish to persist as a table or a view. Though this approach does bring about major overheads and maintenance activities when it comes to BigQuery itself. Not only would you have to maintain the code for the data warehouse, but you would also have to write and maintain the code that handles all objects across each environment.

Moreover, I have always wondered if I could just easily visualise the interdependent views and tables in BigQuery without having to write custom code for it.

Programming languages such as Python and Java give us the ability to:

  • Use control structures (e.g. if statements and for loops)

Unfortunately, when it comes to SQL code that’s not the case. While SQL is a powerful and easy to use language, it can become far more useful when we pair it with a fully-featured templating language.

dbt to the rescue

Photo by Daniel Tausis on Unsplash

dbt (data build tool) is a command-line tool. It enables analytics engineers to transform data in their warehouses by simply writing select statements.

dbt enables analysts to work more like software engineers. dbt brings software engineering principles to the ELT world. For example, select statements can be versioned and reused; schema and data can be tested which would be similar to conducting unit tests.

dbt does the T in ELT (Extract, Load, Transform) processes. It doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. It allows us to write transformations as queries and orchestrate them in a more efficient way.

dbt is a compiler and a runner. dbt’s only function is to take code, compile it to SQL, and then run against your database. dbt supports a majority of the most common databases, such as Postgres, BigQuery, Snowflake and Redshift. You can view the full list here.

I think I have your attention now

Photo by Stefan Cosma on Unsplash

dbt handles the materialisation

A dbt project is a directory of `.sql` and `.yml` files, which dbt uses to transform your data. At a minimum, a dbt project must contain:

  • A project file: `dbt_project.yml` file tells dbt that a particular directory is a dbt project, and also contains configurations for your project.

Materialisations are strategies for persisting dbt models in a warehouse. As a data engineer, all you need to do is to concentrate on writing your query and dbt will handle the materialisation for you!

Materialisation can be defined either at the project file level:

# materialization defied at project file level
# Applies to all files under models/example/
# snippets from dbt_project.yml file
models:
example:
materialized: view

Alternatively, they can be defined at the model level:

# materialization defied at a model level
# define your materialized strategy here
{{ config(materialized=’table’) }}SELECT
*
FROM
`project.dataset.tableA` A
left JOIN
`project.dataset.tableB` B
ON
A._key = B._key

Materialisation can be chosen from table, views, incremental and ephemeral. You can read more about them here.

dbt automatically builds the dependency graph

The most important function in dbt is ref(); it’s impossible to build even moderately complex models without it. ref() is how you reference one model within another. dbt uses references between models to automatically build the dependency graph. This will enable dbt to deploy models in the correct order.

Previously I have had to write custom code in order to decide the correct order to deploy views and tables in BigQuery, the ref() feature makes this really simple without having to write a single line of code!

Let’s take an example where query C requires tables (A,B). Normally, you would write it like this:

SELECT
*
FROM
`project.dataset.tableA` A
left JOIN
`project.dataset.tableB` B
ON
A._key = B._key

But using dbt refs together with the materialisation config it’ll look something like this:

# materialization defined at a model level{{ config(materialized="table") }}SELECT
*
FROM
{{ref("tableA")}} A
left JOIN
{{ref("tableB")}} B
ON
A._key = B._key

Indeed, you can see that the most important function in dbt is ref()!

We are not done here, dbt docs also allows us to visualise the dependency graph with just a few commands.

Pretty neat, eh!?

dbt allows you to re-use SQL between models

What!? Macros in dbt are snippets of SQL that can be invoked like functions from models. This makes it possible to re-use SQL between models in keeping with the engineering principle of DRY (Don’t Repeat Yourself). Macro files can contain one or more macros and to simply use macros, add a macro-paths config entry to your dbt_project.yml file, that’s it!

Let’s take a look at this in action, we have all seen code similar to this:

select
field_1,
field_2,
field_3,
field_4,
field_5,
count(*)
from
my_table
group by
1,
2,
3,
4,
5

While this might not look so bad, what if in another query we had to group by n number of columns where n > 50? It would be really tedious to write 1,2,3…50, to top that off what if this needed to be done across multiple SQL queries.

Let’s take a look at the previous example using macros. First, we define our macro in a macro file:

{% macro group_by(n) %}
GROUP BY
{% for i in range(1, n + 1) %}
{{ i }}
{% if not loop.last %} , {% endif %}
{% endfor %}
{% endmacro %}

Above, we have defined a macro called group_by which takes a single argument, n. Integrating our macro into the previous example would like this:

select
field_1,
field_2,
field_3,
field_4,
field_5,
count(*)
from
my_table
group by
{{ group_by(5) }}

This group_by macro can now be used across multiple queries and for n number of fields, pretty cool ;)

dbt can combine SQL with a fully-featured templating language

dbt allows you to combine SQL with Jinja, a templating language. Jinja allows you to turn your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL such as control structures, use of environment variables etc

For example, let’s say you have a BigQuery table with two fields and an array of structs called `event`. This array contains two structs `event_name` and `event_value`, one way of extracting values from this array could be:

SELECT
field_1,
field_2,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X1” as
“column_X1”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X2” as
“column_X2”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X3” as
“column_X3”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X4” as
“column_X4”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X5” as
“column_X5”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X6” as
“column_X6”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X7” as
“column_X7”,
SELECT event_value FROM UNNEST(event) WHERE event_name = “X8” as
“column_X8”
FROM
`project.dataset.table`

Looks kinda ugly, right? Let’s write the same query this time using Jinja:

{% set event_names = [“X1”, “X2”,“X3”,”X4",”X5",”X6",”X7",”X8"] %}
{% set my_struct = “event” %}
SELECT
field_1,
field_2,
{% for name in event_names %}
SELECT event_value FROM UNNEST({{my_struct}}) WHERE event_name=
‘{{name}}’ as column_{{name}},
{% endfor %}
FROM
`project.dataset.table`

How cool is that? If you want to get adventurous you could create a macro and use it in your query as we saw previously :O

{% set event_names = [“X1”, “X2”,“X3”,”X4",”X5",”X6",”X7",”X8"]%}
{% set my_struct = “event” %}
{% set my_key = “event_name” %}
{% set my_value = “event_value” %}
SELECT
field_1,
field_2,
{{ extract_struct_value(my_struct,my_key,my_value,event_names) }}
FROM
`project.dataset.table`

dbt + Cloud Build

Photo by Markus Spiske on Unsplash

I know I have got you really hyped up! You are probably wondering how dbt can help you to maintain separate production and development environments? I can run dbt locally but how do I automate all my beautiful dbt code? :(

Let me solve that problem for you too :)

At its core, dbt profile YAML files can be used to maintain separate production and development environments, which means you can run the same dbt model in different environments just by giving a command-line argument!

Here is a snippet of a dbt profile YAML file that would run in a single environment:

# example dbt profile yaml file
my-beautiful-dbt-profile:
outputs:
dev:
type: bigquery
method: oauth
project: `my-dev-project`
dataset: dev
threads: 1
timeout_seconds: 300
location: `my-location`
priority: interactive
retries: 1
target: dev

Like any great solution, this would be incomplete without a CI/CD process to automate our test and deployment process. But before that, let’s extend our dbt profile YAML file to add more environments:

# example dbt profile yaml file with multiple environments
my-beautiful-dbt-profile:
outputs:
dev:
type: bigquery
method: oauth
project: `my-dev-project`
dataset: dev
threads: 1
timeout_seconds: 300
location: `my-location`
priority: interactive
retries: 1
prod:
type: bigquery
method: oauth
project: `my-prod-project`
dataset: prod
threads: 1
timeout_seconds: 300
location: `my-location`
priority: interactive
retries: 1
target: dev

We will be using Cloud Build as our CI/CD service. Cloud Build is Google’s fully managed CI/CD service that executes your builds on its infrastructure. Cloud Build can import source code from a variety of repositories or cloud storage spaces, execute a build to your specifications, and produce artifacts such as Docker containers or Java archives.

Here is an example of a cloud build YAML file:

- name: 'ubuntu'
id: 'Adding dbt profile for Bigquery'
entrypoint: 'bash'
args:
- '-c'
- |
echo Copying ./profiles.yml to the ~/.dbt...
mkdir ~/.dbt
cp ./profiles.yml ~/.dbt/profiles.yml
# set up python environment and install dbt
- name: "python:3.7"
id: 'Setup Python Environment & Installing DBT'
entrypoint: 'bash'
args:
- '-c'
- |
pip install -r requirements.txt --user
# DEV - DBT DEBUG
- name: "python:3.7"
id: 'DEV - DBT DEBUG'
args: ['dbt', 'debug','--target','dev']
env:
- 'PATH=${_PYTHON_PATH}'
# DEV - DBT RUN
- name: "python:3.7"
id: 'DEV - DBT RUN'
args: ['dbt', 'run','--target','dev']
env:
- 'PATH=${_PYTHON_PATH}'

Now by changing the target argument, you can run your beautiful dbt code in any environment. Easy beans :)

Let’s wrap up

Photo by Rowen Smith on Unsplash

At Servian, we have tested dbt for our clients and it has proved to be a great tool for the projects that required data transformation. dbt is an open-source tool and you can easily apply version control to it, this means that large scale data engineering teams can scale very quickly.

There are many other features that dbt offers which I haven’t covered in my blog post. In order to understand dbt’s full potential I highly recommend you to play around with it!

Here are some links to other blogs talking about dbt:

Data · Cloud · Machine Learning | Consultant at Servian | Monash Data Science Alumni | Melbourne, Australia | LinkedIn: https://bit.ly/2ytIJKt