Optimizing dbt and Google BigQuery

#dbt #BigQuery

Setting up a data warehouse is the first step towards fully utilizing big data analysis, but it is one of many that need to be taken before you can generate value from the data you gather. An important step in that chain of the process is data modeling and transformation. This is where data is extracted, transformed, and loaded (ETL) or extracted, loaded and transformed (ELT). For this article, we’ll use the term ETL synonymously.

ETL as a process is not always as straightforward as it seems. On top of that, there are real challenges to overcome in setting up streamlined data modeling processes, including dealing with relations between data points and simplifying the queries to make the whole set of processes scalable. Even with Google offering a lot of features through its BigQuery, some degree of optimization is still required.

This is where a data build tool or dbt becomes a crucial tool. dbt is a data transformation framework that empowers data engineers through simple SQL commands. Dbt makes it possible for data engineers to develop a workflow, write data transformation rules, and deploy the entire data modeling process.

Raw to Standardized (Transformed)

In simple terms, dbt does most of the hard work of transforming raw data into transformed data. When used in conjunction with BigQuery and Cherre’s standardized framework, dbt allows us to clean data in a (mostly) automated process quickly and succinctly. By combining the two powerful tools with our framework we can create clean, workable datasets for our engineers to customize on behalf of our clients.

Cherre transforms its data from multiple sources through our defined best practice normalization process of the following ETL transformation steps. These required steps play a key role in our company’s data integration strategy as it allows us to gather data from a comprehensive variety of sources and consolidate it into our single, centralized data warehouse location. These steps also make it possible for the many different types of data we process to all work together.

#ETL transformation steps

At Cherre, we load our source data (as close as possible to its original state) into BigQuery to automate much of the above process for us with dbt. Minimal changes occur at the next stage when data moves from source to raw before data is renamed—which turns every label into lower snake case as per our organization-wide naming convention. Relabeling is one of the few manual steps in the process where we edit column x to equal y if necessary. We only make small changes at this point and we don’t start adding types until near the end of the process. This is the lightest clean possible to get it to ‘standardized’ before we go beyond that to connect-on addressing and more intense, custom steps for clients. This light transformation produces a clean, well-structured dataset every time for us to work with easier. 

Dbt handles data consolidation and cleaning, two of the most resource-intensive tasks in complex queries, with ease. Plus, on top of consolidation and cleaning, dbt is also capable of running regression tests at any point in the process.

That last feature is important for a simple reason: you can maintain data quality and integrity by performing tests at the right points. The resulting dataset will already be optimized for BigQuery and even the most complex SQL commands.

Dbt creates a table of modeled data with every pass it runs, so the subsequent queries can be executed faster and without consuming a lot of resources. Even better, the whole process can be defined with a simple .sql file and common SQL commands.

For example, you can create a new table of modeled data by stating {{ config(materialized=’table’) }}, and then defining the queries to run in order to fill that table in a file named modeledData.sql. Subsequent SQL queries can then use modeledData as a source.

Other commands such as ORDER BY and LIMIT also work, and since BigQuery doesn’t need to go through the entire data warehouse to complete the query, each query can be made significantly faster. This is where the power of transformed data lies.

There have been a lot of tests that confirm the performance boost generated by the integration of dbt. A conventional SQL query on a 20GB dataset could take up to 30 seconds to complete, but running the same query through a processed, transformed table cuts that time down to less than one second.

Configuration Options for BigQuery

Setting up dbt and BigQuery is actually very easy to do. Once dbt is installed on your cloud cluster, you can run dbt init to create a project. You get a customizable dbt_project.yml file if you want to fine-tune the project parameters even further.

The next step is integrating dbt tables as datasets in BigQuery, but this too is something that you can do from the BigQuery console. Define a profile that uses a service-account as its method, configure your project name, and you are all set.

You can start choosing datasets as your sources. Further queries are still SQL queries, but you can now use transformed data as your source. At this point, you can return to dbt and configure how BigQuery commands are handled.

For starters, you can set the priority of BigQuery jobs accordingly. Setting the priority correctly allows for better server resource allocation, especially under heavy load. You can go with batch or interactive, with the latter being the on-demand option.

You can also configure the timeout period to adjust for your server resources or specific requirements. By default, the timeout is set to 300ms. You can extend this if you want to process larger datasets—or use more of them—and vice versa.

Of course, you can tell dbt to retry the commands it receives for several times when receiving server errors in return. Ideally, you want to allow dbt to retry the command for three to five times before returning an error. You can even configure the delay between retries.

Let’s not forget that BigQuery also has permission management and dataset locations, both of which are parameters you can configure to further boost performance. This opens up the possibility of setting up distributed dbt instances.

Optimizing Your Queries

Both dbt and BigQuery work really well with each other. The two tools are designed to simplify complex queries and give data engineers the opportunity to develop and deploy advanced queries without assistance. All commands are SQL commands that we already know and love.

The real challenge is improving the queries and optimizing the way dbt works with BigQuery, and that process starts with understanding how to best structure an unstructured dataset. The initial data modeling and transformation process is still crucial.

You can begin with a complete review of input data and data sources based on their sizes. Larger data sources need to be pre-processed to maintain performance. I mean, BigQuery is fast and capable, but that doesn’t mean you should not structure your data correctly.

Computation optimization is the next thing to tackle. Each query requires some amount of server resources, particularly computing power. You can configure your .sql file to be leaner when it comes to CPU use. Another way to get around this issue is through scheduling.

Dbt can run independently from BigQuery too, so you can have the transformed dataset updated at your own pace. Depending on the specific needs of your application or organization, you can easily find the right spot between performance and resource usage.

Lastly, there are some SQL best practices that still need to be maintained when using dbt in conjunction with BigQuery. As mentioned before, the available tools should not make you less diligent about how you design your queries.

Self-joins are best avoided or executed by dbt to avoid hogging BigQuery queries. Avoiding self-joins is better since you don’t risk multiplying the number of output rows. Unbalanced joins must also be avoided, but this is an easier problem to deal with.

Prefiltering rows becomes something you can now do before your datasets reach BigQuery. You can still be very specific with the sources you access, meaning you can use yourTable.yourModel to define a specific source for the query.

GROUP BY is another handy command to use if you want to optimize your queries. You’ll be surprised by how much more flexible your implementation of dbt can be when you actively use data pre-aggregation to avoid cross joins and Cartesian products.

Beyond Subqueries

One thing to note about using dbt to optimize BigQuery jobs is that you should not limit yourself to moving subqueries to dbt. The tool is so much more capable than that, particularly with the latest version. You can, for instance, use variables in the command line.

Hooks are just as handy. Hooks are perfect for triggering specific actions at a certain point in the process. On-run-start lets you check if the required table exists, and then tell dbt to automatically create a table if it doesn’t. On-run-end, on the other hand, is perfect for granting schemas.

Seeds are another powerful component you can use in dbt. Seeds are actually referenced the same way you reference models, but they are no more than a list contained in a CSV file. Seeds are handy for performing certain tasks.

If you need to exclude certain IP addresses or email addresses in your query, for instance, you can create a list of excluded addresses and add the CSV file to your dbt project. You can even reverse the process and use seeds to load exported raw data.

Lastly, we have the native support for Jinja, the templating language supported by dbt. It means you can use expressions, statements, and comments, generate macros, and use references on the fly, all without having to manually rewrite your SQL queries every time changes are made.

There are many ways to optimize data queries and improve the value of your data warehouse but using dbt and BigQuery is certainly among the simpler methods to try. The two work really well with each other and can be optimized for maximum performance.

Cherre is the leader in real estate data and insight. We help companies connect and unify their disparate real estate data so they can make faster, smarter decisions. Contact us today to learn more.

Stefan Thorpe

▻ VP of Engineering @ Cherre ▻ Cloud Solutions Architect ▻ DevOps Evangelist 

Stefan is an IT professional with 20+ years management and hands-on experience providing technical and DevOps solutions to support strategic business objectives.