Speed up dbt workflow with Task

Speed up dbt workflow with Task

·

5 min read

Do you remember the last time you opened your laptop and thought:

"I can't wait to spend half my day writing configs in YAML."?

Yeah, neither do I.

But if you use dbt heavily, I'm afraid you and YAML are in for the long run. Even with the help of dbt-codegen, there's still a lot of manual work involved:

  • Write odd-looking CLI commands like this one
dbt run-operation generate_source --args '{"schema_name": "jaffle_shop", "database_name": "raw", "table_names":["table_1", "table_2"]}'
  • Send the output to some file

  • Don't screw up the naming conventions

  • Repeat for each new SQL model

  • ...

If you're anything like me, this gets old fast.

So after writing about 30 lines of dbt YAML configurations, I said, "Enough is enough!" and wrote 400 more lines to automate the process.

How Task made everything better

I wanted something to automate repetitive work that I could set up quickly and without hassle. After some searching, I landed on Task.

Task is a task runner / build tool that aims to be simpler and easier to use than, for example, GNU Make.

In other words, write a sequence of commands in YAML, give them a name, and run them like this:

task formerly-tedious-action

That's exactly what I needed.

On top of the very fundamental aspect of "it runs stuff", Task comes packed with many other valuable features:

Plus, dozens of other goodies. Honestly, just read the docs.

I don't care about your fancy pants introduction. Show me the code!

📥
You can add this Taskfile.yaml in your dbt project, install Task and try it there. The only prerequisite is that you're either on Linux or Mac. I haven't entirely adapted the tasks for Powershell yet.

Alright, let's see it in action. You can follow along by forking this repository and opening it in Codespaces. It's already configured with everything you need.

The first time will take a while, as it builds the devcontainer. When it's done, you'll be presented with the browser version of VSCode and an almost empty dbt project.

In the assets/ folder, you can find a Duckdb database that is already pre-filled with some data. You can browse the database by running

harlequin assets/demo.duckdb

and it will open a SQL IDE in your terminal that looks like this:

As you can see, there are a couple of different databases (mixpanel and shopify). These represent your raw sources, with some tables each.

Let's get started with actual development with dbt. Close the SQL IDE and write task in your terminal. This runs the default task, which is configured to show this output:

Usually, when starting a new dbt project, the first action would be to define our sources. If using dbt-codegen you'd write something like this:

dbt --quiet run-operation generate_model_yaml --args '{"model_name": "stg_jaffle_shop__orders"}' > models/staging/jaffle_shop/stg_jaffle_shop__orders.yml

With the tasks provided, you can instead just write

task dbt:gsy SCHEMA=mixpanel

that will output

You can see that the task automatically determines where to place the file and enforces a naming convention.

This is entirely arbitrary. If you don't like your lowest layer to be called bronze, you can just edit the BASE_LAYER variable under dbt/Taskfile.yaml.

These commands are still running dbt-codegen in the background, but taking advantage of Task's many features makes it possible to create automations with relatively small effort and (hopefully) minimal bash scripting.

Let's see a couple more examples.

task dbt:gms SOURCE=mixpanel -- events users

will create the base models

  • bronze_mixpanel__events.sql and

  • bronze_mixpanel__users.sql

Now, you need to create their respective YAML files. Let's also say that you already created a silver model, for instance user_events.sql. Your models/ folder should look something like this

Just run the following

task dbt:gmy -- __events users user_events

(only model suffixes are necessary, but they can't be ambiguous, so we prepend events with __ to allow for it).

And here's the output

(Note: this last command runs the models as a precondition if they haven't been run before. This is to ensure that columns get correctly picked up by Codegen)

Conclusion

These were just examples of what can be achieved with Task, and maybe an exaggeration. Yet, the project takes advantage of many of the various features and should provide a decent starting point for those wanting to get started.

In the context of dbt workflows, so much more could be done: for instance, one could make a task (or group thereof) that makes sure that all the column and table metadata gets properly managed and centralized (by taking advantage of Jinja docs, and the yq utility).

Extra notes

You sure have noticed that dbt related tasks are invoked like this: task dbt:.... We invoke them from the project's root, but the Taskfile is in the dbt/ sub-directory. In Task, it's possible to include Taskfiles in others and specify the directory in which they run, like so:

includes:
  dbt:
    taskfile: ./dbt
    dir: ./dbt

This way, you can easily scope and invoke your tasks from a single centralized point. It's also possible to have a global Taskfile that can be used from anywhere!