Clear the clutter! Find + drop undead dbt-created tables

Aly K
Red Pill Analytics
Published in
11 min readJan 2, 2024

--

Cluttered garage space with some red toolboxes, pile of shovels, some old tires and a bunch of random bottles/cans and other clutter.
Photo by todd kent on Unsplash

Clutter- What is it? Where did it come from?

By “db clutter” I mean unnecessary, unused objects living in your database.

There’s gonna be some word clutter here. In this post I’m going talk about:

  • how unused-object clutter gets into your db and why it’s bad
  • describe a community macro solution
  • explain some modifications to the macro

If you want to skip all that for some code to drop the clutter, jump ahead to The Final Macro.

A negative I’ve heard about dbt is it’s “too easy” to create objects in the database. When you rename, delete a model or move it to a new schema, dbt does not automatically drop the old database object → clutter.

Sure, you can drop your personal development schemas and recreate them clean but that can be hard to do in a production environment.

At my last client, if you renamed or deleted a model you had to write an additional sql DDL script to drop the old object, get that through the PR process and then the person promoting your code would need to remember to run your special DDL script in each environment. Any one of these steps were easily forgotten resulting in one more unused object cluttering the database.

This client had some additional clutter because they create backup tables when they do manual data fixes, but have no standard on where to create them or how long they should live. This left a pile of tables throughout the database named <tablename>_<randomlookingdate> from manual data manipulations that occurred months and years ago.

Storage is cheap, but it’s still a waste of resources to keep these things around. It makes your project more confusing

Is the orders model creating the table DBT_CORE_JAFFLE.orders or DBT_MARKET_JAFFLE.orders?

dbt has a model: customer_actv, where did tables cust_active and customer_activate come from?”

A more confusing project leads to more time spent troubleshooting issues and onboarding new employees.

Looking for an answer

Searching the internet for advice, I came across this Community Discussion. The macro listed as “Option 3??!!” looked especially appealing.

We ♥️ snowflake here, so that’s what I’ll be working with in this post. (Edit: ok. I did some work on BigQuery as well)

Macro overview

M-credera’s suggested snowflake code basically queries the INFORMATION_SCHEMA for a list of tables and views and compares that to what’s in dbt via the graph context variable. The result is a list of objects found in the database that aren’t part of dbt. It then loops through the results and drops each object.

Macro explanation step by step:

A. Validation:
Ensures the inputted schema argument contains an iterable object.

B. Call statement get_outdated_tables:
1. The call statement selects 3 columns from a unioned subquery named ‘c’ which is left joined to a bunch of graph node jinja called ‘desired’ where the desired ref_name is null.
2. The ‘c’ subquery selects all the tables from the INFORMATION_SCHEMA for the inputted schemas unioned with a select for all the views from the INFOMATION_SCHEMA for the inputted schemas. (sidebar: snowflake information_schema.tables has both tables and views, so could mod this to remove the union)
3. The ‘desired’ jinja is using the dbt graph nodes to collect all the models and seeds and build a values list of (<schema>,<dbt model or seed name>) resulting in code that looks like:

(values
(UPPER(‘DBT_CORE_JAFFLE’), UPPER(‘stg_orders’)),
(UPPER(‘DBT_CORE_JAFFLE’), UPPER(‘stg_customers’)),
(UPPER(‘DBT_CORE_STRIPE’), UPPER(‘stg_payments’))
) as desired

4. The where desired.ref_name is null is finding the rows where the ‘c’ (schema,ref_name<table>) values selected from the database don’t match a ‘desired’ (schema,ref_name<model>) selected from dbt’s graph context variable.

C. get_outdated_tables results:
The call statement results are accessible via load_result function. Loop over each value (to_delete) in the results:
1. Create a fully qualified name variable using dbt’s target db variable and values from to_delete.
set fqn = target.database + '.' + to_delete[0] + '.' + to_delete[1]

2. Raise an exception if something is in the Bronze schema. Nothing in Bronze should be dropped.
3. Create a new call statement to perform DDL statements to drop each object found in the load_result.

Additional desires

I wanted a bit more functionality:

  1. A dry_run argument. Another user, deegeebee, suggested including an input which would allow the macro to act as a report- outputting everything without executing the drop statements.
  2. Remove the bronze exception. We don’t use a medallion architecture and we have no specific schema that should be safe from this macro. However, the client DOES have tables that shouldn’t be dropped scattered through different schemas. I’ll need to come up with a way to keep these safe.
  3. An optional “stale_days” argument. This would allow you to run it and just drop everything that hasn’t been touched (altered) in X days.
  4. A bit more information to the output about each dropped or to-be-dropped object. Looking at a list of table names isn’t all that helpful, so I want to include the last altered date and row count.
  5. Cloud job to run the macro on a schedule.

Modifications for 1 & 2: dry_run arg & remove Bronze exception

Note: Your line numbers won’t exactly align with mine because I added comments and had a minor fight with the dbt graph object resulting in small changes to the suggested code before starting down the _additional functionality_ rabbit hole. The complete macro is posted at the end.

screenshot from VSC highlighting the specific line changes since MEDIUM does not have a great way to show this.
  • Green Box: I added a new argument and defaulted it to true since the safest way to run this is where nothing will be dropped.
  • Yellow strikethrough: removed exception for Bronze objects.
  • Pink Box: reworked the loop so if dry_run is false, call the statement to drop the objects, if dry_run is true then just put the drop statement in the log and an additional 🥺 log statement.

Additional work for 2: donot drop list

So the client has mixed some manually created tables in their dbt managed schemas. If you were building a project from scratch I would suggest not doing this, but they’re already there and have been there for years, so need to keep them safe from this macro.

In order to be able to use a dbt ref() in the macro, I created a seed to list the ⛔ do not drop ⛔ objects.

The seed is a simple csv of 2 columns. It can list as many tables as necessary.

table_schema,table_name
EXAMPLE_SCHEMA,DONOTDROPEXAMPLE_TABLE_NAME

Then in the 2 select from INFORMATION_SCHEMA where clauses I added a check that the table is not in the seed data:
and (table_schema, table_name) not in (select table_schema, table_name from {{ref(‘donot_drop_list’)}})

Modification 3: stale_days arg

Added the new stale_days argument. Rather arbitrarily chose 60 as a default value. There’s 2 queries to the information_schema that need an additional where clause to limit the query to things last altered before the current date minus the inputted stale days.

Another VSC screenshot. don’t worry the full code is in the post later.

Modification 4: More info in the output

I wanted a last_altered date and row count (for tables) to be part of the output, so I added the new columns to the necessary select statements.

Last VSC screenshot of code.

The main select also includes a trunc to the minute because I don’t need infinitesimal units of time displayed.

I then added that information to the log statement as part of the for loop at the end:
--last_altered:’~ to_delete[3].strftime(“%d-%b-%Y %H:%M”) ~’; row_count:’~ to_delete[4]

The Final Macro

{% macro drop_orphaned_tables(schema, stale_days = 60, dry_run = true) %}

{% if execute %}
{% if (schema is not string and schema is not iterable) or schema is mapping or schema|length <= 0 %}
{% do exceptions.raise_compiler_error('"schema" must be a string or a list') %}
{% endif %}
{% if schema is string %}
{% set schema = [schema] %}
{% endif %}

{% call statement('get_outdated_tables', fetch_result=True) %}
select c.schema_name,
c.ref_name,
c.ref_type,
date_trunc("minute", c.last_altered),
c.row_count
from (
select table_schema as schema_name,
table_name as ref_name,
'table' as ref_type,
last_altered,
to_char(row_count) as row_count
from information_schema.tables
where last_altered < DATEADD('days', -{{stale_days}}, CURRENT_TIMESTAMP)
and table_type != 'VIEW'
and (table_schema, table_name) not in (select table_schema, table_name from {{ref('donot_drop_list')}})
and table_schema in (
{%- for s in schema -%}
UPPER('{{ s }}'){% if not loop.last %}, {% endif %}
{%- endfor -%}
)
union all
select table_schema as schema_name,
table_name as ref_name,
'view' as ref_type,
last_altered,
'NA' as row_count
from information_schema.views
where last_altered < DATEADD('days', -{{stale_days}}, CURRENT_TIMESTAMP)
and (table_schema, table_name) not in (select table_schema, table_name from {{ref('donot_drop_list')}})
and table_schema in (
{%- for s in schema -%}
UPPER('{{ s }}'){% if not loop.last %},{% endif %}
{%- endfor -%}
)) as c
left join (values
{%- for node in graph['nodes'].values() | selectattr("resource_type", "equalto", "model") | list
+ graph['nodes'].values() | selectattr("resource_type", "equalto", "seed") | list
+ graph['nodes'].values() | selectattr("resource_type", "equalto", "snapshot") | list%}
{%- if (node.schema in schema) and node.alias and node.alias != None %}
(UPPER('{{ node.schema }}'), UPPER('{{node.alias}}')),
{%- endif -%}
{%- if loop.last %} {# this is to avoid the issue of the last node in graph having alias = 'None' or being in a different schema causing compile error due to the "," if node.alias is None here it doesn't really hurt anything#}
(UPPER('{{ node.schema }}'), UPPER('{{node.alias}}'))
{%- endif %}
{%- endfor %}
) as desired (schema_name, ref_name) on desired.schema_name = c.schema_name
and desired.ref_name = c.ref_name
where desired.ref_name is null
{% endcall %}

{%- for to_delete in load_result('get_outdated_tables')['data'] %}
{% set fqn = target.database + '.' + to_delete[0] + '.' + to_delete[1] %}
{% if dry_run == false %}
{% call statement() -%}
{% do log('dropping ' ~ to_delete[2] ~ ': ' ~ fqn, info=true) %}
drop {{ to_delete[2] }} if exists {{ fqn }};
{%- endcall %}
{% elif dry_run == true %}
{% do log( 'drop '~to_delete[2] ~ ' if exists ' ~ fqn ~'; -- last_altered:'~ to_delete[3].strftime("%d-%b-%Y %H:%M") ~'; row_count:'~ to_delete[4], info=true) %}
{% endif %}
{%- if loop.last and dry_run == true %}
{% do log('Please drop the above objects or move them to the proper backup schema.\n', info=true) %}
{% endif %}
{%- endfor %}

{%- set response = load_result('get_outdated_tables')['response'] %}
{% do log('Query Status + # of results found: '~ response, info=true) %}

{% endif %}
{% endmacro %}​

Other code changes:

  • Added ‘snapshot’ resource_type to the graph node loop within the get_outdated_tables call statement query.
  • Added check for ‘None’ in the graph context variable loop — don’t need to add extra entries to the list when there’s no model name to check against.
  • A log entry to display the total count of objects found at the very end. If there’s tons of results, it can be handy to just see a total count at the bottom.

Cloud Job(s)

You can create a simple cloud job to run this as a report (dry_run = true) for objects older than the 60day default with this command:

dbt run-operation drop_orphaned_tables — args ‘{schema: [schema1,schema2,…]}’

Here’s a more complex business scenario:
say you have a project with different owners for different areas, you can set up a cloud job for each.

Requirements:
The owner of CORE schemas has implemented strict standards and all their developers know the CORE schemas are maintained by dbt and will have no other tables allowed. They want everything not associated with a model, seed or snapshot dropped once a month.

The owner of the MARKET schemas is trying to implement development + support standards but it’s still a WIP. They’d like to see a report (dry run = true) every friday.

Here’s what those 2 jobs would look like:

dbt cloud job: CORE_cleanup

  1. dbt command:
    dbt run-operation drop_orphaned_tables — args ‘{schema: [DBT_CORE_JAFFLE,DBT_CORE_STRIPE,DBT_CORE_XFORM, DBT_CORE_REPORTA, DBT_CORE_REPORTB], stale_days: 0, dry_run: false}’
  2. Run on schedule: True
  3. Timing: Cron schedule: 0 0 1 * *

dbt cloud job: MARKET_cleanup_report

  1. dbt command:
    dbt run-operation drop_orphaned_tables — args ‘{schema: [DBT_MARKET_JAFFLE,DBT_MARKET_XFRMA, DBT_MARKET_XFRMB,DBT_MARKET_REPORTING], stale_days: 30, dry_run: true}’
  2. Run on schedule: True
  3. Timing: Hours of the day. Run at: 0
    Check only the Friday checkbox.
    OR
    Timing: Cron schedule: 0 0 * * Fri

You can set up notifications within dbt to ensure the proper people/channels are notified for each event.

Sample, hypothetical output for CORE team:

23:02:54 Registered adapter: snowflake=1.5.
23:02:54 Found 165 models, 229 tests, 10 snapshots, 0 analyses, 948 macros, 0 operations, 2 seed files, 5 sources, 3 exposures, 0 metrics, 0 groups
23:02:58 Query Status + # of results found: SUCCESS 0
23:02:54 Registered adapter: snowflake=1.5.6

Standards followed, no objects found to be dropped.
The CORE owner has no further responsibility, anything that slips by will be dropped the 1st of every month.

Sample, hypothetical output for the MARKET team:

23:02:54 Registered adapter: snowflake=1.5.
23:02:54 Found 165 models, 229 tests, 10 snapshots, 0 analyses, 948 macros, 0 operations, 2 seed files, 5 sources, 3 exposures, 0 metrics, 0 groups
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_XFORMA.MOCK_ORDERS; - last_altered:30-Oct-2023 11:11; row_count:4
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_JAFFLE.EXAMPLE_DIM_CUST; - last_altered:30-Oct-2023 10:47; row_count:100
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_JAFFLE.EMPLOYEES; - last_altered:30-Oct-2023 07:03; row_count:6
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_REPORTING.CUST_ACTIVE; - last_altered:30-Oct-2022 10:47; row_count:59399
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_REPORTING.CUST_ACTIVATE; - last_altered:19-Dec-2021 14:09; row_count:58789
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_REPORTING.STATS_PYTHON; - last_altered:06-Nov-2023 14:00; row_count:124954
23:30:57 drop view if exists PROJECT_DB.DBT_MARKET_JAFFLE.SNOWFLAKE_WAREHOUSE_METERING; - last_altered:19-Sep-2023 13:21; row_count:NA
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_JAFFLE.JFLC_EVENTS; - last_altered:06-Nov-2019 16:12; row_count:0
23:30:57 drop view if exists PROJECT_DB.DBT_MARKET_XFRMA.CONTRACT_RATES; - last_altered:30-May-2022 07:22; row_count:NA
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_JAFFLE.SUBS_MAP; - last_altered:27-Oct-2023 16:35; row_count:7
23:30:57 drop table if exists PROJECT_DB.DBT_MARKET_JAFFLE.SNOWFLAKE_QUERY_HISTORY; - last_altered:16-Apr-2023 11:00; row_count:901120
23:30:57 Please drop the above objects or move them to the proper backup schema.
23:30:57 Query Status + # of results found: SUCCESS 11
23:30:54 Registered adapter: snowflake=1.5.6

The MARKET owner has some additional work:

Drop what can be dropped
They can 🍒 pick drop statements from the report and manually run them for individual tables
OR
They can create an adhoc macro command updating the schema list and stale_days arguments to bring back only objects determined droppable and then run with dry_run = false.

Work on enforcing a standard where non-dbt managed objects are only created in “non-dbt” schemas. They can update the donot_drop_list seed for objects that should never be dropped but for whatever reason can’t be moved.

Closing thoughts

1.DB clutter is bad, easy to come by but can be conquered

Could it be conquered in other ways? Of course! If your project is small and you can probably 👀 the database object list and know what’s clutter and delete it. If your project strictly follows standards maybe you have nothing left behind (must be nice! ;) ).

But if you have leftover objects living un-dead in your database, this macro gives you the flexibility to handle them in many different ways.

We’ve built 2 examples in this post:

  • Scheduled (cloud job), automatic, immediate drop (dry_run: False)
  • Scheduled, automatic, aged-out report (stale_days: 30, dry_run: True)

With arguments for schemas, stale_days and dry_run, you can customize the macro to fit your project needs.

2. dbt graph context variable can be challenging to work with but very useful

This was the most challenging part of modifying this macro. There was a definite lack of documentation on dbt:

screenshot from dbtdocs site saying there will be more information available in the future.

I can’t wait for the future. 😂

3. The dbt community is a great place to find helpful solutions

Are you part of the slack convo? Do you follow the dbt discourse? The dbt community is huge and helpful!

Is your organization adapting to remain competitive? In the Architect Room, we design a roadmap for the future of your digital organization, while still capitalizing on current technology investments. Our knowledge spans a variety of offerings across all of the major public cloud providers. Visit Red Pill Analytics to learn more.

--

--