Configuring the DuckDB Spatial Extension in dbt Projects

This page shows you exactly how to make the DuckDB spatial extension load reliably for every dbt run — declaring it in profiles.yml, forcing it active with on-run-start hooks, and capping memory so spatial joins survive incremental builds without an out-of-memory crash.

When to use this approach

Reach for in-process DuckDB spatial configuration when these conditions hold; otherwise an alternative engine is the better fit:

You want a zero-server spatial engine for CI or local builds. DuckDB runs in-process with no daemon, which is why it is the recommended lightweight validator before promoting to a heavier engine. If your serving tier needs a persistent multi-user database instead, follow setting up PostGIS with dbt.
Your geometry volumes fit a single host’s memory budget. DuckDB excels at GeoParquet-backed analytics on one machine. For warehouse-scale, multi-tenant geometry feeds, weigh the trade-offs in choosing the right spatial adapter.
You need spatial transforms wired into the transformation graph, not a separate ETL job. Materializing spatial models as DuckDB tables lets dbt own lineage end to end — see how that ripples through the DAG in spatial model dependency graphs.

Prerequisites

Confirm the following before configuring anything:

dbt Core 1.7+ with the dbt-duckdb adapter 1.7.0 or newer (pip install "dbt-duckdb>=1.7"). The adapter’s native extension management is what makes the extensions: key below work.
DuckDB 0.10+ (bundled with the adapter wheel). The spatial extension is published in the DuckDB community/core registry at extensions.duckdb.org.
Outbound network access on the first build to fetch the .duckdb_extension binary, or a pre-seeded extension directory for air-gapped / CI runners.
Environment variables exported for the database path so nothing is hard-coded: DBT_DUCKDB_PATH for dev and DBT_DUCKDB_PATH_PROD for production targets.

Step-by-step instructions

1. Declare the extension in profiles.yml

Tell the adapter to register spatial at connection initialization and pin deterministic resource boundaries so the first model never starts before the extension is resolvable.

# profiles.yml
your_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "{{ env_var('DBT_DUCKDB_PATH', 'data/warehouse.duckdb') }}"
      threads: 4
      extensions:
        - spatial
      settings:
        memory_limit: "4GB"
        threads: 4
        enable_progress_bar: false

Verify the profile parses before running any models:

dbt debug --target dev
# Expect: "Connection test: [OK connection ok]"

2. Force the extension active with on-run-start hooks

The extensions: array installs the binary, but a function call still fails unless the module is LOADed into the session. Add idempotent hooks in dbt_project.yml so every invocation guarantees spatial readiness — this also prevents race conditions when concurrent runs initialize against the same file.

# dbt_project.yml
on-run-start:
  - "INSTALL spatial;"
  - "LOAD spatial;"
  - "SET enable_optimizer = true;"

INSTALL spatial is a no-op once the binary is cached; LOAD spatial activates it for the current session. Without LOAD, spatial functions parse but error at materialization time.

3. Confirm the extension is loaded

Run a one-line probe against the database to assert the module is live before trusting a full build:

duckdb "$DBT_DUCKDB_PATH" -c "SELECT extension_name, loaded, installed
  FROM duckdb_extensions() WHERE extension_name = 'spatial';"
# Expect one row: spatial | true | true

4. Stage geometries as GeoParquet, not raw JSON

Performance bottlenecks in DuckDB spatial models come from serialization, not transform logic. DuckDB reads and writes GeoParquet natively, preserving spatial metadata and pushing spatial filters down into the scan. Stage external data as Well-Known Binary (WKB) or GeoParquet, then apply spatial functions in the model:

{{ config(materialized='table') }}

WITH parsed AS (
    SELECT
        parcel_id,
        ST_GeomFromGeoJSON(geom_json) AS geometry
    FROM {{ source('raw', 'land_parcels') }}
)
SELECT
    parcel_id,
    ST_AsWKB(geometry)  AS geometry_wkb,
    ST_Area(geometry)   AS area_sq_meters
FROM parsed
WHERE ST_IsValid(geometry)

Predicate pushdown on GeoParquet typically cuts scan I/O 60–80% versus row-based text formats when filtering by bounding box before aggregation. Confirm the model built and that no geometries were silently dropped:

SELECT count(*) AS rows, count(*) FILTER (WHERE geometry_wkb IS NULL) AS null_geom
FROM {{ ref('stg_land_parcels') }};
-- Expect null_geom = 0

5. Cap memory and threads on production targets

Spatial joins build an in-memory R-tree, so large intersections or nearest-neighbour passes are the usual OOM culprits. For datasets above ~10 million geometries, lower threads to reduce contention and give the index room, and bound spill-to-disk so it does not saturate the host filesystem.

# profiles.yml — production output
prod:
  type: duckdb
  path: "{{ env_var('DBT_DUCKDB_PATH_PROD') }}"
  threads: 2
  settings:
    memory_limit: "8GB"
    max_temp_directory_size: "2GB"

Isolate heavy spatial joins into their own materialized: 'table' models rather than ephemeral views, so DuckDB persists intermediate results across DAG steps instead of recomputing them.

Configuration reference

Parameter	Where	Accepted values	Default	Spatial-specific note
`extensions`	`profiles.yml` output	list incl. `spatial`	none	Triggers install at connection init; still requires a `LOAD` hook
`memory_limit`	`settings`	size string, e.g. `8GB`	80% of RAM	R-tree construction for spatial joins is memory-bound; raise before lowering threads
`threads`	output / `settings`	integer	cores	Drop to `2` for large `ST_Intersects` joins to avoid thread contention
`max_temp_directory_size`	`settings`	size string	unlimited	Bounds spill-to-disk so large joins do not fill the host volume
`enable_optimizer`	`on-run-start` SET	`true` / `false`	`true`	Keep on so spatial predicates are pushed into the scan
`path`	output	file path via `env_var()`	n/a	Use a separate file per concurrent run to avoid write locks

Gotchas & edge cases

LOAD is not optional. Declaring extensions: [spatial] installs the binary but does not activate it; omitting the LOAD spatial hook produces a runtime error only once a model calls an ST_ function, deep into the build.
Air-gapped runners have no registry. INSTALL spatial needs outbound access to extensions.duckdb.org on the first run. Pre-bundle the .duckdb_extension file into the Docker image or CI cache, or DuckDB will fail with Extension 'spatial' not found.
DuckDB geometry is planar by default. It assumes Cartesian coordinates unless you reproject, so distance and area on raw lon/lat are distorted. Standardize the coordinate reference system with ST_Transform() at the staging layer before any measurement.
Concurrent writes deadlock a single file. Two dbt invocations materializing into the same .duckdb file collide. Use materialized: 'table' for spatial staging layers and a separate database file per run.
ST_Union on mixed input fails. Corrupted WKB or mixed-CRS geometries raise Invalid geometry. Gate aggregations with ST_IsValid() and normalize CRS first.

Frequently asked questions

Why do my spatial functions fail even though extensions: [spatial] is set?

The binary is installed but not loaded into the session. Add LOAD spatial; to your on-run-start hooks in dbt_project.yml — INSTALL alone registers the extension on disk, LOAD makes its functions callable in the current connection.

How do I configure the spatial extension for an air-gapped CI runner?

Pre-fetch the .duckdb_extension binary and seed it into DuckDB’s extension directory inside the image, then keep INSTALL spatial; (a no-op against the cached copy) and LOAD spatial; in your hooks. The duckdb_extensions() probe in step 3 confirms installed = true without any network call.

Why does a spatial join crash with "Out of memory"?

DuckDB builds the spatial R-tree in memory. Raise memory_limit, lower threads to 2 to cut contention, set max_temp_directory_size so the join can spill, and stage inputs as GeoParquet so predicate pushdown shrinks the working set before the join.

Should I keep geometries as WKB or GeoJSON through the pipeline?

WKB or GeoParquet throughout. Re-parsing raw ST_GeomFromGeoJSON on every model inflates payloads and defeats the scan-phase filter pushdown that makes DuckDB fast. Parse JSON once at staging, then carry WKB downstream.

Why does ST_Distance return surprising values on lon/lat data?

DuckDB treats geometry as planar unless reprojected, so it measures in degrees, not metres. Apply ST_Transform() to a metric projection at the staging layer, or measure on already-projected coordinates, before any distance or area calculation.

DuckDB Spatial Extension Integration — the parent guide to lifecycle management and advanced spatial indexing in DuckDB.
Choosing the Right Spatial Adapter — when DuckDB beats PostGIS or warehouse-native GIS for your workload.
Setting Up PostGIS with dbt — the persistent-server counterpart for production serving tiers.
Spatial Model Dependency Graphs — isolating heavy spatial joins into materialized tables across the DAG.

Up: Part of DuckDB Spatial Extension Integration.