Core Fundamentals & Architecture for dbt Geospatial

Integrating spatial data into a modern analytics stack is a shift away from monolithic, desktop-bound GIS toward declarative, version-controlled, SQL-native transformation. The hard part is not running a single ST_Intersects query — it is structuring an entire pipeline so that coordinate reference systems stay consistent, invalid geometries are quarantined before they corrupt joins, and heavy spatial computation lands in the right layer at the right cost. This guide sets out the architecture that makes that possible with dbt: the layer boundaries, the type and index decisions, and the testing discipline that turns fragmented GIS scripts into reproducible spatial data products.

The premise is straightforward. Raw spatial payloads — GeoJSON, shapefiles, WKT, or binary geometry columns — land in a staging layer where they are validated and normalized. Intermediate models execute the expensive work: spatial joins, unions, buffers, and aggregations. Marts expose pre-indexed, query-optimized geometries to BI dashboards, routing engines, and machine-learning feature stores. By anchoring this flow in dbt, teams inherit DAG-based dependency resolution, incremental execution, and CI/CD without re-implementing spatial orchestration from scratch. The first architectural decision is the execution engine: teams should start with choosing the right spatial adapter so that the project’s function dialect, index types, and SRID handling match the target warehouse.

This page is the entry point for the rest of the section. It frames the decisions that every downstream guide assumes — from setting up PostGIS with dbt for production warehouses, to DuckDB spatial extension integration for local development and CI, to structuring spatial model dependency graphs so the DAG resolves cleanly under parallel execution.

Reference Architecture: Raw to Consumers

A robust dbt geospatial architecture enforces strict boundaries between ingestion, transformation, and consumption. These boundaries prevent geometry bloat, hold CRS consistent, and keep expensive operations away from the layers that serve interactive queries.

Each arrow is a dbt ref() edge, so the architecture is also the lineage graph. Raw layers preserve source fidelity without mutation; staging normalizes types and constructs valid geometries; intermediate models run the heavy predicates; marts expose curated, indexed geometries. The sections below define the responsibilities, type rules, and failure modes at each boundary.

Core Concepts & Boundaries

Three decisions determine whether a spatial pipeline is correct or merely plausible: how coordinate reference systems are governed, whether columns are typed as geometry or geography, and which spatial index backs each query pattern.

Coordinate Reference System Governance

A coordinate reference system (CRS), identified by its SRID, defines how stored coordinates map to positions on the Earth. The single most common cause of silently wrong spatial results is mixing SRIDs in one calculation — a distance computed between an EPSG:4326 point and an EPSG:3857 point is numerically meaningless. Governance means picking a canonical storage projection (EPSG:4326 / WGS 84 for global storage, or a local projected system such as a UTM zone or state plane for accurate distance, area, and buffer work), enforcing it at the staging boundary, and transforming with ST_Transform only at well-defined points in the DAG. A project-wide policy belongs in spatial reference system management, and the per-pipeline mechanics of reprojection are covered in automating CRS conversions in dbt pipelines. Adhering to the OGC Simple Features specification keeps these representations interoperable across engines.

Geometry vs Geography Types

The GEOMETRY type treats coordinates as points on a flat Cartesian plane; the GEOGRAPHY type treats them as positions on a spheroid. The choice is not cosmetic. GEOMETRY is fast and supports the widest function coverage, but distances and areas are only correct if the data is in an appropriate projected CRS. GEOGRAPHY computes great-circle distances correctly on raw lat/long, at the cost of a smaller function set and higher compute. A practical default: store and serve as GEOMETRY in a projected SRID for analytics that need accurate metric operations, and reach for GEOGRAPHY when you need globe-spanning distance correctness without managing projections. Whichever you pick, declare it as an explicit column type in a dbt schema contract so incremental runs cannot silently coerce one into the other.

Spatial Index Selection

A spatial predicate without a supporting index degrades to a quadratic scan, which is the difference between a sub-second query and an exhausted warehouse. The index type must match the query pattern and engine:

GiST — the general-purpose bounding-box R-tree index in PostGIS. The right default for ST_Intersects, ST_Contains, and ST_DWithin. In PostGIS it is created after materialization, so dbt models declare it via a post_hook.
SP-GiST — a space-partitioned index that can outperform GiST for non-overlapping, point-heavy data (for example, dense GPS pings). Choosing between the two is a recurring decision; the GiST vs SP-GiST trade-offs belong with the broader topic of spatial index hints.
HNSW — an approximate-nearest-neighbor index used for high-dimensional vector and embedding workloads; relevant when spatial features feed similarity search rather than exact topology.

On managed warehouses such as BigQuery and Snowflake, spatial indexing is transparent — there is no index to declare — but those engines enforce strict SRID normalization at ingestion, which pushes more responsibility onto the staging layer.

Adapter & Engine Comparison

The execution engine dictates the function dialect, the type system, the indexing model, and how suitable the engine is for fast CI. The table below summarizes the trade-offs that drive the choosing the right spatial adapter decision.

Capability	PostGIS	DuckDB Spatial	BigQuery GIS
Type system	`GEOMETRY` + `GEOGRAPHY`	`GEOMETRY` (planar)	`GEOGRAPHY` (spheroid only)
Function coverage	Broadest (`ST_*`, topology, raster)	Wide, GEOS-backed	Curated subset, no planar `ST_Buffer` semantics
Indexing	Explicit GiST / SP-GiST via `post_hook`	In-memory, implicit	Transparent, managed
SRID handling	Per-column, enforced by you	Per-column, explicit cast	Fixed to EPSG:4326
CI suitability	Needs a running service	Excellent — in-process, zero service	Needs cloud credentials + cost
Best fit	Production warehouse, rich topology	Local dev, CI validation, prototyping	Petabyte-scale serverless analytics

A common and effective pattern is to develop and validate against DuckDB spatial extension integration — fast, free, no service to manage — then promote the same models to PostGIS or BigQuery for production. That only works if engine-specific differences are isolated behind macros, which the macro and UDF abstraction section addresses.

Pipeline Layer Responsibilities

Staging: Validation and Standardization

Raw spatial data rarely arrives analytics-ready. The staging layer owns three responsibilities before anything moves downstream:

Geometry validation — confirm WKT/WKB payloads are well-formed and topologically valid with ST_IsValid, repairing with ST_MakeValid or quarantining failures. Invalid geometries silently break every downstream join.
CRS normalization — reproject everything to the canonical SRID with ST_Transform and stamp the result, so no later model has to guess a coordinate’s frame.
Index preparation — flag the geometry columns that will be materialized so the index decision is explicit rather than incidental.

A minimal staging model makes the contract concrete:

-- models/staging/stg_parcels.sql
with source as (
    select * from {{ source('gis', 'parcels_raw') }}
),

validated as (
    select
        parcel_id,
        case
            when ST_IsValid(geom) then geom
            else ST_MakeValid(geom)
        end as geom_clean
    from source
)

select
    parcel_id,
    ST_Transform(geom_clean, 4326) as geometry  -- canonical storage CRS
from validated
where geom_clean is not null

Intermediate: Heavy Spatial Computation

Spatial joins and constructors are expensive and must never run at the mart layer where they would re-execute on every interactive query. The intermediate layer isolates ST_Union, ST_Buffer, ST_Difference, and the spatial-join predicates ST_Intersects, ST_Contains, and ST_DWithin. This is also where fan-out is contained: a point-in-polygon join can multiply row counts, so bounding-box pre-filters and well-ordered dependencies matter. Structuring these relationships is the subject of spatial model dependency graphs, and untangling the cycles that block compilation is covered in resolving circular dependencies in spatial models. For nearest-neighbor and proximity work specifically, the patterns in optimizing proximity joins keep these models from degrading as data grows.

Marts: Materialization and Query Optimization

The mart layer serves pre-aggregated, indexed geometries tuned to how they are consumed:

view — best for dynamic filters or when the BI tool renders geometry client-side.
table / incremental — required when heavy joins are pre-computed, history is snapshotted, or interactive latency matters.
ephemeral — for CTEs referenced repeatedly inside one model but never materialized on their own.

Marts should not store raw high-precision coordinates when simplified or tiled geometry suffices for the consumer. Applying ST_Simplify and declaring the spatial index at materialization time is what keeps dashboard scans fast:

-- models/marts/mart_service_areas.sql
{{ config(
    materialized = 'table',
    post_hook = "CREATE INDEX IF NOT EXISTS idx_{{ this.name }}_geom ON {{ this }} USING GIST (geometry)"
) }}

select
    region_id,
    ST_Simplify(geometry, 0.0001) as geometry  -- drop precision the map will never show
from {{ ref('int_service_area_unions') }}

Macro & UDF Abstraction

Spatial logic that repeats — validity sweeps, reprojection, bounding-box filters — belongs in dbt macros rather than copy-pasted SQL. Macros also absorb the dialect differences between engines, which is what makes the develop-on-DuckDB, deploy-to-PostGIS pattern viable. A reprojection wrapper is a typical first abstraction:

-- macros/spatial/to_canonical_srid.sql
{% macro to_canonical_srid(geom_col, target_srid=4326) %}
    ST_Transform({{ geom_col }}, {{ target_srid }})
{% endmacro %}

Wrap logic in a macro when it appears in more than one model, when it differs by adapter, or when it encodes a governance rule (the canonical SRID, the simplification tolerance) that should be changed in exactly one place. The full vocabulary of parameterization and cross-engine portability lives in advanced spatial macros and UDF patterns, with worked examples in building custom spatial macros and reusable predicates such as writing reusable ST_DWithin macros in dbt. Reprojection that must run across many models at once is handled in batch transforming coordinate systems with dbt.

Testing & Data Quality

Spatial data fails in ways that not_null and unique never catch. Production projects add geometry-aware generic tests:

Validity tests — flag any row where ST_IsValid(geometry) is false, catching self-intersections, unclosed rings, and collapsed polygons.
SRID assertions — assert that every geometry in a mart shares one SRID, the guardrail against silent distance errors.
Null-geometry guards — fail when a geometry column contains unexpected nulls after a join that should have matched.
Cardinality checks — verify expected row counts after ST_Intersects or ST_DWithin to catch fan-out early.

A reusable validity test as a dbt generic test:

-- tests/generic/assert_valid_geometry.sql
{% test assert_valid_geometry(model, column_name) %}
    select {{ column_name }}
    from {{ model }}
    where not ST_IsValid({{ column_name }})
{% endtest %}

Wired into a model’s schema, the contract and tests sit together:

# models/marts/_marts.yml
version: 2
models:
  - name: mart_service_areas
    config:
      contract:
        enforced: true
    columns:
      - name: geometry
        data_type: geometry
        tests:
          - assert_valid_geometry

Performance & Scale Considerations

As datasets grow from thousands to billions of rows, a few patterns keep cost and latency bounded:

Incremental spatial processing — partition large tables by time or a spatial grid (H3, S2, or QuadKey) and recompute only changed cells using is_incremental() filters combined with bounding-box predicates such as ST_Envelope.
Bounding-box pre-filtering — gate every expensive predicate behind a cheap && bounding-box check so the planner can use the GiST index before evaluating exact geometry.
Geometry simplification and tiling — store full precision in staging but expose ST_Simplify-reduced or pre-tiled geometry in marts so the consumer never scans coordinates it cannot render.
Compute routing — send lightweight aggregations to serverless engines and reserve dedicated clusters for heavy topology operations.

These trade-offs compound at volume; the dedicated treatment in handling large geospatial datasets covers partition strategy and incremental state tracking in depth.

CI/CD Integration

The most reliable way to keep spatial regressions out of production is to run the validity, SRID, and cardinality tests on every pull request — and to do it against a fast, free engine before promoting to the production warehouse. DuckDB Spatial is purpose-built for this: it runs in-process with no service to stand up, so a GitHub Actions job can build and test the full DAG in seconds.

# .github/workflows/dbt-spatial-ci.yml
name: dbt spatial CI
on: [pull_request]
jobs:
  build:
    runs-on: ubuntu-latest
    env:
      DBT_PROFILES_DIR: ./ci
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install dbt-duckdb
      - run: dbt build --target ci   # builds + runs every geometry test

The CI profile points dbt at DuckDB and reads any secrets through the env_var() pattern so nothing is hard-coded:

# ci/profiles.yml
dbt_geospatial:
  target: ci
  outputs:
    ci:
      type: duckdb
      path: "{{ env_var('DBT_DUCKDB_PATH', 'ci.duckdb') }}"
      extensions: [spatial]

Seed a handful of edge-case geometries — an unclosed ring, a self-intersection, a wrong-SRID point — as fixtures so the test suite proves it actually catches the failures it claims to. Only after the DuckDB gate is green should the same models promote to PostGIS, whose step-by-step setup is in how to install the dbt PostGIS adapter.

Common Failure Modes & Remediation

Symptom	Root cause	Remediation
Distances or areas wildly wrong	Mixed SRIDs in one calculation	Enforce canonical CRS at staging with `ST_Transform`; add an SRID-consistency test
Spatial join returns zero or far too many rows	Geographic vs projected mismatch, or missing bounding-box pre-filter	Align CRS before the join; gate the predicate behind `&&` and a GiST index
`ST_Intersects` errors on “invalid geometry”	Unrepaired self-intersections / unclosed rings from source	Run `ST_MakeValid` in staging and `assert_valid_geometry` as a test
Query slow despite an index	Planner not using the GiST index, or index built before data load	Create the index in a `post_hook` after materialization; pre-filter with a bounding box
Incremental run drops or duplicates geometry	Partition key misaligned with the spatial predicate	Re-key partitions to the spatial grid; review the DAG in spatial model dependency graphs
Works on DuckDB, fails on PostGIS	Adapter dialect / type-coercion difference	Isolate the difference behind an adapter-aware macro; enforce a schema contract

Schema-level breaking changes — switching a column from GEOGRAPHY to GEOMETRY, or changing an index strategy — deserve their own discipline; versioning spatial schemas in dbt covers backward-compatible migration and rollback.

Choosing the Right Spatial Adapter — match engine, function dialect, and index model to your warehouse.
Setting Up PostGIS with dbt — enable spatial functions, search paths, and GiST indexing in production.
DuckDB Spatial Extension Integration — in-process spatial compute for local development and CI.
Spatial Model Dependency Graphs — structure the DAG for parallel execution and minimal data shuffling.
Spatial Data Architecture & Governance — CRS policy, schema versioning, security scoping, and large-dataset handling.

↑ Back to dbt Geospatial

Explore this section