Advanced Spatial Macros & UDF Patterns: Architecting the Modern Geospatial Stack

The convergence of analytics engineering and spatial computation has fundamentally shifted how organizations operationalize location intelligence. As teams adopt dbt for geospatial workloads, the reliance on fragmented GIS scripts and monolithic ETL jobs gives way to modular, version-controlled, and testable transformation layers. At the center of this architectural evolution sit advanced spatial macros and user-defined function (UDF) patterns, which bridge the gap between declarative SQL modeling and the computational intensity of geometry operations.

This guide establishes the execution boundaries, adapter lifecycles, and performance strategies required to productionize spatial transformations across PostGIS, DuckDB, BigQuery, and Snowflake. It targets analytics engineers, platform architects, GIS backend developers, and spatial data scientists who need deterministic, scalable, and CI/CD-ready workflows. Each section maps to a deeper companion guide: macro construction is covered in Building Custom Spatial Macros, deterministic projection chains in Geometry Transformation Pipelines, planner control in Index Hints for Spatial Queries, and join scaling in Optimizing Proximity Joins.

The throughline is simple: treat spatial logic as a first-class engineering artifact. A macro is not a convenience wrapper; it is the contract that guarantees a geometry entering a spatial model dependency graph carries a known coordinate reference system, a validated topology, and a predictable execution cost. When that contract holds at every layer, location intelligence stops being an experimental capability and becomes a reliable component of the analytics stack.

Core Architecture: Compile-Time Abstraction vs. Runtime Execution

Spatial transformations operate across two distinct execution phases: compile-time macro expansion and runtime function evaluation. Maintaining a strict separation between these phases is critical for preserving spatial integrity and optimizing query performance.

dbt macros act as compile-time templates. They generate warehouse-specific SQL before the query ever reaches the execution engine. When a macro encapsulates a spatial operation, it can dynamically inject adapter-specific syntax, enforce SRID alignment, or construct bounding-box predicates from configuration variables. By contrast, database-native UDFs and built-in spatial routines execute at runtime, leveraging the warehouse’s vectorized engine and memory-optimized geometry libraries for heavy computation.

The architectural rule is therefore: macros handle structural abstraction, parameterization, and cross-platform compatibility, while computationally intensive operations — polygon clipping, spatial aggregation, coordinate transformation — are delegated to native spatial functions or compiled UDFs. The custom spatial macro patterns guide details how to abstract vendor-specific functions into reusable, testable components. This separation keeps spatial logic decoupled from downstream business metrics while guaranteeing deterministic execution across development, staging, and production.

-- Compile-time concern: a macro normalizes CRS and validity in one place
{% macro to_canonical_geom(geom_col, target_srid=4326) %}
  ST_MakeValid(
    ST_Transform(
      ST_SetSRID({{ geom_col }}, COALESCE(NULLIF(ST_SRID({{ geom_col }}), 0), {{ target_srid }})),
      {{ target_srid }}
    )
  )
{% endmacro %}

-- Runtime concern: the warehouse executes the expanded, indexable predicate
SELECT a.id, b.id
FROM {{ ref('stg_facilities') }} a
JOIN {{ ref('stg_parcels') }} b
  ON ST_DWithin(a.geom, b.geom, 500)   -- evaluated by the spatial engine

Core Concepts & Boundaries

Before writing a single macro, the team must agree on four boundaries that every spatial model inherits.

CRS governance. A coordinate reference system (CRS) defines how coordinates map to positions on Earth. Mixing projections silently corrupts distance and area math. Canonicalize storage on EPSG:4326 for global geography or EPSG:3857 for web rendering, and enforce that contract in the staging layer with automated CRS conversions. Never let an un-projected geometry cross a layer boundary.

Geometry vs. geography types. GEOMETRY performs fast planar math on a Cartesian plane; GEOGRAPHY performs slower but correct spherical math on an ellipsoid. Use GEOMETRY in a projected CRS for local-scale analysis where speed matters, and GEOGRAPHY (or cast with ::geography) when distances span large extents or cross UTM zones. The choice is a per-model decision documented in the model’s YAML, not an accident of the source schema.

SRID enforcement. The Spatial Reference Identifier (SRID) is the integer that tags a geometry with its CRS. A geometry with SRID = 0 is unknown, not a default — treat it as a hard failure. Apply ST_SetSRID only to assert a known-but-untagged value, and ST_Transform to actually reproject. Conflating the two is the single most common source of misaligned geometries.

Spatial index selection. The right index determines whether a join uses a bounding-box scan or a full table scan:

GiST — the general-purpose default for GEOMETRY/GEOGRAPHY; balanced build and query cost, ideal for mixed polygon/point workloads.
SP-GiST — space-partitioned, faster for uniformly distributed point clouds and ST_DWithin proximity scans.
BRIN — block-range, near-zero storage, suited to naturally clustered data (e.g. time-ordered GPS pings) where exact precision is secondary.
HNSW — for approximate nearest-neighbor over high-dimensional embeddings (e.g. learned location vectors), not classic geometry, but increasingly relevant to spatial ML marts.

Index strategy is a materialization concern; see spatial index hints in dbt materializations for declaring these through post-hook.

Adapter & Engine Comparison

Modern data platforms rarely operate within a single spatial engine. The table below summarizes the trade-offs that drive macro design and the spatial adapter selection decision.

Capability	PostGIS	DuckDB Spatial	BigQuery GIS	Snowflake
Primary type	`GEOMETRY` + `GEOGRAPHY`	`GEOMETRY`	`GEOGRAPHY` only	`GEOGRAPHY` + `GEOMETRY`
Function coverage	Most complete (`ST_*` superset)	Broad, GEOS-backed	Spherical subset	Growing subset
Index model	GiST / SP-GiST / BRIN (explicit)	Implicit, in-memory R-tree	Managed, transparent	Managed, transparent
User-defined functions	SQL + PL/pgSQL + C	SQL macros + extensions	SQL + JS UDF	SQL + JS + Python UDF
Reprojection	`ST_Transform` (full PROJ)	`ST_Transform` (PROJ)	Implicit WGS84 only	Limited
CI suitability	Heavy (container)	Excellent (embedded, fast)	Cloud-only	Cloud-only
Best fit	Authoritative production store	Local dev + CI validation	Petabyte spherical analytics	Enterprise governed warehouse

The practical pattern most teams converge on is a DuckDB spatial extension integration for fast local iteration and CI, then promotion to a PostGIS or BigQuery production target. Because function signatures diverge — BigQuery’s ST_DWITHIN is spherical-only, PostGIS requires a ::geography cast for metric distances — portable macros must route on the adapter rather than hardcoding one dialect.

Pipeline Layer Responsibilities

Spatial DAGs map cleanly onto dbt’s layered convention, but each layer carries spatial-specific obligations.

Staging (views): validate, never compute. The staging layer normalizes types and asserts the geometry contract — nothing more. Its checklist:

Cast or construct geometries from source WKT/WKB with ST_GeomFromText / ST_GeomFromWKB.
Assert and, where needed, repair the SRID with the to_canonical_geom macro above.
Reject or flag empty and null geometries; never let them flow downstream.
Keep models as view materializations so validation cost is paid at query time, not storage.

-- models/staging/stg_parcels.sql
{{ config(materialized='view') }}

SELECT
  parcel_id,
  {{ to_canonical_geom('raw_geom', 4326) }} AS geom
FROM {{ source('cadastre', 'parcels') }}
WHERE raw_geom IS NOT NULL
  AND NOT ST_IsEmpty(raw_geom)

Intermediate (incremental tables): the heavy spatial work. Spatial joins, point-in-polygon assignment, and topology cleaning live here. Materialize as incremental so only changed partitions recompute, and isolate expensive routines into their own models — a large ST_Intersection clip should never share a model with a lightweight attribute join.

Marts (materialized tables + indexes): query-ready geometry. Marts expose curated geometries with a committed index so consumers get bounding-box performance. Build the spatial index in a post-hook and document the intended access pattern (proximity, containment, tiling) so the index type matches the workload.

-- models/marts/mart_service_areas.sql
{{ config(
    materialized='table',
    post_hook="CREATE INDEX IF NOT EXISTS {{ this.name }}_gix ON {{ this }} USING GIST (geom)"
) }}

Macro & UDF Abstraction

The decision of when to reach for a macro versus a native UDF is the highest-leverage call in the stack.

Wrap logic in a macro when the goal is templating: applying the same projection rule everywhere, generating an index DDL, or routing dialect differences. Reach for a native UDF when the logic is genuinely procedural — iterative network traversal, custom topology repair, or a domain algorithm the warehouse cannot express in pure SQL — and you want the engine’s optimizer and parallelism to own its execution.

Cross-engine portability is delivered through adapter.dispatch, which selects an implementation by adapter at compile time. This is the mechanism that lets one model run on DuckDB in CI and PostGIS in production unchanged.

{% macro proximity_predicate(geom_a, geom_b, meters) %}
  {{ return(adapter.dispatch('proximity_predicate', 'dbt_geo')(geom_a, geom_b, meters)) }}
{% endmacro %}

{% macro default__proximity_predicate(geom_a, geom_b, meters) %}
  ST_DWithin({{ geom_a }}::geography, {{ geom_b }}::geography, {{ meters }})
{% endmacro %}

{% macro bigquery__proximity_predicate(geom_a, geom_b, meters) %}
  ST_DWITHIN({{ geom_a }}, {{ geom_b }}, {{ meters }})
{% endmacro %}

{% macro duckdb__proximity_predicate(geom_a, geom_b, meters) %}
  ST_DWithin_Spheroid({{ geom_a }}, {{ geom_b }}, {{ meters }})
{% endmacro %}

Parameterize tolerances, target SRIDs, and distance units rather than hardcoding them; a tolerance or target_srid argument turns a brittle copy-paste into a single documented contract. The deep dive on reusable ST_DWithin macros in dbt walks through the full dispatch pattern end to end.

Testing & Data Quality

Spatial data introduces failure modes traditional analytics tests ignore: null geometries, invalid topologies, SRID mismatches, and self-intersecting polygons can silently corrupt every downstream metric. Productionizing spatial transformations requires generic tests that operate at the geometry level.

A reusable generic test catches invalidity across every model that references it:

-- macros/test_is_valid_geometry.sql
{% test is_valid_geometry(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE NOT ST_IsValid({{ column_name }})
   OR {{ column_name }} IS NULL
   OR ST_IsEmpty({{ column_name }})
{% endtest %}

# models/staging/_staging.yml
version: 2
models:
  - name: stg_parcels
    columns:
      - name: geom
        tests:
          - is_valid_geometry
          - dbt_utils.expression_is_true:
              expression: "ST_SRID(geom) = 4326"

Layer three categories of assertion:

Validity & null guards — ST_IsValid, ST_IsEmpty, and non-null constraints on every geometry column.
SRID assertions — verify every joined table shares the canonical SRID before the join executes; a mismatch here is the root cause of most “empty result” bugs.
Topology rules — UDF-backed assertions such as no overlapping administrative boundaries or valid network connectivity, run as singular tests in CI.

Aligning with the OGC Simple Feature Access specification and the PostGIS reference manual gives authoritative baselines for function behavior and precision guarantees.

Performance & Scale Considerations

Geospatial operations are inherently resource-intensive; without deliberate optimization, spatial joins trigger full table scans that exhaust warehouse credits.

Bounding-box pre-filtering. The most effective acceleration is filtering with the bounding-box operator (&& in PostGIS, envelope checks elsewhere) before applying exact-geometry predicates. This shrinks the candidate set so the optimizer can use the spatial index. Partitioning, grid bucketing, and adaptive thresholds are covered in optimizing proximity joins, with the nearest-neighbor case detailed in speeding up nearest-neighbor joins in PostGIS.

Query planner control. Modern warehouses auto-optimize, but spatial workloads often need explicit guidance. Materialized spatial indexes and planner hints injected through dbt configs force the optimizer to prefer spatial access paths over sequential scans, without abandoning dbt’s declarative paradigm.

Parallel spatial joins. Partition large join keys (e.g. by H3 cell or admin region) so the engine can fan the join across workers; a single monolithic ST_Intersects over an unpartitioned table cannot parallelize effectively.

Incremental materialization trade-offs. Incremental models avoid recomputing unchanged geometry but add complexity: the unique_key must be stable, and late-arriving geometry edits require an is_incremental() predicate that re-scans the affected spatial extent. For very large feeds, combine incrementality with the partitioning strategies in handling large geospatial datasets. Isolate any routine that can exceed model timeouts — large-scale clipping, massive point-in-polygon evaluation — into dedicated intermediate models so the scheduler allocates compute independently.

CI/CD Integration

Spatial pipelines earn trust by validating geometry on every pull request, not after deployment. The portability built through adapter.dispatch pays off here: run the full test suite on an embedded DuckDB engine in CI, then promote validated models to PostGIS.

# .github/workflows/spatial-ci.yml
name: spatial-ci
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install "dbt-duckdb[spatial]"
      - run: dbt deps
      - run: dbt build --target ci   # seeds + run + test on DuckDB
        env:
          DBT_SPATIAL_SRID: "4326"

The matching profile reads every secret and tunable through dbt’s env_var() pattern so no credential or SRID is hardcoded:

# profiles.yml
dbt_geo:
  target: ci
  outputs:
    ci:
      type: duckdb
      path: "{{ env_var('DBT_DUCKDB_PATH', 'ci.duckdb') }}"
      extensions: ["spatial"]
    prod:
      type: postgres
      host: "{{ env_var('DBT_PG_HOST') }}"
      user: "{{ env_var('DBT_PG_USER') }}"
      password: "{{ env_var('DBT_PG_PASSWORD') }}"
      dbname: "{{ env_var('DBT_PG_DATABASE') }}"
      schema: "{{ env_var('DBT_PG_SCHEMA', 'analytics') }}"

Seed deterministic geometry fixtures — a handful of WKT polygons covering interior, boundary, and degenerate cases — so tests assert against known-correct topology rather than live production data. Validate the production target itself with SELECT PostGIS_Version(); in an on-run-start hook before any model executes, failing fast if the extension is missing. Reprojection logic that must survive environment promotion is covered in batch-transforming coordinate systems with dbt.

Common Failure Modes & Remediation

Symptom	Root cause	Remediation
Spatial join returns zero rows	Mismatched SRIDs across the two tables	Assert one canonical SRID at staging; add a `dbt_utils.expression_is_true` SRID test
`ST_Transform` returns NULL	Geometry tagged `SRID = 0` (unknown)	`ST_SetSRID` to the true source SRID before transforming
Query suddenly does a full scan	Missing or bloated spatial index	Rebuild the GiST index in a `post-hook`; verify with `EXPLAIN ANALYZE`
`ST_Intersects` errors on “non-noded intersection”	Invalid / self-intersecting input geometry	Run `ST_MakeValid` in staging; gate with the `is_valid_geometry` test
CI passes, production fails	Adapter version / function-signature drift	Pin extension versions; route dialect differences through `adapter.dispatch`
Distances are wrong by orders of magnitude	Planar `GEOMETRY` math used for metric distance	Cast to `::geography` or reproject to a metric CRS before measuring

Each of these traces back to a contract that was not enforced at a layer boundary. The discipline that prevents them — canonical CRS, SRID assertions, validity tests, and committed indexes — is the same discipline that makes the rest of this stack portable.

Building Custom Spatial Macros — abstracting vendor-specific spatial functions into reusable, dispatched macros.
Geometry Transformation Pipelines — deterministic projection and topology chains across layers.
Index Hints for Spatial Queries — steering the planner toward spatial access paths.
Optimizing Proximity Joins — partitioning and bounding-box strategies for fast spatial joins.
Choosing the Right Spatial Adapter — PostGIS vs. DuckDB vs. BigQuery trade-offs.

Up: dbt + Geospatial home · Part of the Core Fundamentals & Architecture and Spatial Data Architecture & Governance tracks.

Explore this section