Should I use the geometry or geography type for proximity work?

Use projected geometry in a metric SRID when data fits inside one UTM zone — it is faster and keeps full planar index support. Use geography only when matches span large distances or cross zones, where spheroidal accuracy outweighs the extra CPU and partial loss of planar index optimizations.

Optimizing Proximity Joins

Proximity joins are the most computationally punishing operation most spatial pipelines ever run. Matching customer locations to the nearest service radius, linking IoT sensor readings to administrative grid cells, or enriching transactional logs with geographic context all reduce to the same shape: evaluate distances between two coordinate sets and keep the pairs that fall within a threshold. Written naively, that shape is a Cartesian product — every row in dataset A measured against every row in dataset B — and it collapses under its own weight the moment either side grows past a few hundred thousand geometries. This page is part of the Advanced Spatial Macros & UDF Patterns collection, and it covers how to make proximity logic index-driven, deterministic, and cheap enough to run on every dbt invocation.

The core problem is search-space pruning. A correct proximity join produces the same output as the brute-force version, but it reaches that output by letting a spatial index discard the overwhelming majority of candidate pairs before any exact distance is computed. The patterns below show how to wire that pruning into dbt models so the optimization survives refactors, incremental runs, and a switch of warehouse adapter — instead of living in an ad-hoc query that one engineer tuned once and no one dares touch.

Prerequisites checklist

Before building any of the models below, confirm the environment can support index-assisted spatial search:

dbt-core ≥ 1.7 with a spatial-capable adapter — dbt-postgres ≥ 1.7 against PostGIS ≥ 3.3, or dbt-duckdb ≥ 1.7 with the DuckDB spatial extension for CI runs. Adapter trade-offs are compared in choosing the right spatial adapter.
PostGIS extension enabled in the target schema (CREATE EXTENSION postgis;) — verified during PostGIS adapter configuration.
A single canonical projected SRID agreed for the project (for example EPSG:26910 UTM Zone 10N, or EPSG:3857 for web-map work). Proximity thresholds in metres only make sense once both sides share a metric CRS.
Database grants to CREATE INDEX on the target schema — index-driven joins are worthless if the model cannot build its own GiST index in a post_hook.
Environment variables for connection and SRID, surfaced through dbt’s env_var() pattern so the same models run locally, in CI, and in production without edits.

# dbt_project.yml — project-wide defaults consumed by the proximity models
vars:
  canonical_srid: "{{ env_var('DBT_CANONICAL_SRID', '26910') }}"
  proximity_radius_m: "{{ env_var('DBT_PROXIMITY_RADIUS_M', '500') }}"

models:
  my_project:
    proximity:
      +materialized: table
      +post-hook: "{{ build_spatial_index(this, 'geom') }}"

Architecture context

A proximity join is an intermediate-layer concern: it consumes validated, reprojected geometries from staging and emits enriched fact rows for marts. It must never see raw, mixed-CRS input — that is the job of the upstream geometry transformation pipeline. Placing the join correctly in the spatial model dependency graph is what guarantees both inputs already carry a known SRID and a valid topology before any distance is measured.

DAG layer	Model	Responsibility	Materialization
staging	`stg_customer_locations`	Reproject to canonical SRID, drop NULL/invalid geometry	view
staging	`stg_service_zones`	Reproject, `ST_MakeValid`, build GiST index	table
intermediate	`int_customer_nearest_zone`	Index-driven proximity join (this page)	incremental table
mart	`fct_customer_coverage`	Aggregate distances into coverage metrics	table

Configuration walkthrough

Two pieces of configuration make every downstream proximity model cheaper. The first is an on-run-start hook that fails fast when the spatial extension is missing — a proximity join that silently falls back to sequential scans is worse than one that refuses to run. The second is a reusable index-builder macro invoked from each model’s post_hook.

# dbt_project.yml
on-run-start:
  - "{{ assert_postgis_available() }}"

-- macros/assert_postgis_available.sql
{% macro assert_postgis_available() %}
  {% if execute and target.type == 'postgres' %}
    {% set result = run_query("SELECT PostGIS_Version()") %}
    {% if result.rows | length == 0 %}
      {{ exceptions.raise_compiler_error("PostGIS not available on " ~ target.name) }}
    {% endif %}
  {% endif %}
{% endmacro %}

-- macros/build_spatial_index.sql
{% macro build_spatial_index(relation, geom_col='geom') %}
  CREATE INDEX IF NOT EXISTS
    {{ (relation.identifier ~ '_' ~ geom_col ~ '_gist') | trim }}
  ON {{ relation }}
  USING GIST ({{ geom_col }});
{% endmacro %}

A GiST index is the default R-tree structure PostGIS uses to accelerate both bounding-box overlap (&&) and nearest-neighbour (<->) operators. Without it, the planner has no choice but a sequential scan, and every optimization below degrades to brute force.

Core implementation

CRS alignment and distance fidelity

Distance calculations are only as accurate as the coordinate system underneath them. A geometry stored in EPSG:4326 (WGS84) holds coordinates in decimal degrees, so ST_Distance on two such geometries returns a degree value — meaningless for a metre-scale threshold and wildly distorted away from the equator. Two options resolve this: reproject both sides to a metric projected CRS with ST_Transform, or cast to the geography type so PostGIS computes true spheroidal distances. The geography path is more accurate over long distances but costs more CPU and bypasses some planar index optimizations, so for radius joins inside a single UTM zone, reproject to a projected SRID and stay in the geometry domain.

Guard the contract with a compile-time validation macro that refuses to join mismatched SRIDs rather than producing silently wrong distances:

-- macros/validate_proximity_srids.sql
{% macro validate_proximity_srids(model_a, model_b, target_srid=none) %}
  {% set srid = target_srid or var('canonical_srid') %}
  SELECT
    CASE
      WHEN ST_SRID(a.geom) != {{ srid }} THEN 'SRID mismatch in {{ model_a }}'
      WHEN ST_SRID(b.geom) != {{ srid }} THEN 'SRID mismatch in {{ model_b }}'
      ELSE 'SRID validated'
    END AS validation_status
  FROM {{ ref(model_a) }} a
  CROSS JOIN {{ ref(model_b) }} b
  LIMIT 1
{% endmacro %}

Index-driven nearest-neighbour joins

The most performant nearest-neighbour pattern in PostGIS pairs the <-> KNN distance operator with ORDER BY and LIMIT inside a CROSS JOIN LATERAL. This is what forces an index-assisted KNN search: for each driving row the planner walks the GiST index to find the closest candidate, instead of computing exact distances for the entire opposite table.

-- models/proximity/int_customer_nearest_zone.sql
SELECT
  a.id          AS point_id,
  b.id          AS nearest_zone_id,
  ST_Distance(a.geom, b.geom) AS exact_distance_m
FROM {{ ref('stg_customer_locations') }} a
CROSS JOIN LATERAL (
  SELECT id, geom
  FROM {{ ref('stg_service_zones') }}
  ORDER BY geom <-> a.geom
  LIMIT 1
) b

The ORDER BY geom <-> a.geom is the load-bearing clause: PostGIS only uses the index for KNN when the ordering expression references an indexed column against a constant-per-row geometry, which the lateral subquery supplies. Drop the LIMIT, or wrap the operator in a function, and the planner reverts to a sort over a full scan. Deeper execution-plan analysis of this exact pattern lives in speeding up nearest-neighbor joins in PostGIS.

Radius joins with ST_DWithin and bounding-box pre-filtering

When the question is “every zone within N metres” rather than “the single nearest zone”, ST_DWithin is the correct predicate. It is index-aware: PostGIS internally expands each geometry’s bounding box by the radius, probes the GiST index with a && overlap test, and only computes exact ST_Distance for the survivors. Critically, ST_DWithin never produces the false negatives a hand-rolled ST_Distance(...) < N filter can, because it does the radius expansion before the index probe rather than after.

-- models/proximity/int_customer_zones_in_range.sql
SELECT
  a.id            AS point_id,
  b.id            AS zone_id,
  ST_Distance(a.geom, b.geom) AS distance_m
FROM {{ ref('stg_customer_locations') }} a
JOIN {{ ref('stg_service_zones') }} b
  ON ST_DWithin(a.geom, b.geom, {{ var('proximity_radius_m') }})

For joins where both sides are large, an explicit && bounding-box pre-filter in front of the precise predicate can cut the candidate set further before the planner reaches the costlier exact math — the same pruning the dispatch layer in spatial macro UDF patterns can inject automatically.

Validation and testing

Spatial joins fail in ways traditional relational tests rarely catch: a silent SRID mismatch, a NULL geometry that degrades the index scan, or a radius threshold that quietly returns an empty result set. Verify setup before trusting output.

-- ad-hoc: confirm both inputs share the canonical SRID and are valid
SELECT 'service_zones' AS model,
       COUNT(*) FILTER (WHERE NOT ST_IsValid(geom)) AS invalid,
       COUNT(*) FILTER (WHERE ST_SRID(geom) <> 26910) AS wrong_srid
FROM {{ ref('stg_service_zones') }};

-- confirm the planner actually uses the GiST index for the KNN join
EXPLAIN ANALYZE
SELECT a.id, b.id
FROM stg_customer_locations a
CROSS JOIN LATERAL (
  SELECT id FROM stg_service_zones
  ORDER BY geom <-> a.geom LIMIT 1
) b;
-- expect "Index Scan using ..._geom_gist" in the inner loop, not "Seq Scan"

Encode these as dbt tests so regressions fail the build rather than the dashboard:

# models/proximity/_proximity.yml
version: 2
models:
  - name: int_customer_nearest_zone
    columns:
      - name: nearest_zone_id
        tests:
          - not_null
      - name: exact_distance_m
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 50000      # physical sanity bound, in metres
    tests:
      - dbt_utils.expression_is_true:
          expression: "count(*) >= (select count(*) * 0.99 from {{ ref('stg_customer_locations') }})"
          # a nearest-neighbour join should not silently drop input rows

The OGC Simple Features rules that ST_IsValid enforces are the foundation these tests build on; a self-intersecting polygon can break an index scan as surely as a wrong SRID can skew a distance.

Advanced patterns

Incremental proximity joins. Recomputing every nearest-neighbour pair on each run is wasteful when only a fraction of records change. An incremental model restricts the expensive join to newly ingested or updated driving rows, merging the results into the existing table:

-- models/proximity/int_customer_nearest_zone.sql
{{ config(
    materialized='incremental',
    unique_key='point_id',
    incremental_strategy='merge',
    post_hook="{{ build_spatial_index(this, 'geom') }}"
) }}

WITH new_records AS (
  SELECT id, geom, updated_at
  FROM {{ ref('stg_customer_locations') }}
  {% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
  {% endif %}
)
SELECT
  nr.id         AS point_id,
  sz.id         AS nearest_zone_id,
  nr.geom,
  ST_Distance(nr.geom, sz.geom) AS distance_m,
  nr.updated_at
FROM new_records nr
CROSS JOIN LATERAL (
  SELECT id, geom
  FROM {{ ref('stg_service_zones') }}
  ORDER BY geom <-> nr.geom
  LIMIT 1
) sz

Macro parameterization. Wrap the lateral-join shape in a single macro so source models, geometry columns, and the LIMIT (K) become arguments. One interface then drives every nearest-neighbour model in the project, and an index hint or bounding-box pre-filter added inside the macro propagates everywhere at once. Construction details for this style of abstraction are in building custom spatial macros.

Multi-engine compatibility. The DuckDB spatial extension supports ST_DWithin and an R-tree index, but its KNN ergonomics differ from PostGIS — there is no <-> lateral idiom, so radius joins via ST_DWithin are the portable choice for models that must validate in CI on DuckDB spatial extension integration before promotion to PostGIS. Route the engine-specific difference through a dispatched macro rather than branching SQL by hand.

Troubleshooting

Symptom	Root cause	Fix
Join runs for minutes, `EXPLAIN` shows `Seq Scan`	No GiST index, or `<->` wrapped in a function so the planner can’t use KNN	Build the index in a `post_hook`; keep `ORDER BY geom <-> a.geom LIMIT k` literal
Distances are tiny decimals (e.g. `0.004`)	Geometries still in `EPSG:4326`; `ST_Distance` returned degrees	Reproject both sides to the canonical metric SRID with `ST_Transform`, or cast to `geography`
`ST_DWithin` returns far fewer rows than expected	Radius given in degrees against a projected CRS, or vice-versa	Match the radius units to the SRID’s units; in metric SRIDs the radius is metres
Empty result from a nearest-neighbour join	`NULL` or invalid geometries on one side excluded silently	Add `not_null` + `ST_IsValid` staging tests; run `ST_MakeValid` upstream
Query memory spills / OOM on large radius joins	Cartesian-scale candidate set before the exact predicate	Add an explicit `&&` bounding-box pre-filter; lower the radius or partition the driving set

FAQ

Why does my KNN join ignore the GiST index even though it exists?

The <-> operator only triggers an index-assisted KNN scan when it sits directly in an ORDER BY that references the indexed column against a per-row constant geometry, with a LIMIT. Wrapping it in a function, adding arithmetic, or dropping the LIMIT makes the planner fall back to sorting a full scan. Keep the lateral subquery exactly ORDER BY geom <-> a.geom LIMIT k.

Should I use ST_DWithin or ST_Distance(...) < n for a radius join?

Always ST_DWithin. It expands each bounding box by the radius and probes the GiST index before computing exact distances, so it is both index-accelerated and free of the false negatives a post-hoc ST_Distance(...) < n filter can introduce. ST_Distance should appear only to report the measured distance on rows the predicate already kept.

geometry or geography type for proximity work?

Use projected geometry in a metric SRID when your data fits inside one UTM zone or a local projection — it is faster and keeps full planar index support. Reach for geography only when matches span large distances or cross zones, where spheroidal accuracy matters more than the extra CPU and the partial loss of planar index optimizations.

How do I make an incremental proximity join only recompute changed rows?

Materialize the model incremental with a merge strategy and unique_key, then gate the driving CTE with {% if is_incremental() %} on an updated_at watermark. Only new or changed points run through the lateral KNN join; existing nearest-zone assignments are merged in place. Rebuild the GiST index in a post_hook after a full refresh.

Why does the same join return different results in DuckDB and PostGIS?

The two engines differ in default distance units and KNN ergonomics. DuckDB has no <-> lateral idiom, so a PostGIS KNN model has no direct equivalent — use the portable ST_DWithin radius form for CI validation, and route any engine-specific syntax through a dispatched macro so one interface emits correct SQL per adapter.

Speeding up nearest-neighbor joins in PostGIS — execution-plan analysis of the <-> lateral pattern.
Building Custom Spatial Macros — parameterize the lateral-join shape into one reusable interface.
Index Hints for Spatial Queries — steer the planner toward GiST access paths.
Geometry Transformation Pipelines — the upstream CRS and topology contract every join depends on.
Choosing the Right Spatial Adapter — PostGIS vs. DuckDB trade-offs behind these patterns.

Up: Part of Advanced Spatial Macros & UDF Patterns.

Explore this section