Speeding up nearest-neighbor joins in PostGIS

This page shows you how to rewrite a nearest-neighbor spatial join so PostGIS resolves each match with an index-assisted KNN scan — the <-> operator inside a CROSS JOIN LATERAL — instead of an ST_Distance sweep over every candidate row, and how to embed that pattern in an incremental dbt model.

When to use this approach

Reach for index-driven KNN — rather than a brute-force distance scan or a fixed-radius filter — when any of these hold:

You need the single closest row (or the N closest), not everything inside a radius. If you instead want radius membership (“every facility within 500 m”), use an ST_DWithin predicate — see writing reusable ST_DWithin macros in dbt. KNN answers “which one is nearest”; ST_DWithin answers “which are within”.
Your current join already uses a distance function and has slowed to a crawl. A correlated subquery or CROSS JOIN that wraps ST_Distance cannot use a GiST index and degrades into an O(n·m) sequential scan as either side grows. This is the core search-space problem laid out in optimizing proximity joins.
The join runs on every dbt invocation. KNN logic is cheap enough to schedule continuously only when an index does the pruning; an unindexed scan is fine for a one-off backfill but unsustainable on an incremental cadence.

Prerequisites

PostGIS ≥ 3.1 on PostgreSQL ≥ 13 — earlier PostGIS revisions support <-> but the recheck-free index ordering used here is most reliable on 3.1+.
dbt-core ≥ 1.7 with dbt-postgres ≥ 1.7.
A single projected, metric SRID shared by both tables (for example EPSG:26910 UTM Zone 10N, or EPSG:3857 for web-map work). KNN distance is returned in the units of the underlying CRS, so both sides must be pre-projected — normalize at the staging layer with ST_Transform, never at execution time.
CREATE INDEX grant on the target schema so the model can build and maintain its own GiST index.
Connection and SRID via env_var() so the same model runs locally, in CI, and in production unchanged:

# dbt_project.yml
vars:
  canonical_srid: "{{ env_var('DBT_CANONICAL_SRID', '26910') }}"

Step-by-step instructions

1. Audit and rebuild GiST indexes

Before touching query logic, confirm that GiST indexes exist on both geometry columns and that planner statistics are current. A missing, bloated, or untransformed index forces a sequential scan no matter how the query is written.

-- Verify existing spatial indexes
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename IN ('stg_customer_points', 'stg_facility_polygons')
  AND indexdef ILIKE '%gist%';

Expected: one USING gist (geom) row per table. If an index is missing or built on an untransformed column, rebuild it without locking writes, then refresh statistics:

CREATE INDEX CONCURRENTLY idx_facility_geom_gist
  ON stg_facility_polygons USING GIST (geom);

ANALYZE stg_customer_points;
ANALYZE stg_facility_polygons;

For tables under heavy DML, schedule a periodic VACUUM ANALYZE so index bloat does not degrade KNN traversal. See the official PostgreSQL GiST documentation for maintenance guidance.

2. Replace the distance scan with a KNN LATERAL join

Swap any CROSS JOIN filter or correlated subquery for a LATERAL join that orders by the <-> KNN operator. This forces one index scan per driving row that stops the moment it finds the closest geometry, instead of materializing the full Cartesian product.

SELECT
  c.customer_id,
  f.facility_id,
  ST_Distance(c.geom, f.geom) AS exact_distance_meters
FROM stg_customer_points c
CROSS JOIN LATERAL (
  SELECT facility_id, geom
  FROM stg_facility_polygons
  ORDER BY c.geom <-> geom
  LIMIT 1
) f;

The <-> operator returns bounding-box distance in the CRS units, so the GiST index can order candidates directly. The outer ST_Distance then computes the precise planar distance for the single matched row only. Verify the plan uses the index before going further:

EXPLAIN (ANALYZE, BUFFERS)
SELECT c.customer_id, f.facility_id
FROM stg_customer_points c
CROSS JOIN LATERAL (
  SELECT facility_id, geom FROM stg_facility_polygons
  ORDER BY c.geom <-> geom LIMIT 1
) f;

Expected: the inner node reads Index Scan using idx_facility_geom_gist (an Index Only Scan is also fine), not Seq Scan.

3. Add an optional bounding-box cutoff

To stop the scan from walking the whole tree for points that have no nearby match, cap the <-> expression. This prunes extreme outliers before the LIMIT is reached.

CROSS JOIN LATERAL (
  SELECT facility_id, geom
  FROM stg_facility_polygons
  WHERE c.geom <-> geom < 50000   -- discard candidates > 50 km away
  ORDER BY c.geom <-> geom
  LIMIT 1
) f

Expected: rows with no facility inside the cutoff drop out (the LATERAL yields no row), so customers in empty regions no longer trigger full-tree traversal.

4. Embed the KNN join in an incremental dbt model

A full-refresh table model re-runs the entire KNN scan on every build, which is unsustainable past tens of millions of rows. Materialize incrementally so only new or changed driving rows are matched.

{{ config(
    materialized='incremental',
    unique_key='customer_id',
    on_schema_change='sync_all_columns',
    pre_hook=["SET LOCAL work_mem = '256MB';"],
    post_hook="CREATE INDEX IF NOT EXISTS {{ this.name }}_geom_gist ON {{ this }} USING GIST (geom)"
) }}

WITH new_customers AS (
  SELECT customer_id, geom
  FROM {{ ref('stg_customer_points') }}
  {% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
  {% endif %}
)

SELECT
  nc.customer_id,
  f.facility_id,
  ST_Distance(nc.geom, f.geom) AS exact_distance_meters
FROM new_customers nc
CROSS JOIN LATERAL (
  SELECT facility_id, geom
  FROM {{ ref('stg_facility_polygons') }}
  ORDER BY nc.geom <-> geom
  LIMIT 1
) f

Expected: dbt run --select m_customer_nearest_facility reports a small rows affected count after the first build instead of a full re-scan, dropping incremental run times from hours to minutes. The SET LOCAL in the pre_hook raises sort memory for the transaction only, so concurrent sessions are unaffected. Partitioning and incremental state strategies that complement this are covered in optimizing proximity joins.

5. Tune planner costs for SSD-backed indexes

Even with a correct query, default planner costs can still favour a sequential scan. Align them with how spatial indexes actually behave on SSDs, scoped to the session during profiling.

SET random_page_cost = 1.1;             -- favour index scans over seq scans
SET effective_cache_size = '12GB';      -- ~75% of system RAM
SET enable_seqscan = off;               -- profiling only: force index usage

Expected: re-running the EXPLAIN (ANALYZE, BUFFERS) from step 2 now shows a lower estimated cost on the Index Scan path. Reset enable_seqscan to on before production — it is a diagnostic lever to confirm a usable plan exists, not a permanent setting.

Configuration reference

Parameter	Accepted values	Default	Spatial-specific notes
`work_mem`	size string, e.g. `256MB`	`4MB`	Set via `SET LOCAL` in a `pre_hook`; sorts of bounding-box intermediates spill to disk when too low.
`random_page_cost`	float	`4.0`	Lower to `1.1` on SSD so the planner prefers GiST index scans for KNN.
`effective_cache_size`	size string	`4GB`	Set to ~75% of system RAM; informs the planner about OS cache, not an allocation.
`enable_seqscan`	`on` / `off`	`on`	`off` only while profiling to prove a KNN index plan exists; reset before production.
`<->` cutoff	distance in CRS units	none	Optional `WHERE c.geom <-> geom < N` to skip full-tree traversal for far-flung points.
`materialized`	`table` / `incremental`	`view`	Use `incremental` with `unique_key` so KNN runs only on changed driving rows.

Gotchas & edge cases

CRS mismatch silently kills the index. If the two tables hold different SRIDs, PostGIS injects an ST_Transform at execution time, which strips index usability and reprojects every row. Pre-project both sides to one metric SRID at staging; verify with SELECT DISTINCT ST_SRID(geom) on each table.
<-> on geographic (degree) CRS measures degrees, not metres. A < 50000 cutoff on EPSG:4326 means 50 000 degrees — effectively no filter. Index and order on a projected metric CRS, or the ordering is still correct but your distance literals are meaningless.
The KNN ordering is on the bounding box, the answer must be exact. <-> orders by box distance, which can differ from true distance for large polygons. Keeping ST_Distance in the outer SELECT (and LIMIT 1 on the box order) is fine for points; for large polygons, fetch the top few candidates (LIMIT 5) and pick the minimum exact distance.
Wrapping <-> in a function or CASE disables the index. Keep the operator a bare ORDER BY term over a plain, correctly-typed geom column. Steering the planner when it still refuses is covered in index hints for spatial queries.
Stale statistics revert the plan to Seq Scan. After a bulk load, run ANALYZE before the first KNN build, or the planner mis-estimates row counts and skips the index.

FAQ

Why is my KNN join still doing a sequential scan?

The planner cannot use a GiST index when the <-> expression is wrapped in a function, CASE, or COALESCE, when an implicit type cast happens at execution time (often a CRS mismatch), or when statistics are stale. Keep ORDER BY c.geom <-> geom as a bare term over a pre-projected, correctly-typed column, run ANALYZE, and confirm an Index Scan with EXPLAIN (ANALYZE, BUFFERS).

Should I use the <-> KNN operator or ST_DWithin?

Use <-> in a CROSS JOIN LATERAL when you want the single nearest (or N nearest) rows regardless of absolute distance. Use ST_DWithin for fixed-radius membership such as “everything within 500 m”. They answer different questions; the radius pattern is in writing reusable ST_DWithin macros in dbt.

Why does <-> return distances that look wrong?

<-> reports distance in the units of the column’s CRS. On a geographic CRS (EPSG:4326) that is degrees, not metres, so thresholds and outputs look nonsensical. Reproject both tables to a projected metric SRID before indexing so the operator measures metres.

Can the bounding-box ordering give the wrong nearest row?

For point geometries, no — box distance equals point distance, so LIMIT 1 is exact. For large polygons the bounding box can rank a slightly-further geometry first. Fetch a small candidate set (LIMIT 5) ordered by <->, then pick the row with the minimum exact ST_Distance.

How do I keep the incremental model fast as the target table grows?

Rebuild the GiST index on the model output in a post_hook, filter driving rows with is_incremental() so only new or updated points are matched, and add a <-> distance cutoff to avoid full-tree traversal for points with no nearby match. Partitioning strategies are detailed in optimizing proximity joins.

Writing reusable ST_DWithin macros in dbt — the fixed-radius alternative for membership rather than nearest-row queries.
Using spatial index hints in dbt materializations — steering the planner when it still bypasses the GiST index.
Automating CRS conversions in dbt pipelines — pre-projecting both sides to one metric SRID so <-> stays index-usable.

Up: Part of Optimizing Proximity Joins.