Handling Large Geospatial Datasets

Scaling spatial analytics past tens of millions of geometries exposes a problem that tabular pipelines never hit: geometry is heavy, topology is expensive to evaluate, and a single mis-projected feed can silently corrupt every downstream join. Row-oriented storage, ingestion-time partitioning, and naive ST_Intersects across unpartitioned tables all collapse under the computational weight of production location intelligence. This guide covers the four controls that keep a warehouse-scale spatial DAG fast and trustworthy — spatially aware partitioning, deterministic coordinate reference system enforcement, compute-optimized join strategies, and CI-gated validity testing — as part of the broader Spatial Data Architecture & Governance discipline.

The goal is to treat geometry as a first-class, partitioned, strictly governed data type rather than an opaque blob bolted onto a fact table. Doing so preserves the reproducibility, automated testing, and lineage guarantees analytics engineers already expect from the modern stack while respecting the mathematical rigor of spatial topology.

Prerequisites

Before applying the patterns below, confirm your environment meets these baselines:

dbt Core ≥ 1.7 (for current incremental and on_schema_change semantics).
Spatial engine: PostGIS ≥ 3.3 on PostgreSQL ≥ 14, the DuckDB spatial extension ≥ 0.10, or a cloud warehouse with native GIS (BigQuery GIS / Snowflake GEOGRAPHY). See choosing the right spatial adapter if you are still deciding.
Grid library: H3 or S2 bindings available either as a database extension (h3-pg) or precomputed at ingestion.
Warehouse grants: CREATE on the target schema, plus permission to create indexes (USAGE/CREATE on the tablespace for PostGIS GiST builds).
Environment variables: a DBT_SPATIAL_CANONICAL_SRID value wired through env_var() so the canonical projection is configurable per environment.

Architecture context

Within a layered spatial DAG, large-dataset handling is not a single model — it is a set of guarantees applied at every layer: geometry is validated and projected in staging, partitioned and grid-keyed in intermediate models, then index-served from marts. The diagram below shows where each control sits relative to the spatial model dependency graph.

DAG layer	Scaling control	Key operation
Raw	Columnar spatial storage	GeoParquet + WKB, predicate pushdown
Staging	Canonical CRS + validity	`ST_IsValid`, `ST_MakeValid`, `ST_Transform`
Intermediate	Grid partitioning	H3/S2 cell assignment, `cluster_by`
Mart	Index-served joins	GiST / native clustering, two-phase predicates

Configuration walkthrough

Pin the canonical projection and grid resolution as project variables so every model reads them consistently rather than hard-coding SRIDs. In dbt_project.yml:

# dbt_project.yml
vars:
  canonical_srid: "{{ env_var('DBT_SPATIAL_CANONICAL_SRID', '4326') }}"
  h3_resolution: 8        # ~0.46 km^2 hexagons; tune per dataset density

models:
  my_project:
    intermediate:
      +materialized: incremental
      +tags: ['spatial']

Build the spatial extension once per run rather than per model with an on-run-start hook, and confirm the engine version so CI fails loudly on an unprovisioned database:

# dbt_project.yml (continued)
on-run-start:
  - "{{ ensure_spatial_ready() }}"

-- macros/ensure_spatial_ready.sql
{% macro ensure_spatial_ready() %}
  {% if target.type == 'postgres' %}
    CREATE EXTENSION IF NOT EXISTS postgis;
    CREATE EXTENSION IF NOT EXISTS h3;
  {% elif target.type == 'duckdb' %}
    INSTALL spatial; LOAD spatial;
    INSTALL h3 FROM community; LOAD h3;
  {% endif %}
{% endmacro %}

The connection profile itself stays conventional; the only spatial-specific requirement is that the role can create extensions and indexes:

# profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: postgres
      host: "{{ env_var('DBT_HOST') }}"
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      dbname: "{{ env_var('DBT_DBNAME') }}"
      schema: analytics
      threads: 8

Core implementation

Columnar storage and spatial partitioning

Verbose text encodings like GeoJSON impose severe serialization overhead and block predicate pushdown. The baseline for high-throughput workloads is GeoParquet, which standardizes Well-Known Binary (WKB) inside a columnar layer so the engine can filter and vectorize before touching geometry.

Partitioning by ingestion timestamp alone creates hotspots and forces full-table scans during spatial queries. Instead, partition by a deterministic spatial grid. Hierarchical systems like Uber’s H3 or Google’s S2 emit uniform cells that distribute geometries evenly across distributed compute, and aligning models to those grids lets the optimizer prune irrelevant partitions before any geometry is evaluated:

-- models/intermediate/int_events_gridded.sql
{{ config(
    materialized='incremental',
    unique_key='event_id',
    partition_by={"field": "h3_index", "data_type": "string"},
    cluster_by=["h3_index", "event_timestamp"]
) }}

SELECT
    event_id,
    event_timestamp,
    h3_latlng_to_cell(ST_Y(geom), ST_X(geom), {{ var('h3_resolution') }}) AS h3_index,
    geom
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
  WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

By co-locating geometries that share spatial proximity, downstream transformations scan only the necessary cells. This routinely cuts warehouse I/O by 60–80% and removes the shuffle bottlenecks that plague naive ST_Intersects on unpartitioned tables.

Canonical CRS enforcement

Coordinate reference system drift is the most common failure mode in spatial pipelines: when datasets with mismatched EPSG codes intersect, joins produce geometrically invalid results or silently misalign features. Robust Spatial Reference System Management mandates explicit validation at the ingestion boundary and deterministic transformation before any analytical operation. Wrap that logic in a macro so every staging model normalizes identically:

-- macros/enforce_canonical_crs.sql
{% macro enforce_canonical_crs(geometry_col, target_srid=none) %}
  {% set srid = target_srid or var('canonical_srid') %}
  CASE
    WHEN {{ geometry_col }} IS NULL THEN NULL
    WHEN NOT ST_IsValid({{ geometry_col }}) THEN ST_Transform(ST_MakeValid({{ geometry_col }}), {{ srid }})
    WHEN ST_SRID({{ geometry_col }}) = {{ srid }} THEN {{ geometry_col }}
    ELSE ST_Transform({{ geometry_col }}, {{ srid }})
  END
{% endmacro %}

Applying this macro during staging guarantees that every downstream model operates on a unified coordinate space, and it prevents silent corruption when external vendors deliver shapefiles or GeoJSON with undocumented projections. For global analytics EPSG:4326 (WGS 84) is the usual canonical store; web-mapping marts that need planar math reproject to EPSG:3857 at the mart boundary.

Compute-optimized spatial joins

Large spatial joins are expensive because they evaluate topological relationships across a Cartesian product. To stay fast at scale, pre-filter on grid identifiers before applying exact predicates — a two-phase pattern that lets the planner use a hash join instead of a nested-loop geometry scan:

-- models/marts/mart_events_in_zone.sql
{{ config(materialized='incremental', unique_key='event_id') }}

WITH candidates AS (
    -- Phase 1: cheap equality join on the shared grid key
    SELECT e.event_id, e.geom AS event_geom, z.zone_id, z.geom AS zone_geom
    FROM {{ ref('int_events_gridded') }} e
    JOIN {{ ref('int_zones_gridded') }} z
      ON e.h3_index = z.h3_index
)
-- Phase 2: exact predicate on the reduced candidate set only
SELECT event_id, zone_id
FROM candidates
WHERE ST_Intersects(event_geom, zone_geom)

Phase one matches records on precomputed cells (h3_index); phase two applies ST_Intersects or ST_DWithin only to surviving candidates. For deeper tuning of the exact-predicate stage, see optimizing proximity joins and index hints for spatial queries. Cloud warehouses such as BigQuery and Snowflake also expose native spatial clustering; aligning cluster_by with those features keeps spatial locality during execution.

Validation & testing

Reproducibility at scale depends on validity guards that run in CI, not in production. First confirm the engine is provisioned and geometries are valid with ad-hoc sweeps:

-- Confirm the spatial engine is available (PostGIS)
SELECT PostGIS_Version();

-- Sweep for invalid geometries and SRID drift before promotion
SELECT
    COUNT(*) FILTER (WHERE NOT ST_IsValid(geom))            AS invalid_geoms,
    COUNT(*) FILTER (WHERE ST_SRID(geom) <> 4326)           AS wrong_srid,
    COUNT(*) FILTER (WHERE geom IS NULL OR ST_IsEmpty(geom)) AS empty_geoms
FROM {{ ref('int_events_gridded') }};

Then encode those checks as dbt tests so they gate every run. Essential tests for large spatial pipelines are geometry validity (ST_IsValid), extent assertions (coordinates within expected bounds), null/empty guards, and canonical-SRID consistency:

# models/intermediate/_intermediate.yml
version: 2
models:
  - name: int_events_gridded
    columns:
      - name: geom
        tests:
          - dbt_utils.expression_is_true:
              expression: "ST_IsValid(geom)"
          - dbt_utils.expression_is_true:
              expression: "ST_SRID(geom) = {{ var('canonical_srid') }}"
          - not_null
      - name: h3_index
        tests:
          - not_null

Embedding these directly in the DAG means a single invalid polygon or mismatched SRID fails CI rather than cascading into corrupted aggregations downstream.

Advanced patterns

Incremental partition recompute. Pair the grid key with an incremental predicate so only touched cells rebuild. Combining partition_by on h3_index with is_incremental() filtering lets a daily run recompute a handful of cells instead of a full materialization.
Macro-parameterized resolution. Expose h3_resolution as a var and accept it as a macro argument so dense urban marts and sparse continental marts share one transformation with different cell sizes. Reusable spatial logic belongs in custom spatial macros.
Multi-engine portability. Keep grid keying engine-agnostic by isolating ST_/h3_ calls behind macros that branch on target.type, so a lightweight DuckDB run can validate the DAG in CI before promotion to PostGIS.
Governance-aware aggregation. Aggregate point data up to a grid cell or polygon before exposing it to BI, which both reduces scan volume and lowers re-identification risk under your data security and scoping rules. Treat any change to geometry column types, CRS, or partition keys as a versioned event under versioning spatial schemas in dbt.

Troubleshooting

Symptom	Root cause	Fix
Spatial join scans full tables, no partition pruning	Predicate filters on geometry before the grid key	Add the `h3_index` equality join as phase one so the planner prunes partitions first
`ST_Transform` returns NULL for some rows	Source geometry has SRID 0 / undeclared projection	`ST_SetSRID` to the known source EPSG before transforming; reject SRID 0 in staging
Aggregations fail with “GEOSIntersection: TopologyException”	Self-intersecting or unclosed input rings	Route invalid geometries through `ST_MakeValid` in the canonical-CRS macro
Incremental run reprocesses every partition	`cluster_by`/`partition_by` not aligned with the incremental key	Cluster on the same `h3_index` used in the `is_incremental()` predicate
GiST index ignored, queries slow after large load	Stale planner statistics / index bloat	Run `VACUUM ANALYZE` (or `REINDEX`) after bulk loads; confirm geometries share one SRID so the index is usable

FAQ

Why partition by H3 instead of by region or country?

Administrative boundaries produce wildly uneven partitions — a single dense metro cell can dwarf an entire rural country. Uniform grid cells distribute geometries evenly so distributed compute and partition pruning both stay balanced.

Should I store geometry in EPSG:4326 or a projected CRS?

Store canonically in EPSG:4326 for interoperability and reproject to a planar CRS (e.g. EPSG:3857 or a local UTM zone) only at the mart boundary where you need accurate distance or area math. Mixing projected and unprojected geometries in one query plan bypasses index use.

Does grid pre-filtering change my join results?

Only if the grid resolution is coarser than your geometries and you forget the exact predicate. Keep phase two (ST_Intersects / ST_DWithin) on the candidate set; the grid join only shrinks the candidate pool, it never decides the final match.

How do I keep this DAG fast to test in CI?

Run the same models against the DuckDB spatial extension over a seeded geometry fixture. It validates SQL, grid keying, and validity tests in seconds before promotion to a heavier PostGIS environment.

Spatial Reference System Management — canonical CRS contracts and idempotent normalization.
Data Security & Scoping Rules — row-level security and coordinate generalization for sensitive geometries.
Versioning Spatial Schemas in dbt — auditable evolution of geometry types and partition strategies.
Optimizing Proximity Joins — tuning the exact-predicate stage of a two-phase join.

Up: Spatial Data Architecture & Governance