Should geometries stay as WKB or text through the batch?

WKB throughout. Round-tripping through ST_AsText inflates payload 40-60% and adds parsing that pushes large batches into spill-to-disk.

Batch transforming coordinate systems with dbt

This page shows you how to reproject hundreds of millions of geometries from one coordinate reference system to another inside dbt — deterministically, in bounded chunks, and without exhausting warehouse memory or silently corrupting topology.

When spatial datasets scale into the hundreds of millions of rows, coordinate reference system (CRS) transformations become the critical bottleneck in an analytics pipeline. dbt excels at declarative modeling, but native spatial operations need careful orchestration to avoid memory exhaustion, index fragmentation, and floating-point precision loss. The engineering challenge is not merely invoking ST_Transform; it is executing it idempotently across distributed compute nodes while preserving topology and minimizing serialization overhead. This guide sits inside the broader geometry transformation pipeline workflow and supplies the chunked, recoverable reprojection step that workflow depends on.

When to use this approach

Reach for chunked, incremental batch reprojection when:

Your geometry table is too large for a single-pass reproject. Below a few million rows, the full-table normalize_geom pattern in the parent geometry transformation pipeline is simpler and fine. Above tens of millions, a single ST_Transform pass spills to disk and times out.
New geometries arrive continuously and you only want to reproject the delta. If you instead need policy and registry governance around which SRID is canonical, start with automating CRS conversions in dbt pipelines and layer this batch mechanics on top.
You are memory-constrained, not compute-constrained. For storage-side tactics (column pruning, geometry simplification, spatial partitioning at rest), pair this with handling large geospatial datasets.

Prerequisites

dbt Core ≥ 1.7 (stable adapter.dispatch, --empty CI runs, --state for targeted reruns).
One spatial adapter: dbt-postgres ≥ 1.7 against PostGIS ≥ 3.3, dbt-snowflake ≥ 1.7, or dbt-bigquery ≥ 1.7 (spherical GEOGRAPHY only — no planar SRIDs).
Grants: CREATE on the target schema and permission to build a GiST (PostGIS) or search-optimized index on the output table.
Environment variables through dbt’s env_var() (never hardcoded): DBT_TARGET_SRID to pin the canonical projection, plus the usual connection secrets.
A staging model that already tags each geometry with its true source SRID via ST_SetSRID — an untagged geometry cannot be reprojected.

Step-by-step instructions

Step 1: Centralize reprojection in a warehouse-agnostic macro

The foundation is a reusable macro that abstracts engine-specific function signatures while enforcing strict SRID validation. Warehouses diverge sharply here: PostGIS requires an explicit SRID before transformation, Snowflake infers it from metadata, and BigQuery fixes GEOGRAPHY to EPSG:4326 and rejects planar reprojection outright. Centralizing the logic, as covered in building custom spatial macros, isolates every edge case in one file.

-- macros/transform_crs.sql
{% macro transform_crs(geom_col, target_srid, source_srid=4326) %}
  {%- if target.type == 'postgres' -%}
    ST_Transform(ST_SetSRID({{ geom_col }}::geometry, {{ source_srid }}), {{ target_srid }})
  {%- elif target.type == 'snowflake' -%}
    ST_TRANSFORM({{ geom_col }}, {{ target_srid }})
  {%- elif target.type == 'bigquery' -%}
    {# BigQuery GEOGRAPHY is fixed to EPSG:4326; reproject upstream or skip. #}
    {%- if target_srid | int == 4326 -%}
      ST_GEOGFROMWKB({{ geom_col }})
    {%- else -%}
      {{ exceptions.raise_compiler_error("BigQuery GEOGRAPHY only supports EPSG:4326; pre-project upstream") }}
    {%- endif -%}
  {%- else -%}
    {{ exceptions.raise_compiler_error("Unsupported adapter for spatial CRS transformation") }}
  {%- endif -%}
{% endmacro %}

Verify the macro compiles for your adapter before wiring it into a model:

dbt compile --select transform_crs
# Inspect target/compiled/... — the rendered SQL should name your warehouse's ST_Transform variant

Step 2: Partition the work and process WKB natively

Batch transforms are memory-intensive because spatial functions decompress binary geometry into in-memory coordinate arrays. Processing millions of rows without chunking triggers OOM errors or aggressive spill-to-disk that degrades throughput 3–5x. Keep geometries in WKB and never round-trip through ST_AsText, which inflates payload size 40–60%. Materialize incrementally and partition on a temporal or spatial bucket so each run touches a bounded slice.

-- models/marts/fct_transformed_geometries.sql
{{ config(
    materialized='incremental',
    unique_key='record_id',
    partition_by=['date_partition'],
    cluster_by=['region_code'],
    incremental_strategy='merge',
    on_schema_change='sync_all_columns'
) }}

SELECT
    record_id,
    date_partition,
    region_code,
    {{ transform_crs('raw_geom_wkb', 3857, 4326) }} AS geom_transformed,
    metadata_payload
FROM {{ ref('stg_spatial_events') }}
{% if is_incremental() %}
    WHERE date_partition > (SELECT MAX(date_partition) FROM {{ this }})
{% endif %}

Align partition boundaries with your warehouse’s optimal scan size (typically 1–4 GB per slice) so parallel workers stay busy without overrunning memory. Confirm the model runs on a single partition first:

dbt run --select fct_transformed_geometries --vars '{"date_partition": "2026-06-01"}'

Step 3: Add deterministic incremental state management

Full-table reprojection is rarely viable, and incremental materialization needs deterministic merge logic to survive late-arriving data and upstream schema drift. The unique_key makes updates idempotent; spatial pipelines need three further guards:

SELECT
    record_id,
    -- detect geometry mutations cheaply, without parsing coordinate arrays
    md5(raw_geom_wkb)                          AS geom_hash,
    source_srid,                               -- preserve original SRID for audit / reverse transforms
    {{ transform_crs('raw_geom_wkb', var('target_srid')) }} AS geom_transformed
FROM {{ ref('stg_spatial_events') }}
WHERE deleted_at IS NULL                        -- never resurrect soft-deleted geometries

Hash-based change detection — a lightweight md5 of the raw WKB flags mutations without decoding geometry.
Soft-delete handling — filter deleted_at IS NOT NULL inside the incremental window so deleted features cannot corrupt downstream joins.
Metadata retention — keep source_srid in a companion column to enable audit trails and reverse transforms. These controls cut warehouse compute 60–80% on mature pipelines by recomputing only net-new or modified rows.

Step 4: Preserve topology and control precision

Reprojection introduces floating-point drift, especially between geographic (lat/lon) and projected (meter) systems. At scale that drift produces self-intersections, sliver polygons, and broken rings that fail downstream joins. Manage it explicitly:

SELECT
    record_id,
    -- snap to a consistent tolerance: ~1mm urban, ~10m continental
    ST_SnapToGrid(
        {{ transform_crs('raw_geom_wkb', var('target_srid')) }},
        0.001
    ) AS geom_transformed
FROM {{ ref('stg_spatial_events') }}
WHERE ST_IsValid(raw_geom_wkb)   -- route invalid input to quarantine, do not fail the batch

Grid snapping with ST_SnapToGrid (or the warehouse equivalent) aligns vertices to a fixed tolerance.
Ring validation with ST_IsValid immediately after transform; send failures to a quarantine table rather than aborting.
Precision casting of any extracted ordinates to DECIMAL(10,7) keeps warehouse-specific rounding out of analytical models. This follows the OGC Simple Features standard for topology validity so geometries stay interoperable across GIS clients and spatial indexes.

Step 5: Monitor extents and recover surgically

Production pipelines need automated validation to catch silent failures before they reach reporting or feature stores. Encode spatial integrity as dbt tests so they fail the run, not a dashboard:

# models/marts/_marts.yml
version: 2
models:
  - name: fct_transformed_geometries
    columns:
      - name: geom_transformed
        tests:
          - not_null
      - name: source_srid
        tests:
          - accepted_values:
              values: [4326, 3857, 26918]
    tests:
      - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
          datepart: day
          date_col: date_partition

Track spatial extent with ST_XMin, ST_YMax, and bounding-box aggregations; sudden shifts mean projection misalignment or a corrupted input batch. When a partition fails, reprocess only that partition rather than rerunning the whole DAG:

# Rebuild only the failed slice and its downstream models
dbt build --select fct_transformed_geometries+ \
  --vars '{"date_partition": "2026-06-12", "region_code": "us-west"}'

Configuration reference

Parameter	Accepted values	Default	Spatial notes
`transform_crs(target_srid)`	any EPSG code the engine supports	—	BigQuery accepts `4326` only; raises a compiler error otherwise
`transform_crs(source_srid)`	EPSG of the tagged input	`4326`	Must match the SRID set at staging or `ST_Transform` errors
`materialized`	`incremental`, `table`	`incremental`	Use `table` for one-off full reprojections under a few million rows
`incremental_strategy`	`merge`, `delete+insert`	`merge`	`merge` is idempotent on `unique_key`; required for late-arriving data
`partition_by`	temporal or spatial bucket column	—	Size each partition to a 1–4 GB scan for parallel execution
`cluster_by`	`region_code` or similar	—	Co-locates nearby geometries to cut index scan cost
`ST_SnapToGrid` tolerance	grid size in CRS units	`0.001`	~1mm for urban datasets, ~10m for continental

Gotchas and edge cases

Unknown (0) SRID at input. ST_Transform cannot reproject an untagged geometry. Always ST_SetSRID at staging; a source_srid mismatch produces correct-looking but wrong coordinates rather than an error.
Geometry vs geography coercion. PostGIS geometry is planar and unit-dependent; casting to geography mid-batch silently changes distance semantics. Keep the type explicit and consistent across the model.
BigQuery has no planar SRID. The macro raises a compiler error for any target_srid other than 4326 — reproject upstream in PostGIS or DuckDB before promoting to BigQuery.
merge on a partitioned table with no unique_key rewrites whole partitions every run. Always declare unique_key alongside partition_by.
Stale planner statistics after a bulk load make the warehouse ignore the spatial index. Run ANALYZE (PostGIS) or the equivalent in a post-hook so the next run is index-eligible.

Frequently asked questions

Why does ST_Transform return NULL after this step?

Almost always a NULL or invalid input geometry, or an input tagged with SRID 0. Wrap the macro output in a WHERE ST_IsValid(...) guard and route failures to a quarantine table, and confirm the staging model applied ST_SetSRID with the real source projection.

How big should each partition be?

Target a 1–4 GB scan per slice. Smaller and you pay per-run overhead and underuse parallel workers; larger and individual workers spill to disk. Tune partition_by granularity (daily vs hourly, or a coarser spatial grid) to land in that window.

Should I keep geometries as WKB or text through the batch?

WKB throughout. Round-tripping through ST_AsText/ST_GeomFromText inflates payload 40–60% and forces extra parsing, which is exactly the serialization overhead that pushes large batches into spill-to-disk.

How do I reprocess one bad batch without rerunning everything?

Use dbt’s --select with the model-plus-downstream operator and pass the failing partition as a var: dbt build --select fct_transformed_geometries+ --vars ‘{“date_partition”: “…”}’. Avoid full reruns — target the specific date_partition or region_code bucket that failed.

Geometry transformation pipelines — the staged workflow this batch reprojection step plugs into.
Automating CRS conversions in dbt pipelines — the governance and registry layer that decides which SRID is canonical.
Handling large geospatial datasets — storage-side tactics that complement chunked reprojection.
Building custom spatial macros — the reusable, dispatch-aware macro patterns behind transform_crs.

Up: Geometry transformation pipelines