Building the Transform Framework

March 24, 2026 — Phase 1 Build

The Question

We spent weeks validating architectural patterns through experiments — each one proving a specific piece works in isolation. Now it's time to build the real thing. Can we take the Step/DAG model, the schedule and realtime builtins, and the pipeline runners from the experiments and turn them into a production framework that actually processes Sound Transit's feeds?

What We Tried

Ported the Step base class, @step decorator, folder scanner, and DAG resolver from the pipeline-dag experiment into a single continuous_gtfs Python package
Added execution hooks, per-step log capture, and fail-fast vs continue error modes (patterns from the gtfs-ci-demo prototype we analyzed)
Built schedule builtins (RemoveRows, UpdateFields, ClearField, UpdateFeedInfo) and realtime builtins (FilterStopsByID, CombineFeeds, RenameVehicles, TransformTripId, UpdateFeedHeader, PassThrough)
Created pipeline runners for both schedule (ingest → validate → transform → validate → package) and realtime (decode → transform → encode PB → encode JSON)
Wrote Sound Transit's actual transform rules as a pipeline folder, consolidated from their old 4-config model into a single canonical pipeline

What We Found

Sound Transit's per-environment configs were an artifact of their old system, not a real requirement. They maintained 4 separate transform files (pims_base, pims_production, dev_base, pims_post_fwle) because their system selected rulesets at runtime. When we consolidated them into one pipeline, the conflicts disappeared — the "base removes Link routes, production retains them" pattern was just a workaround for not having separate pipeline versions.
The protobuf HasField() check is the right approach for RT transforms — not type-based dispatch. We initially considered keying the context by feed type (vehicle_positions, trip_updates), but a single FeedMessage can contain any mix of entity types. MTA combines VP and TU in one feed. The transforms inspect each entity's oneof variant and handle it correctly regardless of what the feed was labeled.
66 tests pass in under 4 seconds, including real Sound Transit data. The schedule pipeline processes the 1.5MB, 178K-row feed in 633ms (22ms for transforms, 472ms for zip packaging). The realtime pipeline processes VehiclePositions in 5.5ms. Both well within spec targets.
The prototype had patterns worth porting. The gtfs-ci-demo's execution hooks, per-step log capture, and error categorization were the right ideas but over-engineered. We took the core patterns (5 hook events, Python logger interception per step, fail-fast vs continue modes) and left behind the recovery strategies, suggested actions, and snapshot/rollback machinery.
Cross-file ID mappings close the cascade removal gap. When a route is removed, trips and stop_times need cascading deletes. Rather than making every step aware of every other step, we added ctx.id_mappings — one step writes the mapping, a dependent step reads it. Loose coupling through the context.

What It Looks Like

The CLI running against Sound Transit's real schedule feed:

$ continuous-gtfs dag tests/fixtures/pipelines/st_schedule/
DAG: 16 steps
  1. clear_1line_short_name [builtin] ['trips.txt']
  2. clear_2line_short_name [builtin] ['trips.txt']
  3. remove_inactive_calendars [builtin] ['calendar.txt']
  4. remove_llr_calendar [builtin] ['calendar.txt'] (after: remove_inactive_calendars)
  ...

$ continuous-gtfs schedule data/sound-transit/schedule.zip tests/fixtures/pipelines/st_schedule/ -o output.zip
Schedule pipeline: 632.6ms
  Input:  22 files, 178485 rows
  Output: 22 files, 178461 rows
  ingest: 105.3ms
  validate_input: 20.8ms
  transform: 22.1ms
  validate_output: 12.4ms
  package: 472.0ms

$ continuous-gtfs realtime data/sound-transit/rt/vehicle_positions.pb tests/fixtures/pipelines/st_realtime/
RT pipeline: 5.5ms
  Input:  16 entities (4705 bytes)
  Output: 16 entities
    PB:   4705 bytes
    JSON: 23620 bytes

The framework is now a framework, not a prototype. We're building user documentation alongside the code — the Guide covers everything from getting started to writing custom transforms.

The Decision

The transform framework is production-ready and processing real Sound Transit data correctly. We consolidated Sound Transit's 4-environment config model into a single canonical pipeline — environment variance will be handled at the pipeline version level (different container images) if ever needed.

What This Means

Agency engineers can write transforms now — the builtins cover 90% of Sound Transit's rules, and @step handles the rest
The CLI enables local development and testing against real GTFS data
The pipeline worker is ready to be wired to the orchestrator via gRPC (Phase 2)
User documentation is being written alongside the code to catch ergonomic issues early

Open Questions

Should we add dry-run and selective execution modes now, or wait until there's a real need?
How should the 3 schedule-dependent RT transforms (PreserveCancelledTrips, InsertMissingCancellations, ConvertScheduledToNew) access the schedule feed?
The zip packaging stage dominates schedule pipeline time (472ms of 633ms) — is there value in publishing individual files to GCS instead of a zip?