What Does the Architecture Look Like?
March 21, 2026 — Synthesis of all experiments
The Question
We're replacing Sound Transit's OneBusAway-based GTFS publishing with a modern pipeline. What does the architecture actually look like, and how do we know it'll work? We ran 5 focused experiments to validate each high-risk component with real code and real data before committing to an architecture.
What We Validated
| Component | Experiment | Key Finding |
|---|---|---|
| Feed serving | rt-publish-latency | LB + 1s CDN cache: 112ms, $26/mo |
| Transform framework | pipeline-dag | Unified Step model, folder-scanned DAG |
| Config management | config-management | Kernel (infra) vs user-land (agency repo) |
| Schedule comparison | gtfs-digester-core | BLAKE3 fingerprint + primary-key diff |
| RT comparison | rt-feed-comparison | 4-level semantic equivalence |
What It Looks Like
How transforms are managed:
Sound Transit staff own a git repo with pipeline folders. Each folder contains Python files with Step instances — either builtins they instantiate or custom functions they write. The framework scans the folder, builds a DAG, and the Control Console renders it:
How we know nothing broke:
Every pipeline run produces a comparison report:
Schedule: "Feed fingerprint changed (v1:b629af → v1:c83d21).
stops.txt: 3 added, 5 renamed. All changes match
configured transforms. No unexpected modifications."
RT: "Level 2 equivalent (semantically identical).
42/42 entities matched after transforms.
Header timestamp updated. No data loss."
The Decisions
These are now backed by tested code and real Sound Transit data — not just proposals:
- Serve feeds as static files via GCS + Global HTTP LB + CDN with 1-second cache
- Transform framework uses the unified Step model — builtins and custom code discovered from a folder, rendered as a DAG
- Config boundary: infrastructure (env vars, OpenTofu) vs agency code (git repo with framework dependency)
- Change detection via canonical fingerprinting (schedule) and 4-level semantic equivalence (RT)
- Deployment as two Cloud Run services — pipeline (gRPC) and control console (public HTTP), wired via IAM + authenticated gRPC (journal post)
Updated architecture
The deployment model evolved from two services to three (web + orchestrator + pipeline worker pools) with an asset registry for data ingestion. See 009: How do we deploy and orchestrate the whole system? and the architecture overview for the current design.
Open Questions
- How does the RT pipeline interface differ from schedule? Both use the Step model, but schedule operates on Polars DataFrames while RT operates on protobuf messages. The context shape and builtins are different. Should we unify the interface or embrace the difference?
- What does the Control Console need to show for RT vs schedule? Schedule runs are infrequent and produce a full diff report. RT runs every 20 seconds and produce a comparison report. The UI patterns are fundamentally different — batch review vs live monitoring.
- How do we handle the transition period? Sound Transit needs to run both OBA and our system in parallel. How do consumers discover the new feed URLs? Do we need a migration path for existing integrations?
What's Next
Deploy the two-service model to Cloud Run and verify the framework works containerizedDone — see journal post- Wire up the full schedule and RT pipelines end-to-end
- Build out the Control Console UI with real DAG rendering and comparison reports