What Does the Architecture Look Like?

March 21, 2026 — Synthesis of all experiments

The Question

We're replacing Sound Transit's OneBusAway-based GTFS publishing with a modern pipeline. What does the architecture actually look like, and how do we know it'll work? We ran 5 focused experiments to validate each high-risk component with real code and real data before committing to an architecture.

What We Validated

Component	Experiment	Key Finding
Feed serving	rt-publish-latency	LB + 1s CDN cache: 112ms, $26/mo
Transform framework	pipeline-dag	Unified Step model, folder-scanned DAG
Config management	config-management	Kernel (infra) vs user-land (agency repo)
Schedule comparison	gtfs-digester-core	BLAKE3 fingerprint + primary-key diff
RT comparison	rt-feed-comparison	4-level semantic equivalence

What It Looks Like

flowchart TD SRC[Sound Transit PIMS/ESB] SRC -->|Schedule\non change| SCHED SRC -->|RT\nevery ~20s| RT subgraph SCHED[Schedule Pipeline] S1[Detect change via fingerprint] S2[Transform via Step DAG] S3[Validate via MobilityData] S4[Publish to GCS] S1 --> S2 --> S3 --> S4 end subgraph RT[RT Pipeline x3 feeds] R1[Fetch protobuf] R2[Transform via Step DAG] R3[Validate via RT comparison] R4[Publish to GCS] R1 --> R2 --> R3 --> R4 end SCHED --> CDN RT --> CDN subgraph CDN["GCS + Load Balancer + CDN"] direction LR CDN1["1s cache, ~$26-33/month\nCloud Armor for protection\nConsumer-specific URLs"] end CDN --> GM[Google Maps] CDN --> TA[Transit App] CDN --> OBA[OneBusAway consumers]

How transforms are managed:

Sound Transit staff own a git repo with pipeline folders. Each folder contains Python files with Step instances — either builtins they instantiate or custom functions they write. The framework scans the folder, builds a DAG, and the Control Console renders it:

flowchart LR subgraph repo["Agency git repo (pipelines/schedule/)"] F1[feed_info.py] F2[stop_cleanup.py] end subgraph dag[Control Console DAG] UF[Update Feed] --> MS[Merge Stops] --> RU[Remove Unused] --> VS[Validate Stops] end F1 -.-> UF F2 -.-> MS F2 -.-> RU

How we know nothing broke:

Every pipeline run produces a comparison report:

Schedule: "Feed fingerprint changed (v1:b629af → v1:c83d21).
          stops.txt: 3 added, 5 renamed. All changes match
          configured transforms. No unexpected modifications."

RT:       "Level 2 equivalent (semantically identical).
          42/42 entities matched after transforms.
          Header timestamp updated. No data loss."

The Decisions

These are now backed by tested code and real Sound Transit data — not just proposals:

Serve feeds as static files via GCS + Global HTTP LB + CDN with 1-second cache
Transform framework uses the unified Step model — builtins and custom code discovered from a folder, rendered as a DAG
Config boundary: infrastructure (env vars, OpenTofu) vs agency code (git repo with framework dependency)
Change detection via canonical fingerprinting (schedule) and 4-level semantic equivalence (RT)
Deployment as two Cloud Run services — pipeline (gRPC) and control console (public HTTP), wired via IAM + authenticated gRPC (journal post)

Updated architecture

The deployment model evolved from two services to three (web + orchestrator + pipeline worker pools) with an asset registry for data ingestion. See 009: How do we deploy and orchestrate the whole system? and the architecture overview for the current design.

Open Questions

How does the RT pipeline interface differ from schedule? Both use the Step model, but schedule operates on Polars DataFrames while RT operates on protobuf messages. The context shape and builtins are different. Should we unify the interface or embrace the difference?
What does the Control Console need to show for RT vs schedule? Schedule runs are infrequent and produce a full diff report. RT runs every 20 seconds and produce a comparison report. The UI patterns are fundamentally different — batch review vs live monitoring.
How do we handle the transition period? Sound Transit needs to run both OBA and our system in parallel. How do consumers discover the new feed URLs? Do we need a migration path for existing integrations?

What's Next

~~Deploy the two-service model to Cloud Run and verify the framework works containerized~~ Done — see journal post
Wire up the full schedule and RT pipelines end-to-end
Build out the Control Console UI with real DAG rendering and comparison reports