How do we deploy and orchestrate the whole system?
March 23, 2026 — Experiment: host-orchestration
Updated
The orchestrator service was built in 011: Wiring Up the Orchestrator, porting patterns from this experiment into production code with proto-first contracts, proper error handling, and Docker Compose local dev.
The Question
We've validated every piece of the pipeline individually — transforms, feed comparison, CDN publishing, data ingestion, and a two-service deployment model. But how do we wire it all together into a deployed system where feeds flow in, get transformed by the right pipeline version, and come out the other end? And how do we run multiple pipeline versions simultaneously for different environments?
What We Tried
- A three-service architecture: a public REST API (web), an internal gRPC orchestrator, and pipeline worker pools
- Workers as gRPC clients that connect outbound to the orchestrator, rather than the orchestrator calling workers
- Two pipeline images with different baked-in transforms deployed simultaneously as separate worker pools
- PostgreSQL-backed asset registry running inside the orchestrator with in-memory caching
- VPC networking with internal-only ingress on the orchestrator
- IAM authentication between services using metadata server ID tokens
What We Found
-
Worker pools can't accept inbound connections — they have to call home. This flipped our mental model. Workers connect outbound to the orchestrator's gRPC server and self-register with their version and capabilities. No service discovery needed — the orchestrator just tracks who's connected.
-
Google's heavy auth libraries don't work on lightweight runtimes. The
google-auth-librarynpm package consumed over 1GB of memory and crashed. Fetching ID tokens directly from the metadata server athttp://metadata.google.internalis 20 lines of code, zero dependencies, and works in 512MB containers. -
Internal-only ingress actually works through VPC connectors. We feared this wouldn't work (it didn't in the earlier deployment-pattern experiment without VPC). With a VPC connector, the web service reaches the orchestrator through the VPC and the orchestrator is genuinely not accessible from the public internet.
-
You need PRIVATE_RANGES_ONLY egress if your service talks to both VPC resources and public APIs. ALL_TRAFFIC through the VPC connector blocked Secret Manager and GCS API calls. Switching to PRIVATE_RANGES_ONLY routes private IPs (Cloud SQL) through the VPC and everything else through the public internet.
-
Different pipeline code automatically produces different version hashes. The worker computes a SHA-256 of its pipeline folder on startup. Production (passthrough) got
e14e2d71d447, staging (rename vehicles + filter stops) got991b0c102172. The orchestrator sees two distinct versions connected. -
The full chain works: fetch → version → debounce → fan-out → dispatch → transform → register. One fetch of VehiclePositions from PIMS creates one version, which triggers debounce for both production and staging pipelines, dispatching to different workers with different transforms.
What It Looks Like
The deployed system with real PIMS QA data flowing through:
The system is live. Try it:
WEB=https://host-orchestration-web-jqxmyinlqq-uw.a.run.app
# System overview
curl -s $WEB/api/v1/status | jq
# → {"connectedWorkers":2,"registeredAssets":10,"cachedAssetVersions":10}
# Connected pipeline workers (two versions, different transforms baked in)
curl -s $WEB/api/v1/workers | jq
# → "e14e2d71d447" (production — passthrough)
# → "991b0c102172" (staging — rename vehicles, filter stops)
# All assets: 6 PIMS inputs (prod + QA) + 4 pipeline outputs
curl -s $WEB/api/v1/assets | jq '.[] | {key, versionStrategy, cached, latestSizeBytes}'
# → pims/prod/schedule content_hash 1,568,820 bytes cached
# → pims/prod/vehicle_positions timestamp 975 bytes cached
# → pims/qa/schedule content_hash 1,536,738 bytes cached
# → output/st-realtime-production/feed.pb 984 bytes
# → ...
# Latest RT run per pipeline (completing every ~30s)
curl -s "$WEB/api/v1/runs?limit=50" | jq '[.[] | select(.pipelineId | contains("realtime"))] | group_by(.pipelineId) | .[] | {pipeline: .[0].pipelineId, status: .[0].status, startedAt: .[0].startedAt}'
# → {"pipeline":"st-realtime-production","status":"completed","startedAt":"2026-03-23T08:44:13.982Z"}
# → {"pipeline":"st-realtime-staging","status":"completed","startedAt":"2026-03-23T08:44:14.068Z"}
# Latest schedule run per pipeline
curl -s "$WEB/api/v1/runs?limit=50" | jq '[.[] | select(.pipelineId | contains("schedule"))] | group_by(.pipelineId) | .[] | {pipeline: .[0].pipelineId, status: .[0].status}'
# → {"pipeline":"st-schedule-production","status":"failed"} (workers weren't connected at trigger time)
Six PIMS assets (3 QA + 3 production, with separate API tokens from Secret Manager), four pipeline output assets, two pipeline versions running simultaneously. RT runs complete continuously; schedule runs trigger on content change (deduped by hash).
The Decision
We're deploying as three Cloud Run services — a public web API, an internal gRPC orchestrator, and pipeline worker pools — with workers connecting outbound to the orchestrator and registering their version and capabilities.
What This Means
- All 10 experiments are now closed — every layer of the MVP stack is validated on real infrastructure with real data
- The 10 system specs are ready for team review and implementation planning
- Sound Transit's PIMS QA feeds are flowing through the deployed system right now
- Adding a new pipeline version = building a new container image + deploying a new worker pool
- Adding a new environment = configuring asset bindings in the orchestrator
- The orchestrator is the single coordination point — workers are stateless and replaceable
Open Questions
- Should the orchestrator use a proper job scheduler (like BullMQ) instead of setTimeout for periodic fetches?
- How do we handle orchestrator restarts gracefully — dirty detection needs to re-trigger any missed runs
- What's the right monitoring story — should the orchestrator emit metrics to Cloud Monitoring, or is the REST API sufficient?
- Cloud SQL PG17 major version upgrade took 35 minutes on a micro instance — create fresh at the target version, never upgrade in place
Links
- Experiment spec | Results
- System specs — 10 specs ready for review
- Related: #31 (Component Validation: Deployment Process)
- Previous: 006: How do we deploy two services that talk to each other?