Skip to content

How do we deploy and orchestrate the whole system?

March 23, 2026 — Experiment: host-orchestration

Updated

The orchestrator service was built in 011: Wiring Up the Orchestrator, porting patterns from this experiment into production code with proto-first contracts, proper error handling, and Docker Compose local dev.

The Question

We've validated every piece of the pipeline individually — transforms, feed comparison, CDN publishing, data ingestion, and a two-service deployment model. But how do we wire it all together into a deployed system where feeds flow in, get transformed by the right pipeline version, and come out the other end? And how do we run multiple pipeline versions simultaneously for different environments?

What We Tried

  • A three-service architecture: a public REST API (web), an internal gRPC orchestrator, and pipeline worker pools
  • Workers as gRPC clients that connect outbound to the orchestrator, rather than the orchestrator calling workers
  • Two pipeline images with different baked-in transforms deployed simultaneously as separate worker pools
  • PostgreSQL-backed asset registry running inside the orchestrator with in-memory caching
  • VPC networking with internal-only ingress on the orchestrator
  • IAM authentication between services using metadata server ID tokens

What We Found

  1. Worker pools can't accept inbound connections — they have to call home. This flipped our mental model. Workers connect outbound to the orchestrator's gRPC server and self-register with their version and capabilities. No service discovery needed — the orchestrator just tracks who's connected.

  2. Google's heavy auth libraries don't work on lightweight runtimes. The google-auth-library npm package consumed over 1GB of memory and crashed. Fetching ID tokens directly from the metadata server at http://metadata.google.internal is 20 lines of code, zero dependencies, and works in 512MB containers.

  3. Internal-only ingress actually works through VPC connectors. We feared this wouldn't work (it didn't in the earlier deployment-pattern experiment without VPC). With a VPC connector, the web service reaches the orchestrator through the VPC and the orchestrator is genuinely not accessible from the public internet.

  4. You need PRIVATE_RANGES_ONLY egress if your service talks to both VPC resources and public APIs. ALL_TRAFFIC through the VPC connector blocked Secret Manager and GCS API calls. Switching to PRIVATE_RANGES_ONLY routes private IPs (Cloud SQL) through the VPC and everything else through the public internet.

  5. Different pipeline code automatically produces different version hashes. The worker computes a SHA-256 of its pipeline folder on startup. Production (passthrough) got e14e2d71d447, staging (rename vehicles + filter stops) got 991b0c102172. The orchestrator sees two distinct versions connected.

  6. The full chain works: fetch → version → debounce → fan-out → dispatch → transform → register. One fetch of VehiclePositions from PIMS creates one version, which triggers debounce for both production and staging pipelines, dispatching to different workers with different transforms.

What It Looks Like

The deployed system with real PIMS QA data flowing through:

flowchart LR PIMS[PIMS QA Feeds] -->|fetch + auth| O[Orchestrator] O -->|gRPC| W1[Worker v1\ne14e2d71d447\nproduction] O -->|gRPC| W2[Worker v2\n991b0c102172\nstaging] W1 -->|stream PB| O W2 -->|stream PB| O O -->|register| DB[(PostgreSQL)] O -->|publish| GCS[GCS Bucket] Web[Web API] -->|gRPC/VPC| O

The system is live. Try it:

WEB=https://host-orchestration-web-jqxmyinlqq-uw.a.run.app
# System overview
curl -s $WEB/api/v1/status | jq
# → {"connectedWorkers":2,"registeredAssets":10,"cachedAssetVersions":10}

# Connected pipeline workers (two versions, different transforms baked in)
curl -s $WEB/api/v1/workers | jq
# → "e14e2d71d447"   (production — passthrough)
# → "991b0c102172"   (staging — rename vehicles, filter stops)

# All assets: 6 PIMS inputs (prod + QA) + 4 pipeline outputs
curl -s $WEB/api/v1/assets | jq '.[] | {key, versionStrategy, cached, latestSizeBytes}'
# → pims/prod/schedule        content_hash  1,568,820 bytes  cached
# → pims/prod/vehicle_positions  timestamp  975 bytes         cached
# → pims/qa/schedule          content_hash  1,536,738 bytes  cached
# → output/st-realtime-production/feed.pb   984 bytes
# → ...

# Latest RT run per pipeline (completing every ~30s)
curl -s "$WEB/api/v1/runs?limit=50" | jq '[.[] | select(.pipelineId | contains("realtime"))] | group_by(.pipelineId) | .[] | {pipeline: .[0].pipelineId, status: .[0].status, startedAt: .[0].startedAt}'
# → {"pipeline":"st-realtime-production","status":"completed","startedAt":"2026-03-23T08:44:13.982Z"}
# → {"pipeline":"st-realtime-staging","status":"completed","startedAt":"2026-03-23T08:44:14.068Z"}

# Latest schedule run per pipeline
curl -s "$WEB/api/v1/runs?limit=50" | jq '[.[] | select(.pipelineId | contains("schedule"))] | group_by(.pipelineId) | .[] | {pipeline: .[0].pipelineId, status: .[0].status}'
# → {"pipeline":"st-schedule-production","status":"failed"}  (workers weren't connected at trigger time)

Six PIMS assets (3 QA + 3 production, with separate API tokens from Secret Manager), four pipeline output assets, two pipeline versions running simultaneously. RT runs complete continuously; schedule runs trigger on content change (deduped by hash).

The Decision

We're deploying as three Cloud Run services — a public web API, an internal gRPC orchestrator, and pipeline worker pools — with workers connecting outbound to the orchestrator and registering their version and capabilities.

What This Means

  • All 10 experiments are now closed — every layer of the MVP stack is validated on real infrastructure with real data
  • The 10 system specs are ready for team review and implementation planning
  • Sound Transit's PIMS QA feeds are flowing through the deployed system right now
  • Adding a new pipeline version = building a new container image + deploying a new worker pool
  • Adding a new environment = configuring asset bindings in the orchestrator
  • The orchestrator is the single coordination point — workers are stateless and replaceable

Open Questions

  • Should the orchestrator use a proper job scheduler (like BullMQ) instead of setTimeout for periodic fetches?
  • How do we handle orchestrator restarts gracefully — dirty detection needs to re-trigger any missed runs
  • What's the right monitoring story — should the orchestrator emit metrics to Cloud Monitoring, or is the REST API sufficient?
  • Cloud SQL PG17 major version upgrade took 35 minutes on a micro instance — create fresh at the target version, never upgrade in place