What Does the Production Infrastructure Look Like?

March 25, 2026 — Phase 3 Build

The Question

We have a working transform framework and orchestrator that run locally against Docker Compose Postgres. But these services need to run in Google Cloud — with private networking between them, a managed database, a CDN for public feed serving, and CI/CD that builds and pushes container images automatically. How do we define all of that infrastructure, and how much of it can we validate before actually deploying?

What We Tried

Ported the three-service architecture from the host-orchestration experiment into production OpenTofu configs, targeting the sound-transit-gtfs-pipeline GCP project
Defined Dockerfiles for all three services (pipeline worker, orchestrator, web) with multi-stage builds
Created a GitHub Actions workflow that detects which services changed and only builds affected images
Set up Cloud CDN with USE_ORIGIN_HEADERS caching (the configuration validated in the rt-publish-latency experiment)
Deployed in stages: APIs + service accounts first, then networking, then Cloud SQL, then Cloud Run + CDN

What We Found

44 resources planned cleanly, but deploying surfaced 5 issues the plan couldn't catch. PostgreSQL 18 defaults to ENTERPRISE_PLUS edition (which doesn't support micro instances). The VPC connector API rejects the network ID format when using ip_cidr_range. IAM database usernames have a 63-character limit. The orchestrator crashes if the schema isn't applied before it starts. And gRPC health probes require implementing the health checking protocol.
Staged deployment saved us from a long rollback. By applying APIs and service accounts first, then networking, then Cloud SQL (the slow one at ~5 minutes), then Cloud Run, we could fix each issue without re-creating expensive resources. The Cloud SQL instance survived all the fixes around it.
Private-only Cloud SQL needs a maintenance window pattern. With no public IP, you can't connect from your laptop to run migrations. The workflow: temporarily enable public IP, authorize your current IP, run psql, then immediately remove public access. Each patch takes about 30 seconds. This is now captured in our sysadmin skill.
IAM database auth requires cloudsql.iam_authentication flag. Code review caught this before deployment, but we also added a password-based user via Secret Manager as a pragmatic fallback — IAM auth integration into the postgres client library is a separate task.
Cloud Run caches images by tag. Pushing a new image at :latest doesn't trigger a re-pull. You have to force a new revision by updating an env var (e.g., DEPLOY_TS). The experiments warned us about this and it played out exactly as documented.

What It Looks Like

All three services running in production:

$ curl -s https://continuous-gtfs-web-27fufqiknq-uw.a.run.app/health | jq .
{
  "status": "ok",
  "service": "continuous-gtfs-web"
}

$ gcloud logging read '...service_name="continuous-gtfs-orchestrator"' --limit=3
Orchestrator gRPC listening on :8080
Dirty detection: all pipelines clean
STARTUP TCP probe succeeded

Note

The pipeline image tag is content-addressed — same pipeline code from different branches produces the same tag. The orchestrator and web use commit SHAs since they change less frequently and aren't versioned per-environment.

The Decision

The production infrastructure is deployed and all three services are running. 44 resources, staged deployment, with the orchestrator connected to Cloud SQL and reporting clean dirty detection on startup.

What This Means

The system is live — web service returns health OK, orchestrator is listening for workers, worker pool is ready
Container builds are automated on push to develop via GitHub Actions
Database schema is applied and accessible via private IP from the orchestrator
CDN is configured and ready to serve feeds once the pipeline publishes them
Operational knowledge is captured in the /sysadmin skill for future maintenance
Local development continues to use Docker Compose Postgres — no GCP dependency for day-to-day work

Open Questions

DNS for the CDN endpoint — do we have a domain name ready, or use the raw IP initially?
HTTPS for the CDN — needs DNS pointing at the LB IP before a managed certificate can be provisioned
Should schema migrations be baked into the orchestrator startup (auto-apply on boot) or remain a manual maintenance window step?