Skip to content

What Does the Production Infrastructure Look Like?

March 25, 2026 — Phase 3 Build

The Question

We have a working transform framework and orchestrator that run locally against Docker Compose Postgres. But these services need to run in Google Cloud — with private networking between them, a managed database, a CDN for public feed serving, and CI/CD that builds and pushes container images automatically. How do we define all of that infrastructure, and how much of it can we validate before actually deploying?

What We Tried

  • Ported the three-service architecture from the host-orchestration experiment into production OpenTofu configs, targeting the sound-transit-gtfs-pipeline GCP project
  • Defined Dockerfiles for all three services (pipeline worker, orchestrator, web) with multi-stage builds
  • Created a GitHub Actions workflow that detects which services changed and only builds affected images
  • Set up Cloud CDN with USE_ORIGIN_HEADERS caching (the configuration validated in the rt-publish-latency experiment)
  • Deployed in stages: APIs + service accounts first, then networking, then Cloud SQL, then Cloud Run + CDN

What We Found

  1. 44 resources planned cleanly, but deploying surfaced 5 issues the plan couldn't catch. PostgreSQL 18 defaults to ENTERPRISE_PLUS edition (which doesn't support micro instances). The VPC connector API rejects the network ID format when using ip_cidr_range. IAM database usernames have a 63-character limit. The orchestrator crashes if the schema isn't applied before it starts. And gRPC health probes require implementing the health checking protocol.

  2. Staged deployment saved us from a long rollback. By applying APIs and service accounts first, then networking, then Cloud SQL (the slow one at ~5 minutes), then Cloud Run, we could fix each issue without re-creating expensive resources. The Cloud SQL instance survived all the fixes around it.

  3. Private-only Cloud SQL needs a maintenance window pattern. With no public IP, you can't connect from your laptop to run migrations. The workflow: temporarily enable public IP, authorize your current IP, run psql, then immediately remove public access. Each patch takes about 30 seconds. This is now captured in our sysadmin skill.

  4. IAM database auth requires cloudsql.iam_authentication flag. Code review caught this before deployment, but we also added a password-based user via Secret Manager as a pragmatic fallback — IAM auth integration into the postgres client library is a separate task.

  5. Cloud Run caches images by tag. Pushing a new image at :latest doesn't trigger a re-pull. You have to force a new revision by updating an env var (e.g., DEPLOY_TS). The experiments warned us about this and it played out exactly as documented.

What It Looks Like

All three services running in production:

$ curl -s https://continuous-gtfs-web-27fufqiknq-uw.a.run.app/health | jq .
{
  "status": "ok",
  "service": "continuous-gtfs-web"
}

$ gcloud logging read '...service_name="continuous-gtfs-orchestrator"' --limit=3
Orchestrator gRPC listening on :8080
Dirty detection: all pipelines clean
STARTUP TCP probe succeeded
flowchart TB Internet -->|HTTPS| LB[Cloud Load Balancer + CDN\n34.36.64.110] Internet -->|HTTPS| Web[continuous-gtfs-web\nCloud Run, 0-2 instances] LB -->|Cache-Control: max-age=1| GCS[continuous-gtfs-feeds\nGCS Bucket] Web -->|gRPC via VPC| Orch[continuous-gtfs-orchestrator\nCloud Run, 1-2 instances] Orch -->|Private IP| DB[(continuous-gtfs-db\nPostgreSQL 18)] Orch -->|gRPC| Worker[continuous-gtfs-pipeline\nWorker Pool, 1 instance] Orch --> GCS Orch -->|Secret Manager| SM[pims-prod, pims-qa, db-password] Worker -->|VPC| Orch

Note

The pipeline image tag is content-addressed — same pipeline code from different branches produces the same tag. The orchestrator and web use commit SHAs since they change less frequently and aren't versioned per-environment.

The Decision

The production infrastructure is deployed and all three services are running. 44 resources, staged deployment, with the orchestrator connected to Cloud SQL and reporting clean dirty detection on startup.

What This Means

  • The system is live — web service returns health OK, orchestrator is listening for workers, worker pool is ready
  • Container builds are automated on push to develop via GitHub Actions
  • Database schema is applied and accessible via private IP from the orchestrator
  • CDN is configured and ready to serve feeds once the pipeline publishes them
  • Operational knowledge is captured in the /sysadmin skill for future maintenance
  • Local development continues to use Docker Compose Postgres — no GCP dependency for day-to-day work

Open Questions

  • DNS for the CDN endpoint — do we have a domain name ready, or use the raw IP initially?
  • HTTPS for the CDN — needs DNS pointing at the LB IP before a managed certificate can be provisioned
  • Should schema migrations be baked into the orchestrator startup (auto-apply on boot) or remain a manual maintenance window step?