Skip to content

Can we run the full pipeline end to end?

March 22, 2026 — Experiments: realtime-e2e + schedule-e2e

Updated

The experiments validated here are now running in production. Real Sound Transit PIMS feeds are being fetched, transformed, and published to CDN every 20 seconds. See 013: Real Feeds Are Flowing.

The Question

We've validated individual pieces — the transform framework, feed comparison tools, CDN publishing, deployment model, and data ingestion registry. But can we wire them into complete pipelines that process Sound Transit's real feeds from input to output? And can we do it fast enough — under 1 second for realtime, under 60 seconds for schedule?

What We Tried

  • A complete realtime pipeline: decode protobuf → apply transforms → encode to both protobuf and JSON → validate roundtrip
  • A complete schedule pipeline: ingest zip → validate structure → apply transforms → validate output → repackage zip
  • Six realtime transform types based on Sound Transit's milestone details document, including stop filtering, multi-source feed merging, vehicle renaming, trip ID regex substitution, and header updates
  • Two schedule transform configs from Sound Transit's official transform document: a base config that removes inactive services and routes, and a production config that retains Link routes, renames stations, and updates route metadata
  • Both pipelines tested against real PIMS feeds from Sound Transit's QA and production endpoints

What We Found

  1. Realtime transforms complete in under 1 millisecond. The entire decode-through-encode pipeline takes 10-23ms, but the actual transforms (renaming vehicles, filtering stops, merging feeds) are sub-millisecond. JSON encoding is the only stage above 1ms at 4-8ms — and it can run asynchronously after protobuf is published.

  2. Schedule transforms process 178,000 rows in 11-17 milliseconds. Sound Transit's production config touches 7,195 rows (clearing trip_short_name on every Link trip) and still completes in 17ms. The 60-second budget has 60x headroom. Zip packaging dominates at 636-755ms.

  3. Sound Transit's transform document defines four environment-specific configs. Each is a layered set of remove/retain/update rules. The "retain" operation — explicitly keeping rows that other rules would remove — is essential for their production config where Link routes are removed by the base config but retained for PIMS output.

  4. Six of nine requested RT transform types work against real data. Three transforms (preserving cancelled trips, inserting missing cancellations, converting scheduled trips to NEW) require cross-referencing the schedule feed.

    Schedule-dependent transforms are MVP targets

    The output-assets-in-registry design (from 007) enables the RT pipeline to consume the transformed schedule as an input slot. All nine transform types are MVP targets — see the realtime pipeline spec.

  5. JSON output is 5x larger than protobuf. A 4.7KB VehiclePositions protobuf becomes 23.6KB as JSON. This matters for CDN costs if consumers use JSON, but protobuf is the primary format.

  6. No ServiceAlerts endpoint exists from PIMS. The OBA aggregated feed returns 401. ServiceAlerts testing is deferred — we'll need to clarify this data source with Sound Transit.

What It Looks Like

Sound Transit's schedule transform config expressed as data-driven rules:

# Production: retain Link routes that base config removes, update metadata
config = TransformConfig(
    retains=[RetainRule("routes.txt", [MatchCondition("route_id", "100479")])],
    removes=[RemoveRule("calendar.txt", [MatchCondition("service_id", regex=r"^LLR.*")])],
    updates=[UpdateRule("stops.txt", [MatchCondition("stop_id", "C05")], {
        "stop_name": "Symphony",  # University Street → Symphony
    })],
)

Realtime transform from Sound Transit's milestone doc:

# Remove non-revenue stations from feeds before expansion opening
FilterStopsByID(blocked_stop_ids={"E01", "E07", "S03", "S05"})

# Merge feeds from multiple backend systems
CombineFeeds(additional_feeds=[vendor_b_feed])

Pipeline timing breakdown (real PIMS data):

Schedule (178K rows, 1.5MB zip):
  ingest:    154ms  ████
  validate:   13ms  ▌
  transform:  17ms  ▌
  validate:   12ms  ▌
  package:   636ms  ████████████████

Realtime (16 entities, 6.5KB):
  decode:    0.04ms
  transform: 0.02ms
  encode_pb: 0.03ms
  encode_json: 7ms  ██
  validate:   12ms  ███

The Decision

Both pipelines are validated end-to-end against real Sound Transit data with massive performance headroom — the transform engine is ready for production integration.

What This Means

  • We can now write a functionally complete MVP spec — every pipeline stage has been validated with real data
  • Sound Transit's official transform configs (remove/retain/update rules) map directly to our data-driven engine
  • The schedule pipeline's data-driven approach means new transform configs are just config changes, not code changes
  • Three RT transform types that cross-reference the schedule feed are the main integration work remaining
  • host-orchestration is the last experiment needed to validate the deployment model

Open Questions

  • How do we handle cascade removals in schedule transforms? Removing a route should also remove its trips and stop_times.
  • Where does the transforms config live — in the agency git repo (user-land) or managed by the host (kernel)?
  • Sound Transit needs to provide a ServiceAlerts data source — PIMS doesn't expose one, and OBA requires auth we don't have.
  • Should the schedule pipeline publish extracted files to GCS individually (faster) or always as a zip (simpler)?