Skip to content

Detecting What Changed in a Schedule Feed

March 20, 2026 — Experiment: gtfs-digester-core

The Question

Sound Transit regenerates their GTFS schedule feed regularly. Sometimes the content changes (new routes, updated stop names), sometimes it's identical but re-exported with different formatting. We need to answer two questions reliably: "did anything actually change?" and "if so, what exactly?"

This matters for the pipeline (don't re-process identical feeds) and for milestone validation (prove our output matches theirs).

What We Tried

We built a canonicalization and fingerprinting library that:

  • Normalizes GTFS files into a canonical form (consistent column order, sorted rows, trimmed whitespace, zero-padded times)
  • Computes a BLAKE3 hash fingerprint of the canonical form
  • Compares two feeds at the file level and row level, keyed on primary keys from the GTFS spec

What We Found

  1. Sound Transit's feed has 23 files — including Fares V2 tables and custom extensions we hadn't expected (modifications.txt, direction_names_exceptions.txt). Our library handles unknown files gracefully with warnings.

  2. Their feed has non-standard columnstrips.txt has boarding_type, drt_advance_book_min, peak_offpeak. stops.txt has tts_stop_name. These are silently dropped in the canonical form (with warnings), preserving the standard data.

  3. Canonicalization roundtrips perfectly — load the feed, normalize it, write it out, load it again. Same fingerprint. Same row counts. Same values. Zero data loss across all 135,937 stop_times rows.

  4. Fingerprinting is deterministic — we loaded the same feed 10 times and got the identical fingerprint every time. Reordering columns, shuffling rows, changing whitespace — all produce the same fingerprint.

What It Looks Like

Checking if a feed changed — one line, instant:

$ bin/fingerprint data/sound-transit/schedule.zip

v1:b629af85a609c496deb47b06faa0387eabd31...
  agency.txt: 1 rows, 8 cols
  calendar.txt: 16 rows, 10 cols
  calendar_dates.txt: 34 rows, 3 cols
  feed_info.txt: 1 rows, 9 cols
  routes.txt: 7 rows, 7 cols
  shapes.txt: 33,041 rows, 5 cols
  stop_times.txt: 135,937 rows, 7 cols
  stops.txt: 349 rows, 13 cols
  trips.txt: 8,539 rows, 10 cols

Same fingerprint = nothing changed, skip processing. Different fingerprint? Here's what a diff looks like:

Archive diff: 2 files changed

stops.txt: +3 added, -0 removed, ~5 modified
  [+] stop_id=NEW_01  stop_name="Lynnwood City Center"
  [+] stop_id=NEW_02  stop_name="Mariner"
  [+] stop_id=NEW_03  stop_name="Mountlake Terrace"
  [~] stop_id=S1234   stop_name: "Intl Dist/Chinatown" → "International District"
  [~] stop_id=S1235   stop_name: "Seatac/Airport" → "SeaTac/Airport"
  ...

trips.txt: +0 added, -2 removed, ~0 modified
  [-] trip_id=T9901  (route_id=TEST_ROUTE)
  [-] trip_id=T9902  (route_id=TEST_ROUTE)

agency.txt, calendar.txt, calendar_dates.txt, feed_info.txt,
routes.txt, shapes.txt, stop_times.txt: unchanged (fingerprint match)

Only the changed files get parsed. Unchanged files are verified by hash comparison alone.

The Decision

Use gtfs-digester for all schedule feed comparison and change detection. Fingerprinting for dedup, schema-aware diffing for validation, canonical output for publishing.

What This Means

  • The pipeline detects real changes vs. re-exports automatically — no unnecessary processing
  • Milestone validation has a clear, reviewable report: "here's exactly what our pipeline changed and why"
  • Sound Transit can see a human-readable diff of any two versions of their feed
  • The canonical form guarantees consumers get consistently structured data

Open Questions

  • What do we do about non-standard files and fields? Sound Transit's feed has agency-specific files (direction_names_exceptions.txt) and non-standard columns (boarding_type, tts_stop_name). The digester will support registering extensions, but should the pipeline's published output include these? Or should we constrain the public feed to only official GTFS spec + recognized extensions, and strip everything else? Keeping them means consumers get extra data they may not understand. Stripping them means losing agency-specific information that some consumers may rely on.