Detecting What Changed in a Schedule Feed
March 20, 2026 — Experiment: gtfs-digester-core
The Question
Sound Transit regenerates their GTFS schedule feed regularly. Sometimes the content changes (new routes, updated stop names), sometimes it's identical but re-exported with different formatting. We need to answer two questions reliably: "did anything actually change?" and "if so, what exactly?"
This matters for the pipeline (don't re-process identical feeds) and for milestone validation (prove our output matches theirs).
What We Tried
We built a canonicalization and fingerprinting library that:
- Normalizes GTFS files into a canonical form (consistent column order, sorted rows, trimmed whitespace, zero-padded times)
- Computes a BLAKE3 hash fingerprint of the canonical form
- Compares two feeds at the file level and row level, keyed on primary keys from the GTFS spec
What We Found
-
Sound Transit's feed has 23 files — including Fares V2 tables and custom extensions we hadn't expected (
modifications.txt,direction_names_exceptions.txt). Our library handles unknown files gracefully with warnings. -
Their feed has non-standard columns —
trips.txthasboarding_type,drt_advance_book_min,peak_offpeak.stops.txthastts_stop_name. These are silently dropped in the canonical form (with warnings), preserving the standard data. -
Canonicalization roundtrips perfectly — load the feed, normalize it, write it out, load it again. Same fingerprint. Same row counts. Same values. Zero data loss across all 135,937 stop_times rows.
-
Fingerprinting is deterministic — we loaded the same feed 10 times and got the identical fingerprint every time. Reordering columns, shuffling rows, changing whitespace — all produce the same fingerprint.
What It Looks Like
Checking if a feed changed — one line, instant:
$ bin/fingerprint data/sound-transit/schedule.zip
v1:b629af85a609c496deb47b06faa0387eabd31...
agency.txt: 1 rows, 8 cols
calendar.txt: 16 rows, 10 cols
calendar_dates.txt: 34 rows, 3 cols
feed_info.txt: 1 rows, 9 cols
routes.txt: 7 rows, 7 cols
shapes.txt: 33,041 rows, 5 cols
stop_times.txt: 135,937 rows, 7 cols
stops.txt: 349 rows, 13 cols
trips.txt: 8,539 rows, 10 cols
Same fingerprint = nothing changed, skip processing. Different fingerprint? Here's what a diff looks like:
Archive diff: 2 files changed
stops.txt: +3 added, -0 removed, ~5 modified
[+] stop_id=NEW_01 stop_name="Lynnwood City Center"
[+] stop_id=NEW_02 stop_name="Mariner"
[+] stop_id=NEW_03 stop_name="Mountlake Terrace"
[~] stop_id=S1234 stop_name: "Intl Dist/Chinatown" → "International District"
[~] stop_id=S1235 stop_name: "Seatac/Airport" → "SeaTac/Airport"
...
trips.txt: +0 added, -2 removed, ~0 modified
[-] trip_id=T9901 (route_id=TEST_ROUTE)
[-] trip_id=T9902 (route_id=TEST_ROUTE)
agency.txt, calendar.txt, calendar_dates.txt, feed_info.txt,
routes.txt, shapes.txt, stop_times.txt: unchanged (fingerprint match)
Only the changed files get parsed. Unchanged files are verified by hash comparison alone.
The Decision
Use gtfs-digester for all schedule feed comparison and change detection. Fingerprinting for dedup, schema-aware diffing for validation, canonical output for publishing.
What This Means
- The pipeline detects real changes vs. re-exports automatically — no unnecessary processing
- Milestone validation has a clear, reviewable report: "here's exactly what our pipeline changed and why"
- Sound Transit can see a human-readable diff of any two versions of their feed
- The canonical form guarantees consumers get consistently structured data
Open Questions
- What do we do about non-standard files and fields? Sound Transit's feed has agency-specific files (
direction_names_exceptions.txt) and non-standard columns (boarding_type,tts_stop_name). The digester will support registering extensions, but should the pipeline's published output include these? Or should we constrain the public feed to only official GTFS spec + recognized extensions, and strip everything else? Keeping them means consumers get extra data they may not understand. Stripping them means losing agency-specific information that some consumers may rely on.