Detecting What Changed in a Schedule Feed

March 20, 2026 — Experiment: gtfs-digester-core

The Question

Sound Transit regenerates their GTFS schedule feed regularly. Sometimes the content changes (new routes, updated stop names), sometimes it's identical but re-exported with different formatting. We need to answer two questions reliably: "did anything actually change?" and "if so, what exactly?"

This matters for the pipeline (don't re-process identical feeds) and for milestone validation (prove our output matches theirs).

What We Tried

We built a canonicalization and fingerprinting library that:

Normalizes GTFS files into a canonical form (consistent column order, sorted rows, trimmed whitespace, zero-padded times)
Computes a BLAKE3 hash fingerprint of the canonical form
Compares two feeds at the file level and row level, keyed on primary keys from the GTFS spec

What We Found

Sound Transit's feed has 23 files — including Fares V2 tables and custom extensions we hadn't expected (modifications.txt, direction_names_exceptions.txt). Our library handles unknown files gracefully with warnings.
Their feed has non-standard columns — trips.txt has boarding_type, drt_advance_book_min, peak_offpeak. stops.txt has tts_stop_name. These are silently dropped in the canonical form (with warnings), preserving the standard data.
Canonicalization roundtrips perfectly — load the feed, normalize it, write it out, load it again. Same fingerprint. Same row counts. Same values. Zero data loss across all 135,937 stop_times rows.
Fingerprinting is deterministic — we loaded the same feed 10 times and got the identical fingerprint every time. Reordering columns, shuffling rows, changing whitespace — all produce the same fingerprint.

What It Looks Like

Checking if a feed changed — one line, instant:

$ bin/fingerprint data/sound-transit/schedule.zip

v1:b629af85a609c496deb47b06faa0387eabd31...
  agency.txt: 1 rows, 8 cols
  calendar.txt: 16 rows, 10 cols
  calendar_dates.txt: 34 rows, 3 cols
  feed_info.txt: 1 rows, 9 cols
  routes.txt: 7 rows, 7 cols
  shapes.txt: 33,041 rows, 5 cols
  stop_times.txt: 135,937 rows, 7 cols
  stops.txt: 349 rows, 13 cols
  trips.txt: 8,539 rows, 10 cols

Same fingerprint = nothing changed, skip processing. Different fingerprint? Here's what a diff looks like:

Archive diff: 2 files changed

stops.txt: +3 added, -0 removed, ~5 modified
  [+] stop_id=NEW_01  stop_name="Lynnwood City Center"
  [+] stop_id=NEW_02  stop_name="Mariner"
  [+] stop_id=NEW_03  stop_name="Mountlake Terrace"
  [~] stop_id=S1234   stop_name: "Intl Dist/Chinatown" → "International District"
  [~] stop_id=S1235   stop_name: "Seatac/Airport" → "SeaTac/Airport"
  ...

trips.txt: +0 added, -2 removed, ~0 modified
  [-] trip_id=T9901  (route_id=TEST_ROUTE)
  [-] trip_id=T9902  (route_id=TEST_ROUTE)

agency.txt, calendar.txt, calendar_dates.txt, feed_info.txt,
routes.txt, shapes.txt, stop_times.txt: unchanged (fingerprint match)

Only the changed files get parsed. Unchanged files are verified by hash comparison alone.

The Decision

Use gtfs-digester for all schedule feed comparison and change detection. Fingerprinting for dedup, schema-aware diffing for validation, canonical output for publishing.

What This Means

The pipeline detects real changes vs. re-exports automatically — no unnecessary processing
Milestone validation has a clear, reviewable report: "here's exactly what our pipeline changed and why"
Sound Transit can see a human-readable diff of any two versions of their feed
The canonical form guarantees consumers get consistently structured data

Open Questions

What do we do about non-standard files and fields? Sound Transit's feed has agency-specific files (direction_names_exceptions.txt) and non-standard columns (boarding_type, tts_stop_name). The digester will support registering extensions, but should the pipeline's published output include these? Or should we constrain the public feed to only official GTFS spec + recognized extensions, and strip everything else? Keeping them means consumers get extra data they may not understand. Stripping them means losing agency-specific information that some consumers may rely on.