
An open API service indexing awesome lists of open source software.

Read & analyze GTFS datasets using Node.js.

gtfs public-transport transit

Last synced: about 1 month ago
JSON representation

Read & analyze GTFS datasets using Node.js.




# gtfs-utils

**Utilities to process [GTFS]( data sets.**

[![npm version](](
![minimum Node.js version](
[![support me via GitHub Sponsors](](
[![chat with me on Twitter](](

- ✅ supports `frequencies.txt`
- ✅ works in the browser
- ✅ fully asynchronous/streaming

## Design goals

### streaming/iterative on sorted data

As [public transportation systems will hopefully become more integrated]( over time, GTFS datasets will often be multiple GBs large. GTFS processing should work in memory-constrained Raspberry Pis or [FaaS]( environments as well.

Whenever possible, all `gtfs-utils` tools will only read as little data into memory as possible. For this, the individual files in a GTFS dataset need to be [sorted in a way](#sorted-gtfs-files) that allows iterative processing.

Read more in the [*performance* section](#performance).

### data-source-agnostic

`gtfs-utils` does not make assumptions about where you read the GTFS data from. Although it has a built-in tool to read CSV from files on disk, anything is possible: [`.zip` archives](docs/, [HTTP requests](docs/, in-memory [buffers](, [dat]([IPFS](, etc.

There are too many half-done, slightly opinionated GTFS processing tools out there, so `gtfs-utils` tries to be as universal as possible.

### correct

Aside from new features of the ever-expanded GTFS spec that change the expected behavior of old ones (and bugs of course), `gtfs-utils` tries to follow the spec closely.

For example, it will, when computing the absolute timestamp/instant of an arrival at a stop, always take into account `stop_timezone` or the user-supplied timezone, because [`stop_times.txt` uses "wall clock time"](

## Installing

npm install gtfs-utils

## Usage

[API documentation](docs/

### sorted GTFS files

**`gtfs-utils` assumes that the files in your GTFS dataset are sorted in a particular way**; This allows it to compute some data aggregations more memory-efficiently, which means that you can use it to process [very large](#performance) datasets. For example, if [`trips.txt`]( and [`stop_times.txt`]( are both sorted by `trip_id`, `computeStopovers()` can read each file incrementally, only rows for *one* `trip_id` at a time.

[Miller]( and [`sponge`]( work very well for this:

mlr --csv sort -f agency_id agency.txt | sponge agency.txt
mlr --csv sort -f parent_station -nr location_type stops.txt | sponge stops.txt
mlr --csv sort -f route_id routes.txt | sponge routes.txt
mlr --csv sort -f trip_id trips.txt | sponge trips.txt
mlr --csv sort -f trip_id -n stop_sequence stop_times.txt | sponge stop_times.txt
mlr --csv sort -f service_id calendar.txt | sponge calendar.txt
mlr --csv sort -f service_id,date calendar_dates.txt | sponge calendar_dates.txt
mlr --csv sort -f trip_id,start_time frequencies.txt | sponge frequencies.txt

There's also a [`` script]( included in the npm package, which executes the commands above.

*Note:* For read-only sources (like HTTP requests), sorting the files is not an option. You can solve this by [spawning]( `mlr` and piping data through it.

*Note:* With a bit of extra code, you can also use `gtfs-utils` [with a `.zip` archive](docs/ or [with a *remote* feed](docs/

### basic example

Given our [sample GTFS dataset](, we'll answer the following question: **On a specific day, which vehicles of which lines stop at a specific station?**

We define a function `readFile` that reads our GTFS data into a [readable stream]([async iterable]( In this case we'll read CSV files from disk using the built-in `readCsv` helper:

const readCsv = require('gtfs-utils/read-csv')

const readFile = (file) => {
return readCsv(require.resolve('sample-gtfs-feed/gtfs/' + file + '.txt'))

[`computerStopovers()`](docs/ will read [`calendar.txt`](, [`calendar_dates.txt`](, [`trips.txt`](, [`stop_times.txt`]( & [`frequencies.txt`]( and return all *stopovers* of all trips across the full time frame of the dataset.

It returns an [async generator function]( (which thus is [async-iterable](, so we can use `for await`.

In the following example, we're going to print all stopovers at `airport` on the 5th of May 2019:

const {DateTime} = require('luxon')
const computeStopovers = require('gtfs-utils/compute-stopovers')

const day = '2019-05-15'
const isOnDay = (t) => {
const iso = DateTime.fromMillis(t * 1000, {zone: 'Europe/Berlin'}).toISO()
return String(t).slice(0, day.length) === day

const stopovers = await computeStopovers(readFile, 'Europe/Berlin')
for await (const stopover of stopovers) {
if (stopover.stop_id !== 'airport') continue
if (!isOnDay(stopover.arrival)) continue

stop_id: 'airport',
trip_id: 'a-downtown-all-day',
service_id: 'all-day',
route_id: 'A',
start_of_trip: 1557871200,
arrival: 1557926580,
departure: 1557926640,
stop_id: 'airport',
trip_id: 'a-outbound-all-day',
service_id: 'all-day',
route_id: 'A',
start_of_trip: 1557871200,
arrival: 1557933900,
departure: 1557933960,
// …
stop_id: 'airport',
trip_id: 'c-downtown-all-day',
service_id: 'all-day',
route_id: 'C',
start_of_trip: 1557871200,
arrival: 1557926820,
departure: 1557926880,

For more examples, check the [API documentation](docs/

## Performance

By default, `gtfs-utils` verifies that the input files are sorted correctly. You can disable this to improve performance slightly by running with the `CHECK_GTFS_SORTING=false` environment variable.

`gtfs-utils` should be fast enough for small to medium-sized GTFS datasets. It won't be as fast as other GTFS tools because it

- uses [async iteration]( extensively for memory-efficiency and an easy-of-use, which [currently has significant performance penalties in v8](
- is written in JavaScript, so it cannot optimise the memory layout of its data structures.
- parses all columns of a file it needs information from, into a JavaScript object.

On my [M1 Macbook Air](, with the [180mb `2022-02-03` *HVV* GTFS dataset]( (17k `stops.txt` rows, 91k `trips.txt` rows, 2m `stop_times.txt` rows, ~500m stopovers), `computeStopovers` computes 18k stopovers per second, and finishes in several hours.

*Note:* If you want a faster way to query and transform GTFS datasets, I suggest you to use [`gtfs-via-postgres`]( to leverage PostgreSQL's query optimizer. Once you have imported the data, it is usually orders of magnitude faster.

## Related

- [gtfstidy]( – Go command line tool for validating and tidying GTFS feeds.
- [gtfs-stream]( – Streaming GTFS and GTFS-RT parser for node
- [mapzen-gtfs]( – Python library for reading and writing GTFS feeds. (Python)
- [gtfspy]( – Public transport network analysis using Python
- [extract-gtfs-shapes]( – Command-line tool to extract shapes from a GTFS dataset.
- [extract-gtfs-pathways]( – Command-line tool to extract pathways from a GTFS dataset.
- [Awesome GTFS: Frameworks and Libraries]( – A collection of libraries for working with GTFS.

## Contributing

If you have a question or have difficulties using `gtfs-utils`, please double-check your code and setup first. If you think you have found a bug or want to propose a feature, refer to [the issues page](