Building your own MTA train arrivals dataset: a how-to by ResidentMario

Note: this Gist has been superceded by the tutorial section of the gtfs_tripify documentation.

Interested in New York City transit? Want to learn more reasons why your particular train commute is good or bad? This gist will show you how to roll your own daily MTA train arrival dataset using Python. The result can then be used to explore questions about train service that schedule data alone couldn’t answer.

Building a daily roll-up

To begin, visit the MTA GTFS-RT Archive at http://web.mta.info/developers/data/archives.html:

This page contains monthly rollups of realtime train location data in what is known as the “GTFS-RT format”. This is the data that powers both the train tracker apps on your phone and the arrival clocks on the station platforms, and the MTA helpful provides a historical archive of this data online.

The archive covers all train lines in the system. Pick a month that you are interested in, and click on the link to download it to your computer. Be prepared to wait a while; the files are roughly 30 GB in size.

Once the download is finished, you will have a file named something like 201908.zip on your computer:

Double click on this file to extract the files inside, and you will find that inside this zip file is another layer of zip files:

Pick a day that you are interested in and double click on it again to extract the files. This will result in a folder containing many, many tiny files:

Each of these sub-sub-files is a single GTFS-RT message. Each message is a snapshot of the state of a slice of the MTA system. It has important two properties:

The trains that this message covers.
The timestamp that this message represents information about.

For example, the file consider the file gtfs_7_20190601_042000.gtfs. This file contains a snapshot of the state of all 7 trains in the MTA system as of 4:20 AM, June 1st, 2019.

Trains which run similar service routes may get “packaged up” into the same message. For example, the file gtfs_ace_20190601_075709.gtfs contains a snapshot of the state of all A, C, and E trains in the MTA system.

Some trains are packaged with out train lines, but seemingly for historical reasons are excluded from the name of the file:

The Z train is included in the gtfs_J messages.
The 7X train is included in the gtfs_7 messages.
The FS (Franklin Avenue Shuttle) and H (Rockaway Shuttle) are included in the ACE messages.
The W is included in the gtfs_NQR messages.

At this time, the following trains are excluded from the dataset, for unknown reasons:

The 1, 2, 3, 4, 5, 6, and 6X trains do not appear in recent archives, although they appear to appear to have been included in the archives in the past (tracking issue).
The late-night shuttles are not included in this archive.

Now that we understand how to get the trains we want, let’s talk about timestamps. The MTA system updates several times per minute; the exact interval and the reliability of the update sequence varies. Each of these updates is timestamped in EST.

So for example, the gtfs_7_20190601_042000.gtfs message we talked about earlier represents a snapshot dating from 4:20 AM sharp on January 1st 2019. The message that immediately follows, gtfs_7_20190601_042015.gtfs, is a snapshot of the system as of 4:20:15 on January 1st 2019, e.g. 15 seconds later; and so on.

Choose a train line or set of train lines, and copy the subset of the files whose arrival times you are interested in. For the purposes of this demo, I will grab data on every 7 train that ran on January 1st 2019. Paste this into another folder somewhere on your computer.

This data is snapshot data in an encoded binary format known as a Protocol buffer. We now need to convert it into tabular data that we can actually analyze. This is actually an extremely tricky and difficult process. Luckily a Python library exists that can do this work for us—gtfs_tripify.

To begin, install gtfs_tripify using pip from the command line:

pip install gtfs_tripify

Navigate to that folder you dumped the files you are interested in, and run the following snippet of Python:

import gtfs_tripify as gt
import os

messages = []
for filename in sorted(os.listdir('.')):
    if '.py' not in filename:
        with open(filename, 'rb') as f:
            messages.append(f.read())

logbook, _, _ = gt.logify(messages)
gt.ops.to_csv(logbook, 'logbook.csv')

An easy way to run this is to copy paste this code into a run.py file and then run python run.py on your command line.

Running this script you will probably send output to your terminal regarding non-fatal errors in the data stream, this should be safe to ignore.

This script may take a few tens of minutes to finish running. Once it is done, you will be left with a logbook.csv file on your computer containing train arrival and departure data:

trip_id,route_id,action,minimum_time,maximum_time,stop_id,latest_information_time,unique_trip_id
131750_7..N,7,STOPPED_OR_SKIPPED,1559440299.0,1559440695.0,726N,1559440315,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559440846.0,1559440860.0,725N,1559440860,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559440936.0,1559440950.0,724N,1559440950,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559441016.0,1559441030.0,723N,1559441030,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559441211.0,1559441226.0,721N,1559441226,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559441291.0,1559441306.0,720N,1559441306,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559441411.0,1559441426.0,719N,1559441426,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559441561.0,1559441591.0,718N,1559441591,3ac1c948-af61-11e9-909a-8c8590adc94b
131750_7..N,7,STOPPED_OR_SKIPPED,1559441942.0,1559441956.0,712N,1559441956,3ac1c948-af61-11e9-909a-8c8590adc94

This dataset has the following schema:

trip_id: The ID assigned to the trip in the GTFS-Realtime record.
route_id: The ID of the route. In New York City these are easy to read: 2 means this is a number 2 train.
stop_id: The ID assigned to the stop in question.
action: The action that the given train took at the given stop. One of STOPPED_AT, STOPPED_OR_SKIPPED, or EN_ROUTE_TO (the latter only occurs if the trip is still in progress).
minimum_time: The minimum time at which the train pulled into the station. May be NaN. This time is a Unix timestamp.
maximum_time: The maximum time at which the train pulled out of the station. May be NaN. Also a Unix timestamp.
latest_information_time: The timestamp of the most recent GTFS-Realtime data feed containing information pertinent to this record. Also a Unix timestamp.

A snapshot of this data has been attached to this gist for the purposes of demonstration.

At this point you can jump into your favorite data analysis environment and start exploring!

Building a larger dataset

This is a pretty simple example. Naturally, you may be wondering: can I get more data? The answer is yes!

The key limitation is memory. gtfs_tripify does all of its processing in-memory, so it can only consume as many messages as will fit in your computer’s RAM at once. On my machine for example, I can only process data one day at a time.

To build a dataset that’s larger than what you can fit in memory, construct two logbooks for two contiguous “time slices” of the GTFS-RT stream, then combine them using gt.ops.merge_logbooks.

For example, the following script will build and save to disk an arrival dataset for all 7 trains on both July 1st and July 2nd 2019:

import gtfs_tripify as gt
from zipfile import ZipFile
import os

DOWNLOAD_URL = '~/Downloads/201906.zip'

z = ZipFile(DOWNLOAD_URL)
z.extract('20190601.zip')
z.extract('201906012.zip')

messages = []
for filename in sorted(os.listdir('.')):
    if '.py' not in filename and 'gtfs_7_' in filename:
        with open(filename, 'rb') as f:
            messages.append(f.read())

first_logbook, first_logbook_timestamps, _ = gt.logify(messages[:len(messages) // 2])
second_logbook, second_logbook_timestamps, _ = gt.logify(messages[len(messages) // 2:])
logbook = gt.ops.merge_logbooks(
    [(first_logbook, first_logbook_timestamps), (second_logbook, second_logbook_timestamps)],
    'logbook.csv'
)
gt.ops.to_csv(logbook, 'logbook.csv')

Conclusion

That concludes this basic introduction to parsing MTA GTFS-RT data with gtfs_tripify. To find out more about how it works, I recommend reading this explanatory blog post on parsing subway data, then heading over to the GitHub repo to learn more. To see the full potential of this data stream in action, check out the NYC Subway Variability Calculator built by the New York Times.