GTFS Feed Aggregator and Query Engine

Transit agencies publish their data on their services in a standard format, the General Transit Feed Specification. It’s a .zip file with .txt files that describe the stops, routes, schedules, agency metadata, and so on. The .txt files are comma-separated data tables with standardized field names, and they are structured like a relational database with primary ID keys and foreign keys. Many applications, like Google Maps Transit Directions, draw on these GTFS feeds to present transit services to the user.

The following python files enable an analyst to aggregate multiple GTFS feeds and then run SQL queries on the aggregate. Larger government agencies (e.g. states or the federal government) could use a tool like this to do data-driven analysis and planning. This project uses python pandas, numpy, sqlite3, and subprocess32. The open source module gtfsdb does a lot of the heavy lifting for me. I also used the open source module haversine to calculate distances from (lat, long) coordinates. I show how it works on all the GTFS files that are available in the state of California, and make some maps in QGIS.

Input

These python files run on a bunch of GTFS feeds in a local directory. I had 66 GTFS feeds (135 MB) for California agencies. Fun facts: All but two of the 31 large agencies (defined as agencies with over 100 vehicles) in California have GTFS feeds. Of the 68 small agencies in the state, those with fewer than 100 vehicles, about half (37) have GTFS feeds. A disproportionate number of these 37 are in the greater San Francisco area.

Inventory Tables

GTFS has six required tables (agency, stops, routes, trips, stop_times, and calendar) and seven optional tables. First, I inventoried the tables that are found in our GTFS feeds with inventory_tables.py. The results are as follows:

TableRequired or OptionalDescriptionHow many CA agencies have this table (N=66)?
agency.txtRequiredOne or more transit agencies that provide the data in this feed.66
stops.txtRequiredIndividual locations where vehicles pick up or drop off passengers.66
routes.txtRequiredTransit routes. A route is a group of trips that are displayed to riders as a single service.66
trips.txtRequiredTrips for each route. A trip is a sequence of two or more stops that occurs at specific time.66
stop_times.txtRequiredTimes that a vehicle arrives at and departs from individual stops for each trip.66
calendar.txtRequiredDates for service IDs using a weekly schedule. Specify when service starts and ends, as well as days of the week where service is available.65
calendar_dates.txtOptionalExceptions for the service IDs defined in the calendar.txt file. If calendar_dates.txt includes ALL dates of service, this file may be specified instead of calendar.txt.65
fare_attributes.txtOptionalFare information for a transit organization's routes.43
fare_rules.txtOptionalRules for applying fare information for a transit organization's routes.41
shapes.txtOptionalRules for drawing lines on a map to represent a transit organization's routes.61
frequencies.txtOptionalHeadway (time between trips) for routes with variable frequency of service.27
transfers.txtOptionalRules for making connections at transfer points between routes.29
feed_info.txtOptionalAdditional information about the feed itself, including publisher, version, and expiration information.33

This inventory gives us a sense of the quality of the feeds. In addition to the files that are supposed to be in a GTFS feed, it also catches and reports back the existence of miscellaneous files that some agencies tend to drop into their feeds. (I didn’t report them above, but for example, it found ‘SFMTA_Transit_Data_License_Agreement.txt’, ‘patterns.txt’, ‘places.txt’, and ‘timepoints.txt’). It tells us how many agencies we’re really looking at when we run queries on any of the optional tables.

Convert to sqlite

Now let’s convert each of the .zip files to a sqlite db with gtfsdb_loop.py. It calls the open source module gtfsdb to create the sqlite db for each feed.

Run SQL Queries

Now we can run SQL queries on the feeds. I set it up so that in any_sql_query_to_df.py I could replace the values of MY_SQL_STATEMENT and COLUMN_NAMES (and, if I wanted the results in a CSV, CSV_OUTPUT). The rest of the code uses pandas to read the sql results from each feed and concatenate them into a single result. This replicates the functionality of having one big database for all the feeds.

So that I could refer to them or rerun them, I saved all my SQL queries in sql_statements_repository.py. Check out these summary stats I could then calculate:

Stops

There are 84,613 unique transit stops in California.

Routes

There are 2,610 unique routes in California, where a route is a set of trips that is presented to the customer as a single service, e.g. the 720 Line. A subset of these routes is tagged with a modal label: 2,209 are bus routes, 21 are light rail, 15 are intercity rail, 11 are ferry routes, 6 are subway, and 3 are cable car.

Trips per day

Considering Monday service as representative of weekday service, there are 235,139 unique transit trips per day running on Mondays in the State of California. The mean number of trips per day per agency is 3,265. The median is 709 trips per day. The maximum, found in the GTFS feed for LA Metro’s bus service, is 32,441 trips per day.

Transfers

Twenty-nine agencies in the state use the ‘transfer’ table to define rules for making connections between routes. A total of 288 transfers are described across these 29 agencies. Of these 162 are transfers that require a 3-minute minimum time to transfer between routes.

Fares

Forty-three agencies use the ‘fare_attributes’ table to describe the fare price, whether the fare is paid on board or pre-paid, and the number of transfers permitted on a given fare. A total of 403 fare classes are described. Of these, 256 are fare classes on Metrolink and BART, which use route and distance-based fares. Excluding BART and Metrolink, the median price for the remaining fare classes is $2.50.

Run simple distance tolerance query

Since everything’s in python, I can do more sophisticated searches that are beyond what SQL alone can do. For example, in apply_geog_tolerance2.py, I identify where two agencies have closely co-located stops. I am looking for stops that are within ~50 feet of one another and have different agency_ids. The open source module haversine calculates crow-flies distance from lat,long. Note: this program would have very long run times if I didn’t use a search rectangle to narrow down the candidates first.

In the maps below, I display these so-called Interagency Stops. I also look for Interagency Stops where there are fewer than 20 trips per day (see apply_geog_tolerance_for_trip_counts.py). At such stops, agencies can coordinate to make sure there are not long transfer wait times at these stops.

I could extend this work by using a spatial database like spatialite or postGIS to do more sophisticated queries.

Map the Results

Finally, I use QGIS to map some of the results of the queries.

This map shows all the transit stops in California (for the agencies that have public GTFS feeds).

Here are the stops in the Bay Area displayed by the number of trips per Monday that serve that stop. Stops with over 1,000 trips per day are in the top 1% in the state.

Here’s a case study of the interagency stops with fewer than 20 trips per day. There are three agencies that serve the Modesto area. There are quite a few stops where passengers can transfer between them, but this analysis identifies that there are only four where there are fewer than 20 trips per day. With that low a trip number, there’s a good chance there could be long wait times at these stops, as well as the potential to reduce wait times using timed transfers.

Here are the stops in the Los Angeles area displayed by the number of trips per Monday that serve that stop.