GTFS Kit 10.1.1 Documentation

Introduction

GTFS Kit is a Python library for analyzing General Transit Feed Specification (GTFS) data in memory without a database. It uses Pandas and GeoPandas to do the heavy lifting.

Authors

  • Alex Raichev, 2019-09

Installation

Install it from PyPI with UV, say, via uv add gtfs_kit.

Examples

In the Jupyter notebook notebooks/examples.ipynb.

Conventions

  • In conformance with GTFS, dates are encoded as YYYYMMDD date strings, and times are encoded as HH:MM:SS time strings with the possibility that HH > 24. Watch out for that possibility, because it has counterintuitive consequences; see e.g. trips.is_active_trip(), which is used in routes.compute_route_stats(), stops.compute_stop_stats(), and miscellany.compute_feed_stats().

  • ‘DataFrame’ and ‘Series’ refer to Pandas DataFrame and Series objects, respectively

Module constants

Constants useful across modules.

gtfs_kit.constants.COLORS_SET2 = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3']

Colorbrewer 8-class Set2 colors

gtfs_kit.constants.DIST_UNITS = ['ft', 'mi', 'm', 'km']

Valid distance units

gtfs_kit.constants.DTYPE = {'agency_email': 'string', 'agency_fare_url': 'string', 'agency_id': 'string', 'agency_lang': 'string', 'agency_name': 'string', 'agency_phone': 'string', 'agency_timezone': 'string', 'agency_url': 'string', 'arrival_time': 'string', 'attribution_email': 'string', 'attribution_id': 'string', 'attribution_phone': 'string', 'attribution_url': 'string', 'bikes_allowed': 'Int8', 'block_id': 'string', 'contains_id': 'string', 'currency_type': 'string', 'date': 'string', 'departure_time': 'string', 'destination_id': 'string', 'direction_id': 'Int8', 'drop_off_type': 'Int8', 'end_date': 'string', 'end_time': 'string', 'exact_times': 'Int8', 'exception_type': 'Int8', 'fare_id': 'string', 'feed_end_date': 'string', 'feed_lang': 'string', 'feed_publisher_name': 'string', 'feed_publisher_url': 'string', 'feed_start_date': 'string', 'feed_version': 'string', 'friday': 'Int8', 'from_stop_id': 'string', 'headway_secs': 'Int16', 'is_authority': 'Int8', 'is_operator': 'Int8', 'is_producer': 'Int8', 'location_type': 'Int8', 'min_transfer_time': 'Int16', 'monday': 'Int8', 'organization_name': 'string', 'origin_id': 'string', 'parent_station': 'string', 'payment_method': 'Int8', 'pickup_type': 'Int8', 'price': 'float', 'route_color': 'string', 'route_desc': 'string', 'route_id': 'string', 'route_long_name': 'string', 'route_short_name': 'string', 'route_text_color': 'string', 'route_type': 'Int8', 'route_url': 'string', 'saturday': 'Int8', 'service_id': 'string', 'shape_dist_traveled': 'float', 'shape_id': 'string', 'shape_pt_lat': 'float', 'shape_pt_lon': 'float', 'shape_pt_sequence': 'Int32', 'start_date': 'string', 'start_time': 'string', 'stop_code': 'string', 'stop_desc': 'string', 'stop_headsign': 'string', 'stop_id': 'string', 'stop_lat': 'float', 'stop_lon': 'float', 'stop_name': 'string', 'stop_sequence': 'Int32', 'stop_timezone': 'string', 'stop_url': 'string', 'sunday': 'Int8', 'thursday': 'Int8', 'timepoint': 'Int8', 'to_stop_id': 'string', 'transfer_duration': 'Int16', 'transfer_type': 'Int8', 'transfers': 'Int8', 'trip_headsign': 'string', 'trip_id': 'string', 'trip_short_name': 'string', 'tuesday': 'Int8', 'wednesday': 'Int8', 'wheelchair_accessible': 'Int8', 'wheelchair_boarding': 'Int8', 'zone_id': 'string'}

Data types for Pandas CSV reads

gtfs_kit.constants.FEED_ATTRS = ['agency', 'attributions', 'calendar', 'calendar_dates', 'fare_attributes', 'fare_rules', 'feed_info', 'frequencies', 'routes', 'shapes', 'stops', 'stop_times', 'trips', 'transfers', 'dist_units']

Primary feed attributes

gtfs_kit.constants.WGS84 = 'EPSG:4326'

WGS84 coordinate reference system for Geopandas

Module helpers

Functions useful across modules.

gtfs_kit.helpers.almost_equal(f: DataFrame, g: DataFrame) bool

Return True if and only if the given DataFrames are equal after sorting their columns names, sorting their values, and reseting their indices.

gtfs_kit.helpers.combine_time_series(time_series_dict: dict[str, DataFrame], kind: str, *, split_directions: bool = False) DataFrame

Combine the time series DataFrames in the given dictionary into one time series DataFrame with hierarchical columns.

Parameters:
  • time_series_dict (dictionary) – Has the form string -> time series

  • kind (string) – 'route' or 'stop'

  • split_directions (boolean) – If True, then assume the original time series contains data separated by trip direction; otherwise, assume not. The separation is indicated by a suffix '-0' (direction 0) or '-1' (direction 1) in the route ID or stop ID column values.

Returns:

Columns are hierarchical (multi-index). The top level columns are the keys of the dictionary and the second level columns are 'route_id' and 'direction_id', if kind == 'route', or ‘stop_id’ and 'direction_id', if kind == 'stop'. If split_directions, then third column is 'direction_id'; otherwise, there is no 'direction_id' column.

Return type:

DataFrame

gtfs_kit.helpers.datestr_to_date(x: date | str, format_str: str = '%Y%m%d', *, inverse: bool = False) str | date

Given a string x representing a date in the given format, convert it to a datetime.date object and return the result. If inverse, then assume that x is a date object and return its corresponding string in the given format.

gtfs_kit.helpers.downsample(time_series: DataFrame, freq: str) DataFrame

Downsample the given route, stop, or feed time series, (outputs of routes.compute_route_time_series(), stops.compute_stop_time_series(), or miscellany.compute_feed_time_series(), respectively) to the given Pandas frequency string (e.g. ‘15Min’). Return the given time series unchanged if the given frequency is shorter than the original frequency.

gtfs_kit.helpers.drop_feature_ids(collection: dict) dict

Given a GeoJSON FeatureCollection, remove the 'id' attribute of each Feature, if it exists.

gtfs_kit.helpers.get_active_trips_df(trip_times: DataFrame) Series

Count the number of trips in trip_times that are active at any given time.

Assume trip_times contains the columns

  • start_time: start time of the trip in seconds past midnight

  • end_time: end time of the trip in seconds past midnight

Return a Series whose index is times from midnight when trips start and end and whose values are the number of active trips for that time.

gtfs_kit.helpers.get_convert_dist(dist_units_in: str, dist_units_out: str) Callable[[float], float]

Return a function of the form

distance in the units dist_units_in -> distance in the units dist_units_out

Only supports distance units in constants.DIST_UNITS.

gtfs_kit.helpers.get_max_runs(x) array

Given a list of numbers, return a NumPy array of pairs (start index, end index + 1) of the runs of max value.

Example:

>>> get_max_runs([7, 1, 2, 7, 7, 1, 2])
array([[0, 1],
       [3, 5]])

Assume x is not empty. Recipe comes from Stack Overflow.

gtfs_kit.helpers.get_peak_indices(times: list, counts: list) array

Given an increasing list of times as seconds past midnight and a list of trip counts at those respective times, return a pair of indices i, j such that times[i] to times[j] is the first longest time period such that for all i <= x < j, counts[x] is the max of counts. Assume times and counts have the same nonzero length.

Examples:

>>> times = [0, 10, 20, 30, 31, 32, 40]
>>> counts = [7, 1, 2, 7, 7, 1, 2]
>>> get_peak_indices(times, counts)
array([0, 1])

>>> counts = [0, 0, 0]
>>> times = [18000, 21600, 28800]
>>> get_peak_indices(times, counts)
array([0, 3])
gtfs_kit.helpers.get_segment_length(linestring: LineString, p: Point, q: Point | None = None) float

Given a Shapely linestring and two Shapely points, project the points onto the linestring, and return the distance along the linestring between the two points. If q is None, then return the distance from the start of the linestring to the projection of p. The distance is measured in the native coordinates of the linestring.

gtfs_kit.helpers.is_metric(dist_units: str) bool

Return True if the given distance units equals ‘m’ or ‘km’; otherwise return False.

gtfs_kit.helpers.is_not_null(df: DataFrame, col_name: str) bool

Return True if the given DataFrame has a column of the given name (string), and there exists at least one non-NaN value in that column; return False otherwise.

gtfs_kit.helpers.longest_subsequence(seq, mode='strictly', order='increasing', key=None, *, index=False)

Return the longest increasing subsequence of seq.

Parameters:
  • seq (sequence object) – Can be any sequence, like str, list, numpy.array.

  • mode ({'strict', 'strictly', 'weak', 'weakly'}, optional) – If set to ‘strict’, the subsequence will contain unique elements. Using ‘weak’ an element can be repeated many times. Modes ending in -ly serve as a convenience to use with order parameter, because longest_sequence(seq, ‘weakly’, ‘increasing’) reads better. The default is ‘strict’.

  • order ({'increasing', 'decreasing'}, optional) – By default return the longest increasing subsequence, but it is possible to return the longest decreasing sequence as well.

  • key (function, optional) – Specifies a function of one argument that is used to extract a comparison key from each list element (e.g., str.lower, lambda x: x[0]). The default value is None (compare the elements directly).

  • index (bool, optional) – If set to True, return the indices of the subsequence, otherwise return the elements. Default is False.

Returns:

  • elements (list, optional) – A list of elements of the longest subsequence. Returned by default and when index is set to False.

  • indices (list, optional) – A list of indices pointing to elements in the longest subsequence. Returned when index is set to True.

  • Taken from this Stack Overflow answer.

gtfs_kit.helpers.make_html(d: dict) str

Convert the given dictionary into an HTML table (string) with two columns: keys of dictionary, values of dictionary.

gtfs_kit.helpers.make_ids(n: int, prefix: str = 'id_')

Return a length n list of unique sequentially labelled strings for use as IDs.

Example:

>>> make_ids(11, prefix="s")
['s00', s01', 's02', 's03', 's04', 's05', 's06', 's07', 's08', 's09', 's10']
gtfs_kit.helpers.restack_time_series(unstacked_time_series: DataFrame) DataFrame

Given an unstacked stop, route, or feed time series in the form output by the function unstack_time_series(), restack it into its original time series form.

gtfs_kit.helpers.timestr_mod24(timestr: str) int

Given a GTFS HH:MM:SS time string, return a timestring in the same format but with the hours taken modulo 24.

gtfs_kit.helpers.timestr_to_seconds(x: date | str, *, inverse: bool = False, mod24: bool = False) int

Given an HH:MM:SS time string x, return the number of seconds past midnight that it represents. In keeping with GTFS standards, the hours entry may be greater than 23. If mod24, then return the number of seconds modulo 24*3600. If inverse, then do the inverse operation. In this case, if mod24 also, then first take the number of seconds modulo 24*3600.

gtfs_kit.helpers.unstack_time_series(time_series: DataFrame) DataFrame

Given a route, stop, or feed time series of the form output by the functions, compute_stop_time_series(), compute_route_time_series(), or compute_feed_time_series(), respectively, unstack it to return a DataFrame of with the columns:

  • "datetime"

  • the columns time_series.columns.names

  • "value": value at the datetime and other columns

gtfs_kit.helpers.weekday_to_str(weekday: int | str, *, inverse: bool = False) int | str

Given a weekday number (integer in the range 0, 1, …, 6), return its corresponding weekday name as a lowercase string. Here 0 -> ‘monday’, 1 -> ‘tuesday’, and so on. If inverse, then perform the inverse operation.

Module validators

Module cleaners

Functions about cleaning feeds.

gtfs_kit.cleaners.aggregate_routes(feed: Feed, by: str = 'route_short_name', route_id_prefix: str = 'route_') Feed

Aggregate routes by route short name, say, and assign new route IDs using the given prefix.

More specifically, create new route IDs with the function build_aggregate_routes_dict() and the parameters by and route_id_prefix and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

gtfs_kit.cleaners.aggregate_stops(feed: Feed, by: str = 'stop_code', stop_id_prefix: str = 'stop_') Feed

Aggregate stops by stop code, say, and assign new stop IDs using the given prefix.

More specifically, create new stop IDs with the function build_aggregate_stops_dict() and the parameters by and stop_id_prefix and update the old stop IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

gtfs_kit.cleaners.build_aggregate_routes_dict(routes: DataFrame, by: str = 'route_short_name', route_id_prefix: str = 'route_') dict[str, str]

Given a DataFrame of routes, group the routes by route short name, say, and assign new route IDs using the given prefix. Return a dictionary of the form <old route ID> -> <new route ID>. Helper function for aggregate_routes().

More specifically, group routes by the by column, and for each group make one new route ID for all the old route IDs in that group based on the given route_id_prefix string and a running count, e.g. 'route_013'.

gtfs_kit.cleaners.build_aggregate_stops_dict(stops: DataFrame, by: str = 'stop_code', stop_id_prefix: str = 'stop_') dict[str, str]

Given a DataFrame of stops, group the stops by stop code, say, and assign new stop IDs using the given prefix. Return a dictionary of the form <old stop ID> -> <new stop ID>. Helper function for aggregate_stops().

More specifically, group stops by the by column, and for each group make one new stop ID for all the old stops IDs in that group based on the given stop_id_prefix string and a running count, e.g. 'stop_013'.

gtfs_kit.cleaners.clean(feed: Feed) Feed

Apply the following functions to the given Feed in order and return the resulting Feed.

  1. clean_ids()

  2. clean_times()

  3. clean_route_short_names()

  4. drop_zombies()

gtfs_kit.cleaners.clean_column_names(df: DataFrame) DataFrame

Strip the whitespace from all column names in the given DataFrame and return the result.

gtfs_kit.cleaners.clean_ids(feed: Feed) Feed

In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.

gtfs_kit.cleaners.clean_route_short_names(feed: Feed) Feed

In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.

gtfs_kit.cleaners.clean_times(feed: Feed) Feed

In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.

gtfs_kit.cleaners.drop_invalid_columns(feed: Feed) Feed

Drop all DataFrame columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.

gtfs_kit.cleaners.drop_zombies(feed: Feed) Feed

In the given Feed, do the following in order and return the resulting Feed.

  1. Drop stops of location type 0 or NaN with no stop times.

  2. Remove undefined parent stations from the parent_station column.

  3. Drop trips with no stop times.

  4. Drop shapes with no trips.

  5. Drop routes with no trips.

  6. Drop services with no trips.

gtfs_kit.cleaners.extend_id(feed: Feed, id_col: str, extension: str, *, prefix=True) Feed

Add a prefix (if prefix) or a suffix (otherwise) to all values of column id_col across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.

Raises a ValueError if id_col values can’t have strings added to them, e.g. if id_col is ‘direction_id’.

Module calendar

Functions about calendar and calendar_dates.

gtfs_kit.calendar.get_dates(feed: Feed, *, as_date_obj: bool = False) list[str]

Return a list of YYYYMMDD date strings for which the given Feed is valid, which could be the empty list if the Feed has no calendar information.

If as_date_obj, then return datetime.date objects instead.

gtfs_kit.calendar.get_first_week(feed: Feed, *, as_date_obj: bool = False) list[str]

Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.

If as_date_obj, then return date objects, otherwise return date strings.

gtfs_kit.calendar.get_week(feed: Feed, k: int, *, as_date_obj: bool = False) list[str]

Given a Feed and a positive integer k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

gtfs_kit.calendar.subset_dates(feed: Feed, dates: list[str]) list[str]

Given a Feed and a list of YYYYMMDD date strings, return the sublist of dates that lie in the Feed’s dates (the output feed.get_dates()).

Module routes

Functions about routes.

gtfs_kit.routes.build_route_timetable(feed: Feed, route_id: str, dates: list[str]) pd.DataFrame

Return a timetable for the given route and dates (YYYYMMDD date strings).

Return a DataFrame with whose columns are all those in feed.trips plus those in feed.stop_times plus 'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.

Skip dates outside of the Feed’s dates.

If there is no route activity on the given dates, then return an empty DataFrame.

gtfs_kit.routes.build_zero_route_time_series(feed: Feed, date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Return a route time series with the same index and hierarchical columns as output by compute_route_time_series_0(), but fill it full of zero values.

gtfs_kit.routes.compute_route_stats(feed: Feed, trip_stats_subset: pd.DataFrame, dates: list[str], headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame

Compute route stats for all the trips that lie in the given subset of trip stats (of the form output by the function trips.compute_trip_stats()) and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

  • 'date'

  • the columns listed in :func:compute_route_stats_0

Exclude dates with no active trips, which could yield the empty DataFrame.

Notes

  • The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

gtfs_kit.routes.compute_route_stats_0(trip_stats_subset: DataFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) DataFrame

Compute stats for the given subset of trips stats (of the form output by the function trips.compute_trip_stats()).

Ignore trips with zero duration, because they are defunct.

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

  • 'route_id'

  • 'route_short_name'

  • 'route_type'

  • 'direction_id'

  • 'num_trips': number of trips on the route in the subset

  • 'num_trip_starts': number of trips on the route with nonnull start times

  • 'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59

  • 'num_stop_patterns': number of stop pattern across trips

  • 'is_loop': 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise

  • 'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise

  • 'start_time': start time of the earliest trip on the route

  • 'end_time': end time of latest trip on the route

  • 'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates

  • 'min_headway': minimum of the durations (in minutes) mentioned above

  • 'mean_headway': mean of the durations (in minutes) mentioned above

  • 'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)

  • 'peak_start_time': start time of first longest period during which the peak number of trips occurs

  • 'peak_end_time': end time of first longest period during which the peak number of trips occurs

  • 'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours

  • 'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'service_speed': service_distance/service_duration

  • 'mean_trip_distance': service_distance/num_trips

  • 'mean_trip_duration': service_duration/num_trips

If not split_directions, then remove the direction_id column and compute each route’s stats, except for headways, using its trips running in both directions. In this case, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

If trip_stats_subset is empty, return an empty DataFrame.

Raise a ValueError if split_directions and no non-NaN direction ID values present

gtfs_kit.routes.compute_route_time_series(feed: Feed, trip_stats_subset: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Compute route stats in time series form for the trips that lie in the trip stats subset (of the form output by the function trips.compute_trip_stats()) and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate each routes’s stats by trip direction. Specify the time series frequency with a Pandas frequency string, e.g. '5Min'; max frequency is one minute (‘Min’).

Return a DataFrame of the same format output by the function compute_route_time_series_0() but with multiple dates

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame.

Notes

gtfs_kit.routes.compute_route_time_series_0(trip_stats_subset: DataFrame, date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) DataFrame

Compute stats in a 24-hour time series form for the given subset of trips (of the form output by the function trips.compute_trip_stats()).

If split_directions, then separate each routes’s stats by trip direction. Set the time series frequency according to the given frequency string; max frequency is one minute (‘Min’). Use the given YYYYMMDD date label as the date in the time series index.

Return a DataFrame time series version the following route stats for each route.

  • num_trips: number of trips in service on the route at any time within the time bin

  • num_trip_starts: number of trips that start within the time bin

  • num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight

  • service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;

  • service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours

  • service_speed: service_distance/service_duration for the route

The columns are hierarchical (multi-indexed) with

  • top level: name is 'indicator'; values are 'num_trip_starts', 'num_trip_ends', 'num_trips', 'service_distance', 'service_duration', and 'service_speed'

  • middle level: name is 'route_id'; values are the active routes

  • bottom level: name is 'direction_id'; values are 0s and 1s

If not split_directions, then don’t include the bottom level.

The time series has a timestamp index for a 24-hour period sampled at the given frequency. The maximum allowable frequency is 1 minute. If trip_stats_subset is empty, then return an empty DataFrame with the columns 'num_trip_starts', 'num_trip_ends', 'num_trips', 'service_distance', 'service_duration', and 'service_speed'.

Notes

  • The time series is computed at a one-minute frequency, then resampled at the end to the given frequency

  • Trips that lack start or end times are ignored, so the the aggregate num_trips across the day could be less than the num_trips column of compute_route_stats_0()

  • All trip departure times are taken modulo 24 hours. So routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series, except for their num_trip_ends indicator. Trip endings past 23:59:59 not binned so that resampling the num_trips indicator works efficiently.

  • Note that the total number of trips for two consecutive time bins t1 < t2 is the sum of the number of trips in bin t2 plus the number of trip endings in bin t1. Thus we can downsample the num_trips indicator by keeping track of only one extra count, num_trip_ends, and can avoid recording individual trip IDs.

  • All other indicators are downsampled by summing.

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

gtfs_kit.routes.get_routes(feed: Feed, date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False, split_directions: bool = False) pd.DataFrame

Return feed.routes or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time.

Given a Feed, return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.

If as_gdf and feed.shapes is not None, then return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s union of trip shapes. The GeoDataFrame will have a local UTM CRS if use_utm; otherwise it will have CRS WGS84. If split_directions and as_gdf, then add the column direction_id and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. If as_gdf and feed.shapes is None, then raise a ValueError.

gtfs_kit.routes.map_routes(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)

Return a Folium map showing the given routes and (optionally) their stops. At least one of route_ids and route_short_names must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.

gtfs_kit.routes.routes_to_geojson(feed: Feed, route_ids: Iterable[str | None] = None, *, split_directions: bool = False, include_stops: bool = False) dict

Return a GeoJSON FeatureCollection of MultiLineString features representing this Feed’s routes. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the route stops as Point features . If an iterable of route IDs is given, then subset to those routes. If the subset is empty, then return a FeatureCollection with an empty list of features. If the Feed has no shapes, then raise a ValueError. If any of the given route IDs are not found in the feed, then raise a ValueError.

Module shapes

Functions about shapes.

gtfs_kit.shapes.append_dist_to_shapes(feed: Feed) Feed

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting Feed.

As a benchmark, using this function on this Portland feed produces a shape_dist_traveled column that differs by at most 0.016 km in absolute value from of the original values.

gtfs_kit.shapes.build_geometry_by_shape(feed: Feed, shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict

Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.

gtfs_kit.shapes.geometrize_shapes(shapes: DataFrame, *, use_utm: bool = False) GeoDataFrame

Given a GTFS shapes DataFrame, convert it to a GeoDataFrame of LineStrings and return the result, which will no longer have the columns 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat', and 'shape_dist_traveled'.

If use_utm, then use local UTM coordinates for the geometries.

gtfs_kit.shapes.get_shapes(feed: Feed, *, as_gdf: bool = False, use_utm: bool = False) gpd.DataFrame | None

Get the shapes DataFrame for the given feed, which could be None. If as_gdf, then return it as GeoDataFrame with a ‘geometry’ column of linestrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The GeoDataFrame will have a UTM CRS if use_utm; otherwise it will have a WGS84 CRS.

gtfs_kit.shapes.get_shapes_intersecting_geometry(feed: Feed, geometry: sg.base.BaseGeometry, shapes_g: gpd.GeoDataFrame | None = None, *, as_gdf: bool = False) pd.DataFrame | None

If the Feed has no shapes, then return None. Otherwise, return the subset of feed.shapes that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.

If as_gdf, then return the shapes as a GeoDataFrame. Specifying shapes_g will skip the first step of the algorithm, namely, geometrizing feed.shapes.

gtfs_kit.shapes.shapes_to_geojson(feed: Feed, shape_ids: Iterable[str] | None = None) dict

Return a GeoJSON FeatureCollection of LineString features representing feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.

gtfs_kit.shapes.ungeometrize_shapes(shapes_g: GeoDataFrame) DataFrame

The inverse of geometrize_shapes().

If shapes_g is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

Module stop_times

Functions about stop times.

gtfs_kit.stop_times.append_dist_to_stop_times(feed: Feed) Feed

Calculate and append the optional shape_dist_traveled column in feed.stop_times in terms of the distance units feed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Uses feed.shapes, so if that is missing, then return the original feed.

This does not always give accurate results. The algorithm works as follows. Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.

gtfs_kit.stop_times.get_start_and_end_times(feed: Feed, date: str | None = None) list[str]

Return the first departure time and last arrival time (HH:MM:SS time strings) listed in feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.

gtfs_kit.stop_times.get_stop_times(feed: Feed, date: str | None = None) pd.DataFrame

Return feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.

gtfs_kit.stop_times.stop_times_to_geojson(feed: Feed, trip_ids: Iterable[str | None] = None) dict

Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.

For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.

If an iterable of trip IDs is given, then subset to those trips. If some of the given trip IDs are not found in the feed, then raise a ValueError.

Module stops

Functions about stops.

gtfs_kit.stops.STOP_STYLE = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}

Leaflet circleMarker parameters for mapping stops

gtfs_kit.stops.build_geometry_by_stop(feed: Feed, stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict

Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.

gtfs_kit.stops.build_stop_timetable(feed: Feed, stop_id: str, dates: list[str]) pd.DataFrame

Return a DataFrame containing the timetable for the given stop ID and dates (YYYYMMDD date strings)

Return a DataFrame whose columns are all those in feed.trips plus those in feed.stop_times plus 'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.

gtfs_kit.stops.build_zero_stop_time_series(feed: Feed, date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Return a stop time series with the same index and hierarchical columns as output by the function compute_stop_time_series_0(), but fill it full of zero values.

gtfs_kit.stops.compute_stop_activity(feed: Feed, dates: list[str]) pd.DataFrame

Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).

Return a DataFrame with the columns

  • stop_id

  • dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise

  • dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise

  • etc.

  • dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If all dates lie outside the Feed period, then return an empty DataFrame.

gtfs_kit.stops.compute_stop_stats(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame

Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

  • 'date'

  • 'stop_id'

  • 'direction_id': present if and only if split_directions

  • 'num_routes': number of routes visiting the stop (in the given direction) on the date

  • 'num_trips': number of trips visiting stop (in the givin direction) on the date

  • 'max_headway': maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the date

  • 'min_headway': minimum of the durations (in minutes) mentioned above

  • 'mean_headway': mean of the durations (in minutes) mentioned above

  • 'start_time': earliest departure time of a trip from this stop on the date

  • 'end_time': latest departure time of a trip from this stop on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

gtfs_kit.stops.compute_stop_stats_0(stop_times_subset: DataFrame, trip_subset: DataFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) DataFrame

Given a subset of a stop times DataFrame and a subset of a trips DataFrame, return a DataFrame that provides summary stats about the stops in the inner join of the two DataFrames.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

  • stop_id

  • direction_id: present if and only if split_directions

  • num_routes: number of routes visiting stop (in the given direction)

  • num_trips: number of trips visiting stop (in the givin direction)

  • max_headway: maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time

  • min_headway: minimum of the durations (in minutes) mentioned above

  • mean_headway: mean of the durations (in minutes) mentioned above

  • start_time: earliest departure time of a trip from this stop

  • end_time: latest departure time of a trip from this stop

Notes

  • If trip_subset is empty, then return an empty DataFrame.

  • Raise a ValueError if split_directions and no non-NaN direction ID values present.

gtfs_kit.stops.compute_stop_time_series(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Compute time series for the stops on the given dates (YYYYMMDD date strings) at the given frequency (Pandas frequency string, e.g. '5Min'; max frequency is one minute) and return the result as a DataFrame of the same form as output by the function stop_times.compute_stop_time_series_0(). Optionally restrict to stops in the given list of stop IDs.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.

Return a time series DataFrame with a timestamp index across the given dates sampled at the given frequency.

The columns are the same as in the output of the function compute_stop_time_series_0().

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame

Notes

  • See the notes for the function compute_stop_time_series_0()

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

gtfs_kit.stops.compute_stop_time_series_0(stop_times_subset: DataFrame, trip_subset: DataFrame, freq: str = '5Min', date_label: str = '20010101', *, split_directions: bool = False) DataFrame

Given a subset of a stop times DataFrame and a subset of a trips DataFrame, return a DataFrame that provides a summary time series about the stops in the inner join of the two DataFrames. If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the given Pandas frequency string to specify the frequency of the time series, e.g. '5Min'; max frequency is one minute (‘Min’) Use the given date label (YYYYMMDD date string) as the date in the time series index.

Return a time series DataFrame with a timestamp index for a 24-hour period sampled at the given frequency. The only indicator variable for each stop is

  • num_trips: the number of trips that visit the stop and have a nonnull departure time from the stop

The maximum allowable frequency is 1 minute.

The columns are hierarchical (multi-indexed) with

  • top level: name = ‘indicator’, values = [‘num_trips’]

  • middle level: name = ‘stop_id’, values = the active stop IDs

  • bottom level: name = ‘direction_id’, values = 0s and 1s

If not split_directions, then don’t include the bottom level.

Notes

  • The time series is computed at a one-minute frequency, then resampled at the end to the given frequency

  • Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()

  • All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.

  • ‘num_trips’ should be resampled with how=np.sum

  • If trip_subset is empty, then return an empty DataFrame

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

gtfs_kit.stops.geometrize_stops(stops: DataFrame, *, use_utm: bool = False) GeoDataFrame

Given a stops DataFrame, convert it to a GeoPandas GeoDataFrame of Points and return the result, which will no longer have the columns 'stop_lon' and 'stop_lat'.

gtfs_kit.stops.get_stops(feed: Feed, date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame

Return feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. If in_stations, then subset further stops in stations if station data is available. If as_gdf, then return the result as a GeoDataFrame with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The GeoDataFrame will have a UTM CRS if use_utm and a WGS84 CRS otherwise.

gtfs_kit.stops.get_stops_in_area(feed: Feed, area: gpd.GeoDataFrame) pd.DataFrame

Return the subset of feed.stops that contains all stops that lie within the given GeoDataFrame of polygons.

gtfs_kit.stops.map_stops(feed: Feed, stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})

Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.

gtfs_kit.stops.stops_to_geojson(feed: Feed, stop_ids: Iterable[str | None] = None) dict

Return a GeoJSON FeatureCollection of Point features representing all the stops in feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of stop IDs is given, then subset to those stops. If some of the given stop IDs are not found in the feed, then raise a ValueError.

gtfs_kit.stops.ungeometrize_stops(stops_g: GeoDataFrame) DataFrame

The inverse of geometrize_stops().

If stops_g is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

Module trips

Functions about trips.

gtfs_kit.trips.compute_busiest_date(feed: Feed, dates: list[str]) str

Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.

gtfs_kit.trips.compute_trip_activity(feed: Feed, dates: list[str]) pd.DataFrame

Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns

  • 'trip_id'

  • dates[0]: 1 if the trip is active on dates[0]; 0 otherwise

  • dates[1]: 1 if the trip is active on dates[1]; 0 otherwise

  • etc.

  • dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty DataFrame.

gtfs_kit.trips.compute_trip_stats(feed: Feed, route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) pd.DataFrame

Return a DataFrame with the following columns:

  • 'trip_id'

  • 'route_id'

  • 'route_short_name'

  • 'route_type'

  • 'direction_id': NaN if missing from feed

  • 'shape_id': NaN if missing from feed

  • 'stop_pattern_name': output from name_stop_patterns()

  • 'num_stops': number of stops on trip

  • 'start_time': first departure time of the trip

  • 'end_time': last departure time of the trip

  • 'start_stop_id': stop ID of the first stop of the trip

  • 'end_stop_id': stop ID of the last stop of the trip

  • 'is_loop': 1 if the start and end stop are less than 400m apart and 0 otherwise

  • 'distance': distance of the trip; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'duration': duration of the trip in hours

  • 'speed': distance/duration

If feed.stop_times has a shape_dist_traveled column with at least one non-NaN value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to NaN.

If route IDs are given, then restrict to trips on those routes.

Notes

  • Assume the following feed attributes are not None:

    • feed.trips

    • feed.routes

    • feed.stop_times

    • feed.shapes (optionally)

  • Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on this Portland feed using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.

gtfs_kit.trips.get_active_services(feed: Feed, date: str) list[str]

Given a Feed and a date string in YYYYMMDD format, return the list of service IDs that are active on the date.

gtfs_kit.trips.get_trips(feed: Feed, date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame

Return feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.

If as_gdf and feed.shapes is not None, then return the trips as a GeoDataFrame of LineStrings representating trip shapes. Use local UTM CRS if use_utm; otherwise it the WGS84 CRS. If as_gdf and feed.shapes is None, then raise a ValueError.

gtfs_kit.trips.locate_trips(feed: Feed, date: str, times: list[str]) pd.DataFrame

Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).

Return a DataFrame with the columns

  • 'trip_id'

  • 'route_id'

  • 'direction_id': all NaNs if feed.trips.direction_id is missing

  • 'time'

  • 'rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path

  • 'lon': longitude of trip at given time

  • 'lat': latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

gtfs_kit.trips.map_trips(feed: Feed, trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = False)

Return a Folium map showing the given trips and (optionally) their stops. If any of the given trip IDs are not found in the feed, then raise a ValueError. If include_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.

gtfs_kit.trips.name_stop_patterns(feed: Feed) pd.DataFrame

For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the DataFrame feed.trips with the additional column stop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.

If feed.trips has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.

gtfs_kit.trips.trips_to_geojson(feed: Feed, trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) dict

Return a GeoJSON FeatureCollection of LineString features representing all the Feed’s trips. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips. If any of the given trip IDs are not found in the feed, then raise a ValueError. If the Feed has no shapes, then raise a ValueError.

Module miscellany

Functions about miscellany.

gtfs_kit.miscellany.assess_quality(feed: Feed) pd.DataFrame

Return a DataFrame of various feed indicators and values, e.g. number of trips missing shapes.

The resulting DataFrame has the columns

  • 'indicator': string; name of an indicator, e.g. ‘num_routes’

  • 'value': value of the indicator, e.g. 27

This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.

gtfs_kit.miscellany.compute_bounds(feed: Feed, stop_ids: list[str] | None = None) np.array

Return the bounding box (Numpy array [min longitude, min latitude, max longitude, max latitude]) of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit.miscellany.compute_centroid(feed: Feed, stop_ids: list[str] | None = None) sg.Point

Return the centroid (Shapely Point) of the convex hull the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit.miscellany.compute_convex_hull(feed: Feed, stop_ids: list[str] | None = None) sg.Polygon

Return a convex hull (Shapely Polygon) representing the convex hull of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit.miscellany.compute_feed_stats(feed: Feed, trip_stats: pd.DataFrame, dates: list[str], *, split_route_types=False) pd.DataFrame

Compute some stats for the given Feed, trip stats (in the format output by the function trips.compute_trip_stats()) and dates (YYYYMMDD date stings).

Return a DataFrame with the columns

  • 'date'

  • 'route_type' (optional): presest if and only if split_route_types

  • 'num_stops': number of stops active on the date

  • 'num_routes': number of routes active on the date

  • 'num_trips': number of trips that start on the date

  • 'num_trip_starts': number of trips with nonnull start times on the date

  • 'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date

  • 'peak_num_trips': maximum number of simultaneous trips in service on the date

  • 'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date

  • 'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date

  • 'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'service_duration': sum of the service durations for the active routes on the date; measured in hours

  • 'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.

gtfs_kit.miscellany.compute_feed_stats_0(feed: Feed, trip_stats_subset: pd.DataFrame, *, split_route_types=False) pd.DataFrame

Helper function for compute_feed_stats().

gtfs_kit.miscellany.compute_feed_time_series(feed: Feed, trip_stats: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_route_types: bool = False) pd.DataFrame

Compute some feed stats in time series form for the given dates (YYYYMMDD date strings) and trip stats (of the form output by the function trips.compute_trip_stats()). Use the given Pandas frequency string freq to specify the frequency of the resulting time series, e.g. ‘5Min’; highest frequency allowable is one minute (‘1Min’). If split_route_types, then split stats by route type; otherwise don’t

Return a time series DataFrame with a datetime index across the given dates sampled at the given frequency across the given dates. The columns are

  • 'num_trips': number of trips in service during during the time period

  • 'num_trip_starts': number of trips with starting during the time period

  • 'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight

  • 'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours

  • 'service_speed': service_distance/service_duration

Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty DataFrame with the specified columns.

If split_route_types, then multi-index the columns with

  • top level: name is 'indicator'; values are 'num_trip_starts', 'num_trip_ends', 'num_trips', 'service_distance', 'service_duration', and 'service_speed'

  • bottom level: name is 'route_type'; values are route type values

If all dates lie outside the Feed’s date range, then return an empty DataFrame

gtfs_kit.miscellany.compute_screen_line_counts(feed: Feed, screen_lines: gpd.GeoDataFrame, dates: list[str]) pd.DataFrame

Find all the Feed trips active on the given YYYYMMDD dates whose shapes intersect the given GeoDataFrame of screen lines, that is, of straight WGS84 LineStrings. Compute the intersection times and directions for each trip.

Return a DataFrame with the columns

  • 'date'

  • 'trip_id'

  • 'route_id'

  • 'route_short_name'

  • 'shape_id': shape ID of the trip

  • 'screen_line_id': ID of the screen line as specified in screen_lines or as assigned after the fact.

  • 'crossing_distance': distance (in the feed’s distance units) along the trip shape of the screen line intersection 'crossing_time': time that the trip’s vehicle crosses the scren line; one trip could cross multiple times

  • 'crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction

Notes:

  • Assume the Feed’s stop times DataFrame has an accurate shape_dist_traveled column.

  • Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.

  • Assume that the screen line is straight and simple.

  • Probably does not give correct results for trips with self-intersecting shapes.

  • The algorithm works as follows

    1. Find the trip shapes that intersect the screen lines.

    2. For each such shape and screen line, compute the intersection points, the distance of the point along the shape, and the orientation of the screen line relative to the shape.

    3. For each given date, restrict to trips active on the date and interpolate a stop time for the intersection point using the shape_dist_traveled column.

    4. Use that interpolated time as the crossing time of the trip vehicle.

gtfs_kit.miscellany.convert_dist(feed: Feed, new_dist_units: str) Feed

Convert the distances recorded in the shape_dist_traveled columns of the given Feed to the given distance units. New distance units must lie in constants.DIST_UNITS. Return the resulting Feed.

gtfs_kit.miscellany.create_shapes(feed: Feed, *, all_trips: bool = False) Feed

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

gtfs_kit.miscellany.describe(feed: Feed, sample_date: str | None = None) pd.DataFrame

Return a DataFrame of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.

The resulting DataFrame has the columns

  • 'indicator': string; name of an indicator, e.g. ‘num_routes’

  • 'value': value of the indicator, e.g. 27

gtfs_kit.miscellany.list_fields(feed: Feed, table: str | None = None) pd.DataFrame

Return a DataFrame describing all the fields of the GTFS tables in the given feed or in the given table if specified.

The resulting DataFrame has the following columns.

  • 'table': name of the GTFS table, e.g. 'stops'

  • 'column': name of a column in the table, e.g. 'stop_id'

  • 'num_values': number of values in the column

  • 'num_nonnull_values': number of nonnull values in the column

  • 'num_unique_values': number of unique values in the column, excluding null values

  • 'min_value': minimum value in the column

  • 'max_value': maximum value in the column

If the table is not in the feed, then return an empty DataFrame If the table is not valid, raise a ValueError

gtfs_kit.miscellany.restrict_to_agencies(feed: Feed, agency_ids: list[str]) Feed

Build a new feed by restricting this one via restrict_to_routes() and the routes with the given agency IDs. Return the resulting feed.

gtfs_kit.miscellany.restrict_to_area(feed: Feed, area: gpd.GeoDataFrame) Feed

Build a new feed by restricting this one via restrict_to_trips() and the trips that have at least one stop intersecting the given GeoDataFrame of polygons. Return the resulting feed.

gtfs_kit.miscellany.restrict_to_dates(feed: Feed, dates: list[str]) Feed

Build a new feed by restricting this one via restrict_to_trips() and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.

gtfs_kit.miscellany.restrict_to_routes(feed: Feed, route_ids: list[str]) Feed

Build a new feed by restricting this one via restrict_to_trips() and the trips with the given route IDs. Return the resulting feed.

gtfs_kit.miscellany.restrict_to_trips(feed: Feed, trip_ids: list[str]) Feed

Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.

If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.

This function is probably more useful internally than externally.

Module feed

This module defines a Feed class to represent GTFS feeds. There is an instance attribute for every GTFS table (routes, stops, etc.), which stores the table as a Pandas DataFrame, or as None in case that table is missing.

The Feed class also has heaps of methods: a method to compute route stats, a method to compute screen line counts, validations methods, etc. To ease reading, almost all of these methods are defined in other modules and grouped by theme (routes.py, stops.py, etc.). These methods, or rather functions that operate on feeds, are then imported within the Feed class. This separation of methods unfortunately messes up slightly the Feed class documentation generated by Sphinx, introducing an extra leading feed parameter in the method signatures. Ignore that extra parameter; it refers to the Feed instance, usually called self and usually hidden automatically by Sphinx.

class gtfs_kit.feed.Feed(dist_units: str, agency: DataFrame | None = None, stops: DataFrame | None = None, routes: DataFrame | None = None, trips: DataFrame | None = None, stop_times: DataFrame | None = None, calendar: DataFrame | None = None, calendar_dates: DataFrame | None = None, fare_attributes: DataFrame | None = None, fare_rules: DataFrame | None = None, shapes: DataFrame | None = None, frequencies: DataFrame | None = None, transfers: DataFrame | None = None, feed_info: DataFrame | None = None, attributions: DataFrame | None = None)

Bases: object

An instance of this class represents a not-necessarily-valid GTFS feed, where GTFS tables are stored as DataFrames. Beware that the stop times DataFrame can be big (several gigabytes), so make sure you have enough memory to handle it.

Primary instance attributes:

  • dist_units: a string in constants.DIST_UNITS; specifies the distance units of the shape_dist_traveled column values, if present; also effects whether to display trip and route stats in metric or imperial units

  • agency

  • stops

  • routes

  • trips

  • stop_times

  • calendar

  • calendar_dates

  • fare_attributes

  • fare_rules

  • shapes

  • frequencies

  • transfers

  • feed_info

  • attributions

There are also a few secondary instance attributes that are derived from the primary attributes and are automatically updated when the primary attributes change. However, for this update to work, you must update the primary attributes like this (good):

feed.trips['route_short_name'] = 'bingo'
feed.trips = feed.trips

and not like this (bad):

feed.trips['route_short_name'] = 'bingo'

The first way ensures that the altered trips DataFrame is saved as the new trips attribute, but the second way does not.

aggregate_routes(by: str = 'route_short_name', route_id_prefix: str = 'route_') Feed

Aggregate routes by route short name, say, and assign new route IDs using the given prefix.

More specifically, create new route IDs with the function build_aggregate_routes_dict() and the parameters by and route_id_prefix and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

aggregate_stops(by: str = 'stop_code', stop_id_prefix: str = 'stop_') Feed

Aggregate stops by stop code, say, and assign new stop IDs using the given prefix.

More specifically, create new stop IDs with the function build_aggregate_stops_dict() and the parameters by and stop_id_prefix and update the old stop IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

append_dist_to_shapes() Feed

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting Feed.

As a benchmark, using this function on this Portland feed produces a shape_dist_traveled column that differs by at most 0.016 km in absolute value from of the original values.

append_dist_to_stop_times() Feed

Calculate and append the optional shape_dist_traveled column in feed.stop_times in terms of the distance units feed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Uses feed.shapes, so if that is missing, then return the original feed.

This does not always give accurate results. The algorithm works as follows. Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.

assess_quality() pd.DataFrame

Return a DataFrame of various feed indicators and values, e.g. number of trips missing shapes.

The resulting DataFrame has the columns

  • 'indicator': string; name of an indicator, e.g. ‘num_routes’

  • 'value': value of the indicator, e.g. 27

This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.

build_geometry_by_shape(shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict

Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.

build_geometry_by_stop(stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict

Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.

build_route_timetable(route_id: str, dates: list[str]) pd.DataFrame

Return a timetable for the given route and dates (YYYYMMDD date strings).

Return a DataFrame with whose columns are all those in feed.trips plus those in feed.stop_times plus 'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.

Skip dates outside of the Feed’s dates.

If there is no route activity on the given dates, then return an empty DataFrame.

build_stop_timetable(stop_id: str, dates: list[str]) pd.DataFrame

Return a DataFrame containing the timetable for the given stop ID and dates (YYYYMMDD date strings)

Return a DataFrame whose columns are all those in feed.trips plus those in feed.stop_times plus 'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.

build_zero_route_time_series(date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Return a route time series with the same index and hierarchical columns as output by compute_route_time_series_0(), but fill it full of zero values.

build_zero_stop_time_series(date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Return a stop time series with the same index and hierarchical columns as output by the function compute_stop_time_series_0(), but fill it full of zero values.

clean() Feed

Apply the following functions to the given Feed in order and return the resulting Feed.

  1. clean_ids()

  2. clean_times()

  3. clean_route_short_names()

  4. drop_zombies()

clean_ids() Feed

In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.

clean_route_short_names() Feed

In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.

clean_times() Feed

In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.

compute_bounds(stop_ids: list[str] | None = None) np.array

Return the bounding box (Numpy array [min longitude, min latitude, max longitude, max latitude]) of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_busiest_date(dates: list[str]) str

Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.

compute_centroid(stop_ids: list[str] | None = None) sg.Point

Return the centroid (Shapely Point) of the convex hull the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_convex_hull(stop_ids: list[str] | None = None) sg.Polygon

Return a convex hull (Shapely Polygon) representing the convex hull of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_feed_stats(trip_stats: pd.DataFrame, dates: list[str], *, split_route_types=False) pd.DataFrame

Compute some stats for the given Feed, trip stats (in the format output by the function trips.compute_trip_stats()) and dates (YYYYMMDD date stings).

Return a DataFrame with the columns

  • 'date'

  • 'route_type' (optional): presest if and only if split_route_types

  • 'num_stops': number of stops active on the date

  • 'num_routes': number of routes active on the date

  • 'num_trips': number of trips that start on the date

  • 'num_trip_starts': number of trips with nonnull start times on the date

  • 'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date

  • 'peak_num_trips': maximum number of simultaneous trips in service on the date

  • 'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date

  • 'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date

  • 'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'service_duration': sum of the service durations for the active routes on the date; measured in hours

  • 'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.

compute_feed_time_series(trip_stats: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_route_types: bool = False) pd.DataFrame

Compute some feed stats in time series form for the given dates (YYYYMMDD date strings) and trip stats (of the form output by the function trips.compute_trip_stats()). Use the given Pandas frequency string freq to specify the frequency of the resulting time series, e.g. ‘5Min’; highest frequency allowable is one minute (‘1Min’). If split_route_types, then split stats by route type; otherwise don’t

Return a time series DataFrame with a datetime index across the given dates sampled at the given frequency across the given dates. The columns are

  • 'num_trips': number of trips in service during during the time period

  • 'num_trip_starts': number of trips with starting during the time period

  • 'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight

  • 'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours

  • 'service_speed': service_distance/service_duration

Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty DataFrame with the specified columns.

If split_route_types, then multi-index the columns with

  • top level: name is 'indicator'; values are 'num_trip_starts', 'num_trip_ends', 'num_trips', 'service_distance', 'service_duration', and 'service_speed'

  • bottom level: name is 'route_type'; values are route type values

If all dates lie outside the Feed’s date range, then return an empty DataFrame

compute_route_stats(trip_stats_subset: pd.DataFrame, dates: list[str], headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame

Compute route stats for all the trips that lie in the given subset of trip stats (of the form output by the function trips.compute_trip_stats()) and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

  • 'date'

  • the columns listed in :func:compute_route_stats_0

Exclude dates with no active trips, which could yield the empty DataFrame.

Notes

  • The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

compute_route_time_series(trip_stats_subset: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Compute route stats in time series form for the trips that lie in the trip stats subset (of the form output by the function trips.compute_trip_stats()) and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate each routes’s stats by trip direction. Specify the time series frequency with a Pandas frequency string, e.g. '5Min'; max frequency is one minute (‘Min’).

Return a DataFrame of the same format output by the function compute_route_time_series_0() but with multiple dates

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame.

Notes

  • See the notes for compute_route_time_series_0()

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

compute_screen_line_counts(screen_lines: gpd.GeoDataFrame, dates: list[str]) pd.DataFrame

Find all the Feed trips active on the given YYYYMMDD dates whose shapes intersect the given GeoDataFrame of screen lines, that is, of straight WGS84 LineStrings. Compute the intersection times and directions for each trip.

Return a DataFrame with the columns

  • 'date'

  • 'trip_id'

  • 'route_id'

  • 'route_short_name'

  • 'shape_id': shape ID of the trip

  • 'screen_line_id': ID of the screen line as specified in screen_lines or as assigned after the fact.

  • 'crossing_distance': distance (in the feed’s distance units) along the trip shape of the screen line intersection 'crossing_time': time that the trip’s vehicle crosses the scren line; one trip could cross multiple times

  • 'crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction

Notes:

  • Assume the Feed’s stop times DataFrame has an accurate shape_dist_traveled column.

  • Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.

  • Assume that the screen line is straight and simple.

  • Probably does not give correct results for trips with self-intersecting shapes.

  • The algorithm works as follows

    1. Find the trip shapes that intersect the screen lines.

    2. For each such shape and screen line, compute the intersection points, the distance of the point along the shape, and the orientation of the screen line relative to the shape.

    3. For each given date, restrict to trips active on the date and interpolate a stop time for the intersection point using the shape_dist_traveled column.

    4. Use that interpolated time as the crossing time of the trip vehicle.

compute_stop_activity(dates: list[str]) pd.DataFrame

Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).

Return a DataFrame with the columns

  • stop_id

  • dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise

  • dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise

  • etc.

  • dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If all dates lie outside the Feed period, then return an empty DataFrame.

compute_stop_stats(dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame

Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

  • 'date'

  • 'stop_id'

  • 'direction_id': present if and only if split_directions

  • 'num_routes': number of routes visiting the stop (in the given direction) on the date

  • 'num_trips': number of trips visiting stop (in the givin direction) on the date

  • 'max_headway': maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the date

  • 'min_headway': minimum of the durations (in minutes) mentioned above

  • 'mean_headway': mean of the durations (in minutes) mentioned above

  • 'start_time': earliest departure time of a trip from this stop on the date

  • 'end_time': latest departure time of a trip from this stop on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

compute_stop_time_series(dates: list[str], stop_ids: list[str | None] = None, freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame

Compute time series for the stops on the given dates (YYYYMMDD date strings) at the given frequency (Pandas frequency string, e.g. '5Min'; max frequency is one minute) and return the result as a DataFrame of the same form as output by the function stop_times.compute_stop_time_series_0(). Optionally restrict to stops in the given list of stop IDs.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.

Return a time series DataFrame with a timestamp index across the given dates sampled at the given frequency.

The columns are the same as in the output of the function compute_stop_time_series_0().

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame

Notes

  • See the notes for the function compute_stop_time_series_0()

  • Raise a ValueError if split_directions and no non-NaN direction ID values present

compute_trip_activity(dates: list[str]) pd.DataFrame

Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns

  • 'trip_id'

  • dates[0]: 1 if the trip is active on dates[0]; 0 otherwise

  • dates[1]: 1 if the trip is active on dates[1]; 0 otherwise

  • etc.

  • dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty DataFrame.

compute_trip_stats(route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) pd.DataFrame

Return a DataFrame with the following columns:

  • 'trip_id'

  • 'route_id'

  • 'route_short_name'

  • 'route_type'

  • 'direction_id': NaN if missing from feed

  • 'shape_id': NaN if missing from feed

  • 'stop_pattern_name': output from name_stop_patterns()

  • 'num_stops': number of stops on trip

  • 'start_time': first departure time of the trip

  • 'end_time': last departure time of the trip

  • 'start_stop_id': stop ID of the first stop of the trip

  • 'end_stop_id': stop ID of the last stop of the trip

  • 'is_loop': 1 if the start and end stop are less than 400m apart and 0 otherwise

  • 'distance': distance of the trip; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None

  • 'duration': duration of the trip in hours

  • 'speed': distance/duration

If feed.stop_times has a shape_dist_traveled column with at least one non-NaN value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to NaN.

If route IDs are given, then restrict to trips on those routes.

Notes

  • Assume the following feed attributes are not None:

    • feed.trips

    • feed.routes

    • feed.stop_times

    • feed.shapes (optionally)

  • Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on this Portland feed using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.

convert_dist(new_dist_units: str) Feed

Convert the distances recorded in the shape_dist_traveled columns of the given Feed to the given distance units. New distance units must lie in constants.DIST_UNITS. Return the resulting Feed.

copy() Feed

Return a copy of this feed, that is, a feed with all the same attributes.

create_shapes(*, all_trips: bool = False) Feed

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

describe(sample_date: str | None = None) pd.DataFrame

Return a DataFrame of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.

The resulting DataFrame has the columns

  • 'indicator': string; name of an indicator, e.g. ‘num_routes’

  • 'value': value of the indicator, e.g. 27

property dist_units

The distance units of the Feed.

drop_invalid_columns() Feed

Drop all DataFrame columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.

drop_zombies() Feed

In the given Feed, do the following in order and return the resulting Feed.

  1. Drop stops of location type 0 or NaN with no stop times.

  2. Remove undefined parent stations from the parent_station column.

  3. Drop trips with no stop times.

  4. Drop shapes with no trips.

  5. Drop routes with no trips.

  6. Drop services with no trips.

extend_id(id_col: str, extension: str, *, prefix=True) Feed

Add a prefix (if prefix) or a suffix (otherwise) to all values of column id_col across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.

Raises a ValueError if id_col values can’t have strings added to them, e.g. if id_col is ‘direction_id’.

geometrize_shapes(*, use_utm: bool = False) GeoDataFrame

Given a GTFS shapes DataFrame, convert it to a GeoDataFrame of LineStrings and return the result, which will no longer have the columns 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat', and 'shape_dist_traveled'.

If use_utm, then use local UTM coordinates for the geometries.

geometrize_stops(*, use_utm: bool = False) GeoDataFrame

Given a stops DataFrame, convert it to a GeoPandas GeoDataFrame of Points and return the result, which will no longer have the columns 'stop_lon' and 'stop_lat'.

get_active_services(date: str) list[str]

Given a Feed and a date string in YYYYMMDD format, return the list of service IDs that are active on the date.

get_dates(*, as_date_obj: bool = False) list[str]

Return a list of YYYYMMDD date strings for which the given Feed is valid, which could be the empty list if the Feed has no calendar information.

If as_date_obj, then return datetime.date objects instead.

get_first_week(*, as_date_obj: bool = False) list[str]

Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.

If as_date_obj, then return date objects, otherwise return date strings.

get_routes(date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False, split_directions: bool = False) pd.DataFrame

Return feed.routes or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time.

Given a Feed, return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.

If as_gdf and feed.shapes is not None, then return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s union of trip shapes. The GeoDataFrame will have a local UTM CRS if use_utm; otherwise it will have CRS WGS84. If split_directions and as_gdf, then add the column direction_id and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. If as_gdf and feed.shapes is None, then raise a ValueError.

get_shapes(*, as_gdf: bool = False, use_utm: bool = False) gpd.DataFrame | None

Get the shapes DataFrame for the given feed, which could be None. If as_gdf, then return it as GeoDataFrame with a ‘geometry’ column of linestrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The GeoDataFrame will have a UTM CRS if use_utm; otherwise it will have a WGS84 CRS.

get_shapes_intersecting_geometry(geometry: sg.base.BaseGeometry, shapes_g: gpd.GeoDataFrame | None = None, *, as_gdf: bool = False) pd.DataFrame | None

If the Feed has no shapes, then return None. Otherwise, return the subset of feed.shapes that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.

If as_gdf, then return the shapes as a GeoDataFrame. Specifying shapes_g will skip the first step of the algorithm, namely, geometrizing feed.shapes.

get_start_and_end_times(date: str | None = None) list[str]

Return the first departure time and last arrival time (HH:MM:SS time strings) listed in feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.

get_stop_times(date: str | None = None) pd.DataFrame

Return feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.

get_stops(date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame

Return feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. If in_stations, then subset further stops in stations if station data is available. If as_gdf, then return the result as a GeoDataFrame with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The GeoDataFrame will have a UTM CRS if use_utm and a WGS84 CRS otherwise.

get_stops_in_area(area: gpd.GeoDataFrame) pd.DataFrame

Return the subset of feed.stops that contains all stops that lie within the given GeoDataFrame of polygons.

get_trips(date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame

Return feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.

If as_gdf and feed.shapes is not None, then return the trips as a GeoDataFrame of LineStrings representating trip shapes. Use local UTM CRS if use_utm; otherwise it the WGS84 CRS. If as_gdf and feed.shapes is None, then raise a ValueError.

get_week(k: int, *, as_date_obj: bool = False) list[str]

Given a Feed and a positive integer k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

list_fields(table: str | None = None) pd.DataFrame

Return a DataFrame describing all the fields of the GTFS tables in the given feed or in the given table if specified.

The resulting DataFrame has the following columns.

  • 'table': name of the GTFS table, e.g. 'stops'

  • 'column': name of a column in the table, e.g. 'stop_id'

  • 'num_values': number of values in the column

  • 'num_nonnull_values': number of nonnull values in the column

  • 'num_unique_values': number of unique values in the column, excluding null values

  • 'min_value': minimum value in the column

  • 'max_value': maximum value in the column

If the table is not in the feed, then return an empty DataFrame If the table is not valid, raise a ValueError

locate_trips(date: str, times: list[str]) pd.DataFrame

Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).

Return a DataFrame with the columns

  • 'trip_id'

  • 'route_id'

  • 'direction_id': all NaNs if feed.trips.direction_id is missing

  • 'time'

  • 'rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path

  • 'lon': longitude of trip at given time

  • 'lat': latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

map_routes(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)

Return a Folium map showing the given routes and (optionally) their stops. At least one of route_ids and route_short_names must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.

map_stops(stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})

Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.

map_trips(trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = False)

Return a Folium map showing the given trips and (optionally) their stops. If any of the given trip IDs are not found in the feed, then raise a ValueError. If include_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.

name_stop_patterns() pd.DataFrame

For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the DataFrame feed.trips with the additional column stop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.

If feed.trips has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.

restrict_to_agencies(agency_ids: list[str]) Feed

Build a new feed by restricting this one via restrict_to_routes() and the routes with the given agency IDs. Return the resulting feed.

restrict_to_area(area: gpd.GeoDataFrame) Feed

Build a new feed by restricting this one via restrict_to_trips() and the trips that have at least one stop intersecting the given GeoDataFrame of polygons. Return the resulting feed.

restrict_to_dates(dates: list[str]) Feed

Build a new feed by restricting this one via restrict_to_trips() and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.

restrict_to_routes(route_ids: list[str]) Feed

Build a new feed by restricting this one via restrict_to_trips() and the trips with the given route IDs. Return the resulting feed.

restrict_to_trips(trip_ids: list[str]) Feed

Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.

If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.

This function is probably more useful internally than externally.

routes_to_geojson(route_ids: Iterable[str | None] = None, *, split_directions: bool = False, include_stops: bool = False) dict

Return a GeoJSON FeatureCollection of MultiLineString features representing this Feed’s routes. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the route stops as Point features . If an iterable of route IDs is given, then subset to those routes. If the subset is empty, then return a FeatureCollection with an empty list of features. If the Feed has no shapes, then raise a ValueError. If any of the given route IDs are not found in the feed, then raise a ValueError.

shapes_to_geojson(shape_ids: Iterable[str] | None = None) dict

Return a GeoJSON FeatureCollection of LineString features representing feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.

stop_times_to_geojson(trip_ids: Iterable[str | None] = None) dict

Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.

For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.

If an iterable of trip IDs is given, then subset to those trips. If some of the given trip IDs are not found in the feed, then raise a ValueError.

stops_to_geojson(stop_ids: Iterable[str | None] = None) dict

Return a GeoJSON FeatureCollection of Point features representing all the stops in feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of stop IDs is given, then subset to those stops. If some of the given stop IDs are not found in the feed, then raise a ValueError.

subset_dates(dates: list[str]) list[str]

Given a Feed and a list of YYYYMMDD date strings, return the sublist of dates that lie in the Feed’s dates (the output feed.get_dates()).

to_file(path: Path, ndigits: int | None = None) None

Write this Feed to the given path. If the path ends in ‘.zip’, then write the feed as a zip archive. Otherwise assume the path is a directory, and write the feed as a collection of CSV files to that directory, creating the directory if it does not exist. Round all decimals to ndigits decimal places, if given. All distances will be the distance units feed.dist_units. By the way, 6 decimal degrees of latitude and longitude is enough to locate an individual cat.

trips_to_geojson(trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) dict

Return a GeoJSON FeatureCollection of LineString features representing all the Feed’s trips. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips. If any of the given trip IDs are not found in the feed, then raise a ValueError. If the Feed has no shapes, then raise a ValueError.

ungeometrize_stops() DataFrame

The inverse of geometrize_stops().

If stops_g is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

gtfs_kit.feed.list_feed(path: Path) DataFrame

Given a path (string or Path object) to a GTFS zip file or directory, record the file names and file sizes of the contents, and return the result in a DataFrame with the columns:

  • 'file_name'

  • 'file_size'

gtfs_kit.feed.read_feed(path_or_url: Path | str, dist_units: str) Feed

Create a Feed instance from the given path or URL and given distance units. If the path exists, then call _read_feed_from_path(). Else if the URL has OK status according to Requests, then call _read_feed_from_url(). Else raise a ValueError.

Notes:

  • Ignore non-GTFS files in the feed

  • Automatically strip whitespace from the column names in GTFS files

Indices and tables