GTFS Kit 11.0.0 Documentation¶

Introduction¶

GTFS Kit is a Python library for analyzing General Transit Feed Specification (GTFS) data in memory without a database. It uses Pandas and GeoPandas to do the heavy lifting.

Authors¶

Alex Raichev, 2019-09

Installation¶

Install it from PyPI with UV, say, via uv add gtfs_kit.

Examples¶

In the Jupyter notebook notebooks/examples.ipynb.

Conventions¶

In conformance with GTFS, dates are encoded as YYYYMMDD date strings, and times are encoded as HH:MM:SS time strings with the possibility that HH > 24. Watch out for that possibility, because it has counterintuitive consequences; see e.g. trips.is_active_trip(), which is used in routes.compute_route_stats(), stops.compute_stop_stats(), and miscellany.compute_network_stats().
‘DataFrame’ and ‘Series’ refer to Pandas DataFrame and Series objects, respectively

Module constants¶

Constants useful across modules.

gtfs_kit.constants.COLORS_SET2 = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3']¶: Colorbrewer 8-class Set2 colors

gtfs_kit.constants.DIST_UNITS = ['ft', 'mi', 'm', 'km']¶: Valid distance units

gtfs_kit.constants.DTYPES = {'agency': {'agency_email': 'string', 'agency_fare_url': 'string', 'agency_id': 'string', 'agency_lang': 'string', 'agency_name': 'string', 'agency_phone': 'string', 'agency_timezone': 'string', 'agency_url': 'string'}, 'attributions': {'agency_id': 'string', 'attribution_email': 'string', 'attribution_id': 'string', 'attribution_phone': 'string', 'attribution_url': 'string', 'is_authority': 'Int32', 'is_operator': 'Int32', 'is_producer': 'Int32', 'organization_name': 'string', 'route_id': 'string', 'trip_id': 'string'}, 'calendar': {'end_date': 'string', 'friday': 'Int32', 'monday': 'Int32', 'saturday': 'Int32', 'service_id': 'string', 'start_date': 'string', 'sunday': 'Int32', 'thursday': 'Int32', 'tuesday': 'Int32', 'wednesday': 'Int32'}, 'calendar_dates': {'date': 'string', 'exception_type': 'Int32', 'service_id': 'string'}, 'fare_attributes': {'currency_type': 'string', 'fare_id': 'string', 'payment_method': 'Int32', 'price': 'float', 'transfer_duration': 'Int16', 'transfers': 'Int32'}, 'fare_rules': {'contains_id': 'string', 'destination_id': 'string', 'fare_id': 'string', 'origin_id': 'string', 'route_id': 'string'}, 'feed_info': {'feed_end_date': 'string', 'feed_lang': 'string', 'feed_publisher_name': 'string', 'feed_publisher_url': 'string', 'feed_start_date': 'string', 'feed_version': 'string'}, 'frequencies': {'end_time': 'string', 'exact_times': 'Int32', 'headway_secs': 'Int16', 'start_time': 'string', 'trip_id': 'string'}, 'routes': {'agency_id': 'string', 'route_color': 'string', 'route_desc': 'string', 'route_id': 'string', 'route_long_name': 'string', 'route_short_name': 'string', 'route_text_color': 'string', 'route_type': 'Int32', 'route_url': 'string'}, 'shapes': {'shape_dist_traveled': 'float', 'shape_id': 'string', 'shape_pt_lat': 'float', 'shape_pt_lon': 'float', 'shape_pt_sequence': 'Int32'}, 'stop_times': {'arrival_time': 'string', 'departure_time': 'string', 'drop_off_type': 'Int32', 'pickup_type': 'Int32', 'shape_dist_traveled': 'float', 'stop_headsign': 'string', 'stop_id': 'string', 'stop_sequence': 'Int32', 'timepoint': 'Int32', 'trip_id': 'string'}, 'stops': {'location_type': 'Int32', 'parent_station': 'string', 'stop_code': 'string', 'stop_desc': 'string', 'stop_id': 'string', 'stop_lat': 'float', 'stop_lon': 'float', 'stop_name': 'string', 'stop_timezone': 'string', 'stop_url': 'string', 'wheelchair_boarding': 'Int32', 'zone_id': 'string'}, 'transfers': {'from_stop_id': 'string', 'min_transfer_time': 'Int16', 'to_stop_id': 'string', 'transfer_type': 'Int32'}, 'trips': {'bikes_allowed': 'Int32', 'block_id': 'string', 'direction_id': 'Int32', 'route_id': 'string', 'service_id': 'string', 'shape_id': 'string', 'trip_headsign': 'string', 'trip_id': 'string', 'trip_short_name': 'string', 'wheelchair_accessible': 'Int32'}}¶: GTFS data types

gtfs_kit.constants.FEED_ATTRS = ['agency', 'attributions', 'calendar', 'calendar_dates', 'fare_attributes', 'fare_rules', 'feed_info', 'frequencies', 'routes', 'shapes', 'stops', 'stop_times', 'trips', 'transfers', 'dist_units']¶: Feed attributes

gtfs_kit.constants.WGS84 = 'EPSG:4326'¶: WGS84 coordinate reference system for Geopandas

Module helpers¶

Functions useful across modules.

gtfs_kit.helpers.almost_equal(f: DataFrame, g: DataFrame) → bool¶: Return True if and only if the given DataFrames are equal after sorting their columns names, sorting their values, and reseting their indices.

gtfs_kit.helpers.combine_time_series(series_by_indicator: dict[str, DataFrame], *, kind: Literal['route', 'stop'], split_directions: bool = False) → DataFrame¶

Combine a dict of wide time series (one DataFrame per indicator, columns are entities) into a single long-form time series with columns

'datetime'
'route_id' or 'stop_id': depending on kind
'direction_id': present if and only if split_directions
one column per indicator provided in series_by_indicator
'service_speed': if both service_distance and service_duration present

If split_directions, then assume the original time series contains data separated by trip direction; otherwise, assume not. The separation is indicated by a suffix '-0' (direction 0) or '-1' (direction 1) in the route ID or stop ID column values.

gtfs_kit.helpers.date_to_datestr(x: date | None, format_str: str = '%Y%m%d') → str | None¶: Convert a datetime.date to a formatted string. Return None if x is None.

gtfs_kit.helpers.datestr_to_date(x: str | None, format_str: str = '%Y%m%d') → date | None¶: Convert a date string to a datetime.date. Return None if x is None.

gtfs_kit.helpers.downsample(time_series: DataFrame, freq: str) → DataFrame¶

Downsample the given stop, route, or network time series, (outputs of stops.compute_stop_time_series(), routes.compute_route_time_series(), or miscellany.compute_network_time_series(), respectively) to the given frequency (Pandas frequency string, e.g. ‘15Min’).

Return the given time series unchanged if the given frequency is finer than the original frequency or the given frequncy is courser than a day.

gtfs_kit.helpers.drop_feature_ids(collection: dict) → dict¶: Given a GeoJSON FeatureCollection, remove the 'id' attribute of each Feature, if it exists.

gtfs_kit.helpers.get_active_trips_df(trip_times: DataFrame) → Series¶

Count the number of trips in trip_times that are active at any given time.

Assume trip_times contains the columns

start_time: start time of the trip in seconds past midnight
end_time: end time of the trip in seconds past midnight

Return a Series whose index is times from midnight when trips start and end and whose values are the number of active trips for that time.

gtfs_kit.helpers.get_convert_dist(dist_units_in: str, dist_units_out: str) → Callable[[float], float]¶

Return a function of the form

distance in the units dist_units_in -> distance in the units dist_units_out

Only supports distance units in constants.DIST_UNITS.

gtfs_kit.helpers.get_max_runs(x) → array¶

Given a list of numbers, return a NumPy array of pairs (start index, end index + 1) of the runs of max value.

Example:

>>> get_max_runs([7, 1, 2, 7, 7, 1, 2])
array([[0, 1],
       [3, 5]])

Assume x is not empty. Recipe comes from Stack Overflow.

gtfs_kit.helpers.get_peak_indices(times: list, counts: list) → array¶

Given an increasing list of times as seconds past midnight and a list of trip counts at those respective times, return a pair of indices i, j such that times[i] to times[j] is the first longest time period such that for all i <= x < j, counts[x] is the max of counts. Assume times and counts have the same nonzero length.

Examples:

>>> times = [0, 10, 20, 30, 31, 32, 40]
>>> counts = [7, 1, 2, 7, 7, 1, 2]
>>> get_peak_indices(times, counts)
array([0, 1])

>>> counts = [0, 0, 0]
>>> times = [18000, 21600, 28800]
>>> get_peak_indices(times, counts)
array([0, 3])

gtfs_kit.helpers.get_segment_length(linestring: LineString, p: Point, q: Point | None = None) → float¶: Given a Shapely linestring and two Shapely points, project the points onto the linestring, and return the distance along the linestring between the two points. If q is None, then return the distance from the start of the linestring to the projection of p. The distance is measured in the native coordinates of the linestring.

gtfs_kit.helpers.is_metric(dist_units: str) → bool¶: Return True if the given distance units equals ‘m’ or ‘km’; otherwise return False.

gtfs_kit.helpers.is_not_null(df: DataFrame, col_name: str) → bool¶: Return True if the given DataFrame has a column of the given name (string), and there exists at least one non-NaN value in that column; return False otherwise.

gtfs_kit.helpers.longest_subsequence(seq, mode='strictly', order='increasing', key=None, *, index=False)¶

Return the longest increasing subsequence of seq.

Parameters:

seq (sequence object) – Can be any sequence, like str, list, numpy.array.
mode ({'strict', 'strictly', 'weak', 'weakly'}, optional) – If set to ‘strict’, the subsequence will contain unique elements. Using ‘weak’ an element can be repeated many times. Modes ending in -ly serve as a convenience to use with order parameter, because longest_sequence(seq, ‘weakly’, ‘increasing’) reads better. The default is ‘strict’.
order ({'increasing', 'decreasing'}, optional) – By default return the longest increasing subsequence, but it is possible to return the longest decreasing sequence as well.
key (function, optional) – Specifies a function of one argument that is used to extract a comparison key from each list element (e.g., str.lower, lambda x: x[0]). The default value is None (compare the elements directly).
index (bool, optional) – If set to True, return the indices of the subsequence, otherwise return the elements. Default is False.

Returns:

elements (list, optional) – A list of elements of the longest subsequence. Returned by default and when index is set to False.
indices (list, optional) – A list of indices pointing to elements in the longest subsequence. Returned when index is set to True.
Taken from this Stack Overflow answer.

gtfs_kit.helpers.make_html(d: dict) → str¶: Convert the given dictionary into an HTML table (string) with two columns: keys of dictionary, values of dictionary.

gtfs_kit.helpers.make_ids(n: int, prefix: str = 'id_')¶

Return a length n list of unique sequentially labelled strings for use as IDs.

Example:

>>> make_ids(11, prefix="s")
['s00', s01', 's02', 's03', 's04', 's05', 's06', 's07', 's08', 's09', 's10']

gtfs_kit.helpers.replace_date(f: DataFrame, date: str) → DataFrame¶: Given a table with a datetime object column called ‘datetime’ and given a YYYYMMDD date string, replace the datetime dates with the given date and return the resulting table.

gtfs_kit.helpers.seconds_to_timestr(x: int, *, mod24: bool = False) → str | np.nan¶: The inverse of timestr_to_seconds(). If mod24, then first take the number of seconds modulo 24*3600. Return NAN in case of bad inputs.

gtfs_kit.helpers.timestr_mod24(timestr: str) → int | np.nan¶: Given a GTFS HH:MM:SS time string, return a timestring in the same format but with the hours taken modulo 24. Return NAN in case of bad inputes

gtfs_kit.helpers.timestr_to_seconds(x: str, *, mod24: bool = False) → int | np.nan¶: Given an HH:MM:SS time string x, return the number of seconds past midnight that it represents. In keeping with GTFS standards, the hours entry may be greater than 23. If mod24, then return the number of seconds modulo 24*3600. Return np.nan in case of bad inputs.

Module cleaners¶

Functions about cleaning feeds.

gtfs_kit.cleaners.aggregate_routes(feed: Feed, by: str = 'route_short_name', route_id_prefix: str = 'route_') → Feed¶

Aggregate routes by route short name, say, and assign new route IDs using the given prefix.

More specifically, create new route IDs with the function build_aggregate_routes_dict() and the parameters by and route_id_prefix and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

gtfs_kit.cleaners.aggregate_stops(feed: Feed, by: str = 'stop_code', stop_id_prefix: str = 'stop_') → Feed¶

Aggregate stops by stop code, say, and assign new stop IDs using the given prefix.

More specifically, create new stop IDs with the function build_aggregate_stops_dict() and the parameters by and stop_id_prefix and update the old stop IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

gtfs_kit.cleaners.build_aggregate_routes_dict(routes: DataFrame, by: str = 'route_short_name', route_id_prefix: str = 'route_') → dict[str, str]¶

Given a DataFrame of routes, group the routes by route short name, say, and assign new route IDs using the given prefix. Return a dictionary of the form <old route ID> -> <new route ID>. Helper function for aggregate_routes().

More specifically, group routes by the by column, and for each group make one new route ID for all the old route IDs in that group based on the given route_id_prefix string and a running count, e.g. 'route_013'.

gtfs_kit.cleaners.build_aggregate_stops_dict(stops: DataFrame, by: str = 'stop_code', stop_id_prefix: str = 'stop_') → dict[str, str]¶

Given a DataFrame of stops, group the stops by stop code, say, and assign new stop IDs using the given prefix. Return a dictionary of the form <old stop ID> -> <new stop ID>. Helper function for aggregate_stops().

More specifically, group stops by the by column, and for each group make one new stop ID for all the old stops IDs in that group based on the given stop_id_prefix string and a running count, e.g. 'stop_013'.

gtfs_kit.cleaners.clean(feed: Feed) → Feed¶

Apply the following functions to the given Feed in order and return the resulting Feed.

clean_ids()
clean_times()
clean_route_short_names()
drop_zombies()

gtfs_kit.cleaners.clean_column_names(df: DataFrame) → DataFrame¶: Strip the whitespace from all column names in the given DataFrame and return the result.

gtfs_kit.cleaners.clean_ids(feed: Feed) → Feed¶: In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.

gtfs_kit.cleaners.clean_route_short_names(feed: Feed) → Feed¶: In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.

gtfs_kit.cleaners.clean_times(feed: Feed) → Feed¶: In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.

gtfs_kit.cleaners.drop_invalid_columns(feed: Feed) → Feed¶: Drop all DataFrame columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.

gtfs_kit.cleaners.drop_zombies(feed: Feed) → Feed¶

In the given Feed, do the following in order and return the resulting Feed.

Drop agencies with no routes.
Drop stops of location type 0 or NaN with no stop times.
Remove undefined parent stations from the parent_station column.
Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.

gtfs_kit.cleaners.extend_id(feed: Feed, id_col: str, extension: str, *, prefix=True) → Feed¶

Add a prefix (if prefix) or a suffix (otherwise) to all values of column id_col across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.

Raises a ValueError if id_col values are strings, e.g. if id_col is ‘direction_id’.

Module calendar¶

Functions about calendar and calendar_dates.

gtfs_kit.calendar.get_dates(feed: Feed, *, as_date_obj: bool = False) → list[str]¶

Return a list of YYYYMMDD date strings for which the given Feed is valid, which could be the empty list if the Feed has no calendar information.

If as_date_obj, then return datetime.date objects instead.

gtfs_kit.calendar.get_first_week(feed: Feed, *, as_date_obj: bool = False) → list[str]¶

Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.

If as_date_obj, then return date objects, otherwise return date strings.

gtfs_kit.calendar.get_week(feed: Feed, k: int, *, as_date_obj: bool = False) → list[str]¶

Given a Feed and a positive integer k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

gtfs_kit.calendar.subset_dates(feed: Feed, dates: list[str]) → list[str]¶: Given a Feed and a list of YYYYMMDD date strings, return the sorted sublist of dates that lie in the Feed’s dates (the output feed.get_dates()). Could be an empty list.

Module routes¶

Functions about routes.

gtfs_kit.routes.build_route_timetable(feed: Feed, route_id: str, dates: list[str]) → pd.DataFrame¶

Return a timetable for the given route and dates (YYYYMMDD date strings).

Return a DataFrame with whose columns are all those in feed.trips plus those in feed.stop_times plus 'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.

Skip dates outside of the Feed’s dates.

If there is no route activity on the given dates, then return an empty DataFrame.

gtfs_kit.routes.compute_route_stats(feed: Feed, dates: list[str], trip_stats: pd.DataFrame | None = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pd.DataFrame¶

Compute route stats for all the trips that lie in the given subset of trip stats, which defaults to feed.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

'date'
'route_id'
'route_short_name'
'route_type'
'direction_id': present if only if split_directions
'num_trips': number of trips on the route in the subset
'num_trip_starts': number of trips on the route with nonnull start times
'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59
'num_stop_patterns': number of stop pattern across trips
'is_loop': 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise
'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if not split_directions
'start_time': start time of the earliest trip on the route
'end_time': end time of latest trip on the route
'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
'peak_start_time': start time of first longest period during which the peak number of trips occurs
'peak_end_time': end time of first longest period during which the peak number of trips occurs
'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours
'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_speed': service_distance/service_duration when defined; 0 otherwise
'mean_trip_distance': service_distance/num_trips
'mean_trip_duration': service_duration/num_trips

Exclude dates with no active trips, which could yield the empty DataFrame.

If not split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Raise a ValueError if split_directions and no non-null direction ID values present.

gtfs_kit.routes.compute_route_stats_0(trip_stats: DataFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → DataFrame¶

Compute stats for the given subset of trips stats of the form output by the function trips.compute_trip_stats(). Ignore trips with zero duration.

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

'route_id'
'route_short_name'
'route_type'
'direction_id': present if only if split_directions
'num_trips': number of trips on the route in the subset
'num_trip_starts': number of trips on the route with nonnull start times
'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59
'num_stop_patterns': number of stop pattern across trips
'is_loop': 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise
'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if not split_directions
'start_time': start time of the earliest trip on the route
'end_time': end time of latest trip on the route
'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
'peak_start_time': start time of first longest period during which the peak number of trips occurs
'peak_end_time': end time of first longest period during which the peak number of trips occurs
'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours
'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_speed': service_distance/service_duration when defined; 0 otherwise
'mean_trip_distance': service_distance/num_trips
'mean_trip_duration': service_duration/num_trips

If trip_stats is empty, return an empty DataFrame.

If not split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

Raise a ValueError if split_directions and no non-null direction ID values present.

gtfs_kit.routes.compute_route_time_series(feed: Feed, dates: list[str], trip_stats: pd.DataFrame | None = None, freq: str = 'h', *, split_directions: bool = False) → pd.DataFrame¶

Compute route stats in time series form for the trips that lie in the trip stats subset, which defaults to the output of trips.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate each routes’s stats by trip direction. Specify the time series frequency with a Pandas frequency string, e.g. '5Min'.

Return a time series DataFrame with the following columns.

datetime: datetime object
route_id
direction_id: direction of route; presest if and only if split_directions
num_trips: number of trips in service on the route at any time within the time bin
num_trip_starts: number of trips that start within the time bin
num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight
service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;
service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours
service_speed: service_distance/service_duration for the route

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
See the notes for compute_route_time_series_0()
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit.routes.compute_route_time_series_0(trip_stats: DataFrame, date_label: str = '20010101', freq: str = 'h', *, split_directions: bool = False) → DataFrame¶

Compute stats in a 24-hour time series form at the given Pandas frequency for the given subset of trip stats of the form output by the function trips.compute_trip_stats().

If split_directions, then separate each routes’s stats by trip direction. Use the given YYYYMMDD date label as the date in the time series index.

Return a long-format DataFrame with the columns

datetime: datetime object
route_id
direction_id: direction of route; presest if and only if split_directions
num_trips: number of trips in service on the route at any time within the time bin
num_trip_starts: number of trips that start within the time bin
num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight
service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;
service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours
service_speed: service_distance/service_duration for the route

Notes

Trips that lack start or end times are ignored, so the the aggregate num_trips across the day could be less than the num_trips column of compute_route_stats_0()
All trip departure times are taken modulo 24 hours. So routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series, except for their num_trip_ends indicator. Trip endings past 23:59:59 are not binned so that resampling the num_trips indicator works efficiently.
Note that the total number of trips for two consecutive time bins t1 < t2 is the sum of the number of trips in bin t2 plus the number of trip endings in bin t1. Thus we can downsample the num_trips indicator by keeping track of only one extra count, num_trip_ends, and can avoid recording individual trip IDs.
All other indicators are downsampled by summing.
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit.routes.get_routes(feed: Feed, date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False, split_directions: bool = False) → pd.DataFrame¶

Return feed.routes or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time.

Given a Feed, return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.

If as_gdf and feed.shapes is not None, then return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s union of trip shapes. The GeoDataFrame will have a local UTM CRS if use_utm; otherwise it will have CRS WGS84. If split_directions and as_gdf, then add the column direction_id and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. If as_gdf and feed.shapes is None, then raise a ValueError.

gtfs_kit.routes.map_routes(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)¶: Return a Folium map showing the given routes and (optionally) their stops. At least one of route_ids and route_short_names must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.

gtfs_kit.routes.routes_to_geojson(feed: Feed, route_ids: Iterable[str | None] = None, *, split_directions: bool = False, include_stops: bool = False) → dict¶

Return a GeoJSON FeatureCollection of MultiLineString features representing this Feed’s routes. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the route stops as Point features . If an iterable of route IDs is given, then subset to those routes. If the subset is empty, then return a FeatureCollection with an empty list of features. If the Feed has no shapes, then raise a ValueError. If any of the given route IDs are not found in the feed, then raise a ValueError.

Module shapes¶

Functions about shapes.

gtfs_kit.shapes.append_dist_to_shapes(feed: Feed) → Feed¶

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting Feed.

As a benchmark, using this function on this Portland feed produces a shape_dist_traveled column that differs by at most 0.016 km in absolute value from of the original values.

gtfs_kit.shapes.build_geometry_by_shape(feed: Feed, shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict¶: Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.

gtfs_kit.shapes.geometrize_shapes(shapes: DataFrame, *, use_utm: bool = False) → GeoDataFrame¶

Given a GTFS shapes DataFrame, convert it to a GeoDataFrame of LineStrings and return the result, which will no longer have the columns 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat', and 'shape_dist_traveled'.

If use_utm, then use local UTM coordinates for the geometries.

gtfs_kit.shapes.get_shapes(feed: Feed, *, as_gdf: bool = False, use_utm: bool = False) → gpd.DataFrame | None¶: Get the shapes DataFrame for the given feed, which could be None. If as_gdf, then return it as GeoDataFrame with a ‘geometry’ column of linestrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The GeoDataFrame will have a UTM CRS if use_utm; otherwise it will have a WGS84 CRS.

gtfs_kit.shapes.get_shapes_intersecting_geometry(feed: Feed, geometry: sg.base.BaseGeometry, shapes_g: gpd.GeoDataFrame | None = None, *, as_gdf: bool = False) → pd.DataFrame | None¶

If the Feed has no shapes, then return None. Otherwise, return the subset of feed.shapes that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.

If as_gdf, then return the shapes as a GeoDataFrame. Specifying shapes_g will skip the first step of the algorithm, namely, geometrizing feed.shapes.

gtfs_kit.shapes.shapes_to_geojson(feed: Feed, shape_ids: Iterable[str] | None = None) → dict¶

Return a GeoJSON FeatureCollection of LineString features representing feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.

gtfs_kit.shapes.split_simple(shapes_g: GeoDataFrame, segmentize_m: float = 5) → GeoDataFrame¶

Given GTFS shapes as a GeoDataFrame of the form output by geometrize_shapes() and possibly in a non-WGS84 CRS, split each non-simple LineString into maximal simple (non-self-intersecting) sub-LineStrings, and leave the simple LineStrings as is. Before splitting, segmentize (with Shapely’s segmentize method) each non-simple LineString L by segmentize_m meters, which also sets the maximum gap size between L’s simple sub-LineStrings.

Return a GeoDataFrame in the CRS of shapes_g with the columns

'shape_id': a unique identifier of the original LineString L
'subshape_id': a unique identifier of a simple sub-LineString S of L
'subshape_sequence': integer; indicates the order of S when joining up all simple sub-LineStrings to form L
'subshape_length_m': the length of S in meters
'cum_length_m': the length S plus the lengths of sub-LineStrings of L that come before S; in meters
'geometry': LineString geometry corresponding to S

Within each ‘shape_id’ group, the subshapes will be sorted increasingly by ‘subshape_sequence’.

Note that by construction, for each given LineString L with k simple subLineStrings S_i, we have the inequalities

length(L) - k * segmentize_m <= sum over i of length(S_i) <= length(L),

where the lengths are expressed in meters.

gtfs_kit.shapes.ungeometrize_shapes(shapes_g: GeoDataFrame) → DataFrame¶

The inverse of geometrize_shapes().

If shapes_g is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

Module stop_times¶

Functions about stop times.

gtfs_kit.stop_times.append_dist_to_stop_times(feed: Feed) → Feed¶

Calculate and append the optional shape_dist_traveled column in feed.stop_times in terms of the distance units feed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Uses feed.shapes, so if that is missing, then return the original feed.

This does not always give accurate results. The algorithm works as follows. Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.

gtfs_kit.stop_times.get_start_and_end_times(feed: Feed, date: str | None = None) → list[str]¶: Return the first departure time and last arrival time (HH:MM:SS time strings) listed in feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.

gtfs_kit.stop_times.get_stop_times(feed: Feed, date: str | None = None) → pd.DataFrame¶: Return feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.

gtfs_kit.stop_times.stop_times_to_geojson(feed: Feed, trip_ids: Iterable[str | None] = None) → dict¶

Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.

For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.

If an iterable of trip IDs is given, then subset to those trips. If some of the given trip IDs are not found in the feed, then raise a ValueError.

Module stops¶

Functions about stops.

gtfs_kit.stops.STOP_STYLE = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}¶: Leaflet circleMarker parameters for mapping stops

gtfs_kit.stops.build_geometry_by_stop(feed: Feed, stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict¶: Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.

gtfs_kit.stops.build_stop_timetable(feed: Feed, stop_id: str, dates: list[str]) → pd.DataFrame¶

Return a DataFrame containing the timetable for the given stop ID and dates (YYYYMMDD date strings)

Return a DataFrame whose columns are all those in feed.trips plus those in feed.stop_times plus 'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.

gtfs_kit.stops.compute_stop_activity(feed: Feed, dates: list[str]) → pd.DataFrame¶

Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).

Return a DataFrame with the columns

stop_id
dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise
dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If all dates lie outside the Feed period, then return an empty DataFrame.

gtfs_kit.stops.compute_stop_stats(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pd.DataFrame¶

Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

'date'
'stop_id'
'direction_id': present if and only if split_directions
'num_routes': number of routes visiting the stop (in the given direction) on the date
'num_trips': number of trips visiting stop (in the givin direction) on the date
'max_headway': maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the date
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'start_time': earliest departure time of a trip from this stop on the date
'end_time': latest departure time of a trip from this stop on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

gtfs_kit.stops.compute_stop_stats_0(stop_times_subset: DataFrame, trips_subset: DataFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → DataFrame¶

Given a subset of a trips DataFrame and a subset of a stop times DataFrame, return a DataFrame that provides summary stats about the stops in the inner join of the two DataFrames.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

stop_id
direction_id: present if and only if split_directions
num_routes: number of routes visiting stop (in the given direction)
num_trips: number of trips visiting stop (in the givin direction)
max_headway: maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time
min_headway: minimum of the durations (in minutes) mentioned above
mean_headway: mean of the durations (in minutes) mentioned above
start_time: earliest departure time of a trip from this stop
end_time: latest departure time of a trip from this stop

Notes

If trips_subset is empty, then return an empty DataFrame.
Raise a ValueError if split_directions and no non-NaN direction ID values present.

gtfs_kit.stops.compute_stop_time_series(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, freq: str = 'h', *, split_directions: bool = False) → pd.DataFrame¶

Compute time series for the given stops (defaults to all stops in Feed) on the given dates (YYYYMMDD date strings) at the given frequency (Pandas frequency string, e.g. '5Min'). Return a long-format DataFrame with columns

datetime: datetime object for the given date and frequency chunks
stop_id
direction_id: direction of route; presest if and only if split_directions
num_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.

Notes

The time series is computed at a one-minute frequency, then resampled at the end to the given frequency
Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled with how=np.sum
If trips_subset is empty, then return an empty DataFrame
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit.stops.compute_stop_time_series_0(stop_times_subset: DataFrame, trips_subset: DataFrame, freq: str = 'h', date_label: str = '20010101', *, split_directions: bool = False) → DataFrame¶

Compute stop stats in a 24-hour time series form at the given Pandas frequency for stops in the inner join of the given subset of stop times and trips.

If split_directions, then separate each stop’s stats by trip direction. Use the given YYYYMMDD date label as the date in the time series index.

Return a long-format DataFrame with columns

datetime: datetime object for the given date and frequency chunks
stop_id
direction_id: direction of route; presest if and only if split_directions
num_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop

Notes

The time series is computed at a one-minute frequency, then resampled at the end to the given frequency
Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled with how=np.sum
If trips_subset is empty, then return an empty DataFrame
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit.stops.geometrize_stops(stops: DataFrame, *, use_utm: bool = False) → GeoDataFrame¶: Given a stops DataFrame, convert it to a GeoPandas GeoDataFrame of Points and return the result, which will no longer have the columns 'stop_lon' and 'stop_lat'.

gtfs_kit.stops.get_stops(feed: Feed, date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_gdf: bool = False, use_utm: bool = False) → pd.DataFrame¶: Return feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. If in_stations, then subset further stops in stations if station data is available. If as_gdf, then return the result as a GeoDataFrame with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The GeoDataFrame will have a UTM CRS if use_utm and a WGS84 CRS otherwise.

gtfs_kit.stops.get_stops_in_area(feed: Feed, area: gpd.GeoDataFrame) → pd.DataFrame¶: Return the subset of feed.stops that contains all stops that lie within the given GeoDataFrame of polygons.

gtfs_kit.stops.map_stops(feed: Feed, stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})¶: Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.

gtfs_kit.stops.stops_to_geojson(feed: Feed, stop_ids: Iterable[str | None] = None) → dict¶

Return a GeoJSON FeatureCollection of Point features representing all the stops in feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of stop IDs is given, then subset to those stops. If some of the given stop IDs are not found in the feed, then raise a ValueError.

gtfs_kit.stops.ungeometrize_stops(stops_g: GeoDataFrame) → DataFrame¶

The inverse of geometrize_stops().

If stops_g is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

Module trips¶

Functions about trips.

gtfs_kit.trips.compute_busiest_date(feed: Feed, dates: list[str]) → str¶: Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.

gtfs_kit.trips.compute_trip_activity(feed: Feed, dates: list[str]) → pd.DataFrame¶

Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns

'trip_id'
dates[0]: 1 if the trip is active on dates[0]; 0 otherwise
dates[1]: 1 if the trip is active on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty DataFrame.

gtfs_kit.trips.compute_trip_stats(feed: Feed, route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) → pd.DataFrame¶

Return a DataFrame with the following columns:

'trip_id'
'route_id'
'route_short_name'
'route_type'
'direction_id': NaN if missing from feed
'shape_id': NaN if missing from feed
'stop_pattern_name': output from name_stop_patterns()
'num_stops': number of stops on trip
'start_time': first departure time of the trip
'end_time': last departure time of the trip
'start_stop_id': stop ID of the first stop of the trip
'end_stop_id': stop ID of the last stop of the trip
'is_loop': 1 if the start and end stop are less than 400m apart and 0 otherwise
'distance': distance of the trip; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'duration': duration of the trip in hours
'speed': distance/duration

If feed.stop_times has a shape_dist_traveled column with at least one non-NaN value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to NaN.

If route IDs are given, then restrict to trips on those routes.

Notes

Assume the following feed attributes are not None:
- feed.trips
- feed.routes
- feed.stop_times
- feed.shapes (optionally)
Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on this Portland feed using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.

gtfs_kit.trips.get_active_services(feed: Feed, date: str) → list[str]¶: Given a Feed and a date string in YYYYMMDD format, return the list of service IDs that are active on the date.

gtfs_kit.trips.get_trips(feed: Feed, date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False) → pd.DataFrame¶

Return feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.

If as_gdf and feed.shapes is not None, then return the trips as a GeoDataFrame of LineStrings representating trip shapes. Use local UTM CRS if use_utm; otherwise it the WGS84 CRS. If as_gdf and feed.shapes is None, then raise a ValueError.

gtfs_kit.trips.locate_trips(feed: Feed, date: str, times: list[str]) → pd.DataFrame¶

Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).

Return a DataFrame with the columns

'trip_id'
'route_id'
'direction_id': all NaNs if feed.trips.direction_id is missing
'time'
'rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path
'lon': longitude of trip at given time
'lat': latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

gtfs_kit.trips.map_trips(feed: Feed, trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = False)¶: Return a Folium map showing the given trips and (optionally) their stops. If any of the given trip IDs are not found in the feed, then raise a ValueError. If include_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.

gtfs_kit.trips.name_stop_patterns(feed: Feed) → pd.DataFrame¶

For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the DataFrame feed.trips with the additional column stop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.

If feed.trips has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.

gtfs_kit.trips.trips_to_geojson(feed: Feed, trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) → dict¶

Return a GeoJSON FeatureCollection of LineString features representing all the Feed’s trips. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips. If any of the given trip IDs are not found in the feed, then raise a ValueError. If the Feed has no shapes, then raise a ValueError.

Module miscellany¶

Functions about miscellany.

gtfs_kit.miscellany.assess_quality(feed: Feed) → pd.DataFrame¶

Return a DataFrame of various feed indicators and values, e.g. number of trips missing shapes.

The resulting DataFrame has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.

gtfs_kit.miscellany.compute_bounds(feed: Feed, stop_ids: list[str] | None = None) → np.array¶: Return the bounding box (Numpy array [min longitude, min latitude, max longitude, max latitude]) of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit.miscellany.compute_centroid(feed: Feed, stop_ids: list[str] | None = None) → sg.Point¶: Return the centroid (Shapely Point) of the convex hull the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit.miscellany.compute_convex_hull(feed: Feed, stop_ids: list[str] | None = None) → sg.Polygon¶: Return a convex hull (Shapely Polygon) representing the convex hull of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit.miscellany.compute_network_stats(feed: Feed, dates: list[str], trip_stats: pd.DataFrame | None = None, *, split_route_types=False) → pd.DataFrame¶

Compute some network stats for the given subset of trip stats, which defaults to feed.compute_trip_stats(), and for the given dates (YYYYMMDD date stings).

Return a table with the columns

'date'
'route_type' (optional): presest if and only if split_route_types
'num_stops': number of stops active on the date
'num_routes': number of routes active on the date
'num_trips': number of trips that start on the date
'num_trip_starts': number of trips with nonnull start times on the date
'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date
'peak_num_trips': maximum number of simultaneous trips in service on the date
'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date
'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date
'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': sum of the service durations for the active routes on the date; measured in hours
'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.

gtfs_kit.miscellany.compute_network_stats_0(stop_times_subset: DataFrame, trip_stats_subset: DataFrame, *, split_route_types=False) → DataFrame¶

Compute some network stats for the trips common to the given subset of stop times and given subses of trip stats of the form output by the function trips.compute_trip_stats()

Return a DataFrame with the columns

'route_type' (optional): presest if and only if split_route_types
'num_stops': number of stops active on the date
'num_routes': number of routes active on the date
'num_trips': number of trips that start on the date
'num_trip_starts': number of trips with nonnull start times on the date
'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date
'peak_num_trips': maximum number of simultaneous trips in service on the date
'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date
'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date
'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': sum of the service durations for the active routes on the date; measured in hours
'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

Helper function for compute_network_stats().

gtfs_kit.miscellany.compute_network_time_series(feed: Feed, dates: list[str], trip_stats: pd.DataFrame | None = None, freq: str = 'h', *, split_route_types: bool = False) → pd.DataFrame¶

Compute some network stats in time series form for the given dates (YYYYMMDD date strings) and trip stats, which defaults to feed.compute_trip_stats(). Use the given Pandas frequency string freq to specify the frequency of the resulting time series, e.g. ‘5Min’. If split_route_types, then split stats by route type; otherwise don’t.

Return a long-form time series table with the columns

'datetime': datetime object
'route_type': integer; present if and only if split_route_types
'num_trips': number of trips in service during during the time period
'num_trip_starts': number of trips with starting during the time period
'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight
'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours
'service_speed': service_distance/service_duration when defined; 0 otherwise

Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty DataFrame with the specified columns.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.

gtfs_kit.miscellany.compute_screen_line_counts(feed: Feed, screen_lines: gpd.GeoDataFrame, dates: list[str], segmentize_m: float = 5, *, include_testing_cols: bool = False) → pd.DataFrame¶

Find all the Feed trips active on the given YYYYMMDD dates that intersect the given segment-associated screen lines of the form output by build_screen_lines(). Behind the scenes, use simple sub-LineStrings of the feed (with points separated by at segmentize_m meters) to compute screen line intersections. Using them instead of the Feed shapes avoids miscounting intersections in the case of non-simple (self-intersecting) shapes.

For each trip crossing a screen line, compute the crossing time, crossing direction, etc. and return a DataFrame of results with the columns

'date': the YYYYMMDD date string given
'screen_line_id': ID of a screen line
'trip_id': ID of a trip that crosses the screen line
'shape_id': ID of the trip’s shape
'direction_id': GTFS direction of trip
'route_id'
'route_short_name'
'route_type'
'shape_id'
'crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
'crossing_time': time, according to the GTFS schedule, that the trip crosses the screen line
'crossing_dist_m': distance along the trip shape (not subshape) of the crossing; in meters

If include_testing_columns, then include the following extra columns for testing purposes.

'subshape_id': ID of the simple sub-LineString S of the trip’s shape that crosses the screen line
'subshape_length_m': length of S in meters
'from_departure_time': departure time of the trip from the last stop before the screen line
'to_departure_time': departure time of the trip at from the first stop after the screen line
'subshape_dist_frac': proportion of S’s length at which the screen line intersects S

Notes:

Assume the Feed’s stop times DataFrame has an accurate shape_dist_traveled column.
Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
The algorithm works as follows
1. Find the Feed’s simple subshapes (computed via shapes.split_simple()) that intersect the screen lines.
2. For each such subshape and screen line, compute the intersection points, the distance of each point along the subshape, aka the crossing distance, and the orientation of the screen line relative to the subshape.
3. Restrict to trips active on the given dates and for each trip associated to an intersecting subshape above, interpolate a trip stop time for the intersection point using the crossing distance, subshape length, cumulative subshape length, and trip stop times.

gtfs_kit.miscellany.convert_dist(feed: Feed, new_dist_units: str) → Feed¶: Convert the distances recorded in the shape_dist_traveled columns of the given Feed to the given distance units. New distance units must lie in constants.DIST_UNITS. Return the resulting Feed.

gtfs_kit.miscellany.create_shapes(feed: Feed, *, all_trips: bool = False) → Feed¶

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

gtfs_kit.miscellany.describe(feed: Feed, sample_date: str | None = None) → pd.DataFrame¶

Return a DataFrame of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.

The resulting DataFrame has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

gtfs_kit.miscellany.list_fields(feed: Feed, table: str | None = None) → pd.DataFrame¶

Return a DataFrame describing all the fields of the GTFS tables in the given feed or in the given table if specified.

The resulting DataFrame has the following columns.

'table': name of the GTFS table, e.g. 'stops'
'column': name of a column in the table, e.g. 'stop_id'
'num_values': number of values in the column
'num_nonnull_values': number of nonnull values in the column
'num_unique_values': number of unique values in the column, excluding null values
'min_value': minimum value in the column
'max_value': maximum value in the column

If the table is not in the feed, then return an empty DataFrame If the table is not valid, raise a ValueError

gtfs_kit.miscellany.restrict_to_agencies(feed: Feed, agency_ids: list[str]) → Feed¶: Build a new feed by restricting this one via restrict_to_routes() and the routes with the given agency IDs. Return the resulting feed.

gtfs_kit.miscellany.restrict_to_area(feed: Feed, area: gpd.GeoDataFrame) → Feed¶: Build a new feed by restricting this one via restrict_to_trips() and the trips that have at least one stop intersecting the given GeoDataFrame of polygons. Return the resulting feed.

gtfs_kit.miscellany.restrict_to_dates(feed: Feed, dates: list[str]) → Feed¶: Build a new feed by restricting this one via restrict_to_trips() and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.

gtfs_kit.miscellany.restrict_to_routes(feed: Feed, route_ids: list[str]) → Feed¶: Build a new feed by restricting this one via restrict_to_trips() and the trips with the given route IDs. Return the resulting feed.

gtfs_kit.miscellany.restrict_to_trips(feed: Feed, trip_ids: list[str]) → Feed¶

Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.

If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.

This function is probably more useful internally than externally.

Module feed¶

This module defines a Feed class to represent GTFS feeds. There is an instance attribute for every GTFS table (routes, stops, etc.), which stores the table as a Pandas DataFrame, or as None in case that table is missing.

The Feed class also has heaps of methods: a method to compute route stats, a method to compute screen line counts, validations methods, etc. To ease reading, almost all of these methods are defined in other modules and grouped by theme (routes.py, stops.py, etc.). These methods, or rather functions that operate on feeds, are then imported within the Feed class. This separation of methods unfortunately messes up slightly the Feed class documentation generated by Sphinx, introducing an extra leading feed parameter in the method signatures. Ignore that extra parameter; it refers to the Feed instance, usually called self and usually hidden automatically by Sphinx.

Bases: object

An instance of this class represents a not-necessarily-valid GTFS feed, where GTFS tables are stored as DataFrames. Beware that the stop times DataFrame can be big (several gigabytes), so make sure you have enough memory to handle it.

Primary instance attributes:

dist_units: a string in constants.DIST_UNITS; specifies the distance units of the shape_dist_traveled column values, if present; also effects whether to display trip and route stats in metric or imperial units
agency
stops
routes
trips
stop_times
calendar
calendar_dates
fare_attributes
fare_rules
shapes
frequencies
transfers
feed_info
attributions

There are also a few secondary instance attributes that are derived from the primary attributes and are automatically updated when the primary attributes change. However, for this update to work, you must update the primary attributes like this (good):

feed.trips['route_short_name'] = 'bingo'
feed.trips = feed.trips

and not like this (bad):

feed.trips['route_short_name'] = 'bingo'

The first way ensures that the altered trips DataFrame is saved as the new trips attribute, but the second way does not.

aggregate_routes(by: str = 'route_short_name', route_id_prefix: str = 'route_') → Feed¶

Aggregate routes by route short name, say, and assign new route IDs using the given prefix.

More specifically, create new route IDs with the function build_aggregate_routes_dict() and the parameters by and route_id_prefix and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

aggregate_stops(by: str = 'stop_code', stop_id_prefix: str = 'stop_') → Feed¶

Aggregate stops by stop code, say, and assign new stop IDs using the given prefix.

More specifically, create new stop IDs with the function build_aggregate_stops_dict() and the parameters by and stop_id_prefix and update the old stop IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

append_dist_to_shapes() → Feed¶

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting Feed.

As a benchmark, using this function on this Portland feed produces a shape_dist_traveled column that differs by at most 0.016 km in absolute value from of the original values.

append_dist_to_stop_times() → Feed¶

Calculate and append the optional shape_dist_traveled column in feed.stop_times in terms of the distance units feed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Uses feed.shapes, so if that is missing, then return the original feed.

This does not always give accurate results. The algorithm works as follows. Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.

assess_quality() → pd.DataFrame¶

Return a DataFrame of various feed indicators and values, e.g. number of trips missing shapes.

The resulting DataFrame has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.

build_geometry_by_shape(shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict¶: Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.

build_geometry_by_stop(stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict¶: Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.

build_route_timetable(route_id: str, dates: list[str]) → pd.DataFrame¶

Return a timetable for the given route and dates (YYYYMMDD date strings).

Return a DataFrame with whose columns are all those in feed.trips plus those in feed.stop_times plus 'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.

Skip dates outside of the Feed’s dates.

If there is no route activity on the given dates, then return an empty DataFrame.

build_stop_timetable(stop_id: str, dates: list[str]) → pd.DataFrame¶

Return a DataFrame containing the timetable for the given stop ID and dates (YYYYMMDD date strings)

Return a DataFrame whose columns are all those in feed.trips plus those in feed.stop_times plus 'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.

clean() → Feed¶

Apply the following functions to the given Feed in order and return the resulting Feed.

clean_ids()
clean_times()
clean_route_short_names()
drop_zombies()

clean_ids() → Feed¶: In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.

clean_route_short_names() → Feed¶: In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.

clean_times() → Feed¶: In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.

compute_bounds(stop_ids: list[str] | None = None) → np.array¶: Return the bounding box (Numpy array [min longitude, min latitude, max longitude, max latitude]) of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_busiest_date(dates: list[str]) → str¶: Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.

compute_centroid(stop_ids: list[str] | None = None) → sg.Point¶: Return the centroid (Shapely Point) of the convex hull the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_convex_hull(stop_ids: list[str] | None = None) → sg.Polygon¶: Return a convex hull (Shapely Polygon) representing the convex hull of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_network_stats(dates: list[str], trip_stats: pd.DataFrame | None = None, *, split_route_types=False) → pd.DataFrame¶

Compute some network stats for the given subset of trip stats, which defaults to feed.compute_trip_stats(), and for the given dates (YYYYMMDD date stings).

Return a table with the columns

'date'
'route_type' (optional): presest if and only if split_route_types
'num_stops': number of stops active on the date
'num_routes': number of routes active on the date
'num_trips': number of trips that start on the date
'num_trip_starts': number of trips with nonnull start times on the date
'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date
'peak_num_trips': maximum number of simultaneous trips in service on the date
'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date
'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date
'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': sum of the service durations for the active routes on the date; measured in hours
'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.

compute_network_time_series(dates: list[str], trip_stats: pd.DataFrame | None = None, freq: str = 'h', *, split_route_types: bool = False) → pd.DataFrame¶

Compute some network stats in time series form for the given dates (YYYYMMDD date strings) and trip stats, which defaults to feed.compute_trip_stats(). Use the given Pandas frequency string freq to specify the frequency of the resulting time series, e.g. ‘5Min’. If split_route_types, then split stats by route type; otherwise don’t.

Return a long-form time series table with the columns

'datetime': datetime object
'route_type': integer; present if and only if split_route_types
'num_trips': number of trips in service during during the time period
'num_trip_starts': number of trips with starting during the time period
'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight
'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours
'service_speed': service_distance/service_duration when defined; 0 otherwise

Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty DataFrame with the specified columns.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.

compute_route_stats(dates: list[str], trip_stats: pd.DataFrame | None = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pd.DataFrame¶

Compute route stats for all the trips that lie in the given subset of trip stats, which defaults to feed.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

'date'
'route_id'
'route_short_name'
'route_type'
'direction_id': present if only if split_directions
'num_trips': number of trips on the route in the subset
'num_trip_starts': number of trips on the route with nonnull start times
'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59
'num_stop_patterns': number of stop pattern across trips
'is_loop': 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise
'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if not split_directions
'start_time': start time of the earliest trip on the route
'end_time': end time of latest trip on the route
'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
'peak_start_time': start time of first longest period during which the peak number of trips occurs
'peak_end_time': end time of first longest period during which the peak number of trips occurs
'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours
'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_speed': service_distance/service_duration when defined; 0 otherwise
'mean_trip_distance': service_distance/num_trips
'mean_trip_duration': service_duration/num_trips

Exclude dates with no active trips, which could yield the empty DataFrame.

If not split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Raise a ValueError if split_directions and no non-null direction ID values present.

compute_route_time_series(dates: list[str], trip_stats: pd.DataFrame | None = None, freq: str = 'h', *, split_directions: bool = False) → pd.DataFrame¶

Compute route stats in time series form for the trips that lie in the trip stats subset, which defaults to the output of trips.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate each routes’s stats by trip direction. Specify the time series frequency with a Pandas frequency string, e.g. '5Min'.

Return a time series DataFrame with the following columns.

datetime: datetime object
route_id
direction_id: direction of route; presest if and only if split_directions
num_trips: number of trips in service on the route at any time within the time bin
num_trip_starts: number of trips that start within the time bin
num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight
service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;
service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours
service_speed: service_distance/service_duration for the route

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
See the notes for compute_route_time_series_0()
Raise a ValueError if split_directions and no non-null direction ID values present

compute_screen_line_counts(screen_lines: gpd.GeoDataFrame, dates: list[str], segmentize_m: float = 5, *, include_testing_cols: bool = False) → pd.DataFrame¶

Find all the Feed trips active on the given YYYYMMDD dates that intersect the given segment-associated screen lines of the form output by build_screen_lines(). Behind the scenes, use simple sub-LineStrings of the feed (with points separated by at segmentize_m meters) to compute screen line intersections. Using them instead of the Feed shapes avoids miscounting intersections in the case of non-simple (self-intersecting) shapes.

For each trip crossing a screen line, compute the crossing time, crossing direction, etc. and return a DataFrame of results with the columns

'date': the YYYYMMDD date string given
'screen_line_id': ID of a screen line
'trip_id': ID of a trip that crosses the screen line
'shape_id': ID of the trip’s shape
'direction_id': GTFS direction of trip
'route_id'
'route_short_name'
'route_type'
'shape_id'
'crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
'crossing_time': time, according to the GTFS schedule, that the trip crosses the screen line
'crossing_dist_m': distance along the trip shape (not subshape) of the crossing; in meters

If include_testing_columns, then include the following extra columns for testing purposes.

'subshape_id': ID of the simple sub-LineString S of the trip’s shape that crosses the screen line
'subshape_length_m': length of S in meters
'from_departure_time': departure time of the trip from the last stop before the screen line
'to_departure_time': departure time of the trip at from the first stop after the screen line
'subshape_dist_frac': proportion of S’s length at which the screen line intersects S

Notes:

Assume the Feed’s stop times DataFrame has an accurate shape_dist_traveled column.
Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
The algorithm works as follows
1. Find the Feed’s simple subshapes (computed via shapes.split_simple()) that intersect the screen lines.
2. For each such subshape and screen line, compute the intersection points, the distance of each point along the subshape, aka the crossing distance, and the orientation of the screen line relative to the subshape.
3. Restrict to trips active on the given dates and for each trip associated to an intersecting subshape above, interpolate a trip stop time for the intersection point using the crossing distance, subshape length, cumulative subshape length, and trip stop times.

compute_stop_activity(dates: list[str]) → pd.DataFrame¶

Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).

Return a DataFrame with the columns

stop_id
dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise
dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If all dates lie outside the Feed period, then return an empty DataFrame.

compute_stop_stats(dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pd.DataFrame¶

Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a DataFrame with the columns

'date'
'stop_id'
'direction_id': present if and only if split_directions
'num_routes': number of routes visiting the stop (in the given direction) on the date
'num_trips': number of trips visiting stop (in the givin direction) on the date
'max_headway': maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the date
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'start_time': earliest departure time of a trip from this stop on the date
'end_time': latest departure time of a trip from this stop on the date

Exclude dates with no active stops, which could yield the empty DataFrame.

compute_stop_time_series(dates: list[str], stop_ids: list[str | None] = None, freq: str = 'h', *, split_directions: bool = False) → pd.DataFrame¶

Compute time series for the given stops (defaults to all stops in Feed) on the given dates (YYYYMMDD date strings) at the given frequency (Pandas frequency string, e.g. '5Min'). Return a long-format DataFrame with columns

datetime: datetime object for the given date and frequency chunks
stop_id
direction_id: direction of route; presest if and only if split_directions
num_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.

Notes

The time series is computed at a one-minute frequency, then resampled at the end to the given frequency
Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled with how=np.sum
If trips_subset is empty, then return an empty DataFrame
Raise a ValueError if split_directions and no non-null direction ID values present

compute_trip_activity(dates: list[str]) → pd.DataFrame¶

Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns

'trip_id'
dates[0]: 1 if the trip is active on dates[0]; 0 otherwise
dates[1]: 1 if the trip is active on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty DataFrame.

compute_trip_stats(route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) → pd.DataFrame¶

Return a DataFrame with the following columns:

'trip_id'
'route_id'
'route_short_name'
'route_type'
'direction_id': NaN if missing from feed
'shape_id': NaN if missing from feed
'stop_pattern_name': output from name_stop_patterns()
'num_stops': number of stops on trip
'start_time': first departure time of the trip
'end_time': last departure time of the trip
'start_stop_id': stop ID of the first stop of the trip
'end_stop_id': stop ID of the last stop of the trip
'is_loop': 1 if the start and end stop are less than 400m apart and 0 otherwise
'distance': distance of the trip; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'duration': duration of the trip in hours
'speed': distance/duration

If feed.stop_times has a shape_dist_traveled column with at least one non-NaN value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to NaN.

If route IDs are given, then restrict to trips on those routes.

Notes

Assume the following feed attributes are not None:
- feed.trips
- feed.routes
- feed.stop_times
- feed.shapes (optionally)
Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on this Portland feed using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.

convert_dist(new_dist_units: str) → Feed¶: Convert the distances recorded in the shape_dist_traveled columns of the given Feed to the given distance units. New distance units must lie in constants.DIST_UNITS. Return the resulting Feed.

copy() → Feed¶: Return a copy of this feed, that is, a feed with all the same attributes.

create_shapes(*, all_trips: bool = False) → Feed¶

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

describe(sample_date: str | None = None) → pd.DataFrame¶

Return a DataFrame of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.

The resulting DataFrame has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

property dist_units¶: The distance units of the Feed.

drop_invalid_columns() → Feed¶: Drop all DataFrame columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.

drop_zombies() → Feed¶

In the given Feed, do the following in order and return the resulting Feed.

Drop agencies with no routes.
Drop stops of location type 0 or NaN with no stop times.
Remove undefined parent stations from the parent_station column.
Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.

extend_id(id_col: str, extension: str, *, prefix=True) → Feed¶

Add a prefix (if prefix) or a suffix (otherwise) to all values of column id_col across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.

Raises a ValueError if id_col values are strings, e.g. if id_col is ‘direction_id’.

geometrize_shapes(*, use_utm: bool = False) → GeoDataFrame¶

Given a GTFS shapes DataFrame, convert it to a GeoDataFrame of LineStrings and return the result, which will no longer have the columns 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat', and 'shape_dist_traveled'.

If use_utm, then use local UTM coordinates for the geometries.

geometrize_stops(*, use_utm: bool = False) → GeoDataFrame¶: Given a stops DataFrame, convert it to a GeoPandas GeoDataFrame of Points and return the result, which will no longer have the columns 'stop_lon' and 'stop_lat'.

get_active_services(date: str) → list[str]¶: Given a Feed and a date string in YYYYMMDD format, return the list of service IDs that are active on the date.

get_dates(*, as_date_obj: bool = False) → list[str]¶

Return a list of YYYYMMDD date strings for which the given Feed is valid, which could be the empty list if the Feed has no calendar information.

If as_date_obj, then return datetime.date objects instead.

get_first_week(*, as_date_obj: bool = False) → list[str]¶

Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.

If as_date_obj, then return date objects, otherwise return date strings.

get_routes(date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False, split_directions: bool = False) → pd.DataFrame¶

Return feed.routes or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time.

Given a Feed, return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.

If as_gdf and feed.shapes is not None, then return a GeoDataFrame with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s union of trip shapes. The GeoDataFrame will have a local UTM CRS if use_utm; otherwise it will have CRS WGS84. If split_directions and as_gdf, then add the column direction_id and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. If as_gdf and feed.shapes is None, then raise a ValueError.

get_shapes(*, as_gdf: bool = False, use_utm: bool = False) → gpd.DataFrame | None¶: Get the shapes DataFrame for the given feed, which could be None. If as_gdf, then return it as GeoDataFrame with a ‘geometry’ column of linestrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The GeoDataFrame will have a UTM CRS if use_utm; otherwise it will have a WGS84 CRS.

get_shapes_intersecting_geometry(geometry: sg.base.BaseGeometry, shapes_g: gpd.GeoDataFrame | None = None, *, as_gdf: bool = False) → pd.DataFrame | None¶

If the Feed has no shapes, then return None. Otherwise, return the subset of feed.shapes that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.

If as_gdf, then return the shapes as a GeoDataFrame. Specifying shapes_g will skip the first step of the algorithm, namely, geometrizing feed.shapes.

get_start_and_end_times(date: str | None = None) → list[str]¶: Return the first departure time and last arrival time (HH:MM:SS time strings) listed in feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.

get_stop_times(date: str | None = None) → pd.DataFrame¶: Return feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.

get_stops(date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_gdf: bool = False, use_utm: bool = False) → pd.DataFrame¶: Return feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. If in_stations, then subset further stops in stations if station data is available. If as_gdf, then return the result as a GeoDataFrame with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The GeoDataFrame will have a UTM CRS if use_utm and a WGS84 CRS otherwise.

get_stops_in_area(area: gpd.GeoDataFrame) → pd.DataFrame¶: Return the subset of feed.stops that contains all stops that lie within the given GeoDataFrame of polygons.

get_trips(date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False) → pd.DataFrame¶

Return feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.

If as_gdf and feed.shapes is not None, then return the trips as a GeoDataFrame of LineStrings representating trip shapes. Use local UTM CRS if use_utm; otherwise it the WGS84 CRS. If as_gdf and feed.shapes is None, then raise a ValueError.

get_week(k: int, *, as_date_obj: bool = False) → list[str]¶

Given a Feed and a positive integer k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

list_fields(table: str | None = None) → pd.DataFrame¶

Return a DataFrame describing all the fields of the GTFS tables in the given feed or in the given table if specified.

The resulting DataFrame has the following columns.

'table': name of the GTFS table, e.g. 'stops'
'column': name of a column in the table, e.g. 'stop_id'
'num_values': number of values in the column
'num_nonnull_values': number of nonnull values in the column
'num_unique_values': number of unique values in the column, excluding null values
'min_value': minimum value in the column
'max_value': maximum value in the column

If the table is not in the feed, then return an empty DataFrame If the table is not valid, raise a ValueError

locate_trips(date: str, times: list[str]) → pd.DataFrame¶

Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).

Return a DataFrame with the columns

'trip_id'
'route_id'
'direction_id': all NaNs if feed.trips.direction_id is missing
'time'
'rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path
'lon': longitude of trip at given time
'lat': latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

map_routes(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)¶: Return a Folium map showing the given routes and (optionally) their stops. At least one of route_ids and route_short_names must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.

map_stops(stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})¶: Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.

map_trips(trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = False)¶: Return a Folium map showing the given trips and (optionally) their stops. If any of the given trip IDs are not found in the feed, then raise a ValueError. If include_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.

name_stop_patterns() → pd.DataFrame¶

For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the DataFrame feed.trips with the additional column stop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.

If feed.trips has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.

restrict_to_agencies(agency_ids: list[str]) → Feed¶: Build a new feed by restricting this one via restrict_to_routes() and the routes with the given agency IDs. Return the resulting feed.

restrict_to_area(area: gpd.GeoDataFrame) → Feed¶: Build a new feed by restricting this one via restrict_to_trips() and the trips that have at least one stop intersecting the given GeoDataFrame of polygons. Return the resulting feed.

restrict_to_dates(dates: list[str]) → Feed¶: Build a new feed by restricting this one via restrict_to_trips() and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.

restrict_to_routes(route_ids: list[str]) → Feed¶: Build a new feed by restricting this one via restrict_to_trips() and the trips with the given route IDs. Return the resulting feed.

restrict_to_trips(trip_ids: list[str]) → Feed¶

Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.

If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.

This function is probably more useful internally than externally.

routes_to_geojson(route_ids: Iterable[str | None] = None, *, split_directions: bool = False, include_stops: bool = False) → dict¶

Return a GeoJSON FeatureCollection of MultiLineString features representing this Feed’s routes. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the route stops as Point features . If an iterable of route IDs is given, then subset to those routes. If the subset is empty, then return a FeatureCollection with an empty list of features. If the Feed has no shapes, then raise a ValueError. If any of the given route IDs are not found in the feed, then raise a ValueError.

shapes_to_geojson(shape_ids: Iterable[str] | None = None) → dict¶

Return a GeoJSON FeatureCollection of LineString features representing feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.

split_simple(segmentize_m: float = 5) → GeoDataFrame¶

Given GTFS shapes as a GeoDataFrame of the form output by geometrize_shapes() and possibly in a non-WGS84 CRS, split each non-simple LineString into maximal simple (non-self-intersecting) sub-LineStrings, and leave the simple LineStrings as is. Before splitting, segmentize (with Shapely’s segmentize method) each non-simple LineString L by segmentize_m meters, which also sets the maximum gap size between L’s simple sub-LineStrings.

Return a GeoDataFrame in the CRS of shapes_g with the columns

'shape_id': a unique identifier of the original LineString L
'subshape_id': a unique identifier of a simple sub-LineString S of L
'subshape_sequence': integer; indicates the order of S when joining up all simple sub-LineStrings to form L
'subshape_length_m': the length of S in meters
'cum_length_m': the length S plus the lengths of sub-LineStrings of L that come before S; in meters
'geometry': LineString geometry corresponding to S

Within each ‘shape_id’ group, the subshapes will be sorted increasingly by ‘subshape_sequence’.

Note that by construction, for each given LineString L with k simple subLineStrings S_i, we have the inequalities

length(L) - k * segmentize_m <= sum over i of length(S_i) <= length(L),

where the lengths are expressed in meters.

stop_times_to_geojson(trip_ids: Iterable[str | None] = None) → dict¶

Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.

For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.

If an iterable of trip IDs is given, then subset to those trips. If some of the given trip IDs are not found in the feed, then raise a ValueError.

stops_to_geojson(stop_ids: Iterable[str | None] = None) → dict¶

Return a GeoJSON FeatureCollection of Point features representing all the stops in feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of stop IDs is given, then subset to those stops. If some of the given stop IDs are not found in the feed, then raise a ValueError.

subset_dates(dates: list[str]) → list[str]¶: Given a Feed and a list of YYYYMMDD date strings, return the sorted sublist of dates that lie in the Feed’s dates (the output feed.get_dates()). Could be an empty list.

to_file(path: Path, ndigits: int | None = None) → None¶: Write this Feed to the given path. If the path ends in ‘.zip’, then write the feed as a zip archive. Otherwise assume the path is a directory, and write the feed as a collection of CSV files to that directory, creating the directory if it does not exist. Round all decimals to ndigits decimal places, if given. All distances will be the distance units feed.dist_units. By the way, 6 decimal degrees of latitude and longitude is enough to locate an individual cat.

trips_to_geojson(trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) → dict¶

Return a GeoJSON FeatureCollection of LineString features representing all the Feed’s trips. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips. If any of the given trip IDs are not found in the feed, then raise a ValueError. If the Feed has no shapes, then raise a ValueError.

ungeometrize_stops() → DataFrame¶

The inverse of geometrize_stops().

If stops_g is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

gtfs_kit.feed.list_feed(path: Path) → DataFrame¶

Given a path (string or Path object) to a GTFS zip file or directory, record the file names and file sizes of the contents, and return the result in a DataFrame with the columns:

'file_name'
'file_size'

gtfs_kit.feed.read_feed(path_or_url: Path | str, dist_units: str) → Feed¶

Create a Feed instance from the given path or URL and given distance units. If the path exists, then call _read_feed_from_path(). Else if the URL has OK status according to Requests, then call _read_feed_from_url(). Else raise a ValueError.

Notes:

Ignore non-GTFS files in the feed
Automatically strip whitespace from the column names in GTFS files

GTFS Kit 11.0.0 Documentation¶

Introduction¶

Authors¶

Installation¶

Examples¶

Conventions¶

Module constants¶

Module helpers¶

Module cleaners¶

Module calendar¶

Module routes¶

Module shapes¶

Module stop_times¶

Module stops¶

Module trips¶

Module miscellany¶

Module feed¶

Indices and tables¶

Table of Contents