GTFS Kit 10.1.1 Documentation¶
Introduction¶
GTFS Kit is a Python library for analyzing General Transit Feed Specification (GTFS) data in memory without a database. It uses Pandas and GeoPandas to do the heavy lifting.
Installation¶
Install it from PyPI with UV, say, via uv add gtfs_kit
.
Examples¶
In the Jupyter notebook notebooks/examples.ipynb
.
Conventions¶
In conformance with GTFS, dates are encoded as YYYYMMDD date strings, and times are encoded as HH:MM:SS time strings with the possibility that HH > 24. Watch out for that possibility, because it has counterintuitive consequences; see e.g.
trips.is_active_trip()
, which is used inroutes.compute_route_stats()
,stops.compute_stop_stats()
, andmiscellany.compute_feed_stats()
.‘DataFrame’ and ‘Series’ refer to Pandas DataFrame and Series objects, respectively
Module constants¶
Constants useful across modules.
- gtfs_kit.constants.COLORS_SET2 = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3']¶
Colorbrewer 8-class Set2 colors
- gtfs_kit.constants.DIST_UNITS = ['ft', 'mi', 'm', 'km']¶
Valid distance units
- gtfs_kit.constants.DTYPE = {'agency_email': 'string', 'agency_fare_url': 'string', 'agency_id': 'string', 'agency_lang': 'string', 'agency_name': 'string', 'agency_phone': 'string', 'agency_timezone': 'string', 'agency_url': 'string', 'arrival_time': 'string', 'attribution_email': 'string', 'attribution_id': 'string', 'attribution_phone': 'string', 'attribution_url': 'string', 'bikes_allowed': 'Int8', 'block_id': 'string', 'contains_id': 'string', 'currency_type': 'string', 'date': 'string', 'departure_time': 'string', 'destination_id': 'string', 'direction_id': 'Int8', 'drop_off_type': 'Int8', 'end_date': 'string', 'end_time': 'string', 'exact_times': 'Int8', 'exception_type': 'Int8', 'fare_id': 'string', 'feed_end_date': 'string', 'feed_lang': 'string', 'feed_publisher_name': 'string', 'feed_publisher_url': 'string', 'feed_start_date': 'string', 'feed_version': 'string', 'friday': 'Int8', 'from_stop_id': 'string', 'headway_secs': 'Int16', 'is_authority': 'Int8', 'is_operator': 'Int8', 'is_producer': 'Int8', 'location_type': 'Int8', 'min_transfer_time': 'Int16', 'monday': 'Int8', 'organization_name': 'string', 'origin_id': 'string', 'parent_station': 'string', 'payment_method': 'Int8', 'pickup_type': 'Int8', 'price': 'float', 'route_color': 'string', 'route_desc': 'string', 'route_id': 'string', 'route_long_name': 'string', 'route_short_name': 'string', 'route_text_color': 'string', 'route_type': 'Int8', 'route_url': 'string', 'saturday': 'Int8', 'service_id': 'string', 'shape_dist_traveled': 'float', 'shape_id': 'string', 'shape_pt_lat': 'float', 'shape_pt_lon': 'float', 'shape_pt_sequence': 'Int32', 'start_date': 'string', 'start_time': 'string', 'stop_code': 'string', 'stop_desc': 'string', 'stop_headsign': 'string', 'stop_id': 'string', 'stop_lat': 'float', 'stop_lon': 'float', 'stop_name': 'string', 'stop_sequence': 'Int32', 'stop_timezone': 'string', 'stop_url': 'string', 'sunday': 'Int8', 'thursday': 'Int8', 'timepoint': 'Int8', 'to_stop_id': 'string', 'transfer_duration': 'Int16', 'transfer_type': 'Int8', 'transfers': 'Int8', 'trip_headsign': 'string', 'trip_id': 'string', 'trip_short_name': 'string', 'tuesday': 'Int8', 'wednesday': 'Int8', 'wheelchair_accessible': 'Int8', 'wheelchair_boarding': 'Int8', 'zone_id': 'string'}¶
Data types for Pandas CSV reads
- gtfs_kit.constants.FEED_ATTRS = ['agency', 'attributions', 'calendar', 'calendar_dates', 'fare_attributes', 'fare_rules', 'feed_info', 'frequencies', 'routes', 'shapes', 'stops', 'stop_times', 'trips', 'transfers', 'dist_units']¶
Primary feed attributes
- gtfs_kit.constants.WGS84 = 'EPSG:4326'¶
WGS84 coordinate reference system for Geopandas
Module helpers¶
Functions useful across modules.
- gtfs_kit.helpers.almost_equal(f: DataFrame, g: DataFrame) bool ¶
Return
True
if and only if the given DataFrames are equal after sorting their columns names, sorting their values, and reseting their indices.
- gtfs_kit.helpers.combine_time_series(time_series_dict: dict[str, DataFrame], kind: str, *, split_directions: bool = False) DataFrame ¶
Combine the time series DataFrames in the given dictionary into one time series DataFrame with hierarchical columns.
- Parameters:
time_series_dict (dictionary) – Has the form string -> time series
kind (string) –
'route'
or'stop'
split_directions (boolean) – If
True
, then assume the original time series contains data separated by trip direction; otherwise, assume not. The separation is indicated by a suffix'-0'
(direction 0) or'-1'
(direction 1) in the route ID or stop ID column values.
- Returns:
Columns are hierarchical (multi-index). The top level columns are the keys of the dictionary and the second level columns are
'route_id'
and'direction_id'
, ifkind == 'route'
, or ‘stop_id’ and'direction_id'
, ifkind == 'stop'
. Ifsplit_directions
, then third column is'direction_id'
; otherwise, there is no'direction_id'
column.- Return type:
DataFrame
- gtfs_kit.helpers.datestr_to_date(x: date | str, format_str: str = '%Y%m%d', *, inverse: bool = False) str | date ¶
Given a string
x
representing a date in the given format, convert it to a datetime.date object and return the result. Ifinverse
, then assume thatx
is a date object and return its corresponding string in the given format.
- gtfs_kit.helpers.downsample(time_series: DataFrame, freq: str) DataFrame ¶
Downsample the given route, stop, or feed time series, (outputs of
routes.compute_route_time_series()
,stops.compute_stop_time_series()
, ormiscellany.compute_feed_time_series()
, respectively) to the given Pandas frequency string (e.g. ‘15Min’). Return the given time series unchanged if the given frequency is shorter than the original frequency.
- gtfs_kit.helpers.drop_feature_ids(collection: dict) dict ¶
Given a GeoJSON FeatureCollection, remove the
'id'
attribute of each Feature, if it exists.
- gtfs_kit.helpers.get_active_trips_df(trip_times: DataFrame) Series ¶
Count the number of trips in
trip_times
that are active at any given time.Assume
trip_times
contains the columnsstart_time: start time of the trip in seconds past midnight
end_time: end time of the trip in seconds past midnight
Return a Series whose index is times from midnight when trips start and end and whose values are the number of active trips for that time.
- gtfs_kit.helpers.get_convert_dist(dist_units_in: str, dist_units_out: str) Callable[[float], float] ¶
Return a function of the form
distance in the units
dist_units_in
-> distance in the unitsdist_units_out
Only supports distance units in
constants.DIST_UNITS
.
- gtfs_kit.helpers.get_max_runs(x) array ¶
Given a list of numbers, return a NumPy array of pairs (start index, end index + 1) of the runs of max value.
Example:
>>> get_max_runs([7, 1, 2, 7, 7, 1, 2]) array([[0, 1], [3, 5]])
Assume x is not empty. Recipe comes from Stack Overflow.
- gtfs_kit.helpers.get_peak_indices(times: list, counts: list) array ¶
Given an increasing list of times as seconds past midnight and a list of trip counts at those respective times, return a pair of indices i, j such that times[i] to times[j] is the first longest time period such that for all i <= x < j, counts[x] is the max of counts. Assume times and counts have the same nonzero length.
Examples:
>>> times = [0, 10, 20, 30, 31, 32, 40] >>> counts = [7, 1, 2, 7, 7, 1, 2] >>> get_peak_indices(times, counts) array([0, 1]) >>> counts = [0, 0, 0] >>> times = [18000, 21600, 28800] >>> get_peak_indices(times, counts) array([0, 3])
- gtfs_kit.helpers.get_segment_length(linestring: LineString, p: Point, q: Point | None = None) float ¶
Given a Shapely linestring and two Shapely points, project the points onto the linestring, and return the distance along the linestring between the two points. If
q is None
, then return the distance from the start of the linestring to the projection ofp
. The distance is measured in the native coordinates of the linestring.
- gtfs_kit.helpers.is_metric(dist_units: str) bool ¶
Return True if the given distance units equals ‘m’ or ‘km’; otherwise return False.
- gtfs_kit.helpers.is_not_null(df: DataFrame, col_name: str) bool ¶
Return
True
if the given DataFrame has a column of the given name (string), and there exists at least one non-NaN value in that column; returnFalse
otherwise.
- gtfs_kit.helpers.longest_subsequence(seq, mode='strictly', order='increasing', key=None, *, index=False)¶
Return the longest increasing subsequence of seq.
- Parameters:
seq (sequence object) – Can be any sequence, like str, list, numpy.array.
mode ({'strict', 'strictly', 'weak', 'weakly'}, optional) – If set to ‘strict’, the subsequence will contain unique elements. Using ‘weak’ an element can be repeated many times. Modes ending in -ly serve as a convenience to use with order parameter, because longest_sequence(seq, ‘weakly’, ‘increasing’) reads better. The default is ‘strict’.
order ({'increasing', 'decreasing'}, optional) – By default return the longest increasing subsequence, but it is possible to return the longest decreasing sequence as well.
key (function, optional) – Specifies a function of one argument that is used to extract a comparison key from each list element (e.g., str.lower, lambda x: x[0]). The default value is None (compare the elements directly).
index (bool, optional) – If set to True, return the indices of the subsequence, otherwise return the elements. Default is False.
- Returns:
elements (list, optional) – A list of elements of the longest subsequence. Returned by default and when index is set to False.
indices (list, optional) – A list of indices pointing to elements in the longest subsequence. Returned when index is set to True.
Taken from this Stack Overflow answer.
- gtfs_kit.helpers.make_html(d: dict) str ¶
Convert the given dictionary into an HTML table (string) with two columns: keys of dictionary, values of dictionary.
- gtfs_kit.helpers.make_ids(n: int, prefix: str = 'id_')¶
Return a length
n
list of unique sequentially labelled strings for use as IDs.Example:
>>> make_ids(11, prefix="s") ['s00', s01', 's02', 's03', 's04', 's05', 's06', 's07', 's08', 's09', 's10']
- gtfs_kit.helpers.restack_time_series(unstacked_time_series: DataFrame) DataFrame ¶
Given an unstacked stop, route, or feed time series in the form output by the function
unstack_time_series()
, restack it into its original time series form.
- gtfs_kit.helpers.timestr_mod24(timestr: str) int ¶
Given a GTFS HH:MM:SS time string, return a timestring in the same format but with the hours taken modulo 24.
- gtfs_kit.helpers.timestr_to_seconds(x: date | str, *, inverse: bool = False, mod24: bool = False) int ¶
Given an HH:MM:SS time string
x
, return the number of seconds past midnight that it represents. In keeping with GTFS standards, the hours entry may be greater than 23. Ifmod24
, then return the number of seconds modulo24*3600
. Ifinverse
, then do the inverse operation. In this case, ifmod24
also, then first take the number of seconds modulo24*3600
.
- gtfs_kit.helpers.unstack_time_series(time_series: DataFrame) DataFrame ¶
Given a route, stop, or feed time series of the form output by the functions,
compute_stop_time_series()
,compute_route_time_series()
, orcompute_feed_time_series()
, respectively, unstack it to return a DataFrame of with the columns:"datetime"
the columns
time_series.columns.names
"value"
: value at the datetime and other columns
- gtfs_kit.helpers.weekday_to_str(weekday: int | str, *, inverse: bool = False) int | str ¶
Given a weekday number (integer in the range 0, 1, …, 6), return its corresponding weekday name as a lowercase string. Here 0 -> ‘monday’, 1 -> ‘tuesday’, and so on. If
inverse
, then perform the inverse operation.
Module validators¶
Module cleaners¶
Functions about cleaning feeds.
- gtfs_kit.cleaners.aggregate_routes(feed: Feed, by: str = 'route_short_name', route_id_prefix: str = 'route_') Feed ¶
Aggregate routes by route short name, say, and assign new route IDs using the given prefix.
More specifically, create new route IDs with the function
build_aggregate_routes_dict()
and the parametersby
androute_id_prefix
and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.
- gtfs_kit.cleaners.aggregate_stops(feed: Feed, by: str = 'stop_code', stop_id_prefix: str = 'stop_') Feed ¶
Aggregate stops by stop code, say, and assign new stop IDs using the given prefix.
More specifically, create new stop IDs with the function
build_aggregate_stops_dict()
and the parametersby
andstop_id_prefix
and update the old stop IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.
- gtfs_kit.cleaners.build_aggregate_routes_dict(routes: DataFrame, by: str = 'route_short_name', route_id_prefix: str = 'route_') dict[str, str] ¶
Given a DataFrame of routes, group the routes by route short name, say, and assign new route IDs using the given prefix. Return a dictionary of the form <old route ID> -> <new route ID>. Helper function for
aggregate_routes()
.More specifically, group
routes
by theby
column, and for each group make one new route ID for all the old route IDs in that group based on the givenroute_id_prefix
string and a running count, e.g.'route_013'
.
- gtfs_kit.cleaners.build_aggregate_stops_dict(stops: DataFrame, by: str = 'stop_code', stop_id_prefix: str = 'stop_') dict[str, str] ¶
Given a DataFrame of stops, group the stops by stop code, say, and assign new stop IDs using the given prefix. Return a dictionary of the form <old stop ID> -> <new stop ID>. Helper function for
aggregate_stops()
.More specifically, group
stops
by theby
column, and for each group make one new stop ID for all the old stops IDs in that group based on the givenstop_id_prefix
string and a running count, e.g.'stop_013'
.
- gtfs_kit.cleaners.clean(feed: Feed) Feed ¶
Apply the following functions to the given Feed in order and return the resulting Feed.
- gtfs_kit.cleaners.clean_column_names(df: DataFrame) DataFrame ¶
Strip the whitespace from all column names in the given DataFrame and return the result.
- gtfs_kit.cleaners.clean_ids(feed: Feed) Feed ¶
In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.
- gtfs_kit.cleaners.clean_route_short_names(feed: Feed) Feed ¶
In
feed.routes
, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.
- gtfs_kit.cleaners.clean_times(feed: Feed) Feed ¶
In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.
- gtfs_kit.cleaners.drop_invalid_columns(feed: Feed) Feed ¶
Drop all DataFrame columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.
- gtfs_kit.cleaners.drop_zombies(feed: Feed) Feed ¶
In the given Feed, do the following in order and return the resulting Feed.
Drop stops of location type 0 or NaN with no stop times.
Remove undefined parent stations from the
parent_station
column.Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.
- gtfs_kit.cleaners.extend_id(feed: Feed, id_col: str, extension: str, *, prefix=True) Feed ¶
Add a prefix (if
prefix
) or a suffix (otherwise) to all values of columnid_col
across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.Raises a ValueError if
id_col
values can’t have strings added to them, e.g. ifid_col
is ‘direction_id’.
Module calendar¶
Functions about calendar and calendar_dates.
- gtfs_kit.calendar.get_dates(feed: Feed, *, as_date_obj: bool = False) list[str] ¶
Return a list of YYYYMMDD date strings for which the given Feed is valid, which could be the empty list if the Feed has no calendar information.
If
as_date_obj
, then return datetime.date objects instead.
- gtfs_kit.calendar.get_first_week(feed: Feed, *, as_date_obj: bool = False) list[str] ¶
Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.
If
as_date_obj
, then return date objects, otherwise return date strings.
- gtfs_kit.calendar.get_week(feed: Feed, k: int, *, as_date_obj: bool = False) list[str] ¶
Given a Feed and a positive integer
k
, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.If
as_date_obj
, then return datetime.date objects instead.
Module routes¶
Functions about routes.
- gtfs_kit.routes.build_route_timetable(feed: Feed, route_id: str, dates: list[str]) pd.DataFrame ¶
Return a timetable for the given route and dates (YYYYMMDD date strings).
Return a DataFrame with whose columns are all those in
feed.trips
plus those infeed.stop_times
plus'date'
. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.Skip dates outside of the Feed’s dates.
If there is no route activity on the given dates, then return an empty DataFrame.
- gtfs_kit.routes.build_zero_route_time_series(feed: Feed, date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Return a route time series with the same index and hierarchical columns as output by
compute_route_time_series_0()
, but fill it full of zero values.
- gtfs_kit.routes.compute_route_stats(feed: Feed, trip_stats_subset: pd.DataFrame, dates: list[str], headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame ¶
Compute route stats for all the trips that lie in the given subset of trip stats (of the form output by the function
trips.compute_trip_stats()
) and that start on the given dates (YYYYMMDD date strings).If
split_directions
, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.Return a DataFrame with the columns
'date'
the columns listed in :func:
compute_route_stats_0
Exclude dates with no active trips, which could yield the empty DataFrame.
Notes
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- gtfs_kit.routes.compute_route_stats_0(trip_stats_subset: DataFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) DataFrame ¶
Compute stats for the given subset of trips stats (of the form output by the function
trips.compute_trip_stats()
).Ignore trips with zero duration, because they are defunct.
If
split_directions
, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.Return a DataFrame with the columns
'route_id'
'route_short_name'
'route_type'
'direction_id'
'num_trips'
: number of trips on the route in the subset'num_trip_starts'
: number of trips on the route with nonnull start times'num_trip_ends'
: number of trips on the route with nonnull end times that end before 23:59:59'num_stop_patterns'
: number of stop pattern across trips'is_loop'
: 1 if at least one of the trips on the route has itsis_loop
field equal to 1; 0 otherwise'is_bidirectional'
: 1 if the route has trips in both directions; 0 otherwise'start_time'
: start time of the earliest trip on the route'end_time'
: end time of latest trip on the route'max_headway'
: maximum of the durations (in minutes) between trip starts on the route betweenheadway_start_time
andheadway_end_time
on the given dates'min_headway'
: minimum of the durations (in minutes) mentioned above'mean_headway'
: mean of the durations (in minutes) mentioned above'peak_num_trips'
: maximum number of simultaneous trips in service (for the given direction, or for both directions whensplit_directions==False
)'peak_start_time'
: start time of first longest period during which the peak number of trips occurs'peak_end_time'
: end time of first longest period during which the peak number of trips occurs'service_duration'
: total of the duration of each trip on the route in the given subset of trips; measured in hours'service_distance'
: total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'service_speed'
: service_distance/service_duration'mean_trip_distance'
: service_distance/num_trips'mean_trip_duration'
: service_duration/num_trips
If not
split_directions
, then remove the direction_id column and compute each route’s stats, except for headways, using its trips running in both directions. In this case, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.If
trip_stats_subset
is empty, return an empty DataFrame.Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- gtfs_kit.routes.compute_route_time_series(feed: Feed, trip_stats_subset: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Compute route stats in time series form for the trips that lie in the trip stats subset (of the form output by the function
trips.compute_trip_stats()
) and that start on the given dates (YYYYMMDD date strings).If
split_directions
, then separate each routes’s stats by trip direction. Specify the time series frequency with a Pandas frequency string, e.g.'5Min'
; max frequency is one minute (‘Min’).Return a DataFrame of the same format output by the function
compute_route_time_series_0()
but with multiple datesExclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame.
Notes
See the notes for
compute_route_time_series_0()
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- gtfs_kit.routes.compute_route_time_series_0(trip_stats_subset: DataFrame, date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) DataFrame ¶
Compute stats in a 24-hour time series form for the given subset of trips (of the form output by the function
trips.compute_trip_stats()
).If
split_directions
, then separate each routes’s stats by trip direction. Set the time series frequency according to the given frequency string; max frequency is one minute (‘Min’). Use the given YYYYMMDD date label as the date in the time series index.Return a DataFrame time series version the following route stats for each route.
num_trips
: number of trips in service on the route at any time within the time binnum_trip_starts
: number of trips that start within the time binnum_trip_ends
: number of trips that end within the time bin, ignoring trips that end past midnightservice_distance
: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles;service_duration
: sum of the service duration accrued during the time bin across all trips on the route; measured in hoursservice_speed
:service_distance/service_duration
for the route
The columns are hierarchical (multi-indexed) with
top level: name is
'indicator'
; values are'num_trip_starts'
,'num_trip_ends'
,'num_trips'
,'service_distance'
,'service_duration'
, and'service_speed'
middle level: name is
'route_id'
; values are the active routesbottom level: name is
'direction_id'
; values are 0s and 1s
If not
split_directions
, then don’t include the bottom level.The time series has a timestamp index for a 24-hour period sampled at the given frequency. The maximum allowable frequency is 1 minute. If
trip_stats_subset
is empty, then return an empty DataFrame with the columns'num_trip_starts'
,'num_trip_ends'
,'num_trips'
,'service_distance'
,'service_duration'
, and'service_speed'
.Notes
The time series is computed at a one-minute frequency, then resampled at the end to the given frequency
Trips that lack start or end times are ignored, so the the aggregate
num_trips
across the day could be less than thenum_trips
column ofcompute_route_stats_0()
All trip departure times are taken modulo 24 hours. So routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series, except for their
num_trip_ends
indicator. Trip endings past 23:59:59 not binned so that resampling thenum_trips
indicator works efficiently.Note that the total number of trips for two consecutive time bins t1 < t2 is the sum of the number of trips in bin t2 plus the number of trip endings in bin t1. Thus we can downsample the
num_trips
indicator by keeping track of only one extra count,num_trip_ends
, and can avoid recording individual trip IDs.All other indicators are downsampled by summing.
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- gtfs_kit.routes.get_routes(feed: Feed, date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False, split_directions: bool = False) pd.DataFrame ¶
Return
feed.routes
or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time.Given a Feed, return a GeoDataFrame with all the columns of
feed.routes
plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.If
as_gdf
andfeed.shapes
is notNone
, then return a GeoDataFrame with all the columns offeed.routes
plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s union of trip shapes. The GeoDataFrame will have a local UTM CRS ifuse_utm
; otherwise it will have CRS WGS84. Ifsplit_directions
andas_gdf
, then add the columndirection_id
and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. Ifas_gdf
andfeed.shapes
isNone
, then raise a ValueError.
- gtfs_kit.routes.map_routes(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)¶
Return a Folium map showing the given routes and (optionally) their stops. At least one of
route_ids
androute_short_names
must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.
- gtfs_kit.routes.routes_to_geojson(feed: Feed, route_ids: Iterable[str | None] = None, *, split_directions: bool = False, include_stops: bool = False) dict ¶
Return a GeoJSON FeatureCollection of MultiLineString features representing this Feed’s routes. The coordinates reference system is the default one for GeoJSON, namely WGS84.
If
include_stops
, then include the route stops as Point features . If an iterable of route IDs is given, then subset to those routes. If the subset is empty, then return a FeatureCollection with an empty list of features. If the Feed has no shapes, then raise a ValueError. If any of the given route IDs are not found in the feed, then raise a ValueError.
Module shapes¶
Functions about shapes.
- gtfs_kit.shapes.append_dist_to_shapes(feed: Feed) Feed ¶
Calculate and append the optional
shape_dist_traveled
field infeed.shapes
in terms of the distance unitsfeed.dist_units
. Return the resulting Feed.As a benchmark, using this function on this Portland feed produces a
shape_dist_traveled
column that differs by at most 0.016 km in absolute value from of the original values.
- gtfs_kit.shapes.build_geometry_by_shape(feed: Feed, shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict ¶
Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If
use_utm
, then use local UTM coordinates; otherwise, use WGS84 coordinates.
- gtfs_kit.shapes.geometrize_shapes(shapes: DataFrame, *, use_utm: bool = False) GeoDataFrame ¶
Given a GTFS shapes DataFrame, convert it to a GeoDataFrame of LineStrings and return the result, which will no longer have the columns
'shape_pt_sequence'
,'shape_pt_lon'
,'shape_pt_lat'
, and'shape_dist_traveled'
.If
use_utm
, then use local UTM coordinates for the geometries.
- gtfs_kit.shapes.get_shapes(feed: Feed, *, as_gdf: bool = False, use_utm: bool = False) gpd.DataFrame | None ¶
Get the shapes DataFrame for the given feed, which could be
None
. Ifas_gdf
, then return it as GeoDataFrame with a ‘geometry’ column of linestrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The GeoDataFrame will have a UTM CRS ifuse_utm
; otherwise it will have a WGS84 CRS.
- gtfs_kit.shapes.get_shapes_intersecting_geometry(feed: Feed, geometry: sg.base.BaseGeometry, shapes_g: gpd.GeoDataFrame | None = None, *, as_gdf: bool = False) pd.DataFrame | None ¶
If the Feed has no shapes, then return None. Otherwise, return the subset of
feed.shapes
that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.If
as_gdf
, then return the shapes as a GeoDataFrame. Specifyingshapes_g
will skip the first step of the algorithm, namely, geometrizingfeed.shapes
.
- gtfs_kit.shapes.shapes_to_geojson(feed: Feed, shape_ids: Iterable[str] | None = None) dict ¶
Return a GeoJSON FeatureCollection of LineString features representing
feed.shapes
. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.
- gtfs_kit.shapes.ungeometrize_shapes(shapes_g: GeoDataFrame) DataFrame ¶
The inverse of
geometrize_shapes()
.If
shapes_g
is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.
Module stop_times¶
Functions about stop times.
- gtfs_kit.stop_times.append_dist_to_stop_times(feed: Feed) Feed ¶
Calculate and append the optional
shape_dist_traveled
column infeed.stop_times
in terms of the distance unitsfeed.dist_units
. Trips without shapes will have NaN distances. Return the resulting Feed. Usesfeed.shapes
, so if that is missing, then return the original feed.This does not always give accurate results. The algorithm works as follows. Compute the
shape_dist_traveled
field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.
- gtfs_kit.stop_times.get_start_and_end_times(feed: Feed, date: str | None = None) list[str] ¶
Return the first departure time and last arrival time (HH:MM:SS time strings) listed in
feed.stop_times
, respectively. Restrict to the given date (YYYYMMDD string) if specified.
- gtfs_kit.stop_times.get_stop_times(feed: Feed, date: str | None = None) pd.DataFrame ¶
Return
feed.stop_times
. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.
- gtfs_kit.stop_times.stop_times_to_geojson(feed: Feed, trip_ids: Iterable[str | None] = None) dict ¶
Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in
feed.stop_times
. The coordinates reference system is the default one for GeoJSON, namely WGS84.For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.
If an iterable of trip IDs is given, then subset to those trips. If some of the given trip IDs are not found in the feed, then raise a ValueError.
Module stops¶
Functions about stops.
- gtfs_kit.stops.STOP_STYLE = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}¶
Leaflet circleMarker parameters for mapping stops
- gtfs_kit.stops.build_geometry_by_stop(feed: Feed, stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict ¶
Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.
- gtfs_kit.stops.build_stop_timetable(feed: Feed, stop_id: str, dates: list[str]) pd.DataFrame ¶
Return a DataFrame containing the timetable for the given stop ID and dates (YYYYMMDD date strings)
Return a DataFrame whose columns are all those in
feed.trips
plus those infeed.stop_times
plus'date'
, and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.
- gtfs_kit.stops.build_zero_stop_time_series(feed: Feed, date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Return a stop time series with the same index and hierarchical columns as output by the function
compute_stop_time_series_0()
, but fill it full of zero values.
- gtfs_kit.stops.compute_stop_activity(feed: Feed, dates: list[str]) pd.DataFrame ¶
Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).
Return a DataFrame with the columns
stop_id
dates[0]
: 1 if the stop has at least one trip visiting it ondates[0]
; 0 otherwisedates[1]
: 1 if the stop has at least one trip visiting it ondates[1]
; 0 otherwiseetc.
dates[-1]
: 1 if the stop has at least one trip visiting it ondates[-1]
; 0 otherwise
If all dates lie outside the Feed period, then return an empty DataFrame.
- gtfs_kit.stops.compute_stop_stats(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame ¶
Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.
If
split_directions
, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.Return a DataFrame with the columns
'date'
'stop_id'
'direction_id'
: present if and only ifsplit_directions
'num_routes'
: number of routes visiting the stop (in the given direction) on the date'num_trips'
: number of trips visiting stop (in the givin direction) on the date'max_headway'
: maximum of the durations (in minutes) between trip departures at the stop betweenheadway_start_time
andheadway_end_time
on the date'min_headway'
: minimum of the durations (in minutes) mentioned above'mean_headway'
: mean of the durations (in minutes) mentioned above'start_time'
: earliest departure time of a trip from this stop on the date'end_time'
: latest departure time of a trip from this stop on the date
Exclude dates with no active stops, which could yield the empty DataFrame.
- gtfs_kit.stops.compute_stop_stats_0(stop_times_subset: DataFrame, trip_subset: DataFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) DataFrame ¶
Given a subset of a stop times DataFrame and a subset of a trips DataFrame, return a DataFrame that provides summary stats about the stops in the inner join of the two DataFrames.
If
split_directions
, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.Return a DataFrame with the columns
stop_id
direction_id: present if and only if
split_directions
num_routes: number of routes visiting stop (in the given direction)
num_trips: number of trips visiting stop (in the givin direction)
max_headway: maximum of the durations (in minutes) between trip departures at the stop between
headway_start_time
andheadway_end_time
min_headway: minimum of the durations (in minutes) mentioned above
mean_headway: mean of the durations (in minutes) mentioned above
start_time: earliest departure time of a trip from this stop
end_time: latest departure time of a trip from this stop
Notes
If
trip_subset
is empty, then return an empty DataFrame.Raise a ValueError if
split_directions
and no non-NaN direction ID values present.
- gtfs_kit.stops.compute_stop_time_series(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Compute time series for the stops on the given dates (YYYYMMDD date strings) at the given frequency (Pandas frequency string, e.g.
'5Min'
; max frequency is one minute) and return the result as a DataFrame of the same form as output by the functionstop_times.compute_stop_time_series_0()
. Optionally restrict to stops in the given list of stop IDs.If
split_directions
, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.Return a time series DataFrame with a timestamp index across the given dates sampled at the given frequency.
The columns are the same as in the output of the function
compute_stop_time_series_0()
.Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame
Notes
See the notes for the function
compute_stop_time_series_0()
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- gtfs_kit.stops.compute_stop_time_series_0(stop_times_subset: DataFrame, trip_subset: DataFrame, freq: str = '5Min', date_label: str = '20010101', *, split_directions: bool = False) DataFrame ¶
Given a subset of a stop times DataFrame and a subset of a trips DataFrame, return a DataFrame that provides a summary time series about the stops in the inner join of the two DataFrames. If
split_directions
, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the given Pandas frequency string to specify the frequency of the time series, e.g.'5Min'
; max frequency is one minute (‘Min’) Use the given date label (YYYYMMDD date string) as the date in the time series index.Return a time series DataFrame with a timestamp index for a 24-hour period sampled at the given frequency. The only indicator variable for each stop is
num_trips
: the number of trips that visit the stop and have a nonnull departure time from the stop
The maximum allowable frequency is 1 minute.
The columns are hierarchical (multi-indexed) with
top level: name = ‘indicator’, values = [‘num_trips’]
middle level: name = ‘stop_id’, values = the active stop IDs
bottom level: name = ‘direction_id’, values = 0s and 1s
If not
split_directions
, then don’t include the bottom level.Notes
The time series is computed at a one-minute frequency, then resampled at the end to the given frequency
Stop times with null departure times are ignored, so the aggregate of
num_trips
across the day could be less than thenum_trips
column incompute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled with
how=np.sum
If
trip_subset
is empty, then return an empty DataFrameRaise a ValueError if
split_directions
and no non-NaN direction ID values present
- gtfs_kit.stops.geometrize_stops(stops: DataFrame, *, use_utm: bool = False) GeoDataFrame ¶
Given a stops DataFrame, convert it to a GeoPandas GeoDataFrame of Points and return the result, which will no longer have the columns
'stop_lon'
and'stop_lat'
.
- gtfs_kit.stops.get_stops(feed: Feed, date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame ¶
Return
feed.stops
. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. Ifin_stations
, then subset further stops in stations if station data is available. Ifas_gdf
, then return the result as a GeoDataFrame with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The GeoDataFrame will have a UTM CRS ifuse_utm
and a WGS84 CRS otherwise.
- gtfs_kit.stops.get_stops_in_area(feed: Feed, area: gpd.GeoDataFrame) pd.DataFrame ¶
Return the subset of
feed.stops
that contains all stops that lie within the given GeoDataFrame of polygons.
- gtfs_kit.stops.map_stops(feed: Feed, stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})¶
Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.
- gtfs_kit.stops.stops_to_geojson(feed: Feed, stop_ids: Iterable[str | None] = None) dict ¶
Return a GeoJSON FeatureCollection of Point features representing all the stops in
feed.stops
. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of stop IDs is given, then subset to those stops. If some of the given stop IDs are not found in the feed, then raise a ValueError.
- gtfs_kit.stops.ungeometrize_stops(stops_g: GeoDataFrame) DataFrame ¶
The inverse of
geometrize_stops()
.If
stops_g
is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.
Module trips¶
Functions about trips.
- gtfs_kit.trips.compute_busiest_date(feed: Feed, dates: list[str]) str ¶
Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.
- gtfs_kit.trips.compute_trip_activity(feed: Feed, dates: list[str]) pd.DataFrame ¶
Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns
'trip_id'
dates[0]
: 1 if the trip is active ondates[0]
; 0 otherwisedates[1]
: 1 if the trip is active ondates[1]
; 0 otherwiseetc.
dates[-1]
: 1 if the trip is active ondates[-1]
; 0 otherwise
If
dates
isNone
or the empty list, then return an empty DataFrame.
- gtfs_kit.trips.compute_trip_stats(feed: Feed, route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) pd.DataFrame ¶
Return a DataFrame with the following columns:
'trip_id'
'route_id'
'route_short_name'
'route_type'
'direction_id'
: NaN if missing from feed'shape_id'
: NaN if missing from feed'stop_pattern_name'
: output fromname_stop_patterns()
'num_stops'
: number of stops on trip'start_time'
: first departure time of the trip'end_time'
: last departure time of the trip'start_stop_id'
: stop ID of the first stop of the trip'end_stop_id'
: stop ID of the last stop of the trip'is_loop'
: 1 if the start and end stop are less than 400m apart and 0 otherwise'distance'
: distance of the trip; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'duration'
: duration of the trip in hours'speed'
: distance/duration
If
feed.stop_times
has ashape_dist_traveled
column with at least one non-NaN value andcompute_dist_from_shapes == False
, then use that column to compute the distance column. Else iffeed.shapes is not None
, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to NaN.If route IDs are given, then restrict to trips on those routes.
Notes
Assume the following feed attributes are not
None
:feed.trips
feed.routes
feed.stop_times
feed.shapes
(optionally)
Calculating trip distances with
compute_dist_from_shapes=True
seems pretty accurate. For example, calculating trip distances on this Portland feed usingcompute_dist_from_shapes=False
andcompute_dist_from_shapes=True
, yields a difference of at most 0.83km from the original values.
- gtfs_kit.trips.get_active_services(feed: Feed, date: str) list[str] ¶
Given a Feed and a date string in YYYYMMDD format, return the list of service IDs that are active on the date.
- gtfs_kit.trips.get_trips(feed: Feed, date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame ¶
Return
feed.trips
. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.If
as_gdf
andfeed.shapes
is not None, then return the trips as a GeoDataFrame of LineStrings representating trip shapes. Use local UTM CRS ifuse_utm
; otherwise it the WGS84 CRS. Ifas_gdf
andfeed.shapes
isNone
, then raise a ValueError.
- gtfs_kit.trips.locate_trips(feed: Feed, date: str, times: list[str]) pd.DataFrame ¶
Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).
Return a DataFrame with the columns
'trip_id'
'route_id'
'direction_id'
: all NaNs iffeed.trips.direction_id
is missing'time'
'rel_dist'
: number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path'lon'
: longitude of trip at given time'lat'
: latitude of trip at given time
Assume
feed.stop_times
has an accurateshape_dist_traveled
column.
- gtfs_kit.trips.map_trips(feed: Feed, trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = False)¶
Return a Folium map showing the given trips and (optionally) their stops. If any of the given trip IDs are not found in the feed, then raise a ValueError. If
include_direction
, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.
- gtfs_kit.trips.name_stop_patterns(feed: Feed) pd.DataFrame ¶
For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the DataFrame
feed.trips
with the additional columnstop_pattern_name
, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.If
feed.trips
has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.
- gtfs_kit.trips.trips_to_geojson(feed: Feed, trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) dict ¶
Return a GeoJSON FeatureCollection of LineString features representing all the Feed’s trips. The coordinates reference system is the default one for GeoJSON, namely WGS84.
If
include_stops
, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips. If any of the given trip IDs are not found in the feed, then raise a ValueError. If the Feed has no shapes, then raise a ValueError.
Module miscellany¶
Functions about miscellany.
- gtfs_kit.miscellany.assess_quality(feed: Feed) pd.DataFrame ¶
Return a DataFrame of various feed indicators and values, e.g. number of trips missing shapes.
The resulting DataFrame has the columns
'indicator'
: string; name of an indicator, e.g. ‘num_routes’'value'
: value of the indicator, e.g. 27
This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.
- gtfs_kit.miscellany.compute_bounds(feed: Feed, stop_ids: list[str] | None = None) np.array ¶
Return the bounding box (Numpy array [min longitude, min latitude, max longitude, max latitude]) of the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- gtfs_kit.miscellany.compute_centroid(feed: Feed, stop_ids: list[str] | None = None) sg.Point ¶
Return the centroid (Shapely Point) of the convex hull the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- gtfs_kit.miscellany.compute_convex_hull(feed: Feed, stop_ids: list[str] | None = None) sg.Polygon ¶
Return a convex hull (Shapely Polygon) representing the convex hull of the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- gtfs_kit.miscellany.compute_feed_stats(feed: Feed, trip_stats: pd.DataFrame, dates: list[str], *, split_route_types=False) pd.DataFrame ¶
Compute some stats for the given Feed, trip stats (in the format output by the function
trips.compute_trip_stats()
) and dates (YYYYMMDD date stings).Return a DataFrame with the columns
'date'
'route_type'
(optional): presest if and only ifsplit_route_types
'num_stops'
: number of stops active on the date'num_routes'
: number of routes active on the date'num_trips'
: number of trips that start on the date'num_trip_starts'
: number of trips with nonnull start times on the date'num_trip_ends'
: number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date'peak_num_trips'
: maximum number of simultaneous trips in service on the date'peak_start_time'
: start time of first longest period during which the peak number of trips occurs on the date'peak_end_time'
: end time of first longest period during which the peak number of trips occurs on the date'service_distance'
: sum of the service distances for the active routes on the date; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'service_duration'
: sum of the service durations for the active routes on the date; measured in hours'service_speed'
: service_distance/service_duration on the date
Exclude dates with no active stops, which could yield the empty DataFrame.
The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
- gtfs_kit.miscellany.compute_feed_stats_0(feed: Feed, trip_stats_subset: pd.DataFrame, *, split_route_types=False) pd.DataFrame ¶
Helper function for
compute_feed_stats()
.
- gtfs_kit.miscellany.compute_feed_time_series(feed: Feed, trip_stats: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_route_types: bool = False) pd.DataFrame ¶
Compute some feed stats in time series form for the given dates (YYYYMMDD date strings) and trip stats (of the form output by the function
trips.compute_trip_stats()
). Use the given Pandas frequency stringfreq
to specify the frequency of the resulting time series, e.g. ‘5Min’; highest frequency allowable is one minute (‘1Min’). Ifsplit_route_types
, then split stats by route type; otherwise don’tReturn a time series DataFrame with a datetime index across the given dates sampled at the given frequency across the given dates. The columns are
'num_trips'
: number of trips in service during during the time period'num_trip_starts'
: number of trips with starting during the time period'num_trip_ends'
: number of trips ending during the time period, ignoring the trips the end past midnight'service_distance'
: distance traveled during the time period by all trips active during the time period; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'service_duration'
: duration traveled during the time period by all trips active during the time period; measured in hours'service_speed'
:service_distance/service_duration
Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty DataFrame with the specified columns.
If
split_route_types
, then multi-index the columns withtop level: name is
'indicator'
; values are'num_trip_starts'
,'num_trip_ends'
,'num_trips'
,'service_distance'
,'service_duration'
, and'service_speed'
bottom level: name is
'route_type'
; values are route type values
If all dates lie outside the Feed’s date range, then return an empty DataFrame
- gtfs_kit.miscellany.compute_screen_line_counts(feed: Feed, screen_lines: gpd.GeoDataFrame, dates: list[str]) pd.DataFrame ¶
Find all the Feed trips active on the given YYYYMMDD dates whose shapes intersect the given GeoDataFrame of screen lines, that is, of straight WGS84 LineStrings. Compute the intersection times and directions for each trip.
Return a DataFrame with the columns
'date'
'trip_id'
'route_id'
'route_short_name'
'shape_id'
: shape ID of the trip'screen_line_id'
: ID of the screen line as specified inscreen_lines
or as assigned after the fact.'crossing_distance'
: distance (in the feed’s distance units) along the trip shape of the screen line intersection'crossing_time'
: time that the trip’s vehicle crosses the scren line; one trip could cross multiple times'crossing_direction'
: 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
Notes:
Assume the Feed’s stop times DataFrame has an accurate
shape_dist_traveled
column.Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
Probably does not give correct results for trips with self-intersecting shapes.
The algorithm works as follows
Find the trip shapes that intersect the screen lines.
For each such shape and screen line, compute the intersection points, the distance of the point along the shape, and the orientation of the screen line relative to the shape.
For each given date, restrict to trips active on the date and interpolate a stop time for the intersection point using the
shape_dist_traveled
column.Use that interpolated time as the crossing time of the trip vehicle.
- gtfs_kit.miscellany.convert_dist(feed: Feed, new_dist_units: str) Feed ¶
Convert the distances recorded in the
shape_dist_traveled
columns of the given Feed to the given distance units. New distance units must lie inconstants.DIST_UNITS
. Return the resulting Feed.
- gtfs_kit.miscellany.create_shapes(feed: Feed, *, all_trips: bool = False) Feed ¶
Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.
If
all_trips
, then create new shapes for all trips by connecting stops, and remove the old shapes.
- gtfs_kit.miscellany.describe(feed: Feed, sample_date: str | None = None) pd.DataFrame ¶
Return a DataFrame of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.
The resulting DataFrame has the columns
'indicator'
: string; name of an indicator, e.g. ‘num_routes’'value'
: value of the indicator, e.g. 27
- gtfs_kit.miscellany.list_fields(feed: Feed, table: str | None = None) pd.DataFrame ¶
Return a DataFrame describing all the fields of the GTFS tables in the given feed or in the given table if specified.
The resulting DataFrame has the following columns.
'table'
: name of the GTFS table, e.g.'stops'
'column'
: name of a column in the table, e.g.'stop_id'
'num_values'
: number of values in the column'num_nonnull_values'
: number of nonnull values in the column'num_unique_values'
: number of unique values in the column, excluding null values'min_value'
: minimum value in the column'max_value'
: maximum value in the column
If the table is not in the feed, then return an empty DataFrame If the table is not valid, raise a ValueError
- gtfs_kit.miscellany.restrict_to_agencies(feed: Feed, agency_ids: list[str]) Feed ¶
Build a new feed by restricting this one via
restrict_to_routes()
and the routes with the given agency IDs. Return the resulting feed.
- gtfs_kit.miscellany.restrict_to_area(feed: Feed, area: gpd.GeoDataFrame) Feed ¶
Build a new feed by restricting this one via
restrict_to_trips()
and the trips that have at least one stop intersecting the given GeoDataFrame of polygons. Return the resulting feed.
- gtfs_kit.miscellany.restrict_to_dates(feed: Feed, dates: list[str]) Feed ¶
Build a new feed by restricting this one via
restrict_to_trips()
and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.
- gtfs_kit.miscellany.restrict_to_routes(feed: Feed, route_ids: list[str]) Feed ¶
Build a new feed by restricting this one via
restrict_to_trips()
and the trips with the given route IDs. Return the resulting feed.
- gtfs_kit.miscellany.restrict_to_trips(feed: Feed, trip_ids: list[str]) Feed ¶
Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.
If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.
This function is probably more useful internally than externally.
Module feed¶
This module defines a Feed class to represent GTFS feeds.
There is an instance attribute for every GTFS table (routes, stops, etc.),
which stores the table as a Pandas DataFrame,
or as None
in case that table is missing.
The Feed class also has heaps of methods: a method to compute route stats,
a method to compute screen line counts, validations methods, etc.
To ease reading, almost all of these methods are defined in other modules and
grouped by theme (routes.py
, stops.py
, etc.).
These methods, or rather functions that operate on feeds, are
then imported within the Feed class.
This separation of methods unfortunately messes up slightly the Feed
class
documentation generated by Sphinx, introducing an extra leading feed
parameter in the method signatures.
Ignore that extra parameter; it refers to the Feed instance,
usually called self
and usually hidden automatically by Sphinx.
- class gtfs_kit.feed.Feed(dist_units: str, agency: DataFrame | None = None, stops: DataFrame | None = None, routes: DataFrame | None = None, trips: DataFrame | None = None, stop_times: DataFrame | None = None, calendar: DataFrame | None = None, calendar_dates: DataFrame | None = None, fare_attributes: DataFrame | None = None, fare_rules: DataFrame | None = None, shapes: DataFrame | None = None, frequencies: DataFrame | None = None, transfers: DataFrame | None = None, feed_info: DataFrame | None = None, attributions: DataFrame | None = None)¶
Bases:
object
An instance of this class represents a not-necessarily-valid GTFS feed, where GTFS tables are stored as DataFrames. Beware that the stop times DataFrame can be big (several gigabytes), so make sure you have enough memory to handle it.
Primary instance attributes:
dist_units
: a string inconstants.DIST_UNITS
; specifies the distance units of the shape_dist_traveled column values, if present; also effects whether to display trip and route stats in metric or imperial unitsagency
stops
routes
trips
stop_times
calendar
calendar_dates
fare_attributes
fare_rules
shapes
frequencies
transfers
feed_info
attributions
There are also a few secondary instance attributes that are derived from the primary attributes and are automatically updated when the primary attributes change. However, for this update to work, you must update the primary attributes like this (good):
feed.trips['route_short_name'] = 'bingo' feed.trips = feed.trips
and not like this (bad):
feed.trips['route_short_name'] = 'bingo'
The first way ensures that the altered trips DataFrame is saved as the new
trips
attribute, but the second way does not.- aggregate_routes(by: str = 'route_short_name', route_id_prefix: str = 'route_') Feed ¶
Aggregate routes by route short name, say, and assign new route IDs using the given prefix.
More specifically, create new route IDs with the function
build_aggregate_routes_dict()
and the parametersby
androute_id_prefix
and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.
- aggregate_stops(by: str = 'stop_code', stop_id_prefix: str = 'stop_') Feed ¶
Aggregate stops by stop code, say, and assign new stop IDs using the given prefix.
More specifically, create new stop IDs with the function
build_aggregate_stops_dict()
and the parametersby
andstop_id_prefix
and update the old stop IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.
- append_dist_to_shapes() Feed ¶
Calculate and append the optional
shape_dist_traveled
field infeed.shapes
in terms of the distance unitsfeed.dist_units
. Return the resulting Feed.As a benchmark, using this function on this Portland feed produces a
shape_dist_traveled
column that differs by at most 0.016 km in absolute value from of the original values.
- append_dist_to_stop_times() Feed ¶
Calculate and append the optional
shape_dist_traveled
column infeed.stop_times
in terms of the distance unitsfeed.dist_units
. Trips without shapes will have NaN distances. Return the resulting Feed. Usesfeed.shapes
, so if that is missing, then return the original feed.This does not always give accurate results. The algorithm works as follows. Compute the
shape_dist_traveled
field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.
- assess_quality() pd.DataFrame ¶
Return a DataFrame of various feed indicators and values, e.g. number of trips missing shapes.
The resulting DataFrame has the columns
'indicator'
: string; name of an indicator, e.g. ‘num_routes’'value'
: value of the indicator, e.g. 27
This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.
- build_geometry_by_shape(shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict ¶
Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If
use_utm
, then use local UTM coordinates; otherwise, use WGS84 coordinates.
- build_geometry_by_stop(stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict ¶
Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.
- build_route_timetable(route_id: str, dates: list[str]) pd.DataFrame ¶
Return a timetable for the given route and dates (YYYYMMDD date strings).
Return a DataFrame with whose columns are all those in
feed.trips
plus those infeed.stop_times
plus'date'
. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.Skip dates outside of the Feed’s dates.
If there is no route activity on the given dates, then return an empty DataFrame.
- build_stop_timetable(stop_id: str, dates: list[str]) pd.DataFrame ¶
Return a DataFrame containing the timetable for the given stop ID and dates (YYYYMMDD date strings)
Return a DataFrame whose columns are all those in
feed.trips
plus those infeed.stop_times
plus'date'
, and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.
- build_zero_route_time_series(date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Return a route time series with the same index and hierarchical columns as output by
compute_route_time_series_0()
, but fill it full of zero values.
- build_zero_stop_time_series(date_label: str = '20010101', freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Return a stop time series with the same index and hierarchical columns as output by the function
compute_stop_time_series_0()
, but fill it full of zero values.
- clean() Feed ¶
Apply the following functions to the given Feed in order and return the resulting Feed.
- clean_ids() Feed ¶
In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.
- clean_route_short_names() Feed ¶
In
feed.routes
, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.
- clean_times() Feed ¶
In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.
- compute_bounds(stop_ids: list[str] | None = None) np.array ¶
Return the bounding box (Numpy array [min longitude, min latitude, max longitude, max latitude]) of the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- compute_busiest_date(dates: list[str]) str ¶
Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.
- compute_centroid(stop_ids: list[str] | None = None) sg.Point ¶
Return the centroid (Shapely Point) of the convex hull the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- compute_convex_hull(stop_ids: list[str] | None = None) sg.Polygon ¶
Return a convex hull (Shapely Polygon) representing the convex hull of the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- compute_feed_stats(trip_stats: pd.DataFrame, dates: list[str], *, split_route_types=False) pd.DataFrame ¶
Compute some stats for the given Feed, trip stats (in the format output by the function
trips.compute_trip_stats()
) and dates (YYYYMMDD date stings).Return a DataFrame with the columns
'date'
'route_type'
(optional): presest if and only ifsplit_route_types
'num_stops'
: number of stops active on the date'num_routes'
: number of routes active on the date'num_trips'
: number of trips that start on the date'num_trip_starts'
: number of trips with nonnull start times on the date'num_trip_ends'
: number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date'peak_num_trips'
: maximum number of simultaneous trips in service on the date'peak_start_time'
: start time of first longest period during which the peak number of trips occurs on the date'peak_end_time'
: end time of first longest period during which the peak number of trips occurs on the date'service_distance'
: sum of the service distances for the active routes on the date; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'service_duration'
: sum of the service durations for the active routes on the date; measured in hours'service_speed'
: service_distance/service_duration on the date
Exclude dates with no active stops, which could yield the empty DataFrame.
The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
- compute_feed_time_series(trip_stats: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_route_types: bool = False) pd.DataFrame ¶
Compute some feed stats in time series form for the given dates (YYYYMMDD date strings) and trip stats (of the form output by the function
trips.compute_trip_stats()
). Use the given Pandas frequency stringfreq
to specify the frequency of the resulting time series, e.g. ‘5Min’; highest frequency allowable is one minute (‘1Min’). Ifsplit_route_types
, then split stats by route type; otherwise don’tReturn a time series DataFrame with a datetime index across the given dates sampled at the given frequency across the given dates. The columns are
'num_trips'
: number of trips in service during during the time period'num_trip_starts'
: number of trips with starting during the time period'num_trip_ends'
: number of trips ending during the time period, ignoring the trips the end past midnight'service_distance'
: distance traveled during the time period by all trips active during the time period; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'service_duration'
: duration traveled during the time period by all trips active during the time period; measured in hours'service_speed'
:service_distance/service_duration
Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty DataFrame with the specified columns.
If
split_route_types
, then multi-index the columns withtop level: name is
'indicator'
; values are'num_trip_starts'
,'num_trip_ends'
,'num_trips'
,'service_distance'
,'service_duration'
, and'service_speed'
bottom level: name is
'route_type'
; values are route type values
If all dates lie outside the Feed’s date range, then return an empty DataFrame
- compute_route_stats(trip_stats_subset: pd.DataFrame, dates: list[str], headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame ¶
Compute route stats for all the trips that lie in the given subset of trip stats (of the form output by the function
trips.compute_trip_stats()
) and that start on the given dates (YYYYMMDD date strings).If
split_directions
, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.Return a DataFrame with the columns
'date'
the columns listed in :func:
compute_route_stats_0
Exclude dates with no active trips, which could yield the empty DataFrame.
Notes
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- compute_route_time_series(trip_stats_subset: pd.DataFrame, dates: list[str], freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Compute route stats in time series form for the trips that lie in the trip stats subset (of the form output by the function
trips.compute_trip_stats()
) and that start on the given dates (YYYYMMDD date strings).If
split_directions
, then separate each routes’s stats by trip direction. Specify the time series frequency with a Pandas frequency string, e.g.'5Min'
; max frequency is one minute (‘Min’).Return a DataFrame of the same format output by the function
compute_route_time_series_0()
but with multiple datesExclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame.
Notes
See the notes for
compute_route_time_series_0()
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- compute_screen_line_counts(screen_lines: gpd.GeoDataFrame, dates: list[str]) pd.DataFrame ¶
Find all the Feed trips active on the given YYYYMMDD dates whose shapes intersect the given GeoDataFrame of screen lines, that is, of straight WGS84 LineStrings. Compute the intersection times and directions for each trip.
Return a DataFrame with the columns
'date'
'trip_id'
'route_id'
'route_short_name'
'shape_id'
: shape ID of the trip'screen_line_id'
: ID of the screen line as specified inscreen_lines
or as assigned after the fact.'crossing_distance'
: distance (in the feed’s distance units) along the trip shape of the screen line intersection'crossing_time'
: time that the trip’s vehicle crosses the scren line; one trip could cross multiple times'crossing_direction'
: 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
Notes:
Assume the Feed’s stop times DataFrame has an accurate
shape_dist_traveled
column.Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
Probably does not give correct results for trips with self-intersecting shapes.
The algorithm works as follows
Find the trip shapes that intersect the screen lines.
For each such shape and screen line, compute the intersection points, the distance of the point along the shape, and the orientation of the screen line relative to the shape.
For each given date, restrict to trips active on the date and interpolate a stop time for the intersection point using the
shape_dist_traveled
column.Use that interpolated time as the crossing time of the trip vehicle.
- compute_stop_activity(dates: list[str]) pd.DataFrame ¶
Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).
Return a DataFrame with the columns
stop_id
dates[0]
: 1 if the stop has at least one trip visiting it ondates[0]
; 0 otherwisedates[1]
: 1 if the stop has at least one trip visiting it ondates[1]
; 0 otherwiseetc.
dates[-1]
: 1 if the stop has at least one trip visiting it ondates[-1]
; 0 otherwise
If all dates lie outside the Feed period, then return an empty DataFrame.
- compute_stop_stats(dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pd.DataFrame ¶
Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.
If
split_directions
, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.Return a DataFrame with the columns
'date'
'stop_id'
'direction_id'
: present if and only ifsplit_directions
'num_routes'
: number of routes visiting the stop (in the given direction) on the date'num_trips'
: number of trips visiting stop (in the givin direction) on the date'max_headway'
: maximum of the durations (in minutes) between trip departures at the stop betweenheadway_start_time
andheadway_end_time
on the date'min_headway'
: minimum of the durations (in minutes) mentioned above'mean_headway'
: mean of the durations (in minutes) mentioned above'start_time'
: earliest departure time of a trip from this stop on the date'end_time'
: latest departure time of a trip from this stop on the date
Exclude dates with no active stops, which could yield the empty DataFrame.
- compute_stop_time_series(dates: list[str], stop_ids: list[str | None] = None, freq: str = '5Min', *, split_directions: bool = False) pd.DataFrame ¶
Compute time series for the stops on the given dates (YYYYMMDD date strings) at the given frequency (Pandas frequency string, e.g.
'5Min'
; max frequency is one minute) and return the result as a DataFrame of the same form as output by the functionstop_times.compute_stop_time_series_0()
. Optionally restrict to stops in the given list of stop IDs.If
split_directions
, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.Return a time series DataFrame with a timestamp index across the given dates sampled at the given frequency.
The columns are the same as in the output of the function
compute_stop_time_series_0()
.Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty DataFrame
Notes
See the notes for the function
compute_stop_time_series_0()
Raise a ValueError if
split_directions
and no non-NaN direction ID values present
- compute_trip_activity(dates: list[str]) pd.DataFrame ¶
Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns
'trip_id'
dates[0]
: 1 if the trip is active ondates[0]
; 0 otherwisedates[1]
: 1 if the trip is active ondates[1]
; 0 otherwiseetc.
dates[-1]
: 1 if the trip is active ondates[-1]
; 0 otherwise
If
dates
isNone
or the empty list, then return an empty DataFrame.
- compute_trip_stats(route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) pd.DataFrame ¶
Return a DataFrame with the following columns:
'trip_id'
'route_id'
'route_short_name'
'route_type'
'direction_id'
: NaN if missing from feed'shape_id'
: NaN if missing from feed'stop_pattern_name'
: output fromname_stop_patterns()
'num_stops'
: number of stops on trip'start_time'
: first departure time of the trip'end_time'
: last departure time of the trip'start_stop_id'
: stop ID of the first stop of the trip'end_stop_id'
: stop ID of the last stop of the trip'is_loop'
: 1 if the start and end stop are less than 400m apart and 0 otherwise'distance'
: distance of the trip; measured in kilometers iffeed.dist_units
is metric; otherwise measured in miles; contains allnp.nan
entries iffeed.shapes is None
'duration'
: duration of the trip in hours'speed'
: distance/duration
If
feed.stop_times
has ashape_dist_traveled
column with at least one non-NaN value andcompute_dist_from_shapes == False
, then use that column to compute the distance column. Else iffeed.shapes is not None
, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to NaN.If route IDs are given, then restrict to trips on those routes.
Notes
Assume the following feed attributes are not
None
:feed.trips
feed.routes
feed.stop_times
feed.shapes
(optionally)
Calculating trip distances with
compute_dist_from_shapes=True
seems pretty accurate. For example, calculating trip distances on this Portland feed usingcompute_dist_from_shapes=False
andcompute_dist_from_shapes=True
, yields a difference of at most 0.83km from the original values.
- convert_dist(new_dist_units: str) Feed ¶
Convert the distances recorded in the
shape_dist_traveled
columns of the given Feed to the given distance units. New distance units must lie inconstants.DIST_UNITS
. Return the resulting Feed.
- create_shapes(*, all_trips: bool = False) Feed ¶
Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.
If
all_trips
, then create new shapes for all trips by connecting stops, and remove the old shapes.
- describe(sample_date: str | None = None) pd.DataFrame ¶
Return a DataFrame of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.
The resulting DataFrame has the columns
'indicator'
: string; name of an indicator, e.g. ‘num_routes’'value'
: value of the indicator, e.g. 27
- property dist_units¶
The distance units of the Feed.
- drop_invalid_columns() Feed ¶
Drop all DataFrame columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.
- drop_zombies() Feed ¶
In the given Feed, do the following in order and return the resulting Feed.
Drop stops of location type 0 or NaN with no stop times.
Remove undefined parent stations from the
parent_station
column.Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.
- extend_id(id_col: str, extension: str, *, prefix=True) Feed ¶
Add a prefix (if
prefix
) or a suffix (otherwise) to all values of columnid_col
across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.Raises a ValueError if
id_col
values can’t have strings added to them, e.g. ifid_col
is ‘direction_id’.
- geometrize_shapes(*, use_utm: bool = False) GeoDataFrame ¶
Given a GTFS shapes DataFrame, convert it to a GeoDataFrame of LineStrings and return the result, which will no longer have the columns
'shape_pt_sequence'
,'shape_pt_lon'
,'shape_pt_lat'
, and'shape_dist_traveled'
.If
use_utm
, then use local UTM coordinates for the geometries.
- geometrize_stops(*, use_utm: bool = False) GeoDataFrame ¶
Given a stops DataFrame, convert it to a GeoPandas GeoDataFrame of Points and return the result, which will no longer have the columns
'stop_lon'
and'stop_lat'
.
- get_active_services(date: str) list[str] ¶
Given a Feed and a date string in YYYYMMDD format, return the list of service IDs that are active on the date.
- get_dates(*, as_date_obj: bool = False) list[str] ¶
Return a list of YYYYMMDD date strings for which the given Feed is valid, which could be the empty list if the Feed has no calendar information.
If
as_date_obj
, then return datetime.date objects instead.
- get_first_week(*, as_date_obj: bool = False) list[str] ¶
Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.
If
as_date_obj
, then return date objects, otherwise return date strings.
- get_routes(date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False, split_directions: bool = False) pd.DataFrame ¶
Return
feed.routes
or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time.Given a Feed, return a GeoDataFrame with all the columns of
feed.routes
plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.If
as_gdf
andfeed.shapes
is notNone
, then return a GeoDataFrame with all the columns offeed.routes
plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s union of trip shapes. The GeoDataFrame will have a local UTM CRS ifuse_utm
; otherwise it will have CRS WGS84. Ifsplit_directions
andas_gdf
, then add the columndirection_id
and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. Ifas_gdf
andfeed.shapes
isNone
, then raise a ValueError.
- get_shapes(*, as_gdf: bool = False, use_utm: bool = False) gpd.DataFrame | None ¶
Get the shapes DataFrame for the given feed, which could be
None
. Ifas_gdf
, then return it as GeoDataFrame with a ‘geometry’ column of linestrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The GeoDataFrame will have a UTM CRS ifuse_utm
; otherwise it will have a WGS84 CRS.
- get_shapes_intersecting_geometry(geometry: sg.base.BaseGeometry, shapes_g: gpd.GeoDataFrame | None = None, *, as_gdf: bool = False) pd.DataFrame | None ¶
If the Feed has no shapes, then return None. Otherwise, return the subset of
feed.shapes
that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.If
as_gdf
, then return the shapes as a GeoDataFrame. Specifyingshapes_g
will skip the first step of the algorithm, namely, geometrizingfeed.shapes
.
- get_start_and_end_times(date: str | None = None) list[str] ¶
Return the first departure time and last arrival time (HH:MM:SS time strings) listed in
feed.stop_times
, respectively. Restrict to the given date (YYYYMMDD string) if specified.
- get_stop_times(date: str | None = None) pd.DataFrame ¶
Return
feed.stop_times
. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.
- get_stops(date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame ¶
Return
feed.stops
. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. Ifin_stations
, then subset further stops in stations if station data is available. Ifas_gdf
, then return the result as a GeoDataFrame with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The GeoDataFrame will have a UTM CRS ifuse_utm
and a WGS84 CRS otherwise.
- get_stops_in_area(area: gpd.GeoDataFrame) pd.DataFrame ¶
Return the subset of
feed.stops
that contains all stops that lie within the given GeoDataFrame of polygons.
- get_trips(date: str | None = None, time: str | None = None, *, as_gdf: bool = False, use_utm: bool = False) pd.DataFrame ¶
Return
feed.trips
. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.If
as_gdf
andfeed.shapes
is not None, then return the trips as a GeoDataFrame of LineStrings representating trip shapes. Use local UTM CRS ifuse_utm
; otherwise it the WGS84 CRS. Ifas_gdf
andfeed.shapes
isNone
, then raise a ValueError.
- get_week(k: int, *, as_date_obj: bool = False) list[str] ¶
Given a Feed and a positive integer
k
, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.If
as_date_obj
, then return datetime.date objects instead.
- list_fields(table: str | None = None) pd.DataFrame ¶
Return a DataFrame describing all the fields of the GTFS tables in the given feed or in the given table if specified.
The resulting DataFrame has the following columns.
'table'
: name of the GTFS table, e.g.'stops'
'column'
: name of a column in the table, e.g.'stop_id'
'num_values'
: number of values in the column'num_nonnull_values'
: number of nonnull values in the column'num_unique_values'
: number of unique values in the column, excluding null values'min_value'
: minimum value in the column'max_value'
: maximum value in the column
If the table is not in the feed, then return an empty DataFrame If the table is not valid, raise a ValueError
- locate_trips(date: str, times: list[str]) pd.DataFrame ¶
Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).
Return a DataFrame with the columns
'trip_id'
'route_id'
'direction_id'
: all NaNs iffeed.trips.direction_id
is missing'time'
'rel_dist'
: number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path'lon'
: longitude of trip at given time'lat'
: latitude of trip at given time
Assume
feed.stop_times
has an accurateshape_dist_traveled
column.
- map_routes(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)¶
Return a Folium map showing the given routes and (optionally) their stops. At least one of
route_ids
androute_short_names
must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.
- map_stops(stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})¶
Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.
- map_trips(trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = False)¶
Return a Folium map showing the given trips and (optionally) their stops. If any of the given trip IDs are not found in the feed, then raise a ValueError. If
include_direction
, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.
- name_stop_patterns() pd.DataFrame ¶
For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the DataFrame
feed.trips
with the additional columnstop_pattern_name
, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.If
feed.trips
has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.
- restrict_to_agencies(agency_ids: list[str]) Feed ¶
Build a new feed by restricting this one via
restrict_to_routes()
and the routes with the given agency IDs. Return the resulting feed.
- restrict_to_area(area: gpd.GeoDataFrame) Feed ¶
Build a new feed by restricting this one via
restrict_to_trips()
and the trips that have at least one stop intersecting the given GeoDataFrame of polygons. Return the resulting feed.
- restrict_to_dates(dates: list[str]) Feed ¶
Build a new feed by restricting this one via
restrict_to_trips()
and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.
- restrict_to_routes(route_ids: list[str]) Feed ¶
Build a new feed by restricting this one via
restrict_to_trips()
and the trips with the given route IDs. Return the resulting feed.
- restrict_to_trips(trip_ids: list[str]) Feed ¶
Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.
If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.
This function is probably more useful internally than externally.
- routes_to_geojson(route_ids: Iterable[str | None] = None, *, split_directions: bool = False, include_stops: bool = False) dict ¶
Return a GeoJSON FeatureCollection of MultiLineString features representing this Feed’s routes. The coordinates reference system is the default one for GeoJSON, namely WGS84.
If
include_stops
, then include the route stops as Point features . If an iterable of route IDs is given, then subset to those routes. If the subset is empty, then return a FeatureCollection with an empty list of features. If the Feed has no shapes, then raise a ValueError. If any of the given route IDs are not found in the feed, then raise a ValueError.
- shapes_to_geojson(shape_ids: Iterable[str] | None = None) dict ¶
Return a GeoJSON FeatureCollection of LineString features representing
feed.shapes
. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.
- stop_times_to_geojson(trip_ids: Iterable[str | None] = None) dict ¶
Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in
feed.stop_times
. The coordinates reference system is the default one for GeoJSON, namely WGS84.For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.
If an iterable of trip IDs is given, then subset to those trips. If some of the given trip IDs are not found in the feed, then raise a ValueError.
- stops_to_geojson(stop_ids: Iterable[str | None] = None) dict ¶
Return a GeoJSON FeatureCollection of Point features representing all the stops in
feed.stops
. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of stop IDs is given, then subset to those stops. If some of the given stop IDs are not found in the feed, then raise a ValueError.
- subset_dates(dates: list[str]) list[str] ¶
Given a Feed and a list of YYYYMMDD date strings, return the sublist of dates that lie in the Feed’s dates (the output
feed.get_dates()
).
- to_file(path: Path, ndigits: int | None = None) None ¶
Write this Feed to the given path. If the path ends in ‘.zip’, then write the feed as a zip archive. Otherwise assume the path is a directory, and write the feed as a collection of CSV files to that directory, creating the directory if it does not exist. Round all decimals to
ndigits
decimal places, if given. All distances will be the distance unitsfeed.dist_units
. By the way, 6 decimal degrees of latitude and longitude is enough to locate an individual cat.
- trips_to_geojson(trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) dict ¶
Return a GeoJSON FeatureCollection of LineString features representing all the Feed’s trips. The coordinates reference system is the default one for GeoJSON, namely WGS84.
If
include_stops
, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips. If any of the given trip IDs are not found in the feed, then raise a ValueError. If the Feed has no shapes, then raise a ValueError.
- ungeometrize_stops() DataFrame ¶
The inverse of
geometrize_stops()
.If
stops_g
is in UTM coordinates (has a UTM CRS property), then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.
- gtfs_kit.feed.list_feed(path: Path) DataFrame ¶
Given a path (string or Path object) to a GTFS zip file or directory, record the file names and file sizes of the contents, and return the result in a DataFrame with the columns:
'file_name'
'file_size'
- gtfs_kit.feed.read_feed(path_or_url: Path | str, dist_units: str) Feed ¶
Create a Feed instance from the given path or URL and given distance units. If the path exists, then call
_read_feed_from_path()
. Else if the URL has OK status according to Requests, then call_read_feed_from_url()
. Else raise a ValueError.Notes:
Ignore non-GTFS files in the feed
Automatically strip whitespace from the column names in GTFS files