GTFS Kit Polars 1.0.0 Documentation
Introduction
GTFS Kit Polars is a Python 3.12+ library for analyzing General Transit Feed Specification (GTFS) data. It uses Polars and Polars ST LazyFrames to do the heavy lifting.
The functions/methods of GTFS Kit Polars assume a valid GTFS feed but offer no inbuilt validation, because GTFS validation is complex and already solved by dedicated libraries. So unless you know what you’re doing, use the Canonical GTFS Validator before you analyze a feed with GTFS Kit Polars.
GTFS Kit Polars is an experimental port of the GTFS Kit library from Pandas to Polars. It can process large feeds much faster than the Pandas version, and if it proves useful enough, then i’ll incorporate it into GTFS Kit as a new release.
The one thing i don’t like about this Polars version is its dependence on Polars ST, a promising new geospatial library but one that is not yet as user-friendly as GeoPandas.
Installation
Install it from PyPI with UV, say, via uv add gtfs_kit_polars.
Examples
In a Jupyter notebook of examples in the project’s Github repository.
Conventions
In conformance with GTFS, dates are encoded as YYYYMMDD date strings, and times are encoded as HH:MM:SS time strings with the possibility that HH > 24. Watch out for that possibility, because it has counterintuitive consequences; see e.g.
trips.is_active_trip(), which is used inroutes.compute_route_stats(),stops.compute_stop_stats(), andmiscellany.compute_network_stats().‘DataFrame’ and ‘Series’ refer to Pandas DataFrame and Series objects, respectively
Module constants
Constants useful across modules.
- gtfs_kit_polars.constants.COLORS_SET2 = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3']
Colorbrewer 8-class Set2 colors
- gtfs_kit_polars.constants.DIST_UNITS = ['ft', 'mi', 'm', 'km']
Valid distance units
- gtfs_kit_polars.constants.DTYPES = {'agency': {'agency_email': String, 'agency_fare_url': String, 'agency_id': String, 'agency_lang': String, 'agency_name': String, 'agency_phone': String, 'agency_timezone': String, 'agency_url': String}, 'attributions': {'agency_id': String, 'attribution_email': String, 'attribution_id': String, 'attribution_phone': String, 'attribution_url': String, 'is_authority': Int8, 'is_operator': Int8, 'is_producer': Int8, 'organization_name': String, 'route_id': String, 'trip_id': String}, 'calendar': {'end_date': String, 'friday': Int8, 'monday': Int8, 'saturday': Int8, 'service_id': String, 'start_date': String, 'sunday': Int8, 'thursday': Int8, 'tuesday': Int8, 'wednesday': Int8}, 'calendar_dates': {'date': String, 'exception_type': Int8, 'service_id': String}, 'fare_attributes': {'currency_type': String, 'fare_id': String, 'payment_method': Int8, 'price': Float64, 'transfer_duration': Int16, 'transfers': Int8}, 'fare_rules': {'contains_id': String, 'destination_id': String, 'fare_id': String, 'origin_id': String, 'route_id': String}, 'feed_info': {'feed_end_date': String, 'feed_lang': String, 'feed_publisher_name': String, 'feed_publisher_url': String, 'feed_start_date': String, 'feed_version': String}, 'frequencies': {'end_time': String, 'exact_times': Int8, 'headway_secs': Int16, 'start_time': String, 'trip_id': String}, 'routes': {'agency_id': String, 'route_color': String, 'route_desc': String, 'route_id': String, 'route_long_name': String, 'route_short_name': String, 'route_text_color': String, 'route_type': Int8, 'route_url': String}, 'shapes': {'shape_dist_traveled': Float64, 'shape_id': String, 'shape_pt_lat': Float64, 'shape_pt_lon': Float64, 'shape_pt_sequence': Int32}, 'stop_times': {'arrival_time': String, 'departure_time': String, 'drop_off_type': Int8, 'pickup_type': Int8, 'shape_dist_traveled': Float64, 'stop_headsign': String, 'stop_id': String, 'stop_sequence': Int32, 'timepoint': Int8, 'trip_id': String}, 'stops': {'location_type': Int8, 'parent_station': String, 'stop_code': String, 'stop_desc': String, 'stop_id': String, 'stop_lat': Float64, 'stop_lon': Float64, 'stop_name': String, 'stop_timezone': String, 'stop_url': String, 'wheelchair_boarding': Int8, 'zone_id': String}, 'transfers': {'from_stop_id': String, 'min_transfer_time': Int16, 'to_stop_id': String, 'transfer_type': Int8}, 'trips': {'bikes_allowed': Int8, 'block_id': String, 'direction_id': Int8, 'route_id': String, 'service_id': String, 'shape_id': String, 'trip_headsign': String, 'trip_id': String, 'trip_short_name': String, 'wheelchair_accessible': Int8}}
GTFS data types (Polars dtypes)
- gtfs_kit_polars.constants.FEED_ATTRS = ['dist_units', 'unzip_dir', 'agency', 'attributions', 'calendar', 'calendar_dates', 'fare_attributes', 'fare_rules', 'feed_info', 'frequencies', 'routes', 'shapes', 'stops', 'stop_times', 'trips', 'transfers']
Feed attributes
- gtfs_kit_polars.constants.WGS84 = 4326
WGS84 coordinate reference system (used by spatial ops)
Module helpers
Functions useful across modules.
- gtfs_kit_polars.helpers.are_equal(f: DataFrame | LazyFrame, g: DataFrame | LazyFrame) bool
Return True if and only if the tables are equal after sorting column names and sorting rows by all columns. Nulls are treated as equal.
- gtfs_kit_polars.helpers.combine_time_series(series_by_indicator: dict[str, DataFrame | LazyFrame], *, kind: Literal['route', 'stop'], split_directions: bool = False) LazyFrame
Combine a dict of wide time series (one table per indicator, columns are entities) into a single long-form time series with columns
'datetime''route_id'or'stop_id': depending onkind'direction_id': present if and only ifsplit_directionsone column per indicator provided in series_by_indicator
'service_speed': if bothservice_distanceandservice_durationpresent
If
split_directions, then assume the original time series contains data separated by trip direction; otherwise, assume not. The separation is indicated by a suffix'-0'(direction 0) or'-1'(direction 1) in the route ID or stop ID column values.
- gtfs_kit_polars.helpers.date_to_datestr(x: date | None, format_str: str = '%Y%m%d') str | None
Convert a datetime.date to a formatted string. Return
Noneifx is None.
- gtfs_kit_polars.helpers.datestr_to_date(x: str | None, format_str: str = '%Y%m%d') date | None
Convert a date string to a datetime.date. Return
Noneifx is None.
- gtfs_kit_polars.helpers.downsample(time_series: DataFrame | LazyFrame, num_minutes: int) LazyFrame
Downsample the given route, stop, or network time series, (outputs of
routes.compute_route_time_series(),stops.compute_stop_time_series(), ormiscellany.compute_network_time_series(), respectively) to time bins of sizenum_minutesminutes.Return the given time series unchanged if it’s empty or has only one time bin per date. Raise a value error if
num_minutesdoes not evenly divide 1440 (the number of minutes in a day) or if its not a multiple of the bin size of the given time series.
- gtfs_kit_polars.helpers.get_bin_size(time_series: LazyFrame) float
Return the number of minutes per bin of the given time series with datetime column ‘datetime’. Assume the time series is regularly sampled and therefore has a single bin size. Return None if there’s only one unique datetime present.
- gtfs_kit_polars.helpers.get_convert_dist(dist_units_in: str, dist_units_out: str) Callable
Return a Polars expression builder for distance conversion:
expr_or_col -> expr * factor
Only supports units in
constants.DIST_UNITS. Usage:.with_columns(distance_km = get_convert_dist_pl(“m”,”km”)(“distance_m”)) .with_columns(distance_mi = get_convert_dist_pl(“km”,”mi”)(pl.col(“dist”)))
- gtfs_kit_polars.helpers.get_srid(g: DataFrame | LazyFrame) int
Table version of the Polars ST function
srid.
- gtfs_kit_polars.helpers.get_utm_srid(g: GeoDataFrame | GeoLazyFrame) int
Return the UTM SRID for the given geotable.
- gtfs_kit_polars.helpers.get_utm_srid_0(lon, lat)
Given the WGS84 longitude and latitude of a point, return its UTM SRID.
- gtfs_kit_polars.helpers.height(f: pl.DataFrame | pl.LazyDataFrame) int
- gtfs_kit_polars.helpers.is_empty(f: pl.DataFrame | pl.LazyDataFrame) bool
- gtfs_kit_polars.helpers.is_metric(dist_units: str) bool
Return True if the given distance units equals ‘m’ or ‘km’; otherwise return False.
- gtfs_kit_polars.helpers.is_not_null(f: DataFrame | LazyFrame, col_name: str) bool
Return
Trueif the given table has a column of the given name (string), and there exists at least one non-NaN value in that column; returnFalseotherwise.
- gtfs_kit_polars.helpers.longest_subsequence(seq, mode='strictly', order='increasing', key=None, *, index=False) list
Return the longest increasing subsequence of seq.
- Parameters:
seq (sequence object) – Can be any sequence, like str, list, numpy.array.
mode ({'strict', 'strictly', 'weak', 'weakly'}, optional) – If set to ‘strict’, the subsequence will contain unique elements. Using ‘weak’ an element can be repeated many times. Modes ending in -ly serve as a convenience to use with order parameter, because longest_sequence(seq, ‘weakly’, ‘increasing’) reads better. The default is ‘strict’.
order ({'increasing', 'decreasing'}, optional) – By default return the longest increasing subsequence, but it is possible to return the longest decreasing sequence as well.
key (function, optional) – Specifies a function of one argument that is used to extract a comparison key from each list element (e.g., str.lower, lambda x: x[0]). The default value is None (compare the elements directly).
index (bool, optional) – If set to True, return the indices of the subsequence, otherwise return the elements. Default is False.
- Returns:
elements (list, optional) – A list of elements of the longest subsequence. Returned by default and when index is set to False.
indices (list, optional) – A list of indices pointing to elements in the longest subsequence. Returned when index is set to True.
Taken from this Stack Overflow answer.
- gtfs_kit_polars.helpers.make_html(d: dict) str
Convert the given dictionary into an HTML table (string) with two columns: keys of dictionary, values of dictionary.
- gtfs_kit_polars.helpers.make_ids(n: int, prefix: str = 'id_') list[str]
Return a length
nlist of unique sequentially labelled strings for use as IDs.Example:
>>> make_ids(11, prefix="s") ['s00', s01', 's02', 's03', 's04', 's05', 's06', 's07', 's08', 's09', 's10']
- gtfs_kit_polars.helpers.make_lazy(f: DataFrame | LazyFrame) LazyFrame
- gtfs_kit_polars.helpers.replace_date(f: DataFrame | LazyFrame, date: str) DataFrame | LazyFrame
Given a table with a datetime object column called ‘datetime’ and given a YYYYMMDD date string, replace the datetime dates with the given date and return the resulting table.
- gtfs_kit_polars.helpers.seconds_to_timestr(col: str, *, mod24: bool = False) Expr
- gtfs_kit_polars.helpers.seconds_to_timestr_0(x: int, *, mod24: bool = False) str | None
The inverse of
timestr_to_seconds(). Ifmod24, then first take the number of seconds modulo24*3600. ReturnNonein case of bad inputs.
- gtfs_kit_polars.helpers.timestr_to_min(col: str) Expr
- gtfs_kit_polars.helpers.timestr_to_seconds(col: str, *, mod24: bool = False) Expr
- gtfs_kit_polars.helpers.timestr_to_seconds_0(x: str, *, mod24: bool = False) int | None
Given an HH:MM:SS time string
x, return the number of seconds past midnight that it represents. In keeping with GTFS standards, the hours entry may be greater than 23. Ifmod24, then return the number of seconds modulo24*3600. Returnnp.nanin case of bad inputs.
- gtfs_kit_polars.helpers.to_srid(g: DataFrame | LazyFrame, srid: int) DataFrame | LazyFrame
Table version of the Polars ST function
to_srid.
Module cleaners
Functions about cleaning feeds.
- gtfs_kit_polars.cleaners.aggregate_routes(feed: Feed, by: str = 'route_short_name', route_id_prefix: str = 'route_') Feed
Aggregate routes by route short name, say, and assign new route IDs using the given prefix.
More specifically, create new route IDs with the function
build_aggregate_routes_table()and the parametersbyandroute_id_prefixand update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.
- gtfs_kit_polars.cleaners.aggregate_stops(feed: Feed, by: str = 'stop_code', stop_id_prefix: str = 'stop_') Feed
Aggregate stops by the column by and assign new stop IDs using the given prefix. Update IDs in stops, stop_times, and transfers. Return the resulting Feed.
- gtfs_kit_polars.cleaners.build_aggregate_routes_table(routes: DataFrame | LazyFrame, by: str = 'route_short_name', route_id_prefix: str = 'route_') LazyFrame
Group routes by the
bycolumn and assign one new route ID per group using the given prefix. Return a table with columnsroute_idnew_route_id
- gtfs_kit_polars.cleaners.build_aggregate_stops_table(stops: DataFrame | LazyFrame, by: str = 'stop_code', stop_id_prefix: str = 'stop_') LazyFrame
Group stops by the
bycolumn and assign one new stop ID per group using the given prefix. Return a table with columnsstop_idnew_stop_id
- gtfs_kit_polars.cleaners.clean(feed: Feed) Feed
Apply the following functions to the given Feed in order and return the resulting Feed.
- gtfs_kit_polars.cleaners.clean_column_names(f: DataFrame | LazyFrame) DataFrame | LazyFrame
Strip the whitespace from all column names in the given table and return the result.
- gtfs_kit_polars.cleaners.clean_ids(feed: Feed) Feed
In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.
- gtfs_kit_polars.cleaners.clean_route_short_names(feed: Feed) Feed
In
feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.
- gtfs_kit_polars.cleaners.clean_times(feed: Feed) Feed
In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.
- gtfs_kit_polars.cleaners.drop_invalid_columns(feed: Feed) Feed
Drop all table columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.
- gtfs_kit_polars.cleaners.drop_zombies(feed: Feed) Feed
In the given Feed, do the following in order and return the resulting Feed.
Drop agencies with no routes.
Drop stops of location type 0 or None with no stop times.
Remove undefined parent stations from the
parent_stationcolumn.Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.
- gtfs_kit_polars.cleaners.extend_id(feed: Feed, id_col: str, extension: str, *, prefix=True) Feed
Add a prefix (if
prefix) or a suffix (otherwise) to all values of columnid_colacross all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.Raises a ValueError if
id_colvalues are strings, e.g. ifid_colis ‘direction_id’.
Module calendar
Functions about calendar and calendar_dates.
- gtfs_kit_polars.calendar.get_dates(feed: Feed, *, as_date_obj: bool = False) list[str] | list[dt.date]
Return the inclusive date range covered by feed.calendar and feed.calendar_dates as consecutive days. If neither table yields dates, return the empty list.
If
as_date_obj, then return datetime.date objects instead.Note that this is a range and not the set of actual service days.
- gtfs_kit_polars.calendar.get_first_week(feed: Feed, *, as_date_obj: bool = False) list[str] | list[dt.date]
Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.
If
as_date_obj, then return datetime.date objects instead.
- gtfs_kit_polars.calendar.get_week(feed: Feed, k: int, *, as_date_obj: bool = False) list[str] | list[dt.date]
Given a Feed and a positive integer
k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.If
as_date_obj, then return datetime.date objects instead.
Module routes
Functions about routes.
- gtfs_kit_polars.routes.build_route_timetable(feed: Feed, route_id: str, dates: list[str]) pl.LazyFrame
Return a timetable for the given route and dates (YYYYMMDD date strings).
Return a table with whose columns are all those in
feed.tripsplus those infeed.stop_timesplus'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.Skip dates outside of the Feed’s dates.
If there is no route activity on the given dates, then return an empty table.
- gtfs_kit_polars.routes.compute_route_stats(feed: Feed, dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pl.LazyFrame
Compute route stats for all the trips that lie in the given subset of trip stats, which defaults to
feed.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).If
split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.Return a table with the columns
'date''route_id''route_short_name''route_type''direction_id': present if only ifsplit_directions'num_trips': number of trips on the route in the subset'num_trip_starts': number of trips on the route with nonnull start times'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59'num_stop_patterns': number of stop pattern across trips'is_loop': 1 if at least one of the trips on the route has itsis_loopfield equal to 1; 0 otherwise'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if notsplit_directions'start_time': start time of the earliest trip on the route'end_time': end time of latest trip on the route'max_headway': maximum of the durations (in minutes) between trip starts on the route betweenheadway_start_timeandheadway_end_timeon the given dates'min_headway': minimum of the durations (in minutes) mentioned above'mean_headway': mean of the durations (in minutes) mentioned above'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions whensplit_directions==False)'peak_start_time': start time of first longest period during which the peak number of trips occurs'peak_end_time': end time of first longest period during which the peak number of trips occurs'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_speed': service_distance/service_duration when defined; 0 otherwise'mean_trip_distance': service_distance/num_trips'mean_trip_duration': service_duration/num_trips
Exclude dates with no active trips, which could yield an empty table.
If not
split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.Notes
If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Raise a ValueError if
split_directionsand no non-null direction ID values present.
- gtfs_kit_polars.routes.compute_route_stats_0(trip_stats: DataFrame | LazyFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) LazyFrame
Compute stats for the given subset of trip stats (of the form output by the function
trips.compute_trip_stats()).Ignore trips with zero duration, because they are defunct.
If
split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.Return a table with the columns
'route_id''route_short_name''route_type''direction_id': present if only ifsplit_directions'num_trips': number of trips on the route in the subset'num_trip_starts': number of trips on the route with nonnull start times'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59'num_stop_patterns': number of stop pattern across trips'is_loop': True if at least one of the trips on the route has itsis_loopfield equal to True; False otherwise'is_bidirectional': True if the route has trips in both directions; False otherwise; present if only if notsplit_directions'start_time': start time of the earliest trip on the route'end_time': end time of latest trip on the route'max_headway': maximum of the durations (in minutes) between trip starts on the route betweenheadway_start_timeandheadway_end_timeon the given dates'min_headway': minimum of the durations (in minutes) mentioned above'mean_headway': mean of the durations (in minutes) mentioned above'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions whensplit_directions==False)'peak_start_time': start time of first longest period during which the peak number of trips occurs'peak_end_time': end time of first longest period during which the peak number of trips occurs'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_speed': service_distance/service_duration'mean_trip_distance': service_distance/num_trips'mean_trip_duration': service_duration/num_trips
If
trip_statsis empty, return an empty table.Raise a ValueError if
split_directionsand no non-NaN direction ID values present
- gtfs_kit_polars.routes.compute_route_time_series(feed: Feed, dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, num_minutes: int = 60, *, split_directions: bool = False) pl.LazyFrame
Compute route stats in time series form at the given
num_minutesfrequency for the trips that lie in the trip stats subset, which defaults to the output oftrips.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).If
split_directions, then separate each routes’s stats by trip direction.Return a time series table with the following columns.
datetime: datetime objectroute_iddirection_id: direction of route; presest if and only ifsplit_directionsnum_trips: number of trips in service on the route at any time within the time binnum_trip_starts: number of trips that start within the time binnum_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnightservice_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles;service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hoursservice_speed:service_distance/service_durationfor the route
Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table.
Notes
If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
If a route does not run on a given date, then it won’t appear in the time series for that date
See the notes for
compute_route_time_series_0()Raise a ValueError if
split_directionsand no non-null direction ID values present
- gtfs_kit_polars.routes.compute_route_time_series_0(trip_stats: DataFrame | LazyFrame, date_label: str = '20010101', num_minutes: int = 60, *, split_directions: bool = False) LazyFrame
Compute stats in a 24-hour time series form at the
num_minutesfrequency for the given subset of trip stats of the form output by the functiontrips.compute_trip_stats().If
split_directions, then separate each routes’s stats by trip direction. Use the given YYYYMMDD date label as the date in the time series index.Return a long-format table with the columns
datetime: datetime objectroute_iddirection_id: direction of route; presest if and only ifsplit_directionsnum_trips: number of trips in service on the route at any time within the time binnum_trip_starts: number of trips that start within the time binnum_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnightservice_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles;service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hoursservice_speed:service_distance/service_durationfor the route
Notes
Trips that lack start or end times are ignored, so the the aggregate
num_tripsacross the day could be less than thenum_tripscolumn ofcompute_route_stats_0()All trip departure times are taken modulo 24 hours. So routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series, except for their
num_trip_endsindicator. Trip endings past 23:59:59 are not binned so that resampling thenum_tripsindicator works efficiently.Note that the total number of trips for two consecutive time bins t1 < t2 is the sum of the number of trips in bin t2 plus the number of trip endings in bin t1. Thus we can downsample the
num_tripsindicator by keeping track of only one extra count,num_trip_ends, and can avoid recording individual trip IDs.All other indicators are downsampled by summing.
Raise a ValueError if
split_directionsand no non-null direction ID values present
- gtfs_kit_polars.routes.get_routes(feed: Feed, date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False, split_directions: bool = False) pl.LazyFrame | st.GeoLazyFrame
Return
feed.routesor a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time. Ifas_geo, return a geotable with all the columns offeed.routesplus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.If
as_geoandfeed.shapesis not None, then return the routes as a geotable with a ‘geometry’ column of (Multi)LineStrings. The geotable will have a local UTM SRID ifuse_utm; otherwise it will have the WGS84 SRID. Ifas_geoandsplit_directions, then add the columndirection_idand split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. Ifas_geoandfeed.shapesisNone, then raise a ValueError.
- gtfs_kit_polars.routes.map_routes(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)
Return a Folium map showing the given routes and (optionally) their stops. At least one of
route_idsandroute_short_namesmust be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.
- gtfs_kit_polars.routes.routes_to_geojson(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, *, split_directions: bool = False, include_stops: bool = False) dict
Return a GeoJSON FeatureCollection (in WGS84 coordinates) of MultiLineString features representing this Feed’s routes.
If an iterable of route IDs or route short names is given, then subset to the union of those routes, which could yield an empty FeatureCollection in case of all invalid route IDs and route short names. If
include_stops, then include the route stops as Point features. If the Feed has no shapes, then raise a ValueError.
Module shapes
Functions about shapes.
- gtfs_kit_polars.shapes.append_dist_to_shapes(feed: Feed) Feed
Calculate and append the optional
shape_dist_traveledfield infeed.shapesin terms of the distance unitsfeed.dist_units. Return the resulting Feed.As a benchmark, using this function on this Portland feed produces a
shape_dist_traveledcolumn that differs by at most 0.016 km in absolute value from of the original values.
- gtfs_kit_polars.shapes.build_geometry_by_shape(feed: Feed, shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict
Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If
use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.
- gtfs_kit_polars.shapes.geometrize_shapes(shapes: DataFrame | LazyFrame, *, use_utm: bool = False) GeoLazyFrame
Given a GTFS shapes table, convert it to a geotable of LineStrings and return the result, which will no longer have the columns
'shape_pt_sequence','shape_pt_lon','shape_pt_lat', and'shape_dist_traveled'.If
use_utm, then use local UTM coordinates for the geometries.
- gtfs_kit_polars.shapes.get_shapes(feed: Feed, *, as_geo: bool = False, use_utm: bool = False) pl.LazyFrame | None
Get the shapes table for the given feed, which could be
None. Ifas_geo, then return it as geotable with a ‘geometry’ column of LineStrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The geotable will have a UTM SRID ifuse_utm; otherwise it will have a WGS84 SRID.
- gtfs_kit_polars.shapes.get_shapes_intersecting_geometry(feed: Feed, geometry: sg.base.BaseGeometry, shapes_g: st.GeoDataFrame | st.GeoLazyFrame = None, *, as_geo: bool = False) st.GeoLazyFrame | None
If the Feed has no shapes, then return None. Otherwise, return the subset of
feed.shapesthat contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.If
as_geo, then return the shapes as a geotable. Specifyingshapes_gwill skip the first step of the algorithm, namely, geometrizingfeed.shapes.
- gtfs_kit_polars.shapes.shapes_to_geojson(feed: Feed, shape_ids: Iterable[str] | None = None) dict
Return a GeoJSON FeatureCollection of LineString features representing
feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.
- gtfs_kit_polars.shapes.split_simple(shapes_g: GeoLazyFrame | GeoDataFrame) GeoLazyFrame
Given a geotable of GTFS shapes of the form output by
geometrize_shapes()with possibly non-WGS84 coordinates, split each non-simple LineString into large simple (non-self-intersecting) sub-LineStrings, and leave the simple LineStrings as is.Return a geotable in the coordinates of
shapes_gwith the columns'shape_id': GTFS shape ID for a LineString L'subshape_id': a unique identifier of a simple sub-LineString S of L'subshape_sequence': integer; indicates the order of S when joining up all simple sub-LineStrings to form L'subshape_length_m': the length of S in meters'cum_length_m': the length S plus the lengths of sub-LineStrings of L that come before S; in meters'geometry': LineString geometry corresponding to S
Within each ‘shape_id’ group, the subshapes will be sorted increasingly by ‘subshape_sequence’.
Notes
Simplicity checks and splitting are done in local UTM coordinates. Converting back to original coordinates can introduce rounding errors and non-simplicities. So test this function with a
shapes_gin local UTM coordinates.By construction, for each given LineString L with simple sub-LineStrings S_i, we have the inequality
sum over i of length(S_i) <= length(L),
where the lengths are expressed in meters.
- gtfs_kit_polars.shapes.split_simple_0(ls: LineString) list[LineString]
Split the given LineString into simple sub-LineStrings by greedily building the segments from the curve points and binary search, checking for simplicity at every step.
- gtfs_kit_polars.shapes.ungeometrize_shapes(shapes_g: DataFrame | LazyFrame) LazyFrame
The inverse of
geometrize_shapes().If
shapes_gis in UTM coordinates (has a UTM SRID), convert those coordinates back to WGS84 (EPSG:4326), which is the standard for a GTFS shapes table.
Module stop_times
Functions about stop times.
- gtfs_kit_polars.stop_times.append_dist_to_stop_times(feed: Feed) Feed
Calculate and append the optional
shape_dist_traveledcolumn infeed.stop_timesin terms of the distance unitsfeed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Usesfeed.shapes, so if that is missing, then return the original feed.This does not always give accurate results. The algorithm works as follows. Compute the
shape_dist_traveledfield by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.
- gtfs_kit_polars.stop_times.get_start_and_end_times(feed: Feed, date: str | None = None) tuple[str]
Return the first departure time and last arrival time (HH:MM:SS time strings) listed in
feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.
- gtfs_kit_polars.stop_times.get_stop_times(feed: Feed, date: str | None = None) pl.LazyFrame
Return
feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.
- gtfs_kit_polars.stop_times.stop_times_to_geojson(feed: Feed, trip_ids: Iterable[str | None] = None) dict
Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in
feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.
If an iterable of trip IDs is given, then subset to those trips, silently dropping invalid trip IDs.
Module stops
Functions about stops.
- gtfs_kit_polars.stops.STOP_STYLE = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}
Leaflet circleMarker parameters for mapping stops
- gtfs_kit_polars.stops.build_geometry_by_stop(feed: Feed, stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict
Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.
- gtfs_kit_polars.stops.build_stop_timetable(feed: Feed, stop_id: str, dates: list[str]) pl.LazyFrame
Return a timetable for the given stop ID and dates (YYYYMMDD date strings)
Return a table whose columns are all those in
feed.tripsplus those infeed.stop_timesplus'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.
- gtfs_kit_polars.stops.compute_stop_activity(feed: Feed, dates: list[str]) pl.LazyFrame
Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).
Return a table with the columns
stop_id
dates[0]: 1 if the stop has at least one trip visiting it ondates[0]; 0 otherwisedates[1]: 1 if the stop has at least one trip visiting it ondates[1]; 0 otherwiseetc.
dates[-1]: 1 if the stop has at least one trip visiting it ondates[-1]; 0 otherwise
If all dates lie outside the Feed period, then return an empty table.
- gtfs_kit_polars.stops.compute_stop_stats(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pl.LazyFrame
Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.
If
split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.Return a table with the columns
'date''stop_id''direction_id': present if and only ifsplit_directions'num_routes': number of routes visiting the stop (in the given direction) on the date'num_trips': number of trips visiting stop (in the givin direction) on the date'max_headway': maximum of the durations (in minutes) between trip departures at the stop betweenheadway_start_timeandheadway_end_timeon the date'min_headway': minimum of the durations (in minutes) mentioned above'mean_headway': mean of the durations (in minutes) mentioned above'start_time': earliest departure time of a trip from this stop on the date'end_time': latest departure time of a trip from this stop on the date
Exclude dates with no active stops, which could yield the empty table.
- gtfs_kit_polars.stops.compute_stop_stats_0(stop_times_subset: DataFrame | LazyFrame, trip_subset: DataFrame | LazyFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) LazyFrame
Given a subset of a stop times Table and a subset of a trips Table, return a Table that provides summary stats about the stops in the inner join of the two Tables.
If
split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.Return a Table with the columns
stop_id
direction_id: present if and only if
split_directionsnum_routes: number of routes visiting stop (in the given direction)
num_trips: number of trips visiting stop (in the givin direction)
max_headway: maximum of the durations (in minutes) between trip departures at the stop between
headway_start_timeandheadway_end_timemin_headway: minimum of the durations (in minutes) mentioned above
mean_headway: mean of the durations (in minutes) mentioned above
start_time: earliest departure time of a trip from this stop
end_time: latest departure time of a trip from this stop
Notes
If
trip_subsetis empty, then return an empty Table.Raise a ValueError if
split_directionsand no non-null direction ID values present.
- gtfs_kit_polars.stops.compute_stop_time_series(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, num_minutes: int = 60, *, split_directions: bool = False) pl.LazyFrame
Compute time series for the given stops (defaults to all stops in Feed) on the given dates (YYYYMMDD date strings) at the given
num_minutesfrequency. Return a long-format table with the columnsdatetime: datetime object for the given date and frequency chunksstop_iddirection_id: direction of route; presest if and only ifsplit_directionsnum_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop
Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table
If
split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.Notes
Stop times with null departure times are ignored, so the aggregate of
num_tripsacross the day could be less than thenum_tripscolumn incompute_stop_stats_0()All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled by summing
Raise a ValueError if
split_directionsand no non-null direction ID values present
- gtfs_kit_polars.stops.compute_stop_time_series_0(stop_times_subset: DataFrame | LazyFrame, trips_subset: DataFrame | LazyFrame, num_minutes: int = 60, date_label: str = '20010101', *, split_directions: bool = False) LazyFrame
Compute stop stats in a 24-hour time series form at the given
num_minutesfrequency for stops in the inner join of the given subset of stop times and trips.If
split_directions, then separate each stop’s stats by trip direction. Use the given YYYYMMDD date label as the date in the time series.Return a long-format table with columns
datetime: datetime object for the given date and frequency chunksstop_iddirection_id: direction of route; presest if and only ifsplit_directionsnum_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop
Notes
Stop times with null departure times are ignored, so the aggregate of
num_tripsacross the day could be less than thenum_tripscolumn incompute_stop_stats_0()All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled by summing
If
trips_subsetis empty, then return an empty tableRaise a ValueError if
split_directionsand no non-null direction ID values present
- gtfs_kit_polars.stops.geometrize_stops(stops: DataFrame | LazyFrame, *, use_utm: bool = False) GeoDataFrame | GeoLazyFrame
Given a GTFS stops Table, convert it to a geotable with a “geometry” column of LineStrings and a “srid” column with the (constant) srid of the geographic projection, e.g. ‘EPSG:4326’ for the WGS84 srid. Return the resulting geotable, which will no longer have the columns
'stop_lon'and'stop_lat'.If
use_utm, then use local UTM coordinates for the geometries.
- gtfs_kit_polars.stops.get_stops(feed: Feed, date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_geo: bool = False, use_utm: bool = False) pl.LazyFrame
Return
feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. Ifin_stations, then subset further stops in stations if station data is available. Ifas_geo, then return the result as a geotable with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The geotable will have a UTM SRID ifuse_utmand a WGS84 SRID otherwise.
- gtfs_kit_polars.stops.get_stops_in_area(feed: Feed, area: st.GeoLazyFrame | st.GeoDataFrame) st.GeoLazyFrame
Return the subset of
feed.stopsthat contains all stops that intersect the given geotable of polygons.
- gtfs_kit_polars.stops.map_stops(feed: Feed, stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})
Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.
- gtfs_kit_polars.stops.stops_to_geojson(feed: Feed, stop_ids: Iterable[str | None] = None) dict
Return a GeoJSON FeatureCollection of Point features representing all the stops in
feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of stop IDs is given, then subset to those stops.
- gtfs_kit_polars.stops.ungeometrize_stops(stops_g: GeoDataFrame | GeoLazyFrame) DataFrame | LazyFrame
The inverse of
geometrize_stops().If
stops_gis in UTM coordinates, then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.
Module trips
Functions about trips.
- gtfs_kit_polars.trips.compute_busiest_date(feed: Feed, dates: list[str]) str
Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.
- gtfs_kit_polars.trips.compute_trip_activity(feed: Feed, dates: list[str]) pl.LazyFrame
Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns
'trip_id'dates[0]: 1 if the trip is active ondates[0]; 0 otherwisedates[1]: 1 if the trip is active ondates[1]; 0 otherwiseetc.
dates[-1]: 1 if the trip is active ondates[-1]; 0 otherwise
If
datesisNoneor the empty list, then return an empty table.
- gtfs_kit_polars.trips.compute_trip_stats(feed: Feed, route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) pl.LazyFrame
Return a table with the following columns:
'trip_id''route_id''route_short_name''route_type''direction_id': null if missing from feed'shape_id': null if missing from feed'stop_pattern_name': output fromname_stop_patterns()'num_stops': number of stops on trip'start_time': first departure time of the trip'end_time': last departure time of the trip'start_stop_id': stop ID of the first stop of the trip'end_stop_id': stop ID of the last stop of the trip'is_loop': True if the start and end stop are less than 400m apart and False otherwise'distance': distance of the trip; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains all null entries iffeed.shapes is None'duration': duration of the trip in hours'speed': distance/duration
If
feed.stop_timeshas ashape_dist_traveledcolumn with at least one non-null value andcompute_dist_from_shapes == False, then use that column to compute the distance column. Else iffeed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to null.If route IDs are given, then restrict to trips on those routes.
Notes
Assume the following feed attributes are not
None:feed.tripsfeed.routesfeed.stop_timesfeed.shapes(optionally)
Calculating trip distances with
compute_dist_from_shapes=Trueseems pretty accurate. For example, calculating trip distances on this Portland feed usingcompute_dist_from_shapes=Falseandcompute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.
- gtfs_kit_polars.trips.get_active_services(feed: Feed, date: str) list[str]
Given a Feed and a date string in YYYYMMDD format, return the service IDs that are active on the date.
- gtfs_kit_polars.trips.get_trips(feed: Feed, date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False) pl.LazyFrame | st.GeoLazyFrame
Return
feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.If
as_geoandfeed.shapesis not None, then return the trips as a geotable of LineStrings representating trip shapes. Use local UTM CRS ifuse_utm; otherwise it the WGS84 CRS. Ifas_geoandfeed.shapesisNone, then raise a ValueError.
- gtfs_kit_polars.trips.locate_trips(feed, date: str, times: list[str]) LazyFrame
Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).
Return a table with the columns
'trip_id''shape_id''route_id''direction_id': null iffeed.trips.direction_idis missing'time''rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path'lon': longitude of trip at given time'lat': latitude of trip at given time
Assume
feed.stop_timeshas an accurateshape_dist_traveledcolumn.
- gtfs_kit_polars.trips.map_trips(feed: Feed, trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = True)
Return a Folium map showing the given trips. Silently drop invalid trip IDs given. If
show_stops, then plot the trip stops too. Ifshow_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.
- gtfs_kit_polars.trips.name_stop_patterns(feed: Feed) pl.LazyFrame
For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the table
feed.tripswith the additional columnstop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.If
feed.tripshas no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.
- gtfs_kit_polars.trips.trips_to_geojson(feed: Feed, trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) dict
Return a GeoJSON FeatureCollection (in WGS84 coordinates) of LineString features representing all the Feed’s trips.
If
include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips, which could yield an empty FeatureCollection in case all invalid trip IDs.
Module miscellany
Functions about miscellany.
- gtfs_kit_polars.miscellany.assess_quality(feed: Feed) pl.LazyFrame
Return a table of various feed indicators and values, e.g. number of trips missing shapes.
The resulting table has the columns
'indicator': string; name of an indicator, e.g. ‘num_routes’'value': value of the indicator, e.g. 27
This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.
- gtfs_kit_polars.miscellany.compute_bounds(feed: Feed, stop_ids: list[str] | None = None) list
Return the bounding box [min longitude, min latitude, max longitude, max latitude] of the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- gtfs_kit_polars.miscellany.compute_centroid(feed: Feed, stop_ids: list[str] | None = None) sg.Point
Return the centroid of the convex hull of the given Feed’s stops or subset of thereof specified by the given stop IDs.
- gtfs_kit_polars.miscellany.compute_convex_hull(feed: Feed, stop_ids: list[str] | None = None) sg.Polygon
Return the convex hull in WGS84 coordinates of the given Feed’s stops or subset thereof specified by the given stop IDs.
- gtfs_kit_polars.miscellany.compute_network_stats(feed: Feed, dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, *, split_route_types=False) pl.LazyFrame
Compute some network stats for the given subset of trip stats, which defaults to feed.compute_trip_stats(), and for the given dates (YYYYMMDD date stings).
Return a table with the columns
'date''route_type'(optional): presest if and only ifsplit_route_types'num_stops': number of stops active on the date'num_routes': number of routes active on the date'num_trips': number of trips that start on the date'num_trip_starts': number of trips with nonnull start times on the date'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date'peak_num_trips': maximum number of simultaneous trips in service on the date'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date'service_distance': sum of the service distances for the active routes on the date; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_duration': sum of the service durations for the active routes on the date; measured in hours'service_speed': service_distance/service_duration on the date
Exclude dates with no active stops, which could yield the empty table.
The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Notes
If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.
- gtfs_kit_polars.miscellany.compute_network_stats_0(stop_times: DataFrame | LazyFrame, trip_stats: DataFrame | LazyFrame, *, split_route_types=False) LazyFrame
Compute some network stats for the trips common to the given subset of stop times and given subset of trip stats of the form output by the function
trips.compute_trip_stats()Return a table with the columns
'route_type'(optional): presest if and only ifsplit_route_types'num_stops': number of stops active on the date'num_routes': number of routes active on the date'num_trips': number of trips that start on the date'num_trip_starts': number of trips with nonnull start times on the date'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date'peak_num_trips': maximum number of simultaneous trips in service on the date'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date'service_distance': sum of the service distances for the active routes on the date; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_duration': sum of the service durations for the active routes on the date; measured in hours'service_speed': service_distance/service_duration on the date
Exclude dates with no active stops, which could yield the empty table.
Helper function for
compute_network_stats().
- gtfs_kit_polars.miscellany.compute_network_time_series(feed: Feed, dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, num_minutes: int = 60, *, split_route_types: bool = False) pl.LazyFrame
Compute some network stats in time series form for the given dates (YYYYMMDD date strings) and trip stats, which defaults to
feed.compute_trip_stats(). Use the given Pandas frequency stringfreqto specify the frequency of the resulting time series, e.g. ‘5Min’. Ifsplit_route_types, then split stats by route type; otherwise don’t.Return a long-form time series table with the columns
'datetime': datetime object'route_type': integer; present if and only ifsplit_route_types'num_trips': number of trips in service during during the time period'num_trip_starts': number of trips with starting during the time period'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours'service_speed':service_distance/service_durationwhen defined; 0 otherwise
Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty table with the specified columns.
Notes
If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.
- gtfs_kit_polars.miscellany.compute_screen_line_counts(feed: Feed, screen_lines: st.GeoLazyFrame | st.GeoDataFrame, dates: list[str], *, include_diagnostics: bool = False) pl.LazyFrame
Find all the Feed trips active on the given YYYYMMDD dates that intersect the given screen lines (LineStrings) with optional ID column
screen_line_id. Behind the scenes, use simple sub-LineStrings of the feed to compute screen line intersections. Using them instead of the Feed shapes avoids miscounting intersections in the case of non-simple (self-intersecting) shapes.For each trip crossing a screen line, compute the crossing time, crossing direction, etc. and return a table of results with the columns
'date': the YYYYMMDD date string given'screen_line_id': ID of a screen line'trip_id': ID of a trip that crosses the screen line'shape_id': ID of the trip’s shape'direction_id': GTFS direction of trip'route_id''route_short_name''route_type''shape_id''crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction'crossing_time': time, according to the GTFS schedule, that the trip crosses the screen line'crossing_dist_m': distance along the trip shape (not subshape) of the crossing; in meters
If
include_diagnostics, then include the following extra columns for diagnostic purposes.'subshape_id': ID of the simple sub-LineString S of the trip’s shape that crosses the screen line'subshape_length_m': length of S in meters'from_departure_time': departure time of the trip from the last stop before the screen line'to_departure_time': departure time of the trip at from the first stop after the screen line'subshape_dist_frac': proportion of S’s length at which the screen line intersects S
Notes:
Assume the Feed’s stop times table has an accurate
shape_dist_traveledcolumn.Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
The algorithm works as follows
Find the Feed’s simple subshapes (computed via
shapes.split_simple()) that intersect the screen lines.For each such subshape and screen line, compute the intersection points, the distance of each point along the subshape, aka the crossing distance, and the orientation of the screen line relative to the subshape.
Restrict to trips active on the given dates and for each trip associated to an intersecting subshape above, interpolate a trip stop time for the intersection point using the crossing distance, subshape length, cumulative subshape length, and trip stop times.
- gtfs_kit_polars.miscellany.convert_dist(feed: Feed, new_dist_units: str) Feed
Convert the distances recorded in the
shape_dist_traveledcolumns of the given Feed to the given distance units. New distance units must lie inconstants.DIST_UNITS. Return the resulting Feed.
- gtfs_kit_polars.miscellany.create_shapes(feed: Feed, *, all_trips: bool = False) Feed
Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.
If
all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.
- gtfs_kit_polars.miscellany.describe(feed: Feed, sample_date: str | None = None) pl.LazyFrame
Return a table of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.
The resulting table has the columns
'indicator': string; name of an indicator, e.g. ‘num_routes’'value': value of the indicator, e.g. 27
- gtfs_kit_polars.miscellany.list_fields(feed: Feed, table_name: str | None = None) pl.LazyFrame
Return a table summarizing all GTFS tables in the given feed or in the given table name if specified.
The resulting table has the following columns.
'table': name of the GTFS table, e.g.'stops''column': name of a column in the table, e.g.'stop_id''num_values': number of values in the column'num_nonnull_values': number of nonnull values in the column'num_unique_values': number of unique values in the column, excluding null values'min_value': minimum value in the column'max_value': maximum value in the column
If the table is not in the feed, then return an empty table If the table is not valid, raise a ValueError
- gtfs_kit_polars.miscellany.restrict_to_agencies(feed: Feed, agency_ids: list[str]) Feed
Build a new feed by restricting this one via
restrict_to_routes()and the routes with the given agency IDs. Return the resulting feed.
- gtfs_kit_polars.miscellany.restrict_to_area(feed: Feed, area: st.GeoDataFrame | st.GeoLazyFrame) Feed
Build a new feed by restricting this one via
restrict_to_trips()and the trips that have at least one stop intersecting the given geotable of polygons, which can be in any coordinate reference system. Return the resulting feed.
- gtfs_kit_polars.miscellany.restrict_to_dates(feed: Feed, dates: list[str]) Feed
Build a new feed by restricting this one via
restrict_to_trips()and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.
- gtfs_kit_polars.miscellany.restrict_to_routes(feed: Feed, route_ids: list[str]) Feed
Build a new feed by restricting this one via
restrict_to_trips()and the trips with the given route IDs. Return the resulting feed.
- gtfs_kit_polars.miscellany.restrict_to_trips(feed: Feed, trip_ids: list[str]) Feed
Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.
If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.
This function is probably more useful internally than externally.
Module feed
This module defines a Feed class to represent GTFS feeds.
The Feed class also has heaps of methods: a method to compute route stats,
a method to compute screen line counts, validations methods, etc.
To ease testing and reading, almost all of these methods are defined in other modules
and grouped by theme (routes.py, stops.py, etc.).
These methods, or rather functions that operate on feeds, are
then imported within the Feed class.
This separation of methods unfortunately messes up slightly the Feed class
documentation generated by Sphinx, introducing an extra leading feed
parameter in the method signatures.
Ignore that extra parameter; it refers to the Feed instance,
usually called self and usually hidden automatically by Sphinx.
- class gtfs_kit_polars.feed.Feed(dist_units: str, agency: DataFrame | LazyFrame | None = None, stops: DataFrame | LazyFrame | None = None, routes: DataFrame | LazyFrame | None = None, trips: DataFrame | LazyFrame | None = None, stop_times: DataFrame | LazyFrame | None = None, calendar: DataFrame | LazyFrame | None = None, calendar_dates: DataFrame | LazyFrame | None = None, fare_attributes: DataFrame | LazyFrame | None = None, fare_rules: DataFrame | LazyFrame | None = None, shapes: DataFrame | LazyFrame | None = None, frequencies: DataFrame | LazyFrame | None = None, transfers: DataFrame | LazyFrame | None = None, feed_info: DataFrame | LazyFrame | None = None, attributions: DataFrame | LazyFrame | None = None, unzip_dir: TemporaryDirectory | None = None)
Bases:
objectAn instance of this class represents a GTFS feed, where GTFS tables are stored as Polars LazyFrame and are coerced to such upon initialization and attribute updates. The methods assume the instance represents a valid GTFS feed but offer no validation, because that’s complex and already done by dedicated libraries. So unless you know what you’re doing, use the Canonical GTFS Validator before seriously analyzing a feed with this class.
GTFS table instance attributes:
agencystopsroutestripsstop_timescalendarcalendar_datesfare_attributesfare_rulesshapesfrequenciestransfersfeed_infoattributions
Metadata attributes:
dist_units: a string inconstants.DIST_UNITS; specifies the distance units of the shape_dist_traveled column values, if present; also effects whether to display trip and route stats in metric or imperial unitsunzip_dir: temporary file directory for unzipping feeds read from ZIP file
- aggregate_routes(by: str = 'route_short_name', route_id_prefix: str = 'route_') Feed
Aggregate routes by route short name, say, and assign new route IDs using the given prefix.
More specifically, create new route IDs with the function
build_aggregate_routes_table()and the parametersbyandroute_id_prefixand update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.
- aggregate_stops(by: str = 'stop_code', stop_id_prefix: str = 'stop_') Feed
Aggregate stops by the column by and assign new stop IDs using the given prefix. Update IDs in stops, stop_times, and transfers. Return the resulting Feed.
- append_dist_to_shapes() Feed
Calculate and append the optional
shape_dist_traveledfield infeed.shapesin terms of the distance unitsfeed.dist_units. Return the resulting Feed.As a benchmark, using this function on this Portland feed produces a
shape_dist_traveledcolumn that differs by at most 0.016 km in absolute value from of the original values.
- append_dist_to_stop_times() Feed
Calculate and append the optional
shape_dist_traveledcolumn infeed.stop_timesin terms of the distance unitsfeed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Usesfeed.shapes, so if that is missing, then return the original feed.This does not always give accurate results. The algorithm works as follows. Compute the
shape_dist_traveledfield by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.
- assess_quality() pl.LazyFrame
Return a table of various feed indicators and values, e.g. number of trips missing shapes.
The resulting table has the columns
'indicator': string; name of an indicator, e.g. ‘num_routes’'value': value of the indicator, e.g. 27
This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.
- build_geometry_by_shape(shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict
Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If
use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.
- build_geometry_by_stop(stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) dict
Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.
- build_route_timetable(route_id: str, dates: list[str]) pl.LazyFrame
Return a timetable for the given route and dates (YYYYMMDD date strings).
Return a table with whose columns are all those in
feed.tripsplus those infeed.stop_timesplus'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.Skip dates outside of the Feed’s dates.
If there is no route activity on the given dates, then return an empty table.
- build_stop_timetable(stop_id: str, dates: list[str]) pl.LazyFrame
Return a timetable for the given stop ID and dates (YYYYMMDD date strings)
Return a table whose columns are all those in
feed.tripsplus those infeed.stop_timesplus'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.
- clean() Feed
Apply the following functions to the given Feed in order and return the resulting Feed.
- clean_ids() Feed
In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.
- clean_route_short_names() Feed
In
feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.
- clean_times() Feed
In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.
- close_unzip_dir() None
Close this Feed’s temporary unzip directory, if it has one, which was created by reading the feed from a ZIP file. Frees memory.
- compute_bounds(stop_ids: list[str] | None = None) list
Return the bounding box [min longitude, min latitude, max longitude, max latitude] of the given Feed’s stops or of the subset of stops specified by the given stop IDs.
- compute_busiest_date(dates: list[str]) str
Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.
- compute_centroid(stop_ids: list[str] | None = None) sg.Point
Return the centroid of the convex hull of the given Feed’s stops or subset of thereof specified by the given stop IDs.
- compute_convex_hull(stop_ids: list[str] | None = None) sg.Polygon
Return the convex hull in WGS84 coordinates of the given Feed’s stops or subset thereof specified by the given stop IDs.
- compute_network_stats(dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, *, split_route_types=False) pl.LazyFrame
Compute some network stats for the given subset of trip stats, which defaults to feed.compute_trip_stats(), and for the given dates (YYYYMMDD date stings).
Return a table with the columns
'date''route_type'(optional): presest if and only ifsplit_route_types'num_stops': number of stops active on the date'num_routes': number of routes active on the date'num_trips': number of trips that start on the date'num_trip_starts': number of trips with nonnull start times on the date'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date'peak_num_trips': maximum number of simultaneous trips in service on the date'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date'service_distance': sum of the service distances for the active routes on the date; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_duration': sum of the service durations for the active routes on the date; measured in hours'service_speed': service_distance/service_duration on the date
Exclude dates with no active stops, which could yield the empty table.
The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Notes
If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.
- compute_network_time_series(dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, num_minutes: int = 60, *, split_route_types: bool = False) pl.LazyFrame
Compute some network stats in time series form for the given dates (YYYYMMDD date strings) and trip stats, which defaults to
feed.compute_trip_stats(). Use the given Pandas frequency stringfreqto specify the frequency of the resulting time series, e.g. ‘5Min’. Ifsplit_route_types, then split stats by route type; otherwise don’t.Return a long-form time series table with the columns
'datetime': datetime object'route_type': integer; present if and only ifsplit_route_types'num_trips': number of trips in service during during the time period'num_trip_starts': number of trips with starting during the time period'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours'service_speed':service_distance/service_durationwhen defined; 0 otherwise
Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty table with the specified columns.
Notes
If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.
- compute_route_stats(dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pl.LazyFrame
Compute route stats for all the trips that lie in the given subset of trip stats, which defaults to
feed.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).If
split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.Return a table with the columns
'date''route_id''route_short_name''route_type''direction_id': present if only ifsplit_directions'num_trips': number of trips on the route in the subset'num_trip_starts': number of trips on the route with nonnull start times'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59'num_stop_patterns': number of stop pattern across trips'is_loop': 1 if at least one of the trips on the route has itsis_loopfield equal to 1; 0 otherwise'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if notsplit_directions'start_time': start time of the earliest trip on the route'end_time': end time of latest trip on the route'max_headway': maximum of the durations (in minutes) between trip starts on the route betweenheadway_start_timeandheadway_end_timeon the given dates'min_headway': minimum of the durations (in minutes) mentioned above'mean_headway': mean of the durations (in minutes) mentioned above'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions whensplit_directions==False)'peak_start_time': start time of first longest period during which the peak number of trips occurs'peak_end_time': end time of first longest period during which the peak number of trips occurs'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains allnp.nanentries iffeed.shapes is None'service_speed': service_distance/service_duration when defined; 0 otherwise'mean_trip_distance': service_distance/num_trips'mean_trip_duration': service_duration/num_trips
Exclude dates with no active trips, which could yield an empty table.
If not
split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.Notes
If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Raise a ValueError if
split_directionsand no non-null direction ID values present.
- compute_route_time_series(dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, num_minutes: int = 60, *, split_directions: bool = False) pl.LazyFrame
Compute route stats in time series form at the given
num_minutesfrequency for the trips that lie in the trip stats subset, which defaults to the output oftrips.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).If
split_directions, then separate each routes’s stats by trip direction.Return a time series table with the following columns.
datetime: datetime objectroute_iddirection_id: direction of route; presest if and only ifsplit_directionsnum_trips: number of trips in service on the route at any time within the time binnum_trip_starts: number of trips that start within the time binnum_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnightservice_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles;service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hoursservice_speed:service_distance/service_durationfor the route
Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table.
Notes
If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
If a route does not run on a given date, then it won’t appear in the time series for that date
See the notes for
compute_route_time_series_0()Raise a ValueError if
split_directionsand no non-null direction ID values present
- compute_screen_line_counts(screen_lines: st.GeoLazyFrame | st.GeoDataFrame, dates: list[str], *, include_diagnostics: bool = False) pl.LazyFrame
Find all the Feed trips active on the given YYYYMMDD dates that intersect the given screen lines (LineStrings) with optional ID column
screen_line_id. Behind the scenes, use simple sub-LineStrings of the feed to compute screen line intersections. Using them instead of the Feed shapes avoids miscounting intersections in the case of non-simple (self-intersecting) shapes.For each trip crossing a screen line, compute the crossing time, crossing direction, etc. and return a table of results with the columns
'date': the YYYYMMDD date string given'screen_line_id': ID of a screen line'trip_id': ID of a trip that crosses the screen line'shape_id': ID of the trip’s shape'direction_id': GTFS direction of trip'route_id''route_short_name''route_type''shape_id''crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction'crossing_time': time, according to the GTFS schedule, that the trip crosses the screen line'crossing_dist_m': distance along the trip shape (not subshape) of the crossing; in meters
If
include_diagnostics, then include the following extra columns for diagnostic purposes.'subshape_id': ID of the simple sub-LineString S of the trip’s shape that crosses the screen line'subshape_length_m': length of S in meters'from_departure_time': departure time of the trip from the last stop before the screen line'to_departure_time': departure time of the trip at from the first stop after the screen line'subshape_dist_frac': proportion of S’s length at which the screen line intersects S
Notes:
Assume the Feed’s stop times table has an accurate
shape_dist_traveledcolumn.Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
The algorithm works as follows
Find the Feed’s simple subshapes (computed via
shapes.split_simple()) that intersect the screen lines.For each such subshape and screen line, compute the intersection points, the distance of each point along the subshape, aka the crossing distance, and the orientation of the screen line relative to the subshape.
Restrict to trips active on the given dates and for each trip associated to an intersecting subshape above, interpolate a trip stop time for the intersection point using the crossing distance, subshape length, cumulative subshape length, and trip stop times.
- compute_stop_activity(dates: list[str]) pl.LazyFrame
Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).
Return a table with the columns
stop_id
dates[0]: 1 if the stop has at least one trip visiting it ondates[0]; 0 otherwisedates[1]: 1 if the stop has at least one trip visiting it ondates[1]; 0 otherwiseetc.
dates[-1]: 1 if the stop has at least one trip visiting it ondates[-1]; 0 otherwise
If all dates lie outside the Feed period, then return an empty table.
- compute_stop_stats(dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) pl.LazyFrame
Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.
If
split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.Return a table with the columns
'date''stop_id''direction_id': present if and only ifsplit_directions'num_routes': number of routes visiting the stop (in the given direction) on the date'num_trips': number of trips visiting stop (in the givin direction) on the date'max_headway': maximum of the durations (in minutes) between trip departures at the stop betweenheadway_start_timeandheadway_end_timeon the date'min_headway': minimum of the durations (in minutes) mentioned above'mean_headway': mean of the durations (in minutes) mentioned above'start_time': earliest departure time of a trip from this stop on the date'end_time': latest departure time of a trip from this stop on the date
Exclude dates with no active stops, which could yield the empty table.
- compute_stop_time_series(dates: list[str], stop_ids: list[str | None] = None, num_minutes: int = 60, *, split_directions: bool = False) pl.LazyFrame
Compute time series for the given stops (defaults to all stops in Feed) on the given dates (YYYYMMDD date strings) at the given
num_minutesfrequency. Return a long-format table with the columnsdatetime: datetime object for the given date and frequency chunksstop_iddirection_id: direction of route; presest if and only ifsplit_directionsnum_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop
Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table
If
split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.Notes
Stop times with null departure times are ignored, so the aggregate of
num_tripsacross the day could be less than thenum_tripscolumn incompute_stop_stats_0()All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled by summing
Raise a ValueError if
split_directionsand no non-null direction ID values present
- compute_trip_activity(dates: list[str]) pl.LazyFrame
Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns
'trip_id'dates[0]: 1 if the trip is active ondates[0]; 0 otherwisedates[1]: 1 if the trip is active ondates[1]; 0 otherwiseetc.
dates[-1]: 1 if the trip is active ondates[-1]; 0 otherwise
If
datesisNoneor the empty list, then return an empty table.
- compute_trip_stats(route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) pl.LazyFrame
Return a table with the following columns:
'trip_id''route_id''route_short_name''route_type''direction_id': null if missing from feed'shape_id': null if missing from feed'stop_pattern_name': output fromname_stop_patterns()'num_stops': number of stops on trip'start_time': first departure time of the trip'end_time': last departure time of the trip'start_stop_id': stop ID of the first stop of the trip'end_stop_id': stop ID of the last stop of the trip'is_loop': True if the start and end stop are less than 400m apart and False otherwise'distance': distance of the trip; measured in kilometers iffeed.dist_unitsis metric; otherwise measured in miles; contains all null entries iffeed.shapes is None'duration': duration of the trip in hours'speed': distance/duration
If
feed.stop_timeshas ashape_dist_traveledcolumn with at least one non-null value andcompute_dist_from_shapes == False, then use that column to compute the distance column. Else iffeed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to null.If route IDs are given, then restrict to trips on those routes.
Notes
Assume the following feed attributes are not
None:feed.tripsfeed.routesfeed.stop_timesfeed.shapes(optionally)
Calculating trip distances with
compute_dist_from_shapes=Trueseems pretty accurate. For example, calculating trip distances on this Portland feed usingcompute_dist_from_shapes=Falseandcompute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.
- convert_dist(new_dist_units: str) Feed
Convert the distances recorded in the
shape_dist_traveledcolumns of the given Feed to the given distance units. New distance units must lie inconstants.DIST_UNITS. Return the resulting Feed.
- create_shapes(*, all_trips: bool = False) Feed
Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.
If
all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.
- describe(sample_date: str | None = None) pl.LazyFrame
Return a table of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.
The resulting table has the columns
'indicator': string; name of an indicator, e.g. ‘num_routes’'value': value of the indicator, e.g. 27
- property dist_units: str
The distance units of the Feed.
- drop_invalid_columns() Feed
Drop all table columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.
- drop_zombies() Feed
In the given Feed, do the following in order and return the resulting Feed.
Drop agencies with no routes.
Drop stops of location type 0 or None with no stop times.
Remove undefined parent stations from the
parent_stationcolumn.Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.
- extend_id(id_col: str, extension: str, *, prefix=True) Feed
Add a prefix (if
prefix) or a suffix (otherwise) to all values of columnid_colacross all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.Raises a ValueError if
id_colvalues are strings, e.g. ifid_colis ‘direction_id’.
- geometrize_shapes(*, use_utm: bool = False) GeoLazyFrame
Given a GTFS shapes table, convert it to a geotable of LineStrings and return the result, which will no longer have the columns
'shape_pt_sequence','shape_pt_lon','shape_pt_lat', and'shape_dist_traveled'.If
use_utm, then use local UTM coordinates for the geometries.
- geometrize_stops(*, use_utm: bool = False) GeoDataFrame | GeoLazyFrame
Given a GTFS stops Table, convert it to a geotable with a “geometry” column of LineStrings and a “srid” column with the (constant) srid of the geographic projection, e.g. ‘EPSG:4326’ for the WGS84 srid. Return the resulting geotable, which will no longer have the columns
'stop_lon'and'stop_lat'.If
use_utm, then use local UTM coordinates for the geometries.
- get_active_services(date: str) list[str]
Given a Feed and a date string in YYYYMMDD format, return the service IDs that are active on the date.
- get_dates(*, as_date_obj: bool = False) list[str] | list[dt.date]
Return the inclusive date range covered by feed.calendar and feed.calendar_dates as consecutive days. If neither table yields dates, return the empty list.
If
as_date_obj, then return datetime.date objects instead.Note that this is a range and not the set of actual service days.
- get_first_week(*, as_date_obj: bool = False) list[str] | list[dt.date]
Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.
If
as_date_obj, then return datetime.date objects instead.
- get_routes(date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False, split_directions: bool = False) pl.LazyFrame | st.GeoLazyFrame
Return
feed.routesor a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time. Ifas_geo, return a geotable with all the columns offeed.routesplus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.If
as_geoandfeed.shapesis not None, then return the routes as a geotable with a ‘geometry’ column of (Multi)LineStrings. The geotable will have a local UTM SRID ifuse_utm; otherwise it will have the WGS84 SRID. Ifas_geoandsplit_directions, then add the columndirection_idand split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. Ifas_geoandfeed.shapesisNone, then raise a ValueError.
- get_shapes(*, as_geo: bool = False, use_utm: bool = False) pl.LazyFrame | None
Get the shapes table for the given feed, which could be
None. Ifas_geo, then return it as geotable with a ‘geometry’ column of LineStrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The geotable will have a UTM SRID ifuse_utm; otherwise it will have a WGS84 SRID.
- get_shapes_intersecting_geometry(geometry: sg.base.BaseGeometry, shapes_g: st.GeoDataFrame | st.GeoLazyFrame = None, *, as_geo: bool = False) st.GeoLazyFrame | None
If the Feed has no shapes, then return None. Otherwise, return the subset of
feed.shapesthat contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.If
as_geo, then return the shapes as a geotable. Specifyingshapes_gwill skip the first step of the algorithm, namely, geometrizingfeed.shapes.
- get_start_and_end_times(date: str | None = None) tuple[str]
Return the first departure time and last arrival time (HH:MM:SS time strings) listed in
feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.
- get_stop_times(date: str | None = None) pl.LazyFrame
Return
feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.
- get_stops(date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_geo: bool = False, use_utm: bool = False) pl.LazyFrame
Return
feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. Ifin_stations, then subset further stops in stations if station data is available. Ifas_geo, then return the result as a geotable with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The geotable will have a UTM SRID ifuse_utmand a WGS84 SRID otherwise.
- get_stops_in_area(area: st.GeoLazyFrame | st.GeoDataFrame) st.GeoLazyFrame
Return the subset of
feed.stopsthat contains all stops that intersect the given geotable of polygons.
- get_trips(date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False) pl.LazyFrame | st.GeoLazyFrame
Return
feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.If
as_geoandfeed.shapesis not None, then return the trips as a geotable of LineStrings representating trip shapes. Use local UTM CRS ifuse_utm; otherwise it the WGS84 CRS. Ifas_geoandfeed.shapesisNone, then raise a ValueError.
- get_week(k: int, *, as_date_obj: bool = False) list[str] | list[dt.date]
Given a Feed and a positive integer
k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.If
as_date_obj, then return datetime.date objects instead.
- list_fields(table_name: str | None = None) pl.LazyFrame
Return a table summarizing all GTFS tables in the given feed or in the given table name if specified.
The resulting table has the following columns.
'table': name of the GTFS table, e.g.'stops''column': name of a column in the table, e.g.'stop_id''num_values': number of values in the column'num_nonnull_values': number of nonnull values in the column'num_unique_values': number of unique values in the column, excluding null values'min_value': minimum value in the column'max_value': maximum value in the column
If the table is not in the feed, then return an empty table If the table is not valid, raise a ValueError
- locate_trips(date: str, times: list[str]) LazyFrame
Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).
Return a table with the columns
'trip_id''shape_id''route_id''direction_id': null iffeed.trips.direction_idis missing'time''rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path'lon': longitude of trip at given time'lat': latitude of trip at given time
Assume
feed.stop_timeshas an accurateshape_dist_traveledcolumn.
- map_routes(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False)
Return a Folium map showing the given routes and (optionally) their stops. At least one of
route_idsandroute_short_namesmust be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.
- map_stops(stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1})
Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.
- map_trips(trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = True)
Return a Folium map showing the given trips. Silently drop invalid trip IDs given. If
show_stops, then plot the trip stops too. Ifshow_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.
- name_stop_patterns() pl.LazyFrame
For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the table
feed.tripswith the additional columnstop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.If
feed.tripshas no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.
- restrict_to_agencies(agency_ids: list[str]) Feed
Build a new feed by restricting this one via
restrict_to_routes()and the routes with the given agency IDs. Return the resulting feed.
- restrict_to_area(area: st.GeoDataFrame | st.GeoLazyFrame) Feed
Build a new feed by restricting this one via
restrict_to_trips()and the trips that have at least one stop intersecting the given geotable of polygons, which can be in any coordinate reference system. Return the resulting feed.
- restrict_to_dates(dates: list[str]) Feed
Build a new feed by restricting this one via
restrict_to_trips()and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.
- restrict_to_routes(route_ids: list[str]) Feed
Build a new feed by restricting this one via
restrict_to_trips()and the trips with the given route IDs. Return the resulting feed.
- restrict_to_trips(trip_ids: list[str]) Feed
Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.
If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.
This function is probably more useful internally than externally.
- routes_to_geojson(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, *, split_directions: bool = False, include_stops: bool = False) dict
Return a GeoJSON FeatureCollection (in WGS84 coordinates) of MultiLineString features representing this Feed’s routes.
If an iterable of route IDs or route short names is given, then subset to the union of those routes, which could yield an empty FeatureCollection in case of all invalid route IDs and route short names. If
include_stops, then include the route stops as Point features. If the Feed has no shapes, then raise a ValueError.
- shapes_to_geojson(shape_ids: Iterable[str] | None = None) dict
Return a GeoJSON FeatureCollection of LineString features representing
feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.
- split_simple() GeoLazyFrame
Given a geotable of GTFS shapes of the form output by
geometrize_shapes()with possibly non-WGS84 coordinates, split each non-simple LineString into large simple (non-self-intersecting) sub-LineStrings, and leave the simple LineStrings as is.Return a geotable in the coordinates of
shapes_gwith the columns'shape_id': GTFS shape ID for a LineString L'subshape_id': a unique identifier of a simple sub-LineString S of L'subshape_sequence': integer; indicates the order of S when joining up all simple sub-LineStrings to form L'subshape_length_m': the length of S in meters'cum_length_m': the length S plus the lengths of sub-LineStrings of L that come before S; in meters'geometry': LineString geometry corresponding to S
Within each ‘shape_id’ group, the subshapes will be sorted increasingly by ‘subshape_sequence’.
Notes
Simplicity checks and splitting are done in local UTM coordinates. Converting back to original coordinates can introduce rounding errors and non-simplicities. So test this function with a
shapes_gin local UTM coordinates.By construction, for each given LineString L with simple sub-LineStrings S_i, we have the inequality
sum over i of length(S_i) <= length(L),
where the lengths are expressed in meters.
- stop_times_to_geojson(trip_ids: Iterable[str | None] = None) dict
Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in
feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.
If an iterable of trip IDs is given, then subset to those trips, silently dropping invalid trip IDs.
- stops_to_geojson(stop_ids: Iterable[str | None] = None) dict
Return a GeoJSON FeatureCollection of Point features representing all the stops in
feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.If an iterable of stop IDs is given, then subset to those stops.
- subset_dates(dates: list[str]) list[str]
Given a Feed and a list of YYYYMMDD date strings, return the sorted sublist of dates that lie in the Feed’s dates (the output
feed.get_dates()). Could be an empty list.
- to_file(path: Path, ndigits: int | None = None) None
Write this Feed to the given path. If the path ends in ‘.zip’, then write the feed as a zip archive. Otherwise assume the path is a directory, and write the feed as a collection of CSV files to that directory, creating the directory if it does not exist. Round all decimals to
ndigitsdecimal places, if given. All distances will be the distance unitsfeed.dist_units. By the way, 6 decimal degrees of latitude and longitude is enough to locate an individual cat.
- trips_to_geojson(trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) dict
Return a GeoJSON FeatureCollection (in WGS84 coordinates) of LineString features representing all the Feed’s trips.
If
include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips, which could yield an empty FeatureCollection in case all invalid trip IDs.
- ungeometrize_stops() DataFrame | LazyFrame
The inverse of
geometrize_stops().If
stops_gis in UTM coordinates, then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.
- gtfs_kit_polars.feed.list_feed(path: Path) DataFrame
Given a path (string or Path object) to a GTFS zip file or directory, record the file names and file sizes of the contents, and return the result in a table with the columns:
'file_name''file_size'
- gtfs_kit_polars.feed.read_feed(path_or_url: Path | str, dist_units: str) Feed
Create a Feed instance from the given path or URL and given distance units. If the path exists, then call
_read_feed_from_path(). Else if the URL has OK status according to Requests, then call_read_feed_from_url(). Else raise a ValueError.Notes:
Ignore non-GTFS files in the feed
Automatically strip whitespace from the column names in GTFS files