GTFS Kit Polars 1.0.0 Documentation

Introduction

GTFS Kit Polars is a Python 3.12+ library for analyzing General Transit Feed Specification (GTFS) data. It uses Polars and Polars ST LazyFrames to do the heavy lifting.

The functions/methods of GTFS Kit Polars assume a valid GTFS feed but offer no inbuilt validation, because GTFS validation is complex and already solved by dedicated libraries. So unless you know what you’re doing, use the Canonical GTFS Validator before you analyze a feed with GTFS Kit Polars.

GTFS Kit Polars is an experimental port of the GTFS Kit library from Pandas to Polars. It can process large feeds much faster than the Pandas version, and if it proves useful enough, then i’ll incorporate it into GTFS Kit as a new release.

The one thing i don’t like about this Polars version is its dependence on Polars ST, a promising new geospatial library but one that is not yet as user-friendly as GeoPandas.

Authors

Alex Raichev, 2025-11

Installation

Install it from PyPI with UV, say, via uv add gtfs_kit_polars.

Examples

In a Jupyter notebook of examples in the project’s Github repository.

Conventions

In conformance with GTFS, dates are encoded as YYYYMMDD date strings, and times are encoded as HH:MM:SS time strings with the possibility that HH > 24. Watch out for that possibility, because it has counterintuitive consequences; see e.g. trips.is_active_trip(), which is used in routes.compute_route_stats(), stops.compute_stop_stats(), and miscellany.compute_network_stats().
‘DataFrame’ and ‘Series’ refer to Pandas DataFrame and Series objects, respectively

Module constants

Constants useful across modules.

gtfs_kit_polars.constants.COLORS_SET2 = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3']: Colorbrewer 8-class Set2 colors

gtfs_kit_polars.constants.DIST_UNITS = ['ft', 'mi', 'm', 'km']: Valid distance units

gtfs_kit_polars.constants.DTYPES = {'agency': {'agency_email': String, 'agency_fare_url': String, 'agency_id': String, 'agency_lang': String, 'agency_name': String, 'agency_phone': String, 'agency_timezone': String, 'agency_url': String}, 'attributions': {'agency_id': String, 'attribution_email': String, 'attribution_id': String, 'attribution_phone': String, 'attribution_url': String, 'is_authority': Int8, 'is_operator': Int8, 'is_producer': Int8, 'organization_name': String, 'route_id': String, 'trip_id': String}, 'calendar': {'end_date': String, 'friday': Int8, 'monday': Int8, 'saturday': Int8, 'service_id': String, 'start_date': String, 'sunday': Int8, 'thursday': Int8, 'tuesday': Int8, 'wednesday': Int8}, 'calendar_dates': {'date': String, 'exception_type': Int8, 'service_id': String}, 'fare_attributes': {'currency_type': String, 'fare_id': String, 'payment_method': Int8, 'price': Float64, 'transfer_duration': Int16, 'transfers': Int8}, 'fare_rules': {'contains_id': String, 'destination_id': String, 'fare_id': String, 'origin_id': String, 'route_id': String}, 'feed_info': {'feed_end_date': String, 'feed_lang': String, 'feed_publisher_name': String, 'feed_publisher_url': String, 'feed_start_date': String, 'feed_version': String}, 'frequencies': {'end_time': String, 'exact_times': Int8, 'headway_secs': Int16, 'start_time': String, 'trip_id': String}, 'routes': {'agency_id': String, 'route_color': String, 'route_desc': String, 'route_id': String, 'route_long_name': String, 'route_short_name': String, 'route_text_color': String, 'route_type': Int8, 'route_url': String}, 'shapes': {'shape_dist_traveled': Float64, 'shape_id': String, 'shape_pt_lat': Float64, 'shape_pt_lon': Float64, 'shape_pt_sequence': Int32}, 'stop_times': {'arrival_time': String, 'departure_time': String, 'drop_off_type': Int8, 'pickup_type': Int8, 'shape_dist_traveled': Float64, 'stop_headsign': String, 'stop_id': String, 'stop_sequence': Int32, 'timepoint': Int8, 'trip_id': String}, 'stops': {'location_type': Int8, 'parent_station': String, 'stop_code': String, 'stop_desc': String, 'stop_id': String, 'stop_lat': Float64, 'stop_lon': Float64, 'stop_name': String, 'stop_timezone': String, 'stop_url': String, 'wheelchair_boarding': Int8, 'zone_id': String}, 'transfers': {'from_stop_id': String, 'min_transfer_time': Int16, 'to_stop_id': String, 'transfer_type': Int8}, 'trips': {'bikes_allowed': Int8, 'block_id': String, 'direction_id': Int8, 'route_id': String, 'service_id': String, 'shape_id': String, 'trip_headsign': String, 'trip_id': String, 'trip_short_name': String, 'wheelchair_accessible': Int8}}: GTFS data types (Polars dtypes)

gtfs_kit_polars.constants.FEED_ATTRS = ['dist_units', 'unzip_dir', 'agency', 'attributions', 'calendar', 'calendar_dates', 'fare_attributes', 'fare_rules', 'feed_info', 'frequencies', 'routes', 'shapes', 'stops', 'stop_times', 'trips', 'transfers']: Feed attributes

gtfs_kit_polars.constants.WGS84 = 4326: WGS84 coordinate reference system (used by spatial ops)

Module helpers

Functions useful across modules.

gtfs_kit_polars.helpers.are_equal(f: DataFrame | LazyFrame, g: DataFrame | LazyFrame) → bool: Return True if and only if the tables are equal after sorting column names and sorting rows by all columns. Nulls are treated as equal.

gtfs_kit_polars.helpers.combine_time_series(series_by_indicator: dict[str, DataFrame | LazyFrame], *, kind: Literal['route', 'stop'], split_directions: bool = False) → LazyFrame

Combine a dict of wide time series (one table per indicator, columns are entities) into a single long-form time series with columns

'datetime'
'route_id' or 'stop_id': depending on kind
'direction_id': present if and only if split_directions
one column per indicator provided in series_by_indicator
'service_speed': if both service_distance and service_duration present

If split_directions, then assume the original time series contains data separated by trip direction; otherwise, assume not. The separation is indicated by a suffix '-0' (direction 0) or '-1' (direction 1) in the route ID or stop ID column values.

gtfs_kit_polars.helpers.date_to_datestr(x: date | None, format_str: str = '%Y%m%d') → str | None: Convert a datetime.date to a formatted string. Return None if x is None.

gtfs_kit_polars.helpers.datestr_to_date(x: str | None, format_str: str = '%Y%m%d') → date | None: Convert a date string to a datetime.date. Return None if x is None.

gtfs_kit_polars.helpers.downsample(time_series: DataFrame | LazyFrame, num_minutes: int) → LazyFrame

Downsample the given route, stop, or network time series, (outputs of routes.compute_route_time_series(), stops.compute_stop_time_series(), or miscellany.compute_network_time_series(), respectively) to time bins of size num_minutes minutes.

Return the given time series unchanged if it’s empty or has only one time bin per date. Raise a value error if num_minutes does not evenly divide 1440 (the number of minutes in a day) or if its not a multiple of the bin size of the given time series.

gtfs_kit_polars.helpers.get_bin_size(time_series: LazyFrame) → float: Return the number of minutes per bin of the given time series with datetime column ‘datetime’. Assume the time series is regularly sampled and therefore has a single bin size. Return None if there’s only one unique datetime present.

gtfs_kit_polars.helpers.get_convert_dist(dist_units_in: str, dist_units_out: str) → Callable

Return a Polars expression builder for distance conversion:

expr_or_col -> expr * factor

Only supports units in constants.DIST_UNITS. Usage:

.with_columns(distance_km = get_convert_dist_pl(“m”,”km”)(“distance_m”)) .with_columns(distance_mi = get_convert_dist_pl(“km”,”mi”)(pl.col(“dist”)))

gtfs_kit_polars.helpers.get_srid(g: DataFrame | LazyFrame) → int: Table version of the Polars ST function srid.

gtfs_kit_polars.helpers.get_utm_srid(g: GeoDataFrame | GeoLazyFrame) → int: Return the UTM SRID for the given geotable.

gtfs_kit_polars.helpers.get_utm_srid_0(lon, lat): Given the WGS84 longitude and latitude of a point, return its UTM SRID.

gtfs_kit_polars.helpers.height(f: pl.DataFrame | pl.LazyDataFrame) → int

gtfs_kit_polars.helpers.is_empty(f: pl.DataFrame | pl.LazyDataFrame) → bool

gtfs_kit_polars.helpers.is_metric(dist_units: str) → bool: Return True if the given distance units equals ‘m’ or ‘km’; otherwise return False.

gtfs_kit_polars.helpers.is_not_null(f: DataFrame | LazyFrame, col_name: str) → bool: Return True if the given table has a column of the given name (string), and there exists at least one non-NaN value in that column; return False otherwise.

gtfs_kit_polars.helpers.longest_subsequence(seq, mode='strictly', order='increasing', key=None, *, index=False) → list

Return the longest increasing subsequence of seq.

Parameters:

seq (sequence object) – Can be any sequence, like str, list, numpy.array.
mode ({'strict', 'strictly', 'weak', 'weakly'}, optional) – If set to ‘strict’, the subsequence will contain unique elements. Using ‘weak’ an element can be repeated many times. Modes ending in -ly serve as a convenience to use with order parameter, because longest_sequence(seq, ‘weakly’, ‘increasing’) reads better. The default is ‘strict’.
order ({'increasing', 'decreasing'}, optional) – By default return the longest increasing subsequence, but it is possible to return the longest decreasing sequence as well.
key (function, optional) – Specifies a function of one argument that is used to extract a comparison key from each list element (e.g., str.lower, lambda x: x[0]). The default value is None (compare the elements directly).
index (bool, optional) – If set to True, return the indices of the subsequence, otherwise return the elements. Default is False.

Returns:

elements (list, optional) – A list of elements of the longest subsequence. Returned by default and when index is set to False.
indices (list, optional) – A list of indices pointing to elements in the longest subsequence. Returned when index is set to True.
Taken from this Stack Overflow answer.

gtfs_kit_polars.helpers.make_html(d: dict) → str: Convert the given dictionary into an HTML table (string) with two columns: keys of dictionary, values of dictionary.

gtfs_kit_polars.helpers.make_ids(n: int, prefix: str = 'id_') → list[str]

Return a length n list of unique sequentially labelled strings for use as IDs.

Example:

>>> make_ids(11, prefix="s")
['s00', s01', 's02', 's03', 's04', 's05', 's06', 's07', 's08', 's09', 's10']

gtfs_kit_polars.helpers.make_lazy(f: DataFrame | LazyFrame) → LazyFrame

gtfs_kit_polars.helpers.replace_date(f: DataFrame | LazyFrame, date: str) → DataFrame | LazyFrame: Given a table with a datetime object column called ‘datetime’ and given a YYYYMMDD date string, replace the datetime dates with the given date and return the resulting table.

gtfs_kit_polars.helpers.seconds_to_timestr(col: str, *, mod24: bool = False) → Expr

gtfs_kit_polars.helpers.seconds_to_timestr_0(x: int, *, mod24: bool = False) → str | None: The inverse of timestr_to_seconds(). If mod24, then first take the number of seconds modulo 24*3600. Return None in case of bad inputs.

gtfs_kit_polars.helpers.timestr_to_min(col: str) → Expr

gtfs_kit_polars.helpers.timestr_to_seconds(col: str, *, mod24: bool = False) → Expr

gtfs_kit_polars.helpers.timestr_to_seconds_0(x: str, *, mod24: bool = False) → int | None: Given an HH:MM:SS time string x, return the number of seconds past midnight that it represents. In keeping with GTFS standards, the hours entry may be greater than 23. If mod24, then return the number of seconds modulo 24*3600. Return np.nan in case of bad inputs.

gtfs_kit_polars.helpers.to_srid(g: DataFrame | LazyFrame, srid: int) → DataFrame | LazyFrame: Table version of the Polars ST function to_srid.

Module cleaners

Functions about cleaning feeds.

gtfs_kit_polars.cleaners.aggregate_routes(feed: Feed, by: str = 'route_short_name', route_id_prefix: str = 'route_') → Feed

Aggregate routes by route short name, say, and assign new route IDs using the given prefix.

More specifically, create new route IDs with the function build_aggregate_routes_table() and the parameters by and route_id_prefix and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

gtfs_kit_polars.cleaners.aggregate_stops(feed: Feed, by: str = 'stop_code', stop_id_prefix: str = 'stop_') → Feed: Aggregate stops by the column by and assign new stop IDs using the given prefix. Update IDs in stops, stop_times, and transfers. Return the resulting Feed.

gtfs_kit_polars.cleaners.build_aggregate_routes_table(routes: DataFrame | LazyFrame, by: str = 'route_short_name', route_id_prefix: str = 'route_') → LazyFrame

Group routes by the by column and assign one new route ID per group using the given prefix. Return a table with columns

route_id
new_route_id

gtfs_kit_polars.cleaners.build_aggregate_stops_table(stops: DataFrame | LazyFrame, by: str = 'stop_code', stop_id_prefix: str = 'stop_') → LazyFrame

Group stops by the by column and assign one new stop ID per group using the given prefix. Return a table with columns

stop_id
new_stop_id

gtfs_kit_polars.cleaners.clean(feed: Feed) → Feed

Apply the following functions to the given Feed in order and return the resulting Feed.

clean_ids()
clean_times()
clean_route_short_names()
drop_zombies()

gtfs_kit_polars.cleaners.clean_column_names(f: DataFrame | LazyFrame) → DataFrame | LazyFrame: Strip the whitespace from all column names in the given table and return the result.

gtfs_kit_polars.cleaners.clean_ids(feed: Feed) → Feed: In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.

gtfs_kit_polars.cleaners.clean_route_short_names(feed: Feed) → Feed: In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.

gtfs_kit_polars.cleaners.clean_times(feed: Feed) → Feed: In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.

gtfs_kit_polars.cleaners.drop_invalid_columns(feed: Feed) → Feed: Drop all table columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.

gtfs_kit_polars.cleaners.drop_zombies(feed: Feed) → Feed

In the given Feed, do the following in order and return the resulting Feed.

Drop agencies with no routes.
Drop stops of location type 0 or None with no stop times.
Remove undefined parent stations from the parent_station column.
Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.

gtfs_kit_polars.cleaners.extend_id(feed: Feed, id_col: str, extension: str, *, prefix=True) → Feed

Add a prefix (if prefix) or a suffix (otherwise) to all values of column id_col across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.

Raises a ValueError if id_col values are strings, e.g. if id_col is ‘direction_id’.

Module calendar

Functions about calendar and calendar_dates.

gtfs_kit_polars.calendar.get_dates(feed: Feed, *, as_date_obj: bool = False) → list[str] | list[dt.date]

Return the inclusive date range covered by feed.calendar and feed.calendar_dates as consecutive days. If neither table yields dates, return the empty list.

If as_date_obj, then return datetime.date objects instead.

Note that this is a range and not the set of actual service days.

gtfs_kit_polars.calendar.get_first_week(feed: Feed, *, as_date_obj: bool = False) → list[str] | list[dt.date]

Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

gtfs_kit_polars.calendar.get_week(feed: Feed, k: int, *, as_date_obj: bool = False) → list[str] | list[dt.date]

Given a Feed and a positive integer k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

gtfs_kit_polars.calendar.subset_dates(feed: Feed, dates: list[str]) → list[str]: Given a Feed and a list of YYYYMMDD date strings, return the sorted sublist of dates that lie in the Feed’s dates (the output feed.get_dates()). Could be an empty list.

Module routes

Functions about routes.

gtfs_kit_polars.routes.build_route_timetable(feed: Feed, route_id: str, dates: list[str]) → pl.LazyFrame

Return a timetable for the given route and dates (YYYYMMDD date strings).

Return a table with whose columns are all those in feed.trips plus those in feed.stop_times plus 'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.

Skip dates outside of the Feed’s dates.

If there is no route activity on the given dates, then return an empty table.

gtfs_kit_polars.routes.compute_route_stats(feed: Feed, dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pl.LazyFrame

Compute route stats for all the trips that lie in the given subset of trip stats, which defaults to feed.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a table with the columns

'date'
'route_id'
'route_short_name'
'route_type'
'direction_id': present if only if split_directions
'num_trips': number of trips on the route in the subset
'num_trip_starts': number of trips on the route with nonnull start times
'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59
'num_stop_patterns': number of stop pattern across trips
'is_loop': 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise
'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if not split_directions
'start_time': start time of the earliest trip on the route
'end_time': end time of latest trip on the route
'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
'peak_start_time': start time of first longest period during which the peak number of trips occurs
'peak_end_time': end time of first longest period during which the peak number of trips occurs
'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours
'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_speed': service_distance/service_duration when defined; 0 otherwise
'mean_trip_distance': service_distance/num_trips
'mean_trip_duration': service_duration/num_trips

Exclude dates with no active trips, which could yield an empty table.

If not split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Raise a ValueError if split_directions and no non-null direction ID values present.

gtfs_kit_polars.routes.compute_route_stats_0(trip_stats: DataFrame | LazyFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → LazyFrame

Compute stats for the given subset of trip stats (of the form output by the function trips.compute_trip_stats()).

Ignore trips with zero duration, because they are defunct.

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a table with the columns

'route_id'
'route_short_name'
'route_type'
'direction_id': present if only if split_directions
'num_trips': number of trips on the route in the subset
'num_trip_starts': number of trips on the route with nonnull start times
'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59
'num_stop_patterns': number of stop pattern across trips
'is_loop': True if at least one of the trips on the route has its is_loop field equal to True; False otherwise
'is_bidirectional': True if the route has trips in both directions; False otherwise; present if only if not split_directions
'start_time': start time of the earliest trip on the route
'end_time': end time of latest trip on the route
'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
'peak_start_time': start time of first longest period during which the peak number of trips occurs
'peak_end_time': end time of first longest period during which the peak number of trips occurs
'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours
'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_speed': service_distance/service_duration
'mean_trip_distance': service_distance/num_trips
'mean_trip_duration': service_duration/num_trips

If trip_stats is empty, return an empty table.

Raise a ValueError if split_directions and no non-NaN direction ID values present

gtfs_kit_polars.routes.compute_route_time_series(feed: Feed, dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, num_minutes: int = 60, *, split_directions: bool = False) → pl.LazyFrame

Compute route stats in time series form at the given num_minutes frequency for the trips that lie in the trip stats subset, which defaults to the output of trips.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate each routes’s stats by trip direction.

Return a time series table with the following columns.

datetime: datetime object
route_id
direction_id: direction of route; presest if and only if split_directions
num_trips: number of trips in service on the route at any time within the time bin
num_trip_starts: number of trips that start within the time bin
num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight
service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;
service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours
service_speed: service_distance/service_duration for the route

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
If a route does not run on a given date, then it won’t appear in the time series for that date
See the notes for compute_route_time_series_0()
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit_polars.routes.compute_route_time_series_0(trip_stats: DataFrame | LazyFrame, date_label: str = '20010101', num_minutes: int = 60, *, split_directions: bool = False) → LazyFrame

Compute stats in a 24-hour time series form at the num_minutes frequency for the given subset of trip stats of the form output by the function trips.compute_trip_stats().

If split_directions, then separate each routes’s stats by trip direction. Use the given YYYYMMDD date label as the date in the time series index.

Return a long-format table with the columns

datetime: datetime object
route_id
direction_id: direction of route; presest if and only if split_directions
num_trips: number of trips in service on the route at any time within the time bin
num_trip_starts: number of trips that start within the time bin
num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight
service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;
service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours
service_speed: service_distance/service_duration for the route

Notes

Trips that lack start or end times are ignored, so the the aggregate num_trips across the day could be less than the num_trips column of compute_route_stats_0()
All trip departure times are taken modulo 24 hours. So routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series, except for their num_trip_ends indicator. Trip endings past 23:59:59 are not binned so that resampling the num_trips indicator works efficiently.
Note that the total number of trips for two consecutive time bins t1 < t2 is the sum of the number of trips in bin t2 plus the number of trip endings in bin t1. Thus we can downsample the num_trips indicator by keeping track of only one extra count, num_trip_ends, and can avoid recording individual trip IDs.
All other indicators are downsampled by summing.
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit_polars.routes.get_routes(feed: Feed, date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False, split_directions: bool = False) → pl.LazyFrame | st.GeoLazyFrame

Return feed.routes or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time. If as_geo, return a geotable with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.

If as_geo and feed.shapes is not None, then return the routes as a geotable with a ‘geometry’ column of (Multi)LineStrings. The geotable will have a local UTM SRID if use_utm; otherwise it will have the WGS84 SRID. If as_geo and split_directions, then add the column direction_id and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. If as_geo and feed.shapes is None, then raise a ValueError.

gtfs_kit_polars.routes.map_routes(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False): Return a Folium map showing the given routes and (optionally) their stops. At least one of route_ids and route_short_names must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.

gtfs_kit_polars.routes.routes_to_geojson(feed: Feed, route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, *, split_directions: bool = False, include_stops: bool = False) → dict

Return a GeoJSON FeatureCollection (in WGS84 coordinates) of MultiLineString features representing this Feed’s routes.

If an iterable of route IDs or route short names is given, then subset to the union of those routes, which could yield an empty FeatureCollection in case of all invalid route IDs and route short names. If include_stops, then include the route stops as Point features. If the Feed has no shapes, then raise a ValueError.

Module shapes

Functions about shapes.

gtfs_kit_polars.shapes.append_dist_to_shapes(feed: Feed) → Feed

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting Feed.

As a benchmark, using this function on this Portland feed produces a shape_dist_traveled column that differs by at most 0.016 km in absolute value from of the original values.

gtfs_kit_polars.shapes.build_geometry_by_shape(feed: Feed, shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict: Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.

gtfs_kit_polars.shapes.geometrize_shapes(shapes: DataFrame | LazyFrame, *, use_utm: bool = False) → GeoLazyFrame

Given a GTFS shapes table, convert it to a geotable of LineStrings and return the result, which will no longer have the columns 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat', and 'shape_dist_traveled'.

If use_utm, then use local UTM coordinates for the geometries.

gtfs_kit_polars.shapes.get_shapes(feed: Feed, *, as_geo: bool = False, use_utm: bool = False) → pl.LazyFrame | None: Get the shapes table for the given feed, which could be None. If as_geo, then return it as geotable with a ‘geometry’ column of LineStrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The geotable will have a UTM SRID if use_utm; otherwise it will have a WGS84 SRID.

gtfs_kit_polars.shapes.get_shapes_intersecting_geometry(feed: Feed, geometry: sg.base.BaseGeometry, shapes_g: st.GeoDataFrame | st.GeoLazyFrame = None, *, as_geo: bool = False) → st.GeoLazyFrame | None

If the Feed has no shapes, then return None. Otherwise, return the subset of feed.shapes that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.

If as_geo, then return the shapes as a geotable. Specifying shapes_g will skip the first step of the algorithm, namely, geometrizing feed.shapes.

gtfs_kit_polars.shapes.shapes_to_geojson(feed: Feed, shape_ids: Iterable[str] | None = None) → dict

Return a GeoJSON FeatureCollection of LineString features representing feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.

gtfs_kit_polars.shapes.split_simple(shapes_g: GeoLazyFrame | GeoDataFrame) → GeoLazyFrame

Given a geotable of GTFS shapes of the form output by geometrize_shapes() with possibly non-WGS84 coordinates, split each non-simple LineString into large simple (non-self-intersecting) sub-LineStrings, and leave the simple LineStrings as is.

Return a geotable in the coordinates of shapes_g with the columns

'shape_id': GTFS shape ID for a LineString L
'subshape_id': a unique identifier of a simple sub-LineString S of L
'subshape_sequence': integer; indicates the order of S when joining up all simple sub-LineStrings to form L
'subshape_length_m': the length of S in meters
'cum_length_m': the length S plus the lengths of sub-LineStrings of L that come before S; in meters
'geometry': LineString geometry corresponding to S

Within each ‘shape_id’ group, the subshapes will be sorted increasingly by ‘subshape_sequence’.

Notes

Simplicity checks and splitting are done in local UTM coordinates. Converting back to original coordinates can introduce rounding errors and non-simplicities. So test this function with a shapes_g in local UTM coordinates.
By construction, for each given LineString L with simple sub-LineStrings S_i, we have the inequality

sum over i of length(S_i) <= length(L),

where the lengths are expressed in meters.

gtfs_kit_polars.shapes.split_simple_0(ls: LineString) → list[LineString]: Split the given LineString into simple sub-LineStrings by greedily building the segments from the curve points and binary search, checking for simplicity at every step.

gtfs_kit_polars.shapes.ungeometrize_shapes(shapes_g: DataFrame | LazyFrame) → LazyFrame

The inverse of geometrize_shapes().

If shapes_g is in UTM coordinates (has a UTM SRID), convert those coordinates back to WGS84 (EPSG:4326), which is the standard for a GTFS shapes table.

Module stop_times

Functions about stop times.

gtfs_kit_polars.stop_times.append_dist_to_stop_times(feed: Feed) → Feed

Calculate and append the optional shape_dist_traveled column in feed.stop_times in terms of the distance units feed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Uses feed.shapes, so if that is missing, then return the original feed.

This does not always give accurate results. The algorithm works as follows. Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.

gtfs_kit_polars.stop_times.get_start_and_end_times(feed: Feed, date: str | None = None) → tuple[str]: Return the first departure time and last arrival time (HH:MM:SS time strings) listed in feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.

gtfs_kit_polars.stop_times.get_stop_times(feed: Feed, date: str | None = None) → pl.LazyFrame: Return feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.

gtfs_kit_polars.stop_times.stop_times_to_geojson(feed: Feed, trip_ids: Iterable[str | None] = None) → dict

Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.

For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.

If an iterable of trip IDs is given, then subset to those trips, silently dropping invalid trip IDs.

Module stops

Functions about stops.

gtfs_kit_polars.stops.STOP_STYLE = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}: Leaflet circleMarker parameters for mapping stops

gtfs_kit_polars.stops.build_geometry_by_stop(feed: Feed, stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict: Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.

gtfs_kit_polars.stops.build_stop_timetable(feed: Feed, stop_id: str, dates: list[str]) → pl.LazyFrame

Return a timetable for the given stop ID and dates (YYYYMMDD date strings)

Return a table whose columns are all those in feed.trips plus those in feed.stop_times plus 'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.

gtfs_kit_polars.stops.compute_stop_activity(feed: Feed, dates: list[str]) → pl.LazyFrame

Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).

Return a table with the columns

stop_id
dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise
dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If all dates lie outside the Feed period, then return an empty table.

gtfs_kit_polars.stops.compute_stop_stats(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pl.LazyFrame

Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a table with the columns

'date'
'stop_id'
'direction_id': present if and only if split_directions
'num_routes': number of routes visiting the stop (in the given direction) on the date
'num_trips': number of trips visiting stop (in the givin direction) on the date
'max_headway': maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the date
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'start_time': earliest departure time of a trip from this stop on the date
'end_time': latest departure time of a trip from this stop on the date

Exclude dates with no active stops, which could yield the empty table.

gtfs_kit_polars.stops.compute_stop_stats_0(stop_times_subset: DataFrame | LazyFrame, trip_subset: DataFrame | LazyFrame, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → LazyFrame

Given a subset of a stop times Table and a subset of a trips Table, return a Table that provides summary stats about the stops in the inner join of the two Tables.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a Table with the columns

stop_id
direction_id: present if and only if split_directions
num_routes: number of routes visiting stop (in the given direction)
num_trips: number of trips visiting stop (in the givin direction)
max_headway: maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time
min_headway: minimum of the durations (in minutes) mentioned above
mean_headway: mean of the durations (in minutes) mentioned above
start_time: earliest departure time of a trip from this stop
end_time: latest departure time of a trip from this stop

Notes

If trip_subset is empty, then return an empty Table.
Raise a ValueError if split_directions and no non-null direction ID values present.

gtfs_kit_polars.stops.compute_stop_time_series(feed: Feed, dates: list[str], stop_ids: list[str | None] = None, num_minutes: int = 60, *, split_directions: bool = False) → pl.LazyFrame

Compute time series for the given stops (defaults to all stops in Feed) on the given dates (YYYYMMDD date strings) at the given num_minutes frequency. Return a long-format table with the columns

datetime: datetime object for the given date and frequency chunks
stop_id
direction_id: direction of route; presest if and only if split_directions
num_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.

Notes

Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled by summing
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit_polars.stops.compute_stop_time_series_0(stop_times_subset: DataFrame | LazyFrame, trips_subset: DataFrame | LazyFrame, num_minutes: int = 60, date_label: str = '20010101', *, split_directions: bool = False) → LazyFrame

Compute stop stats in a 24-hour time series form at the given num_minutes frequency for stops in the inner join of the given subset of stop times and trips.

If split_directions, then separate each stop’s stats by trip direction. Use the given YYYYMMDD date label as the date in the time series.

Return a long-format table with columns

datetime: datetime object for the given date and frequency chunks
stop_id
direction_id: direction of route; presest if and only if split_directions
num_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop

Notes

Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled by summing
If trips_subset is empty, then return an empty table
Raise a ValueError if split_directions and no non-null direction ID values present

gtfs_kit_polars.stops.geometrize_stops(stops: DataFrame | LazyFrame, *, use_utm: bool = False) → GeoDataFrame | GeoLazyFrame

Given a GTFS stops Table, convert it to a geotable with a “geometry” column of LineStrings and a “srid” column with the (constant) srid of the geographic projection, e.g. ‘EPSG:4326’ for the WGS84 srid. Return the resulting geotable, which will no longer have the columns 'stop_lon' and 'stop_lat'.

If use_utm, then use local UTM coordinates for the geometries.

gtfs_kit_polars.stops.get_stops(feed: Feed, date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_geo: bool = False, use_utm: bool = False) → pl.LazyFrame: Return feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. If in_stations, then subset further stops in stations if station data is available. If as_geo, then return the result as a geotable with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The geotable will have a UTM SRID if use_utm and a WGS84 SRID otherwise.

gtfs_kit_polars.stops.get_stops_in_area(feed: Feed, area: st.GeoLazyFrame | st.GeoDataFrame) → st.GeoLazyFrame: Return the subset of feed.stops that contains all stops that intersect the given geotable of polygons.

gtfs_kit_polars.stops.map_stops(feed: Feed, stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}): Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.

gtfs_kit_polars.stops.stops_to_geojson(feed: Feed, stop_ids: Iterable[str | None] = None) → dict

Return a GeoJSON FeatureCollection of Point features representing all the stops in feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of stop IDs is given, then subset to those stops.

gtfs_kit_polars.stops.ungeometrize_stops(stops_g: GeoDataFrame | GeoLazyFrame) → DataFrame | LazyFrame

The inverse of geometrize_stops().

If stops_g is in UTM coordinates, then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

Module trips

Functions about trips.

gtfs_kit_polars.trips.compute_busiest_date(feed: Feed, dates: list[str]) → str: Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.

gtfs_kit_polars.trips.compute_trip_activity(feed: Feed, dates: list[str]) → pl.LazyFrame

Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns

'trip_id'
dates[0]: 1 if the trip is active on dates[0]; 0 otherwise
dates[1]: 1 if the trip is active on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty table.

gtfs_kit_polars.trips.compute_trip_stats(feed: Feed, route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) → pl.LazyFrame

Return a table with the following columns:

'trip_id'
'route_id'
'route_short_name'
'route_type'
'direction_id': null if missing from feed
'shape_id': null if missing from feed
'stop_pattern_name': output from name_stop_patterns()
'num_stops': number of stops on trip
'start_time': first departure time of the trip
'end_time': last departure time of the trip
'start_stop_id': stop ID of the first stop of the trip
'end_stop_id': stop ID of the last stop of the trip
'is_loop': True if the start and end stop are less than 400m apart and False otherwise
'distance': distance of the trip; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all null entries if feed.shapes is None
'duration': duration of the trip in hours
'speed': distance/duration

If feed.stop_times has a shape_dist_traveled column with at least one non-null value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to null.

If route IDs are given, then restrict to trips on those routes.

Notes

Assume the following feed attributes are not None:
- feed.trips
- feed.routes
- feed.stop_times
- feed.shapes (optionally)
Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on this Portland feed using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.

gtfs_kit_polars.trips.get_active_services(feed: Feed, date: str) → list[str]: Given a Feed and a date string in YYYYMMDD format, return the service IDs that are active on the date.

gtfs_kit_polars.trips.get_trips(feed: Feed, date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False) → pl.LazyFrame | st.GeoLazyFrame

Return feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.

If as_geo and feed.shapes is not None, then return the trips as a geotable of LineStrings representating trip shapes. Use local UTM CRS if use_utm; otherwise it the WGS84 CRS. If as_geo and feed.shapes is None, then raise a ValueError.

gtfs_kit_polars.trips.locate_trips(feed, date: str, times: list[str]) → LazyFrame

Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).

Return a table with the columns

'trip_id'
'shape_id'
'route_id'
'direction_id': null if feed.trips.direction_id is missing
'time'
'rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path
'lon': longitude of trip at given time
'lat': latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

gtfs_kit_polars.trips.map_trips(feed: Feed, trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = True): Return a Folium map showing the given trips. Silently drop invalid trip IDs given. If show_stops, then plot the trip stops too. If show_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.

gtfs_kit_polars.trips.name_stop_patterns(feed: Feed) → pl.LazyFrame

For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the table feed.trips with the additional column stop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.

If feed.trips has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.

gtfs_kit_polars.trips.trips_to_geojson(feed: Feed, trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) → dict

Return a GeoJSON FeatureCollection (in WGS84 coordinates) of LineString features representing all the Feed’s trips.

If include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips, which could yield an empty FeatureCollection in case all invalid trip IDs.

Module miscellany

Functions about miscellany.

gtfs_kit_polars.miscellany.assess_quality(feed: Feed) → pl.LazyFrame

Return a table of various feed indicators and values, e.g. number of trips missing shapes.

The resulting table has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.

gtfs_kit_polars.miscellany.compute_bounds(feed: Feed, stop_ids: list[str] | None = None) → list: Return the bounding box [min longitude, min latitude, max longitude, max latitude] of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

gtfs_kit_polars.miscellany.compute_centroid(feed: Feed, stop_ids: list[str] | None = None) → sg.Point: Return the centroid of the convex hull of the given Feed’s stops or subset of thereof specified by the given stop IDs.

gtfs_kit_polars.miscellany.compute_convex_hull(feed: Feed, stop_ids: list[str] | None = None) → sg.Polygon: Return the convex hull in WGS84 coordinates of the given Feed’s stops or subset thereof specified by the given stop IDs.

gtfs_kit_polars.miscellany.compute_network_stats(feed: Feed, dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, *, split_route_types=False) → pl.LazyFrame

Compute some network stats for the given subset of trip stats, which defaults to feed.compute_trip_stats(), and for the given dates (YYYYMMDD date stings).

Return a table with the columns

'date'
'route_type' (optional): presest if and only if split_route_types
'num_stops': number of stops active on the date
'num_routes': number of routes active on the date
'num_trips': number of trips that start on the date
'num_trip_starts': number of trips with nonnull start times on the date
'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date
'peak_num_trips': maximum number of simultaneous trips in service on the date
'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date
'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date
'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': sum of the service durations for the active routes on the date; measured in hours
'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty table.

The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.

Notes

If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.

gtfs_kit_polars.miscellany.compute_network_stats_0(stop_times: DataFrame | LazyFrame, trip_stats: DataFrame | LazyFrame, *, split_route_types=False) → LazyFrame

Compute some network stats for the trips common to the given subset of stop times and given subset of trip stats of the form output by the function trips.compute_trip_stats()

Return a table with the columns

'route_type' (optional): presest if and only if split_route_types
'num_stops': number of stops active on the date
'num_routes': number of routes active on the date
'num_trips': number of trips that start on the date
'num_trip_starts': number of trips with nonnull start times on the date
'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date
'peak_num_trips': maximum number of simultaneous trips in service on the date
'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date
'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date
'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': sum of the service durations for the active routes on the date; measured in hours
'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty table.

Helper function for compute_network_stats().

gtfs_kit_polars.miscellany.compute_network_time_series(feed: Feed, dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, num_minutes: int = 60, *, split_route_types: bool = False) → pl.LazyFrame

Compute some network stats in time series form for the given dates (YYYYMMDD date strings) and trip stats, which defaults to feed.compute_trip_stats(). Use the given Pandas frequency string freq to specify the frequency of the resulting time series, e.g. ‘5Min’. If split_route_types, then split stats by route type; otherwise don’t.

Return a long-form time series table with the columns

'datetime': datetime object
'route_type': integer; present if and only if split_route_types
'num_trips': number of trips in service during during the time period
'num_trip_starts': number of trips with starting during the time period
'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight
'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours
'service_speed': service_distance/service_duration when defined; 0 otherwise

Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty table with the specified columns.

Notes

If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.

gtfs_kit_polars.miscellany.compute_screen_line_counts(feed: Feed, screen_lines: st.GeoLazyFrame | st.GeoDataFrame, dates: list[str], *, include_diagnostics: bool = False) → pl.LazyFrame

Find all the Feed trips active on the given YYYYMMDD dates that intersect the given screen lines (LineStrings) with optional ID column screen_line_id. Behind the scenes, use simple sub-LineStrings of the feed to compute screen line intersections. Using them instead of the Feed shapes avoids miscounting intersections in the case of non-simple (self-intersecting) shapes.

For each trip crossing a screen line, compute the crossing time, crossing direction, etc. and return a table of results with the columns

'date': the YYYYMMDD date string given
'screen_line_id': ID of a screen line
'trip_id': ID of a trip that crosses the screen line
'shape_id': ID of the trip’s shape
'direction_id': GTFS direction of trip
'route_id'
'route_short_name'
'route_type'
'shape_id'
'crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
'crossing_time': time, according to the GTFS schedule, that the trip crosses the screen line
'crossing_dist_m': distance along the trip shape (not subshape) of the crossing; in meters

If include_diagnostics, then include the following extra columns for diagnostic purposes.

'subshape_id': ID of the simple sub-LineString S of the trip’s shape that crosses the screen line
'subshape_length_m': length of S in meters
'from_departure_time': departure time of the trip from the last stop before the screen line
'to_departure_time': departure time of the trip at from the first stop after the screen line
'subshape_dist_frac': proportion of S’s length at which the screen line intersects S

Notes:

Assume the Feed’s stop times table has an accurate shape_dist_traveled column.
Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
The algorithm works as follows
1. Find the Feed’s simple subshapes (computed via shapes.split_simple()) that intersect the screen lines.
2. For each such subshape and screen line, compute the intersection points, the distance of each point along the subshape, aka the crossing distance, and the orientation of the screen line relative to the subshape.
3. Restrict to trips active on the given dates and for each trip associated to an intersecting subshape above, interpolate a trip stop time for the intersection point using the crossing distance, subshape length, cumulative subshape length, and trip stop times.

gtfs_kit_polars.miscellany.convert_dist(feed: Feed, new_dist_units: str) → Feed: Convert the distances recorded in the shape_dist_traveled columns of the given Feed to the given distance units. New distance units must lie in constants.DIST_UNITS. Return the resulting Feed.

gtfs_kit_polars.miscellany.create_shapes(feed: Feed, *, all_trips: bool = False) → Feed

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

gtfs_kit_polars.miscellany.describe(feed: Feed, sample_date: str | None = None) → pl.LazyFrame

Return a table of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.

The resulting table has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

gtfs_kit_polars.miscellany.list_fields(feed: Feed, table_name: str | None = None) → pl.LazyFrame

Return a table summarizing all GTFS tables in the given feed or in the given table name if specified.

The resulting table has the following columns.

'table': name of the GTFS table, e.g. 'stops'
'column': name of a column in the table, e.g. 'stop_id'
'num_values': number of values in the column
'num_nonnull_values': number of nonnull values in the column
'num_unique_values': number of unique values in the column, excluding null values
'min_value': minimum value in the column
'max_value': maximum value in the column

If the table is not in the feed, then return an empty table If the table is not valid, raise a ValueError

gtfs_kit_polars.miscellany.restrict_to_agencies(feed: Feed, agency_ids: list[str]) → Feed: Build a new feed by restricting this one via restrict_to_routes() and the routes with the given agency IDs. Return the resulting feed.

gtfs_kit_polars.miscellany.restrict_to_area(feed: Feed, area: st.GeoDataFrame | st.GeoLazyFrame) → Feed: Build a new feed by restricting this one via restrict_to_trips() and the trips that have at least one stop intersecting the given geotable of polygons, which can be in any coordinate reference system. Return the resulting feed.

gtfs_kit_polars.miscellany.restrict_to_dates(feed: Feed, dates: list[str]) → Feed: Build a new feed by restricting this one via restrict_to_trips() and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.

gtfs_kit_polars.miscellany.restrict_to_routes(feed: Feed, route_ids: list[str]) → Feed: Build a new feed by restricting this one via restrict_to_trips() and the trips with the given route IDs. Return the resulting feed.

gtfs_kit_polars.miscellany.restrict_to_trips(feed: Feed, trip_ids: list[str]) → Feed

Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.

If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.

This function is probably more useful internally than externally.

Module feed

This module defines a Feed class to represent GTFS feeds.

The Feed class also has heaps of methods: a method to compute route stats, a method to compute screen line counts, validations methods, etc. To ease testing and reading, almost all of these methods are defined in other modules and grouped by theme (routes.py, stops.py, etc.). These methods, or rather functions that operate on feeds, are then imported within the Feed class. This separation of methods unfortunately messes up slightly the Feed class documentation generated by Sphinx, introducing an extra leading feed parameter in the method signatures. Ignore that extra parameter; it refers to the Feed instance, usually called self and usually hidden automatically by Sphinx.

Bases: object

An instance of this class represents a GTFS feed, where GTFS tables are stored as Polars LazyFrame and are coerced to such upon initialization and attribute updates. The methods assume the instance represents a valid GTFS feed but offer no validation, because that’s complex and already done by dedicated libraries. So unless you know what you’re doing, use the Canonical GTFS Validator before seriously analyzing a feed with this class.

GTFS table instance attributes:

agency
stops
routes
trips
stop_times
calendar
calendar_dates
fare_attributes
fare_rules
shapes
frequencies
transfers
feed_info
attributions

Metadata attributes:

dist_units: a string in constants.DIST_UNITS; specifies the distance units of the shape_dist_traveled column values, if present; also effects whether to display trip and route stats in metric or imperial units
unzip_dir: temporary file directory for unzipping feeds read from ZIP file

aggregate_routes(by: str = 'route_short_name', route_id_prefix: str = 'route_') → Feed

Aggregate routes by route short name, say, and assign new route IDs using the given prefix.

More specifically, create new route IDs with the function build_aggregate_routes_table() and the parameters by and route_id_prefix and update the old route IDs to the new ones in all the relevant Feed tables. Return the resulting Feed.

aggregate_stops(by: str = 'stop_code', stop_id_prefix: str = 'stop_') → Feed: Aggregate stops by the column by and assign new stop IDs using the given prefix. Update IDs in stops, stop_times, and transfers. Return the resulting Feed.

append_dist_to_shapes() → Feed

Calculate and append the optional shape_dist_traveled field in feed.shapes in terms of the distance units feed.dist_units. Return the resulting Feed.

As a benchmark, using this function on this Portland feed produces a shape_dist_traveled column that differs by at most 0.016 km in absolute value from of the original values.

append_dist_to_stop_times() → Feed

Calculate and append the optional shape_dist_traveled column in feed.stop_times in terms of the distance units feed.dist_units. Trips without shapes will have NaN distances. Return the resulting Feed. Uses feed.shapes, so if that is missing, then return the original feed.

This does not always give accurate results. The algorithm works as follows. Compute the shape_dist_traveled field by using Shapely to measure the distance of a stop along its trip LineString. If for a given trip this process produces a non-monotonically increasing, hence incorrect, list of (cumulative) distances, then fall back to estimating the distances as follows.

Set the first distance to 0, the last to the length of the trip shape, and leave the remaining ones computed above. Choose the longest increasing subsequence of that new set of distances and use them and their corresponding departure times to linearly interpolate the rest of the distances.

assess_quality() → pl.LazyFrame

Return a table of various feed indicators and values, e.g. number of trips missing shapes.

The resulting table has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

This function is odd but useful for seeing roughly how broken a feed is This function is not a GTFS validator.

build_geometry_by_shape(shape_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict: Return a dictionary of the form <shape ID> -> <Shapely LineString representing shape>. If the Feed has no shapes, then return the empty dictionary. If use_utm, then use local UTM coordinates; otherwise, use WGS84 coordinates.

build_geometry_by_stop(stop_ids: Iterable[str] | None = None, *, use_utm: bool = False) → dict: Return a dictionary of the form <stop ID> -> <Shapely Point representing stop>.

build_route_timetable(route_id: str, dates: list[str]) → pl.LazyFrame

Return a timetable for the given route and dates (YYYYMMDD date strings).

Return a table with whose columns are all those in feed.trips plus those in feed.stop_times plus 'date'. The trip IDs are restricted to the given route ID. The result is sorted first by date and then by grouping by trip ID and sorting the groups by their first departure time.

Skip dates outside of the Feed’s dates.

If there is no route activity on the given dates, then return an empty table.

build_stop_timetable(stop_id: str, dates: list[str]) → pl.LazyFrame

Return a timetable for the given stop ID and dates (YYYYMMDD date strings)

Return a table whose columns are all those in feed.trips plus those in feed.stop_times plus 'date', and the stop IDs are restricted to the given stop ID. The result is sorted by date then departure time.

clean() → Feed

Apply the following functions to the given Feed in order and return the resulting Feed.

clean_ids()
clean_times()
clean_route_short_names()
drop_zombies()

clean_ids() → Feed: In the given Feed, strip whitespace from all string IDs and then replace every remaining whitespace chunk with an underscore. Return the resulting Feed.

clean_route_short_names() → Feed: In feed.routes, assign ‘n/a’ to missing route short names and strip whitespace from route short names. Then disambiguate each route short name that is duplicated by appending ‘-’ and its route ID. Return the resulting Feed.

clean_times() → Feed: In the given Feed, convert H:MM:SS time strings to HH:MM:SS time strings to make sorting by time work as expected. Return the resulting Feed.

close_unzip_dir() → None: Close this Feed’s temporary unzip directory, if it has one, which was created by reading the feed from a ZIP file. Frees memory.

compute_bounds(stop_ids: list[str] | None = None) → list: Return the bounding box [min longitude, min latitude, max longitude, max latitude] of the given Feed’s stops or of the subset of stops specified by the given stop IDs.

compute_busiest_date(dates: list[str]) → str: Given a list of dates (YYYYMMDD date strings), return the first date that has the maximum number of active trips.

compute_centroid(stop_ids: list[str] | None = None) → sg.Point: Return the centroid of the convex hull of the given Feed’s stops or subset of thereof specified by the given stop IDs.

compute_convex_hull(stop_ids: list[str] | None = None) → sg.Polygon: Return the convex hull in WGS84 coordinates of the given Feed’s stops or subset thereof specified by the given stop IDs.

compute_network_stats(dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, *, split_route_types=False) → pl.LazyFrame

Compute some network stats for the given subset of trip stats, which defaults to feed.compute_trip_stats(), and for the given dates (YYYYMMDD date stings).

Return a table with the columns

'date'
'route_type' (optional): presest if and only if split_route_types
'num_stops': number of stops active on the date
'num_routes': number of routes active on the date
'num_trips': number of trips that start on the date
'num_trip_starts': number of trips with nonnull start times on the date
'num_trip_ends': number of trips with nonnull start times and nonnull end times on the date, ignoring trips that end after 23:59:59 on the date
'peak_num_trips': maximum number of simultaneous trips in service on the date
'peak_start_time': start time of first longest period during which the peak number of trips occurs on the date
'peak_end_time': end time of first longest period during which the peak number of trips occurs on the date
'service_distance': sum of the service distances for the active routes on the date; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': sum of the service durations for the active routes on the date; measured in hours
'service_speed': service_distance/service_duration on the date

Exclude dates with no active stops, which could yield the empty table.

The route and trip stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.

Notes

If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.

compute_network_time_series(dates: list[str], trip_stats: pl.LazyFrame | pl.DataFrame | None = None, num_minutes: int = 60, *, split_route_types: bool = False) → pl.LazyFrame

Compute some network stats in time series form for the given dates (YYYYMMDD date strings) and trip stats, which defaults to feed.compute_trip_stats(). Use the given Pandas frequency string freq to specify the frequency of the resulting time series, e.g. ‘5Min’. If split_route_types, then split stats by route type; otherwise don’t.

Return a long-form time series table with the columns

'datetime': datetime object
'route_type': integer; present if and only if split_route_types
'num_trips': number of trips in service during during the time period
'num_trip_starts': number of trips with starting during the time period
'num_trip_ends': number of trips ending during the time period, ignoring the trips the end past midnight
'service_distance': distance traveled during the time period by all trips active during the time period; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_duration': duration traveled during the time period by all trips active during the time period; measured in hours
'service_speed': service_distance/service_duration when defined; 0 otherwise

Exclude dates that lie outside of the Feed’s date range. If all the dates given lie outside of the Feed’s date range, then return an empty table with the specified columns.

Notes

If you’ve already computed trip stats in your workflow, then passing it into this function will speed it up.

compute_route_stats(dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pl.LazyFrame

Compute route stats for all the trips that lie in the given subset of trip stats, which defaults to feed.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate the stats by trip direction (0 or 1). Use the headway start and end times to specify the time period for computing headway stats.

Return a table with the columns

'date'
'route_id'
'route_short_name'
'route_type'
'direction_id': present if only if split_directions
'num_trips': number of trips on the route in the subset
'num_trip_starts': number of trips on the route with nonnull start times
'num_trip_ends': number of trips on the route with nonnull end times that end before 23:59:59
'num_stop_patterns': number of stop pattern across trips
'is_loop': 1 if at least one of the trips on the route has its is_loop field equal to 1; 0 otherwise
'is_bidirectional': 1 if the route has trips in both directions; 0 otherwise; present if only if not split_directions
'start_time': start time of the earliest trip on the route
'end_time': end time of latest trip on the route
'max_headway': maximum of the durations (in minutes) between trip starts on the route between headway_start_time and headway_end_time on the given dates
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'peak_num_trips': maximum number of simultaneous trips in service (for the given direction, or for both directions when split_directions==False)
'peak_start_time': start time of first longest period during which the peak number of trips occurs
'peak_end_time': end time of first longest period during which the peak number of trips occurs
'service_duration': total of the duration of each trip on the route in the given subset of trips; measured in hours
'service_distance': total of the distance traveled by each trip on the route in the given subset of trips; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all np.nan entries if feed.shapes is None
'service_speed': service_distance/service_duration when defined; 0 otherwise
'mean_trip_distance': service_distance/num_trips
'mean_trip_duration': service_duration/num_trips

Exclude dates with no active trips, which could yield an empty table.

If not split_directions, then compute each route’s stats, except for headways, using its trips running in both directions. For headways, (1) compute max headway by taking the max of the max headways in both directions; (2) compute mean headway by taking the weighted mean of the mean headways in both directions.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
The route stats for date d contain stats for trips that start on date d only and ignore trips that start on date d-1 and end on date d.
Raise a ValueError if split_directions and no non-null direction ID values present.

compute_route_time_series(dates: list[str], trip_stats: pl.DataFrame | pl.LazyFrame | None = None, num_minutes: int = 60, *, split_directions: bool = False) → pl.LazyFrame

Compute route stats in time series form at the given num_minutes frequency for the trips that lie in the trip stats subset, which defaults to the output of trips.compute_trip_stats(), and that start on the given dates (YYYYMMDD date strings).

If split_directions, then separate each routes’s stats by trip direction.

Return a time series table with the following columns.

datetime: datetime object
route_id
direction_id: direction of route; presest if and only if split_directions
num_trips: number of trips in service on the route at any time within the time bin
num_trip_starts: number of trips that start within the time bin
num_trip_ends: number of trips that end within the time bin, ignoring trips that end past midnight
service_distance: sum of the service distance accrued during the time bin across all trips on the route; measured in kilometers if feed.dist_units is metric; otherwise measured in miles;
service_duration: sum of the service duration accrued during the time bin across all trips on the route; measured in hours
service_speed: service_distance/service_duration for the route

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table.

Notes

If you’ve already computed trip stats in your workflow, then you should pass that table into this function to speed things up significantly.
If a route does not run on a given date, then it won’t appear in the time series for that date
See the notes for compute_route_time_series_0()
Raise a ValueError if split_directions and no non-null direction ID values present

compute_screen_line_counts(screen_lines: st.GeoLazyFrame | st.GeoDataFrame, dates: list[str], *, include_diagnostics: bool = False) → pl.LazyFrame

Find all the Feed trips active on the given YYYYMMDD dates that intersect the given screen lines (LineStrings) with optional ID column screen_line_id. Behind the scenes, use simple sub-LineStrings of the feed to compute screen line intersections. Using them instead of the Feed shapes avoids miscounting intersections in the case of non-simple (self-intersecting) shapes.

For each trip crossing a screen line, compute the crossing time, crossing direction, etc. and return a table of results with the columns

'date': the YYYYMMDD date string given
'screen_line_id': ID of a screen line
'trip_id': ID of a trip that crosses the screen line
'shape_id': ID of the trip’s shape
'direction_id': GTFS direction of trip
'route_id'
'route_short_name'
'route_type'
'shape_id'
'crossing_direction': 1 or -1; 1 indicates trip travel from the left side to the right side of the screen line; -1 indicates trip travel in the opposite direction
'crossing_time': time, according to the GTFS schedule, that the trip crosses the screen line
'crossing_dist_m': distance along the trip shape (not subshape) of the crossing; in meters

If include_diagnostics, then include the following extra columns for diagnostic purposes.

'subshape_id': ID of the simple sub-LineString S of the trip’s shape that crosses the screen line
'subshape_length_m': length of S in meters
'from_departure_time': departure time of the trip from the last stop before the screen line
'to_departure_time': departure time of the trip at from the first stop after the screen line
'subshape_dist_frac': proportion of S’s length at which the screen line intersects S

Notes:

Assume the Feed’s stop times table has an accurate shape_dist_traveled column.
Assume that trips travel in the same direction as their shapes, an assumption that is part of the GTFS.
Assume that the screen line is straight and simple.
The algorithm works as follows
1. Find the Feed’s simple subshapes (computed via shapes.split_simple()) that intersect the screen lines.
2. For each such subshape and screen line, compute the intersection points, the distance of each point along the subshape, aka the crossing distance, and the orientation of the screen line relative to the subshape.
3. Restrict to trips active on the given dates and for each trip associated to an intersecting subshape above, interpolate a trip stop time for the intersection point using the crossing distance, subshape length, cumulative subshape length, and trip stop times.

compute_stop_activity(dates: list[str]) → pl.LazyFrame

Mark stops as active or inactive on the given dates (YYYYMMDD date strings). A stop is active on a given date if some trips that starts on the date visits the stop (possibly after midnight).

Return a table with the columns

stop_id
dates[0]: 1 if the stop has at least one trip visiting it on dates[0]; 0 otherwise
dates[1]: 1 if the stop has at least one trip visiting it on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the stop has at least one trip visiting it on dates[-1]; 0 otherwise

If all dates lie outside the Feed period, then return an empty table.

compute_stop_stats(dates: list[str], stop_ids: list[str | None] = None, headway_start_time: str = '07:00:00', headway_end_time: str = '19:00:00', *, split_directions: bool = False) → pl.LazyFrame

Compute stats for all stops for the given dates (YYYYMMDD date strings). Optionally, restrict to the stop IDs given.

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops. Use the headway start and end times to specify the time period for computing headway stats.

Return a table with the columns

'date'
'stop_id'
'direction_id': present if and only if split_directions
'num_routes': number of routes visiting the stop (in the given direction) on the date
'num_trips': number of trips visiting stop (in the givin direction) on the date
'max_headway': maximum of the durations (in minutes) between trip departures at the stop between headway_start_time and headway_end_time on the date
'min_headway': minimum of the durations (in minutes) mentioned above
'mean_headway': mean of the durations (in minutes) mentioned above
'start_time': earliest departure time of a trip from this stop on the date
'end_time': latest departure time of a trip from this stop on the date

Exclude dates with no active stops, which could yield the empty table.

compute_stop_time_series(dates: list[str], stop_ids: list[str | None] = None, num_minutes: int = 60, *, split_directions: bool = False) → pl.LazyFrame

Compute time series for the given stops (defaults to all stops in Feed) on the given dates (YYYYMMDD date strings) at the given num_minutes frequency. Return a long-format table with the columns

datetime: datetime object for the given date and frequency chunks
stop_id
direction_id: direction of route; presest if and only if split_directions
num_trips: the number of trips that visit the stop in the time bin and have a nonnull departure time from the stop

Exclude dates that lie outside of the Feed’s date range. If all dates lie outside the Feed’s date range, then return an empty table

If split_directions, then separate the stop stats by direction (0 or 1) of the trips visiting the stops.

Notes

Stop times with null departure times are ignored, so the aggregate of num_trips across the day could be less than the num_trips column in compute_stop_stats_0()
All trip departure times are taken modulo 24 hours, so routes with trips that end past 23:59:59 will have all their stats wrap around to the early morning of the time series.
‘num_trips’ should be resampled by summing
Raise a ValueError if split_directions and no non-null direction ID values present

compute_trip_activity(dates: list[str]) → pl.LazyFrame

Mark trips as active or inactive on the given dates (YYYYMMDD date strings). Return a table with the columns

'trip_id'
dates[0]: 1 if the trip is active on dates[0]; 0 otherwise
dates[1]: 1 if the trip is active on dates[1]; 0 otherwise
etc.
dates[-1]: 1 if the trip is active on dates[-1]; 0 otherwise

If dates is None or the empty list, then return an empty table.

compute_trip_stats(route_ids: list[str | None] = None, *, compute_dist_from_shapes: bool = False) → pl.LazyFrame

Return a table with the following columns:

'trip_id'
'route_id'
'route_short_name'
'route_type'
'direction_id': null if missing from feed
'shape_id': null if missing from feed
'stop_pattern_name': output from name_stop_patterns()
'num_stops': number of stops on trip
'start_time': first departure time of the trip
'end_time': last departure time of the trip
'start_stop_id': stop ID of the first stop of the trip
'end_stop_id': stop ID of the last stop of the trip
'is_loop': True if the start and end stop are less than 400m apart and False otherwise
'distance': distance of the trip; measured in kilometers if feed.dist_units is metric; otherwise measured in miles; contains all null entries if feed.shapes is None
'duration': duration of the trip in hours
'speed': distance/duration

If feed.stop_times has a shape_dist_traveled column with at least one non-null value and compute_dist_from_shapes == False, then use that column to compute the distance column. Else if feed.shapes is not None, then compute the distance column using the shapes and Shapely. Otherwise, set the distances to null.

If route IDs are given, then restrict to trips on those routes.

Notes

Assume the following feed attributes are not None:
- feed.trips
- feed.routes
- feed.stop_times
- feed.shapes (optionally)
Calculating trip distances with compute_dist_from_shapes=True seems pretty accurate. For example, calculating trip distances on this Portland feed using compute_dist_from_shapes=False and compute_dist_from_shapes=True, yields a difference of at most 0.83km from the original values.

convert_dist(new_dist_units: str) → Feed: Convert the distances recorded in the shape_dist_traveled columns of the given Feed to the given distance units. New distance units must lie in constants.DIST_UNITS. Return the resulting Feed.

copy() → Feed: Return a copy of this feed, that is, a feed with all the same attributes.

create_shapes(*, all_trips: bool = False) → Feed

Given a feed, create a shape for every trip that is missing a shape ID. Do this by connecting the stops on the trip with straight lines. Return the resulting feed which has updated shapes and trips tables.

If all_trips, then create new shapes for all trips by connecting stops, and remove the old shapes.

describe(sample_date: str | None = None) → pl.LazyFrame

Return a table of various feed indicators and values, e.g. number of routes. Specialize some those indicators to the given YYYYMMDD sample date string, e.g. number of routes active on the date.

The resulting table has the columns

'indicator': string; name of an indicator, e.g. ‘num_routes’
'value': value of the indicator, e.g. 27

property dist_units: str: The distance units of the Feed.

drop_invalid_columns() → Feed: Drop all table columns of the given Feed that are not listed in the GTFS. Return the resulting Feed.

drop_zombies() → Feed

In the given Feed, do the following in order and return the resulting Feed.

Drop agencies with no routes.
Drop stops of location type 0 or None with no stop times.
Remove undefined parent stations from the parent_station column.
Drop trips with no stop times.
Drop shapes with no trips.
Drop routes with no trips.
Drop services with no trips.

extend_id(id_col: str, extension: str, *, prefix=True) → Feed

Add a prefix (if prefix) or a suffix (otherwise) to all values of column id_col across all tables of this Feed. This can be helpful when preparing to merge multiple GTFS feeds with colliding route IDs, say.

Raises a ValueError if id_col values are strings, e.g. if id_col is ‘direction_id’.

geometrize_shapes(*, use_utm: bool = False) → GeoLazyFrame

Given a GTFS shapes table, convert it to a geotable of LineStrings and return the result, which will no longer have the columns 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat', and 'shape_dist_traveled'.

If use_utm, then use local UTM coordinates for the geometries.

geometrize_stops(*, use_utm: bool = False) → GeoDataFrame | GeoLazyFrame

Given a GTFS stops Table, convert it to a geotable with a “geometry” column of LineStrings and a “srid” column with the (constant) srid of the geographic projection, e.g. ‘EPSG:4326’ for the WGS84 srid. Return the resulting geotable, which will no longer have the columns 'stop_lon' and 'stop_lat'.

If use_utm, then use local UTM coordinates for the geometries.

get_active_services(date: str) → list[str]: Given a Feed and a date string in YYYYMMDD format, return the service IDs that are active on the date.

get_dates(*, as_date_obj: bool = False) → list[str] | list[dt.date]

Return the inclusive date range covered by feed.calendar and feed.calendar_dates as consecutive days. If neither table yields dates, return the empty list.

If as_date_obj, then return datetime.date objects instead.

Note that this is a range and not the set of actual service days.

get_first_week(*, as_date_obj: bool = False) → list[str] | list[dt.date]

Return a list of YYYYMMDD date strings for the first Monday–Sunday week (or initial segment thereof) for which the given Feed is valid. If the feed has no Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

get_routes(date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False, split_directions: bool = False) → pl.LazyFrame | st.GeoLazyFrame

Return feed.routes or a subset thereof. If a YYYYMMDD date string is given, then restrict routes to only those active on the date. If a HH:MM:SS time string is given, possibly with HH > 23, then restrict routes to only those active during the time. If as_geo, return a geotable with all the columns of feed.routes plus a geometry column of (Multi)LineStrings, each of which represents the corresponding routes’s shape.

If as_geo and feed.shapes is not None, then return the routes as a geotable with a ‘geometry’ column of (Multi)LineStrings. The geotable will have a local UTM SRID if use_utm; otherwise it will have the WGS84 SRID. If as_geo and split_directions, then add the column direction_id and split each route into the union of its direction 0 shapes and the union of its direction 1 shapes. If as_geo and feed.shapes is None, then raise a ValueError.

get_shapes(*, as_geo: bool = False, use_utm: bool = False) → pl.LazyFrame | None: Get the shapes table for the given feed, which could be None. If as_geo, then return it as geotable with a ‘geometry’ column of LineStrings and no ‘shape_pt_sequence’, ‘shape_pt_lon’, ‘shape_pt_lat’, ‘shape_dist_traveled’ columns. The geotable will have a UTM SRID if use_utm; otherwise it will have a WGS84 SRID.

get_shapes_intersecting_geometry(geometry: sg.base.BaseGeometry, shapes_g: st.GeoDataFrame | st.GeoLazyFrame = None, *, as_geo: bool = False) → st.GeoLazyFrame | None

If the Feed has no shapes, then return None. Otherwise, return the subset of feed.shapes that contains all shapes that intersect the given Shapely WGS84 geometry, e.g. a Polygon or LineString.

If as_geo, then return the shapes as a geotable. Specifying shapes_g will skip the first step of the algorithm, namely, geometrizing feed.shapes.

get_start_and_end_times(date: str | None = None) → tuple[str]: Return the first departure time and last arrival time (HH:MM:SS time strings) listed in feed.stop_times, respectively. Restrict to the given date (YYYYMMDD string) if specified.

get_stop_times(date: str | None = None) → pl.LazyFrame: Return feed.stop_times. If a date (YYYYMMDD date string) is given, then subset the result to only those stop times with trips active on the date.

get_stops(date: str | None = None, trip_ids: Iterable[str] | None = None, route_ids: Iterable[str] | None = None, *, in_stations: bool = False, as_geo: bool = False, use_utm: bool = False) → pl.LazyFrame: Return feed.stops. If a YYYYMMDD date string is given, then subset to stops active (visited by trips) on that date. If trip IDs are given, then subset further to stops visited by those trips. If route IDs are given, then ignore the trip IDs and subset further to stops visited by those routes. If in_stations, then subset further stops in stations if station data is available. If as_geo, then return the result as a geotable with a ‘geometry’ column of points instead of ‘stop_lat’ and ‘stop_lon’ columns. The geotable will have a UTM SRID if use_utm and a WGS84 SRID otherwise.

get_stops_in_area(area: st.GeoLazyFrame | st.GeoDataFrame) → st.GeoLazyFrame: Return the subset of feed.stops that contains all stops that intersect the given geotable of polygons.

get_trips(date: str | None = None, time: str | None = None, *, as_geo: bool = False, use_utm: bool = False) → pl.LazyFrame | st.GeoLazyFrame

Return feed.trips. If date (YYYYMMDD date string) is given then subset the result to trips that start on that date. If a time (HH:MM:SS string, possibly with HH > 23) is given in addition to a date, then further subset the result to trips in service at that time.

If as_geo and feed.shapes is not None, then return the trips as a geotable of LineStrings representating trip shapes. Use local UTM CRS if use_utm; otherwise it the WGS84 CRS. If as_geo and feed.shapes is None, then raise a ValueError.

get_week(k: int, *, as_date_obj: bool = False) → list[str] | list[dt.date]

Given a Feed and a positive integer k, return a list of YYYYMMDD date strings corresponding to the kth Monday–Sunday week (or initial segment thereof) for which the Feed is valid. For example, k=1 returns the first Monday–Sunday week (or initial segment thereof). If the Feed does not have k Mondays, then return the empty list.

If as_date_obj, then return datetime.date objects instead.

list_fields(table_name: str | None = None) → pl.LazyFrame

Return a table summarizing all GTFS tables in the given feed or in the given table name if specified.

The resulting table has the following columns.

'table': name of the GTFS table, e.g. 'stops'
'column': name of a column in the table, e.g. 'stop_id'
'num_values': number of values in the column
'num_nonnull_values': number of nonnull values in the column
'num_unique_values': number of unique values in the column, excluding null values
'min_value': minimum value in the column
'max_value': maximum value in the column

If the table is not in the feed, then return an empty table If the table is not valid, raise a ValueError

locate_trips(date: str, times: list[str]) → LazyFrame

Return the positions of all trips active on the given date (YYYYMMDD date string) and times (HH:MM:SS time strings, possibly with HH > 23).

Return a table with the columns

'trip_id'
'shape_id'
'route_id'
'direction_id': null if feed.trips.direction_id is missing
'time'
'rel_dist': number between 0 (start) and 1 (end) indicating the relative distance of the trip along its path
'lon': longitude of trip at given time
'lat': latitude of trip at given time

Assume feed.stop_times has an accurate shape_dist_traveled column.

map_routes(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, color_palette: Iterable[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False): Return a Folium map showing the given routes and (optionally) their stops. At least one of route_ids and route_short_names must be given. If both are given, then combine the two into a single set of routes. If any of the given route IDs are not found in the feed, then raise a ValueError.

map_stops(stop_ids: Iterable[str], stop_style: dict = {'color': '#fc8d62', 'fill': 'true', 'fillOpacity': 0.75, 'radius': 8, 'weight': 1}): Return a Folium map showing the given stops of this Feed. If some of the given stop IDs are not found in the feed, then raise a ValueError.

map_trips(trip_ids: Iterable[str], color_palette: list[str] = ['#66c2a5', '#fc8d62', '#8da0cb', '#e78ac3', '#a6d854', '#ffd92f', '#e5c494', '#b3b3b3'], *, show_stops: bool = False, show_direction: bool = True): Return a Folium map showing the given trips. Silently drop invalid trip IDs given. If show_stops, then plot the trip stops too. If show_direction, then use the Folium plugin PolyLineTextPath to draw arrows on each trip polyline indicating its direction of travel; this fails to work in some browsers, such as Brave 0.68.132.

name_stop_patterns() → pl.LazyFrame

For each (route ID, direction ID) pair, find the distinct stop patterns of its trips, and assign them each an integer pattern rank based on the stop pattern’s frequency rank, where 1 is the most frequent stop pattern, 2 is the second most frequent, etc. Return the table feed.trips with the additional column stop_pattern_name, which equals the trip’s ‘direction_id’ concatenated with a dash and its stop pattern rank.

If feed.trips has no ‘direction_id’ column, then temporarily create one equal to all zeros, proceed as above, then delete the column.

restrict_to_agencies(agency_ids: list[str]) → Feed: Build a new feed by restricting this one via restrict_to_routes() and the routes with the given agency IDs. Return the resulting feed.

restrict_to_area(area: st.GeoDataFrame | st.GeoLazyFrame) → Feed: Build a new feed by restricting this one via restrict_to_trips() and the trips that have at least one stop intersecting the given geotable of polygons, which can be in any coordinate reference system. Return the resulting feed.

restrict_to_dates(dates: list[str]) → Feed: Build a new feed by restricting this one via restrict_to_trips() and the trips active on at least one of the given dates (YYYYMMDD strings). Return the resulting feed.

restrict_to_routes(route_ids: list[str]) → Feed: Build a new feed by restricting this one via restrict_to_trips() and the trips with the given route IDs. Return the resulting feed.

restrict_to_trips(trip_ids: list[str]) → Feed

Build a new feed by restricting this one to only the stops, trips, shapes, etc. used by the trips of the given IDs. Return the resulting feed.

If no valid trip IDs are given, which includes the case of the empty list, then the resulting feed will have all empty non-agency tables.

This function is probably more useful internally than externally.

routes_to_geojson(route_ids: Iterable[str] | None = None, route_short_names: Iterable[str] | None = None, *, split_directions: bool = False, include_stops: bool = False) → dict

Return a GeoJSON FeatureCollection (in WGS84 coordinates) of MultiLineString features representing this Feed’s routes.

If an iterable of route IDs or route short names is given, then subset to the union of those routes, which could yield an empty FeatureCollection in case of all invalid route IDs and route short names. If include_stops, then include the route stops as Point features. If the Feed has no shapes, then raise a ValueError.

shapes_to_geojson(shape_ids: Iterable[str] | None = None) → dict

Return a GeoJSON FeatureCollection of LineString features representing feed.shapes. If the Feed has no shapes, then the features will be an empty list. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of shape IDs is given, then subset to those shapes. If the subset is empty, then return a FeatureCollection with an empty list of features.

split_simple() → GeoLazyFrame

Given a geotable of GTFS shapes of the form output by geometrize_shapes() with possibly non-WGS84 coordinates, split each non-simple LineString into large simple (non-self-intersecting) sub-LineStrings, and leave the simple LineStrings as is.

Return a geotable in the coordinates of shapes_g with the columns

'shape_id': GTFS shape ID for a LineString L
'subshape_id': a unique identifier of a simple sub-LineString S of L
'subshape_sequence': integer; indicates the order of S when joining up all simple sub-LineStrings to form L
'subshape_length_m': the length of S in meters
'cum_length_m': the length S plus the lengths of sub-LineStrings of L that come before S; in meters
'geometry': LineString geometry corresponding to S

Within each ‘shape_id’ group, the subshapes will be sorted increasingly by ‘subshape_sequence’.

Notes

Simplicity checks and splitting are done in local UTM coordinates. Converting back to original coordinates can introduce rounding errors and non-simplicities. So test this function with a shapes_g in local UTM coordinates.
By construction, for each given LineString L with simple sub-LineStrings S_i, we have the inequality

sum over i of length(S_i) <= length(L),

where the lengths are expressed in meters.

stop_times_to_geojson(trip_ids: Iterable[str | None] = None) → dict

Return a GeoJSON FeatureCollection of Point features representing all the trip-stop pairs in feed.stop_times. The coordinates reference system is the default one for GeoJSON, namely WGS84.

For every trip, drop duplicate stop IDs within that trip. In particular, a looping trip will lack its final stop.

If an iterable of trip IDs is given, then subset to those trips, silently dropping invalid trip IDs.

stops_to_geojson(stop_ids: Iterable[str | None] = None) → dict

Return a GeoJSON FeatureCollection of Point features representing all the stops in feed.stops. The coordinates reference system is the default one for GeoJSON, namely WGS84.

If an iterable of stop IDs is given, then subset to those stops.

subset_dates(dates: list[str]) → list[str]: Given a Feed and a list of YYYYMMDD date strings, return the sorted sublist of dates that lie in the Feed’s dates (the output feed.get_dates()). Could be an empty list.

to_file(path: Path, ndigits: int | None = None) → None: Write this Feed to the given path. If the path ends in ‘.zip’, then write the feed as a zip archive. Otherwise assume the path is a directory, and write the feed as a collection of CSV files to that directory, creating the directory if it does not exist. Round all decimals to ndigits decimal places, if given. All distances will be the distance units feed.dist_units. By the way, 6 decimal degrees of latitude and longitude is enough to locate an individual cat.

trips_to_geojson(trip_ids: Iterable[str] | None = None, *, include_stops: bool = False) → dict

Return a GeoJSON FeatureCollection (in WGS84 coordinates) of LineString features representing all the Feed’s trips.

If include_stops, then include the trip stops as Point features. If an iterable of trip IDs is given, then subset to those trips, which could yield an empty FeatureCollection in case all invalid trip IDs.

ungeometrize_stops() → DataFrame | LazyFrame

The inverse of geometrize_stops().

If stops_g is in UTM coordinates, then convert those UTM coordinates back to WGS84 coordinates, which is the standard for a GTFS shapes table.

gtfs_kit_polars.feed.list_feed(path: Path) → DataFrame

Given a path (string or Path object) to a GTFS zip file or directory, record the file names and file sizes of the contents, and return the result in a table with the columns:

'file_name'
'file_size'

gtfs_kit_polars.feed.read_feed(path_or_url: Path | str, dist_units: str) → Feed

Create a Feed instance from the given path or URL and given distance units. If the path exists, then call _read_feed_from_path(). Else if the URL has OK status according to Requests, then call _read_feed_from_url(). Else raise a ValueError.

Notes:

Ignore non-GTFS files in the feed
Automatically strip whitespace from the column names in GTFS files

GTFS Kit Polars 1.0.0 Documentation

Introduction

Authors

Installation

Examples

Conventions

Module constants

Module helpers

Module cleaners

Module calendar

Module routes

Module shapes

Module stop_times

Module stops

Module trips

Module miscellany

Module feed

Indices and tables