This repository is a part of the multi-repository project dystonse
. See the main repository for more information.
This is a Rust crate that works with static gtfs schedules (as zip or directory), gtfs-realtime data (as .pb or .zip files) and a mysql database (setup info is specified in dystonse-docker) to read, import or anaylse the data, and/or display a travel information system website (see below for details).
In import mode, it matches the realtime data to the schedule data and writes everything into the mysql database. It also makes predictions for trips in the near future, based on schedule and realtime data, and writes them into the database as well.
In analyse mode, it can compute delay probability curves both for specific and general data sets, and save them as small machine-readable files or human-readable images in different formats. It can also count the data entries per time and output some simple statistics.
In predict mode, it can look up and return the delay probability curve that is most useful for predicting the delay of a specified trip, stop, time, and (optional) delay at a specified earlier stop.
In monitor mode, it generates a passenger information system website ("erweiterter Abfahrtsmonitor") with interactive and visual info about delay probability distributions for every stop of every trip in the near future (the exact length of this time span depends on the available predictions as defined in the importer module, default value is 8.5 days)
Basic syntax is dystonse-gtfs-data [global options] <command> <subcommand> [args]
, or if you run it via cargo, cargo run [--release] -- [global options] <command> <subcommand> [args]
.
There are a lot of database parameters to be defined globally. Those DB_…
parameters can either be defined as environment variables (using the upper case names like DB_PASSWORD
) or as command line parameters (using lower-case variants without the db
-prefix, e.g. --password
). Default values are provided for DB_USER
, DB_HOST
, DB_PORT
and DB_DATABASE
. In contrast, DB_PASSWORD
and GTFS_DATA_SOURCE_ID
always have to be specified when running this, where GTFS_DATA_SOURCE_ID
is a string identifier that will be written as-is into the database for each entry. In the syntax examples below, we use a mix of env vars and command line parameters.
The most important args are dir
and schedule
. dir
is mandatory and names a directory where data should be read from/written to. schedule
is optional and points to a schedule file to use for the analyses/predictions. If no schedule file is given, the newest available schedule is used.
You can also use dystonse-gtfs-data [command [subcommand]] --help
to get information about the command syntax.
This tool can write incoming realtime data into the records
table and/or use it to update its own predictions, which are written into the predictions
table. The outcome is quite different, but the way the incoming data is processed is similar. This is why both actions are part of the import
subcommmand and can be performed in one go. You select them with the --record
and/or --predict
flag.
DB_PASSWORD=<password> dystonse-gtfs-data [-v] --source <source> import --record manual <gtfs file path> <gfts-rt file path(s)>
without -v
, the only output on stdout is a list of the gtfs-realtime filenames that have been parsed successfully.
Instead of manual
mode, you can use automatic
or batch
mode:
DB_PASSWORD=<password> dystonse-gtfs-data -- [-v] --source <source> import --record automatic <dir>
In automatic mode:
- The importer will search for all schedules in
<dir>/schedule
and all realtime files in<dir>/rt
and compute for each schedule which rt-files belong to that schedule. In this context, each realtime file belongs to the newest schedule that is older than the realtime data, as indicated by the date within the filenames. - Beginning with the oldest schedule, the importer will import each realtime file and move it to
<dir>/imported
on success or<dir>/failed
if the import failed for reasons within the realtime file (if the filename is not suitable to extract a date, or if the file could not be parsed). - When all known files are processed, the importer will look for new files that appeared during its operation. If new files are found, it repeats from step 1.
- If no new files were found during step 3, the importer will wait for a minute and then continue with step 3.
In batch
mode, it works exactly as in automatic
mode, but the importer exits after step 2.
Additional required arguments depend on the subcommand you want to use:
For a given source id, this will count the number of valid real time entries for each time interval. An entry is considered valid if its delay_arrival
is between -10 hours and +10 hours. The whole time span for which there is real time data will be split into parts of length corresponding to the interval
parameter, which has a default value of 1h
(one hour).
Simple statistics are output to stdout
as CSV like this (space padding added for clarity, they won't be present in the real output):
time_min; time_max; stop_time update count; average delay; rt file count; rt file size
2020-03-16 00:41:02; 2020-03-16 04:41:02; 72; 11.6111; 12; 18279
[...]
Graph mode is only available if you compile with --features visual-schedule
. This will compute visual schedules of the given route-ids
(or all
) and save them as png images in a directory structure sorted by agency and route. See this post on our blog in german language for more info about visual schedules (Bildfahrpläne).
This will compute specific delay probability curves for a given set of route-ids
(or for all route-ids available in the schedule, if all
is used instead). As long as there are enough data points in the database, it creates the following things for each route variant and each time slot:
- curves of the general distribution of delays at each stop (one curve each for arrival and one for departure delays)
- curve sets of the distribution of arrival delays at each stop, depending on the departure delay at another (earlier) stop (one curve set for each pair of two stops)
This will compute aggregated delay probability curves divided by the following general categories:
- route type: tram/subway/rail/bus/ferry
- route section: beginning/middle/end, see here for the specification.
- time slot: 11 separate time categories defined by weekdays and hours, see here for the specification.
This will compute delay probability curves, using the collected data in the database. The curves (both specific and default) are saved into a file named "all_curves.exp" in the specified data directory. When the argument route-ids
is given, the specific curves are only computed for the given route-ids. When the argument all
is given, all available route-ids from the schedule are used.
This will compute specific delay probability curve sets for the given route-ids
and output them as diagrams in svg file format with human-readable title (in german) and labels/captions. One file is created for each pair of stops in each route variant and each time slot, sorted into a directory structure.
Additional required arguments depend on the subcommand you want to use. Currently, only the single
subcommand is implemented.
This will lookup a single curve or curve set depending on the values of the arguments, and print the output to the command line (we are currently working on a more useful interface for this output). The following arguments are needed:
route-id
,trip-id
and (optional)stop-id
(according to the schedule) of where you want to get a prediction for. Ifstop-id
is ommitted, a prediction for each stop of the route is generated.event-type
: arrival or departuredate-time
date and time of when you want to be at the specified stop- (optional)
start-stop-id
of a previous stop where the vehicle has already been - (optional)
initial-delay
at the previous stop. Ifstart-stop-id
is given, butinitial-delay
is not given, the result will be a curve set instead of a single curve - (optional)
use-realtime
: if given instead ofstart-stop-id
andinitial-delay
, the predictor module will try to look up a usefulstart-stop-id
andinitial-delay
from the database (if there are current realtime data for this trip) . Obviusly, this works only in a very narrow time window, where the vehicle has already started its trip, but not yet arrived atstop-id
.
(not yet implemented.)
The "monitor" website has some large dependencies that are not needed for any of the other modules, therefore it is configured as an optional feature. If you want to use the monitor
command, --features "monitor"
needs to be specified at compile time.
When starting the monitor
directly via command line, the (human-readable) long name of the data source / transport provider needs to be specified as an argument. When used in context of the dystonse-docker setup, this argument can be read from the .env files instead.
You can then run the monitor
functionality with the monitor
subcommand, e.g.:
dystonse-gtfs-data --host <db_hostname> --password <db_password> --source <source> --dir <dir> monitor --source-long-name <source_long_name>
The website will then be available on localhost:3000.
A manual for using the website is included in the website and currently only available in German language.
This started out as a simple test repository for compiling Rust applications in docker. It used to contain a hello-world-application written in Rust, and some docker fluff:
- compile inside a docker container
- copy binary into another container
We used it to test rust development in general, and to check if this works with cross-compiling.
NOTE: The following parts are probably outdated. We will update them when we have fixed the docker config for the current crate, so that it can be compiled into a usable docker image again
Use docker buildx build --platform linux/amd64,linux/arm/v7 -t dystonse/rust-test:latest --push .
to build and push the containers for both linux/amd64
and linux/arm/v7
architectures.
You might have to enable experimental features first, e.g. using export DOCKER_CLI_EXPERIMENTAL=enabled
.
Also, you might have to create and activate a builder, as documented here for Docker Desktop (Mac and Windows) or here for Linux hosts.
We hit a problem when cross-compiling a rust application with dependencies on Docker Desktop for Mac. While building the arm/v7 container, cargo build
can't read some git-specific directory, as explained in this issue.
It boils down to a broken emulation of system calls when emulating a 32-Bit system on a 64-bit host using qemu. The actual bug - if you call it a bug - is not in qemu
but in libc
.
A good workaround should be to use a host kernel which has been compiled with the CONFIG_X86_X32
configuration flag. Docker Desktop for Mac used a virtualized Linux host using HyperKit. The linux image is build with LinuxKit
, however, we could not verify if the image shipped with Docker Desktop has the CONFIG_X86_X32
configuration flag (probably not).
But the same error occurs when cross-compiling on another host, which runs a Debian Linux natively. According to its /boot/config-4.19.0-8-amd64
file, the CONFIG_X86_X32
configuration is enabled there, so it should have worked.