“Where, where did you go,” or searching for missed stops on public transport routes in OpenStreetMap
OpenStreetMap (OSM) is a global project formed around a geographic information database, filled by everyone – both enthusiasts and interested companies. Anyone can contribute, but openness also has a downside, which often leads to incorrect edits being added to the database. Therefore, there are many validators written in the OSM ecosystem that allow you to maintain data quality at an acceptable level.
Since 2016 there is open source metro preprocessorwhich validates (generates error reports) urban rapid transit routes in OSM for completeness and logical/topological errors and converts them into formats suitable for routing and rendering services, including GTFS. In addition to OSM data, it accepts as input list of public transport networks (OT), containing control information about the number of lines, stations, and other things in a certain transport network. The preprocessor has successfully proven itself in preparing OT data for applications such as Maps.me and Organic Maps.
In this article, I would like to share an approach to detecting one of the types of errors that occur quite often in OSM data and the automatic catching of which poses some challenge – this is the random loss of a station from the route. All validator source codes and described algorithm are in the public domain. But first, let's define the concepts used to represent PA data in OpenStreetMap.
Basic Concepts
On first reading, you can get by with the philistine idea of stations (stop)although in OSM it can correspond to at least four different types of objects:
Comment. References to OSM objects may become outdated over time. An alternative way to get them is to run something like this in Overpass Turbopositioning the map in the area of a certain station:
(
nwr[railway=station]({{bbox}});
rel[public_transport=stop_area]({{bbox}});
);
(._;>>;);
out;
These concepts are illustrated schematically in the image (created for OSM Wiki licensed under CC SA-4.0):
If you want to delve into the program code of the algorithm on GitHub, the role of the station in it is played by the StopArea class, based on the OSM relation public_transport=stop_area, which combines the railway station, platforms and station entrances at a certain OT stop location.
Comment. We will talk about high-speed rail transport, although the algorithms and conclusions can be extended to ground transport: buses, trams, trolleybuses.
Transfer or transfer hub – two or more stations between which you can travel on foot in a relatively short time. In OSM they are denoted by a special object – the relation public_transport=stop_area_group. Example – change at Green Park station between the Piccadilly Line, Jubilee Line and Victoria Line. Sometimes what looks like one station to a passenger may appear in OSM as several stations connected by a transfer.
Route is a sequence of stops, usually with its own identifier, which OT vehicles periodically follow. Example of a real route in OSM: Piccadilly line: Uxbridge → Cockfosters.
Consider the hypothetical route “R1: Reading → Maryland”. The definition suggests that “R2: Maryland → Reading” is a different route, as is the shortened version “R4: Reading → Slough” or the express route “R3: Maryland – Slough – Reading”, which makes only one intermediate stop.
Master route — these are all options for routes within the general direction, including routes “there” and “back”, shortened, accelerated by skipping some stations, as well as options with other differences. In our example, all of the listed routes, which we will denote as R1-R5, can be combined into one master route. Example of a real master route: Piccadilly Line.
Comment. Here and below, the stations and routes in the pictures are for illustration purposes only and may not correspond to reality.
More information about the problem being solved
When the validator processes OSM data, it compares the number of stations found on the route with the number previously entered into the control value file. And if it is still possible to store and maintain the number of stations in all high-speed urban transport networks in the world on master routes, then for all route options it is no longer an option. Therefore, if a station in one of the directions is accidentally removed from the OSM data, the number of stations in the master route will not change and the validator will not notice, for example, an error where the Langley station is missed in the opposite direction:
Algorithm
Search for twin routes
Let's look at the set of routes R1-R5 again:
Let's call two routes in a master route potential twins if the start of the first route is the end of the other, and vice versa. These for route R1 are R2 and R3. Among the potential twins, we will choose the closest one based on the number of common stations. Thus, in our case we will find two pairs of twin routes: (R1, R2) and (R4, R5). For each pair, we will run an algorithm to find differences, because it is very likely that the differences in the twins are a consequence of errors in the data.
Finding differences in routes
Determining whether one route contains all the stations from another route is not a difficult task. However, there are many nuances here: a station can be skipped both in the first route and in the second, or several stations can be skipped at once, whether they are consecutive or not. And a station on one route may not correspond to the same station on another, that is, it may be mistakenly replaced by a double or a completely different station. All this is very reminiscent of… Levenshtein's editorial distance!
Let's look at routes T1 and T2 again:
Let's imagine the route as a word, where the letters are stations (in general, transfer hubs). Based on the first letters of the stations, route T1 corresponds to the word RTBSLHM, and the return route T2 corresponds to the word MHSBTR. Let's compare the word for T1 with the inverted word for T2:
RTBSLHM
RTBSHM
For ideal twins, the words should match, but in T2 we have a “typo” – the letter L is missing. Thus, finding a mismatch between stations on a pair of routes comes down to finding the editorial distance between a pair of words.
Comment. The first letters are used here for illustration purposes only. The real “letters” are the objects of the stations/transfer hubs!
The Wagner-Fisher algorithm allows using dynamic programming to find the shortest sequence of steps to transform one word into another by adding, deleting or replacing one letter in one step. The result will be an editorial prescription – in what places additions/deletions/replacements are necessary.
Thus, the Wagner-Fisher algorithm adapted for OT routes will show all the differences in stations for a pair of twin routes. Let’s illustrate this with another example with a larger number of “typos” on the routes, where, in addition to missing stations in the original data, there were two different stations where there should be one common station.
The adapted Wagner-Fisher algorithm, when run on a pair of routes (S1, S2), will issue the following warnings:
There is no Taplow station on the S1 route;
Route S2 does not include Langley station;
Different Slough stations on routes S1 and S2.
False positives
In reality, each case of a detected error may be a false positive and will require a separate investigation, however, in the form described, the algorithm will produce too many false positives in cases where the forward and reverse routes actually diverge at some point along the path, for example:
The algorithm will say that Woodstock and Esplanade, Salt River and Mutual stations do not correspond to each other, and Ndabeni is missed in the U1 route. To combat this, it is worth noting that on twin routes the rails usually run quite close and within the station the landing points are not more than a couple of tens of meters from each other. But even taking into account large stations with many tracks and platforms, this figure should not exceed 100 m. You can safely overestimate this figure, because it is better to get a false positive than to miss an error. Moreover, if the railway routes actually diverge, then usually by much more than 100 m.
Thus, to detect route divergence, we will form the following criterion: if the distance from a station, determined by the algorithm as missed or inappropriate, to the rails of the return route exceeds a certain threshold (~100 m), then we consider that the rails have diverged, and we will not form error messages.
Practical Application
The mentioned metro validator in OSM has been working for many years and contains versatile checks, and the errors it produces are regularly corrected by several people. However, this is not the only OT validator. However, when the above algorithm was run on 300+ metro and other high-speed rail networks, it produced 144 missing/mismatched stations, only 17 of which were false positives.
This is how this beautiful example of the pervasive nature of the theory of algorithms worked, where text processing methods found application in the field of public transport – the domain of graph theory.
It is recommended for implementation in such widely used validators as: