Reversing the Moscow Metro

Urban planners and urbanists need to work with data to conduct quantitative research. However, officials in the Russian Federation are in no hurry to share city statistics openly, posting only the bare minimum of information in open access. Abroad, the situation is a little better, but still there are cases when there is no dataset. (a brief overview of the need and availability of open data can be read here.)

In this case, you have to collect data yourself. At the same time, we are not always talking about work “in the field”, most often all the information is already on the Internet, just not everyone is ready to share it. In this article, I will try to get a matrix of the times of the Moscow metro, along the way we will reverse engineer the Yandex metro application, and also make very cool visualizations of the information received.

Formulation of the problem

It is necessary to obtain data on travel time between any two metro stations for any year (as the Metro develops, new stations are built, transfers). Travel time should take into account the possibility of transfers (and their time).

Possible Solutions

There are ~300 stations in the subway, which doesn’t seem like much, but it’s C(300, 2) = 300! / (2! (300 – 2)!) = (300 * 299) / 2 = 44850 ways to choose 2 of them. Any manual calculation of such a number of routes is doomed to failure. Using the maps API is also fraught with some problems – firstly, it takes into account the time to descend from the station, secondly, I will not have access to archived data, and for some pairs of stations, the route by ground transport will be much faster than using the subway. Then the way to the subway will not even be offered. I also really wanted to reverse engineer some kind of mobile application, so let’s go to the method I chose.

The Yandex metro application allows you to build a route along the subway, and also works completely offline – which means the data is stored on the user’s phone. Running downloaded from a computer and transferred via cable to a phone that does not have an Internet connection, to my own joy, I found that the application was already able to build routes and it did not need primary synchronization with the server to load the schemes. This means that a huge base old versions of this program on 4pdf will allow us to get archival data on the metro time matrix of almost any city in the world. It remains only to understand how this application is arranged inside.

APK file

I will analyze version 2.05 for 2015. From version to version, the database device, which stores all the necessary information, practically did not change, until 2018. Then a transition was made from sqlite to a NoSQL database, or, more simply, the data began to be stored in JSONe (and the structure remained +- the same).

Opening the downloaded apk file on a computer using a zip archiver we will see:

  1. Meta-information folder (META-INF): Contains files with signatures, certificates, and application manifests.

  2. Resources folder (res): Contains all resources such as images, layouts, and strings used in the application.

  3. Code (classes.dex): contains the compiled Java code that performs the functions of the application.

  4. Miscellaneous (lib): contains libraries that are used by the application, for example, to work with images or sound.

  5. Manifest (AndroidManifest.xml): Contains information about the application, such as package name, permissions, and application components.

Note that the files in the APK folder may be encrypted or compressed, so you won’t be able to just open them and read the contents. If you want to examine the code and resources of an application, you need to use special tools such as a decompiler.

However, in our case, the database is an unencrypted sqlite file at res/raw/yandex2.db . In this case, the location of this file was obvious, but in more complex situations, you can use the search utilities by extension and file type.

Database

sqlite files are easiest to learn using sqlitebrowser. This freeware multi-platform software with simple graphical interfaces was able to open our database as well. The structure of this database also does not raise questions. The stations table contains information about stations, including their name, location, and so on. Transfer information is contained in the Transfer table. Everything turned out to be much easier than it could have been.

New versions

The data is stored in the assets/metrokit/ folder. Now the JSON files are read by Python by translating into Networkx Graphespecially since @Sakhar already wrote a handy script when he was building a 3D subway map.

names = json.loads(codecs.open( "l10n.json", "r", "utf_8_sig" ).read())
graph = json.loads(codecs.open( "data.json", "r", "utf_8_sig" ).read())nodeStdict={}

for stop in graph['stops']['items']:
    nodeStdict[stop['nodeId']]=stop['stationId']

G=nx.Graph()
for node in graph['nodes']['items']:
    G.add_node(node['id'])
for link in graph['links']['items']:
    G.add_edge(link['fromNodeId'], link['toNodeId'], length=link['attributes']['time'])
nodestoremove=[]
for node in G.nodes():
    if len(G.edges(node))<2:
        nodestoremove.append(node)
for node in nodestoremove:
    G.remove_node(node)
labels={}
for node in G.nodes():
    try:
        labels[node]=names['keysets']['generated'][nodeStdict[node]+'-name']['ru']
    except: 
				labels[node]='error'

Let’s see what we got:

nx.draw(G)

The diagram is, of course, clearer than from Artemy Lebedev, but by the end of the article we will make it even better.

Now let’s learn how to build a route between two stations. Dijkstra’s algorithm can be used for this. Thousands of articles have already been written about how it works (for example, tyk), so we won’t dwell on it, especially since it has already been implemented in our library.

def calc_time(st1, st2):
  return nx.dijkstra_path_length(G, source=st1, target=st2, weight="length")/60
print(labels) # ... 'nd89811596': 'Римская', 'nd79168908': 'Крестьянская Застава', ...
calc_time("nd89811596", "nd79168908") # 4

As we can see, we received the answer 4. That is how many minutes it takes to travel between these stations. (A request via the site gives 5 minutes on the road, but it may use some kind of information about the number of rolling stock in real time.

Let’s now calculate the desired matrix. The obvious solution would be to consider all possible pairs of stations and try to build a route:

However, this method is terribly inefficient. If A and B are neighboring stations, and C is far away, then building the optimal route A – C, we already know (or almost know) the travel time along B – C, due to the principles of Dijkstra’s algorithm. However, we “forget” about this information and build the route again. Suitable for our task Floyd-Warshall algorithm. (By the way, this is the reason why, even with the use of high-level libraries, it would be good to at least understand how the algorithms work.) NetworkX can also do this out of the box:

res: dict = nx.floyd_warshall(G)
print(res)

Let’s try something else. For example, find the number of stations in a 20-minute neighborhood from a given one. Let’s decide head-on

res = {}
counter = {}
for st1_id, st1_name in labels.items():
  res[f"{st1_id}_{st1_name}"] = []
  counter[f"{st1_id}_{st1_name}"] = 0
  for st2_id, st2_name in labels.items():
    if st1_id != st2_id and calc_time(st1_id, st2_id) < 20:
      res[f"{st1_id}_{st1_name}"].append(f"{st2_id}_{st2_name}")
      counter[f"{st1_id}_{st1_name}"] += 1
      print(st1_name, st2_name)

Now you can try to optimize it. Here the solution is a bit simpler – bypass in width. (you can use nx.bfs_edges() or implement it yourself):

def get_stations_within_distance(G, source_node, distance):
    visited = set()
    queue = [(source_node, 0)]
    while queue:
        node, dist = queue.pop(0)
        visited.add(node)
        if dist < distance:
            neighbors = nx.neighbors(G, node)
            for neighbor in neighbors:
                if neighbor not in visited:
                    queue.append((neighbor, dist + G[node][neighbor]['length']))
    return visited

Now let’s try to visualize the received data. A little computer vision magic and a couple of PIL commands allow you to build this beauty (source):

https://i.postimg.cc/szcs3D75/mcd-scheme-crop.png?dl=1

Here the size of the circle is proportional to the number of other stations in the 20-minute interval. This picture clearly shows the crazy difference between the city center (where you can get to 100+ stations in 20 minutes) and sleeping bags (where there is only one way – to the center. Of course, there will always be some stations with more, and some with less successful transfers, it’s bad when 100% of the hubs are located in the city center

London Underground, for reference

You can also use the data we received to make a more convenient metro map. Using it, it will be possible to estimate travel time simply by glancing at the map, without using mobile applications or interactive screens. What is the idea?

  • On each non-circular ring we write the time in minutes it takes to travel to the roundabout on that line.

  • At each ring station, we write the travel time to some fixed metro station (for example, to Komsomolskaya).

Now, to calculate the route time is quite simple:

  1. Subtract (add if we are passing through the ring) the two indicated times, for simple non-stop routes

  2. Add two numbers at the stations of departure and arrival, as well as add the difference of two transfers on the ring, if we are traveling along the path radius – ring – radius.

In custody

I hope that this article was useful to someone. The use of such non-obvious data sources as an offline mobile application can greatly help researchers from countries that do not actively develop open data. You can find the entire source code in Google Collab, it’s a pretty big mess there, but it will be easier to figure it out if you write everything from scratch. For it to work, you need to download JSON files with the city’s subway.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *