Building the web with R



Frame from the cartoon “One pea, two peas”, 1981, Soyuzmultfilm

Raw data collection is found in many tasks related to analytics. The web also often acts as a source. The probability of getting to a fully prepared and combed source is almost zero. You always have to do something to get this data and put it in order. It is encouraging that if the necessary information is visible in the browser, then in one way or another it can be scratched out from there. Worst case scenario, take photos.

Below are three true stories, united by one goal – to get information from an open source. All code is written “on a napkin” and is purely illustrative and entertaining.

Is a continuation a series of previous publications.

It is necessary to twist the information on the subsidies paid. Here is a simple and easy site. A cursory study shows that the developers did their best, but forgot one important thing – the “upload to excel” button. Consciously believe that they forgot or did not have time. Let’s look further. This is JS, server-side logic, html arrives at the client with a laid-out table fragment. 1366 pages.

What was on the course? Load page by url, position by tag and parse table? It will not work … We need emulation of events, we need a robot.

Rewind time forward and move on to the answer.

Preparing the environment

  • Download Selenium 3.x. (4-ka does not start yet). We take selenium-server-standalone-xxx from the assembly archive site selenium-release.storage
  • Download RSelenium from CRAN
  • We download WebDriver for the installed versions of browsers, put it in PATH (it’s easier and better next to the server, since the drivers depend on browser versions).
  • We start Selenium Server from cmd with the command java -jar selenium-server-standalone-3.141.59.jar

Looking for a hit point

Open developer tools in chrome.

We make trial requests, look at the answers, mark with a marker.

page_url <- "https://subsidies.qoldau.kz/ru/subsidies/recipients?Year=2020"
rvest::read_html(page_url) %>%
  html_nodes(xpath = "//*[@class="sw-result-table-container"]") %>%
  html_table()

rvest::read_html(page_url) %>%
  html_nodes(xpath = "//*[@class="page-link" and @aria-label]")

Letting go of the “mechanical dog”

the code

library(tidyverse)
library(RSelenium)
library(rvest)
library(iterators)
library(foreach)

# стартанули страницу
remDrv$navigate("https://subsidies.qoldau.kz/ru/subsidies/recipients?Year=2020")

lst <- foreach(it = iter(1:1366, .combine = NULL)) %do% {
  # локализуем элемент с таблицей
  tab_elem <- remDrv$findElement(using = "xpath", value = "//*[@class="sw-table-content-wrapper"]")
  df <- read_html(tab_elem$getElementAttribute('innerHTML')[[1]]) %>% 
    html_table() %>%
    # забираем тело таблицы
    .[[1]]

  # локализуем элемент "дальше"
  # тут выборка по русскому значению тега срабатывает, этим и воспользуемся
  next_elem <- remDrv$findElements(using = "xpath", value = "//*[@class="page-link" and @aria-label="Следующая страница"]")[[1]]

  remDrv$mouseMoveToLocation(webElement = next_elem)
  next_elem$click()

  df
}

We get data.framethe rest is a matter of technique.

Tear off champagne

Sociological issues need to be addressed. Elections in Germany, 2017. After much torment found excellent site, beautiful js interactive, all the information is there, fireworks with firecrackers. Super! Now let’s quickly wrap things up.

And then a fly in the ointment creeps up. An interactive leaflet detailing over 5,000 objects. The hand with the leash from the robot quietly hides in his pocket. I want to go outside and look at the schoolchildren coming from the lessons. Drink a cup of coffee. And not to see this site, which a minute ago seemed like a wonderful find.

Attention, the doors are closing. Did we get on the wagon or not? In what reality do we exist further? Where the site was thrown out and went to look further? Or where we got everything we wanted?

Let’s leave the first branch to the writers. Maybe everything ended very well there and this failure led later, far later, to a huge success. Let’s go to the second branch.

Searching below

Open developer tools. Let’s see the info. We click on the city – yes, json arrives here. With what? Yes, with the election results. Here it is! The catch in addressing these results is how to understand what’s what?

Searching at the top

We make the second entry from the top. Yeah, tile map… and some json arrive. What’s there? Yes, this is a list of all points for which there is detailed information … Well, well, let’s check the numbers. That’s it, the link has been found, here are just the very numbers for which json is requested with details.

We carry out the operation

We get a hammer and a soldering iron, 15 minutes of coding, 1 minute of work. The result is on the table.

Collecting election results

library(tidyverse)
library(glue)

# путем просмотра https://interaktiv.morgenpost.de/gemeindekarte-bundestagswahl-2017/
# наблюдаем список подгружаемых тайлов, формируем его руками
tiles_df <- tidyr::expand_grid(i = 32:34, j = 20:22) %>%
  mutate(url = glue("https://interaktiv.morgenpost.de/gewinner_btw2017/grid/6-{i}-{j}.json?v=3.0.0"))

# Шаг 1. собираем список всех городов
loadTile <- function(url){
  resp <- httr::GET(url)
  bind_rows(httr::content(resp)[["data"]])
}

job_df <- tiles_df$url %>%
  purrr::map_dfr(loadTile)

# Шаг 2. Собираем данные по каждому городу
loadTown <- function(id){
  resp <-  glue("https://interaktiv.morgenpost.de/",
                "gemeindekarte-bundestagswahl-2017/data/",
                "jsons/{id}.json") %>%
    httr::GET()
  bind_rows(httr::content(resp)) %>%
    mutate(AGS = id)
}

town_df <- job_df %>%
  slice(1:10) %>%
  pull(AGS) %>%
  purrr::map_dfr(loadTown)

An unexpected third scenario. Read the contents of the site in German to the end. Below you can see links to information sources used when creating the site. Follow the links and pick up the plates. Find out a little later that the plates do not take into account all updates in the division.

National electronic library. book monuments. Book “Explanation on the Apocalypse”, 1625 Naturally, you can only look at a digital copy, you can’t find it in bookstores, even in second-hand bookshops. Unique opportunity!

Needed for work. The only thing is that it becomes very painful to use only viewing from the screen after a while. And no bookmarks. It is impossible to print anything normally, the entire page is compressed into a vertical strip. Taking pictures of the screen, printing and sticking tape? And so all the necessary pages? After a series of experiments, it becomes clear that it is possible to rewrite by hand from the screen. Just as productive. The trouble is that the large 34″ screen rotated vertically doesn’t really help either – it’s almost impossible to view the entire page at close range.

It is also impossible to save the image, the low-resolution preview is saved, which is critical for understanding the text. At a cursory glance, it becomes clear that this is a tile set and the page as such does not seem to exist. There are simply various fragments collected by the browser together.

Have you arrived?

We open the developer tools in chrome, we begin to study network exchange. Studying several pages gives a rough understanding of the internal mechanics and how tiles are assembled into a page. Different zoom, different grid.

You can try to pick up the tools. Of the tricks – the installation of tiles into a single picture using ImageMagick. A minute of work and we have the necessary pages in the maximum resolution in the form of a single graphic file.

scrap_one_page.R

library(tidyverse)
library(magrittr)
library(httr)
library(rvest)
library(stringi)
library(glue)
library(jsonlite)
library(furrr)
library(magick)

n_cores <- parallel::detectCores() - 1

# директории разного типа содержат тайлы 256x256 разных увеличения
# 10 -> 11 -> 12 (max)
# базовый url страницы
base_url <- "https://kp.rusneb.ru/tiles/5fd08afffc8ed229eaf02309_files"

# 1. генерируем фиктивную сетку
grid_df <- expand_grid(y = 0:12, x = 0:12) %>%
  mutate(img_name = glue("{x}_{y}.jpeg"),
         url = glue("{base_url}/11/{img_name}"),
         fname = here::here("page", img_name)
  )

# 2. загружаем в многопотоке
# plan(multisession, workers = n_cores)
plan(sequential)
processTile <- function(url){
  purrr::possibly(image_read, otherwise = NULL)(url)
}

img_lst <- grid_df %$%
  future_map(url, processTile)
plan(sequential)

tiles_df <- grid_df %>%
  mutate(img = !!img_lst) %>%
  drop_na(img)

# 3. склеиваем тайлы (монтаж)
image_obj <- purrr::lift_dl(c)(tiles_df$img)

tile_str <- tiles_df %$%
  glue("{n_distinct(x)}x{n_distinct(y)}")

res <- image_montage(image_obj, geometry = "256x256+0+0", tile = tile_str, 
                     bg = 'black', gravity = "North")

# 4. сохраняем страницу
image_write(res, path = here::here("page", "page.jpg"), format = "jpg")

The previous post is “Refactoring Shiny Applications”.

Similar Posts

Leave a Reply