Where to open a store? A Journey to Geospatial Analysis and Back

  • Evaluate existing locations from the point of view. population density, competition, market volume. Find new locations for opening or relocating a business;

  • Use signs of buyer proximity to the business/competitors in customer analytics to predict churn and responses to mailings/offline advertising;

Today I’ll tell you more about location assessment. I did all the work in the R language.

The problem is classic: we are looking for a place to open a new store (or we want to close the old one or move). We don’t want to pay for third-party services (or almost don’t want to). There is an analyst, he can do it.

The whole process can be broken down into several simple steps:

  1. Collection of population density data;

  2. Collection of data on business/competitors;

  3. Calculation of distances between business, competitors and population;

  4. Obtaining metrics for assessing the quality of a point, comparing and selecting the most interesting ones;

Of course, everything can be done quickly and simply, but there are many nuances in this matter, so it turned out long and painful for me. More details on the stages.

Collection of population density data

At this step, we collect houses with coordinates and the number of households in the city. Initially, I worked with a wide geography. Those. You won’t be able to simply analyze one city and calm down. Therefore, I decided to analyze the regions and republics at once. Also, in the vast geography, settlements of 5 – 10 thousand people, dominated by the private sector, were of interest. And the private sector is classically represented in GIS worse than apartment buildings. In general, a unified approach to searching for detailed information was needed.

When starting to solve the problem, I was lucky to come across a report “Beer vs Coffee” indicating excellent source by the number of households in specific apartment buildings. Naturally, the register does not contain all houses, and the private sector is completely absent. As a result, I merged the register with the FIAS register. I assigned one household to all houses from FIAS without a number of households. Yes, the school in the area will also receive one household, but the error will be small. But now there is information on the private sector. We have a large country and there are cities where the private sector is our everything. I excluded vegetable gardens, dachas and garages, because… They were of no value to me. I also excluded settlements with up to 300 households. But in general, you can take them into account. So we have a set of addresses to geocode.

Next, for each address you need to get the latitude and longitude. And here the first serious problem awaits. If you use Open Street Map, then geocoding in Moscow or St. Petersburg +- is acceptable, but the further from the center, the sadder it becomes. As a result, OSM immediately closes about 30-35% of households, which, you see, is not enough. For example, when geocoding the Sverdlovsk region, you need to obtain data for 568 thousand addresses. This is 2 million households. In 2021 the problem was well solved via API here, which allowed to pump out 250 thousand addresses per day. But in March 2022 the company left Russia. I had to look for an alternative. Now, out of all geocoders, it performs best data, but the number of free requests per day is 10 thousand, for paid versions – starting from 50 thousand. If you process only addresses with the number of households from 2 or more, the quality of geocoding rises to 85-90% for large cities. If we collect the private sector, we will get another 5-7%. The final touch is manual geocoding of addresses of 50 or more households. A little labor-intensive, about 40 minutes per region, but the quality of geocoding becomes more than acceptable, with the exception of certain cities, where apparently nothing will help.

Quality of geocoding of the Sverdlovsk region (represented by the head of the table).

Quality of geocoding of the Sverdlovsk region (represented by the head of the table).

Collection of data on business/competitors

A simpler stage with t.z. preparing registers or collecting data. By analogy with houses, we get the coordinates of our stores and competitors’ stores.

There is a rather conceptual problem: who is considered a competitor and who is not? Is it necessary to take into account the entire market or is it enough to take the most similar formats? I consider 2 layers separately: direct competitors; the entire market by format. More complex, but (seemingly) correct solutions – building business proximity within Reilly models or Huffa. They can take into account brand strength, assortment, area, attractiveness and distance to competitors. But so far I have not had to put them into practice (if readers have experience, please share).

Calculation of distances between business, competitors and population

You can easily and quickly consider Euclidean distances. However, there are many seas, fields and rivers in the country. Therefore, even considering small cities, we get a lot of water barriers, railway tracks, industrial zones, cemeteries, etc. We have to look for an option for calculating distances along roads. This is perhaps the most difficult part of the study, because… Now such an entity as a road graph appears.

In R, I quickly found the osmdata libraries for obtaining road coordinates, as well as sf for Simple Features objects, which initially collect geography. The following sources were great help here, here And here.

Before I tell you how it all works, maybe… organized, let’s take some research problem as an example. For example, let's imagine that we are a small chain of pizzerias in Yekaterinburg. We have plans to conduct a pilot opening of a pizzeria near the city in order to build a strategy for moving into small towns. We chose the 2 closest cities: Kamensk-Uralsky and Novouralsk. Even if the owner of the company likes both of them, he is not yet ready to move to others (this is the path of J, businessmen have the right to their whims).

First we need to get a road graph for further processing. But before this, there is again a conceptual question: what kind of geography do we want to analyze: a city along its territorial border or also take into account nearby settlements? And here I try to rely on the understanding of “trading zones”. In other words, a trading zone is a radius near a store or business where the bulk of its customers are concentrated. There could be such zones. some. Very simplified: I use the primary and secondary trading area as the distance within which 50% and 80% of all orders (purchased, of course) are generated. For a more scientific and consistent approach to the definitions of trading zones, you can see here (and overall an excellent article on geospatial analysis).

It turns out that if we use the concept of shopping zones, we need to analyze not only the city itself, but also the surrounding areas. Therefore, to collect the roads we need in a certain geographical area, we will take the geography of the city with a reserve. Suppose, as a result of a separate study, we found that a pizzeria receives 80% of its turnover from customers living within a radius of 2 km. Then we will add 2 km to the city boundaries. I define the city boundaries as a square (at the border addresses in the city on each side), passing 4 coordinates to the list (bbox) with the addition of 2 km on each side. We will get a square frame where we will look for roads. Next, bbox is passed to the OSM request and we get the roads. There is one caveat: when geocoding houses, from time to time we get an incorrect determination of the address, so I do not take the most extreme values, but the 1% and 99% percentile in latitude and longitude. This way the frame will not swell in latitude or longitude. At the same time, adding 2 km to the percentile boundaries compensates for the 1% geography thrown out on each side.

To make it clearer, I collected already geocoded addresses into parquet format files, separately for 2 cities within the bbox. Previously, I already threw out the extreme values ​​of the addresses on each side. Source data can be obtained from here.

library(data.table)
library(magrittr)
library(dplyr)
library(tidyr)
library(sf)
library(osmdata)
library(tidygraph)
library(sfnetworks)
library(ggplot2)

# Загрузка координат с указанием кол-ва домохозяйств
houses <- arrow::read_parquet("./Каменск-Уральский_homes.parquet") %>%
  data.table()

# определим рамку по координатам для загрузки графа дорог
bbox_city <- st_bbox(
  c(xmin = round(min(as.numeric(houses$lon_home)) - 2000/111000,3),
    xmax = round(max(as.numeric(houses$lon_home)) + 2000/111000,3),
    ymin = round(min(as.numeric(houses$lat_home)) - 2000/111000,3),
    ymax = round(max(as.numeric(houses$lat_home)) + 2000/111000,3)
  ),
  crs = st_crs(4326)
)

# Определим типы дорог для парсинга в OSM 
# https://wiki.openstreetmap.org/wiki/Key:highway
road_type_for_parsing = c(
  "primary",
  "motorway",
  "unclassified",
  "trunk",
  "tertiary",
  "secondary",
  "residential",
  "living_street",
  "motorway_link",
  "trunk_link",
  "primary_link",
  "secondary_link",
  "tertiary_link",
  "footway"
)

# Дороги внутри bbox (в виде объекта sf)
road_osm <- opq(bbox_city) %>%
  add_osm_feature(key = 'highway',
                  value = road_type_for_parsing) %>% 
  osmdata_sf() %>%
  .$osm_lines %>% # забираем только отрезки (lines)
  st_as_sf(, crs = 4326) %>%
  select(osm_id,highway)

In fact, a road network for us is just a set of roads consisting of segments. For each road we have a line, in the geometry variable – a sequential set of coordinates of which it consists. As a result, we get the following:

ggplot(road_osm) +
  geom_sf() +
  ggtitle("Граф дорог г.Каменск-Уральский")

It can be seen that a lot of unnecessary stuff is being loaded. For example, some roads go beyond the designated frame. There are roads outside the general network, or with access outside the framework. Such a road network is not very convenient for analysis.

Next, I eagerly go through many methods of putting the network in order, borrowing the techniques I need from various sources.

I would also like to draw your attention to the library sfnetworks. It was literally created for tasks: taking a road graph, combing it, enriching it with data on houses, shops, calculating distances on the fly between different objects. Before it, there were a lot of self-written functions, fiddling with graph matching, etc. And here is a comprehensive solution in the tidy data style.

# отсекаем лишние дороги (вне связной компоненты)
roads_groups <- sf::st_touches(road_osm) %>%
  igraph::graph.adjlist(.) %>%
  igraph::components(.)

roads_table_order <- table(roads_groups$membership) %>%
  .[order(., decreasing = TRUE)]

road_osm_not_comp <- road_osm[
  roads_groups$membership == names(roads_table_order[1]), ]

rm(roads_table_order)


# Чистка графа по источнику:
# https://r-spatial.org/r/2019/09/26/spatial-networks.html

# Функция чистки графа дорог
sf_to_tidygraph = function(x, directed = TRUE) {
  # browser()
  # https://r-spatial.org/r/2019/09/26/spatial-networks.html
  # разбивка графа на отрезки, извлечение всех узлов
  # Give edges a unique index
  edges <- x %>%
    mutate(edgeID = c(1:n()))
  
  # Extract beginning and end node for each edge
  nodes <- edges %>%
    st_coordinates() %>%
    as_tibble() %>%
    rename(edgeID = L1) %>%
    group_by(edgeID) %>%
    slice(c(1, n())) %>%
    ungroup() %>%
    mutate(start_end = rep(c('start', 'end'), times = n()/2)) %>%
    mutate(xy = paste(.$X, .$Y)) %>%
    mutate(xy = factor(xy, levels = unique(xy))) %>%
    group_by(xy) %>%
    mutate(nodeID = cur_group_id()) %>%
    ungroup() %>%
    select(-xy)
  
  source_nodes <- nodes %>%
    filter(start_end == 'start') %>%
    pull(nodeID)
  
  target_nodes <- nodes %>%
    filter(start_end == 'end') %>%
    pull(nodeID)
  
  # Specify for each edge in which node it starts and in which node it ends
  edges <- edges %>%
    mutate(from = source_nodes, to = target_nodes)
  
  # Remove duplicate nodes and convert to sf object with correct CRS
  nodes <- nodes %>%
    distinct(nodeID, .keep_all = TRUE) %>%
    select(-c(edgeID, start_end)) %>%
    st_as_sf(coords = c('X', 'Y')) %>%
    st_set_crs(st_crs(edges))
  
  # Convert to tbl_graph object
  tbl_graph(nodes = nodes, edges = as_tibble(edges), directed = directed)
  
}

graph <- sf_to_tidygraph(road_osm_not_comp, directed = FALSE)

# Преобразование графа через sfnetworks
# https://cran.r-project.org/web/packages/sfnetworks/vignettes/sfn03_join_filter.html
graph <- sfnetworks::as_sfnetwork(graph) %>%
  st_transform(crs = 4326)

graph = convert(graph, to_spatial_subdivision)

# полная чистка от изолянтов
graph <- graph %>%
  activate("edges") %>%
  filter(!edge_is_multiple()) %>%
  activate("nodes") %>%
  filter(!node_is_isolated())

# вырежем все за пределами рамки bbox
graph <-  st_filter(graph , bbox_city  %>% 
                      st_as_sfc() %>%  st_cast("POLYGON")  %>%
                      st_sfc(crs = 4326))

# могут образоваться изолированные маршруты. Поправим
graph <-   graph %>%
  activate("nodes") %>%
  filter(group_components() == 1)

At the exit we see a cleaner, unified road network.

ggplot() +
  geom_sf(data = st_as_sf(graph, "edges")) +
  ggtitle("Граф дорог г.Каменск-Уральский (очищенный)")

The next step is to look for a tangent to the road graph from each house and add it to the graph. The tangent becomes part of the graph. We fix the connective – tangent – ​​house. You can see about the general concept here.

The process of finding tangents and combining with a graph is expensive. Here, for simplicity, I will omit the optimization stage. I will only say that you can search for tangents locally, and not in all houses at once. Straightforwardly (as in the code above), combining houses and networks is considered 20 to 30 times slower.

# Определим координаты домов
homes_df <- houses %>%
  .[order(lat_home,lon_home)]

# Для домов находим участки сети для блендинга

homes_sf <- homes_df %>%
  unique(.,by = c('lat_home','lon_home')) %>%
  st_as_sf(., coords = c("lon_home","lat_home"),crs = 4326)


blended <- graph %>%
  activate("nodes") %>%
  st_network_blend(., homes_sf)

# получаем id домов и координаты
blended_sf <- blended %>%
  activate("nodes") %>%
  st_as_sf()


homes_graph_point <- data.table(
  home_point_index = blended_sf$home_point_index,
  st_coordinates(blended_sf)) %>%
  .[!is.na(home_point_index)]


# Объединим с основным графом дорог
blended_graph <-  st_network_join(graph,
                                  blended %>% 
                                    activate('nodes') %>%
                                    select(c(geometry))
                                 )

We also save the distances between the house and the tangent in a separate data.table, so that when calculating distances, we also take them into account (due to the high proximity to the graph, I didn’t bother and calculated according to Euclid). I cut off houses beyond 1 km from the road network. They rather relate to neighboring areas or structures that do not affect the overall nature of the analysis. I save the distances between the house and the count so as not to recalculate each time.

# -----(Код объединения расчета расстояний).---------------
# быстрая отсечка домов, далеких от графа дорог
# считаем расстояние по евклиду от дома до ближайшего ребра графа дорог
cjdt <- function(a,b){
  cj = CJ(1:nrow(a),1:nrow(b))
  cbind(a[cj[[1]],],b[cj[[2]],])
}

# Считаем попарно расстояния от домов до точек на графе
blended_point <- blended_graph %>% 
  activate(nodes) %>%
  st_coordinates() %>%
  data.table()

# берем 1 градус как 111 км (множитель 111000)
dist_matrix_home_to_road <- cjdt(blended_point[,.(lat_road = Y, lon_road = X)],
                    homes_df[,.(lat_home,lon_home,home_point_index)]) %>%
  .[, dist := sqrt( (lat_home - lat_road)^2 + (lon_home - lon_road)^2) * 111000] %>%
  .[dist<1000] %>% # Отсечка далеких от графа домов
  .[order(dist)] %>% 
  unique(., by = c('home_point_index'))

Similarly, in the future we will calculate the distances from roads to pizzerias. Those. the distance that will be between the business and households = the distance from the business to the road graph + the distance along the road graph + the distance from the graph to the house.

And another little heuristic: we are looking for where to open up. How to search? There is an approach from those who are involved in rental relations in our network of pizzerias: “Here are 3 addresses where we found premises. Rate it! Let's try to evaluate the whole picture so that point estimates don't scare us: let's put on the map a dense grid of points throughout the bbox (for example, every 200 meters). Then we can use the thrown points as reference points for the search when calculating the discovery scenario. We perform the same operation as with houses, only for a limited number of points at a time, so as not to knock out the RAM (for example, 5000).

# Рассчитаем координаты точек (по аналогии с домами)
potencial_points <- expand.grid(
  lon_point = seq(from = bbox_city[1], to = bbox_city[3], by = 200/111000),
  lat_point = seq(from = bbox_city[2], to = bbox_city[4], by = 200/111000)) %>%
  data.table() %>%
  .[,point_ind := seq(.N)]%>%
  .[order(lat_point,lon_point)]%>%
  .[,iteration := seq(.N) %/% 5000]
# итерации нужны в дальнейшем, чтобы не перегружать оперативную память для крупных городов


# Для потенциальных точек по итерациям находим участки сети для блендинга
graph_point <- data.table() #для хранения связи индекс потенциальная точка - точка на графе

for(iter in c(0:max(potencial_points$iteration))){
  potencial_points_sf <- potencial_points[iteration == iter] %>%
    unique(.,by = c('lat_point','lon_point')) %>%
    st_as_sf(., coords = c("lon_point","lat_point"),crs = 4326)
  
  min_lat <- min(potencial_points[iteration == 0]$lat_point) - 2000/111000
  max_lat <- max(potencial_points[iteration == 0]$lat_point) + 2000/111000
  min_lon <- min(potencial_points[iteration == 0]$lon_point) - 2000/111000
  max_lon <- max(potencial_points[iteration == 0]$lon_point) + 2000/111000
  
  blended <- graph %>%
    activate("nodes") %>%
    filter(node_Y() >= min_lat & node_Y() <= max_lat &
           node_X() >= min_lon & node_X() <= max_lon) %>%
    st_network_blend(., potencial_points_sf)
  
  # получаем id потенциальных точек и координаты
  blended_sf <- blended %>%
    activate("nodes") %>%
    st_as_sf()
  
  
  blended_df <- data.table(
    point_ind = blended_sf$point_ind,
    st_coordinates(blended_sf)
  ) %>%
    .[!is.na(point_ind)]
  
  graph_point <- rbindlist(list(
    graph_point,blended_df
  ))
  
  # Объединим с основным графом
  blended_graph <-  st_network_join(blended_graph,
                                    blended %>% activate('nodes') %>%
                                      select(c(geometry))
  )
}

# быстрая отсечка потенциальных точек, далеких от графа дорог
# Считаем попарно расстояния от домов до точек на графе
pre_dist_matrix_point <- data.table()
potencial_points[,iteration := seq(.N) %/% 100]

blended_point <- blended_graph %>% 
  activate(nodes) %>%
  st_coordinates() %>%
  data.table()

dist_matrix_point_to_road <- data.table()

for(i in c(0:max(potencial_points$iteration))){
  time_matrix <- cjdt(blended_point[,.(lat_road = Y, lon_road = X)],
                      potencial_points[iteration == i ,.(lat_point,lon_point,point_ind)]) %>%
    .[, dist := sqrt( (lat_point - lat_road)^2 + (lon_point - lon_road)^2)*111000 ]%>%
    .[dist<1000] %>% 
    .[order(dist)] %>% 
    unique(., by = c('point_ind'))
  
  dist_matrix_point_to_road <-  rbindlist(
    list(
      dist_matrix_point_to_road,
      time_matrix
    )
  )
}

# Применим морфер для отображения не конечных точек. Подробнее:
# https://cran.r-project.org/web/packages/sfnetworks/vignettes/sfn05_morphers.html
blended_graph = convert(blended_graph, to_spatial_subdivision)

So, at this stage we have combined into one graph: roads, houses, points for searching locations. These are some standard things that we may not recalculate for every business.

Now for each potential point we will find a list of houses that lie close to it (within the trading zone). An example of highly unoptimized code:

# Торговая зона (максимально допустимое расстояние до бизнеса)
max_distance_to_point <- 2000

# Добавим кол-во домохозяйств в dt по расстоянию домов от графа дорог
dist_matrix_home_to_road[homes_df[,.(quarters_count = sum(quarters_count)),
                                 by = c('home_point_index')],
             `:=`(quarters_count = i.quarters_count),
             on = c("home_point_index")]

# Глобально посчитаем по каждому узлу графа расстояние 
# до каждого узла с касательной потенциальной точки
list_coord_road <- st_as_sf(x =  dist_matrix_point_to_road,                         
                              coords = c("lon_road", "lat_road"),crs = 4326)


# в граф потенциальной точкой открытия
dist_table <-  st_network_cost(blended_graph,
                               from = list_coord_road,
                               direction = "all") %>%
  t() %>%
  data.table() %>%
  .[,point_ind := seq(.N)] %>%
  melt(.,id.vars = c('point_ind') ) %>%
  .[,distance_road := as.numeric(value)] %>%
  .[distance_road <= max_distance_to_point] %>%
  .[,variable := as.integer(gsub('V','',variable))] 

# Добавим координаты касательной к дорожным расстояниям
road_coord <- blended_graph %>%
  activate('nodes') %>%
  st_geometry() %>%
  st_coordinates() %>%
  data.table() %>%
  .[,road_point_ind := seq(.N)]

dist_table[road_coord,
           `:=`(road_point_lat = i.Y,
                road_point_lon = i.X),
           on = .(point_ind = road_point_ind)]

dist_table[,c('value','point_ind') := NULL]

# для каждого дома (по координатам) подготовим набор потенциальных точек
homes_point_dist <- merge.data.table(x = dist_matrix_home_to_road,
                                     y = dist_table,
                                     by.x = c('lat_road','lon_road'),
                                     by.y = c('road_point_lat','road_point_lon'),
                                     all.x = T,
                                     all.y = F,
                                     allow.cartesian = T) %>%
  .[!is.na(distance_road)] %>%
  .[(distance_road + dist) <= max_distance_to_point] %>%
  .[,.(home_road_point_lat = lat_road,
       home_road_point_lon = lon_road,
       home_lat = lat_home,
       home_lon = lon_home,
       home_point_index,
       distance_home_to_road = round(dist,0),
       quarters_count,
       mag_ind = variable,
       distance_road = round(distance_road,0)
  )]

# Добавим конкретику по потенциальной точке 
# (для удобства дальнейшего представления данных)
homes_point_dist[
  dist_matrix_point_to_road[,
                .(business_type="new_point",
                  mag_name="",
                  mag_type="new_point",
                  mag_ind = seq(.N),
                  mag_road_point_lat = lat_road,
                  mag_road_point_lon = lon_road,
                  mag_lon = lon_point,
                  mag_lat = lat_point,
                  mag_dist_to_road = round(dist,0),
                  mag_address="")],
  `:=`(business_type = i.business_type,
       mag_name = i.mag_name,
       mag_type = i.mag_type,
       mag_lon = i.mag_lon,
       mag_lat = i.mag_lat,
       mag_road_point_lat = i.mag_road_point_lat,
       mag_road_point_lon = i.mag_road_point_lon,
       mag_dist_to_road = i.mag_dist_to_road,
       mag_address = i.mag_address
  ),
  on = .(mag_ind = mag_ind)
]

# отфильтруем все что в пределах торговой зоны с учетом всех типов расстояний
# от дома до графа, расстояния по графу, от графа до нового магазина
homes_point_dist = homes_point_dist[(distance_road + 
                                    distance_home_to_road +
                                    mag_dist_to_road) <= max_distance_to_point]

At the start we have 6 thousand houses, 5 thousand points for assessment. As a result, there are only 1.4 million filtered pairs of potential points – the coordinates of houses. Everything else is outside the trading zones.

Let us estimate the density of households in the trading area for each point. Here we can not display all potential opening points, but cut them off at a certain threshold, for example. From 500 households in radius.

library(tmap)

# получим только потенциальные локации с кол-вом домохозяйств в торговой зоне от 500
point_quarters_density <-  homes_point_dist %>%
  .[,.(quarters = sum(quarters_count)),by = c("mag_lon","mag_lat")] %>%
  .[quarters > 500] %>%
  st_as_sf(., coords = c("mag_lon","mag_lat"),crs = 4326)

tmap_mode('view')
tm_shape(point_quarters_density) +
  tm_dots(col = "quarters",size = 0.06) +
  tmap_options(basemaps="OpenStreetMap")

2 main areas are clearly visible where high population density is expected within a 2 km radius. So far the picture is not particularly useful. We see two large development areas. Yes, somewhere within the radius there will be more than 30 thousand inhabitants. But competitors are not evenly distributed. It's time to add them.

In total, I found 4 pizzerias, which is a small thing by the standards of large cities. Let's add them to the main structure. I will leave out the point that we can analyze competitors across the entire bbox frame. This is an extremely useful practice, but not in this mini example.

# Добавление существующего бизнеса
magazine_dt <- fread('./Каменск-Уральский_пиццерии.csv',
                     integer64 = 'character') %>%
  .[,.(name_business = `Наименование`,
       lon = `Lon`,
       lat = `Lat`)]

Now you need to add the store coordinates to the graph. In practice, I was convinced that at this stage it is no longer possible to torture the constructed graph, but simply find the nearest point on the graph using Euclidean (there will be no difference, since most of the houses are geocoded, and the tangents are included in the graph).

# Проиндексируем магазины
magazine_dt[,magazine_ind := seq(.N)]

# найдем ближайший дом, используем как точку вхождения
#  считаем расстояние по евклиду 
dist_matrix <- cjdt(magazine_dt[,.(magazine_ind,lat_b = lat, lon_b = lon)],
                    blended_graph %>%
                      activate(nodes) %>%
                      st_coordinates() %>%
                      data.table() %>%
                      .[,.(lat_road = Y,lon_road = X)]) %>%
  .[, dist := sqrt( (lat_road - lat_b)^2 + (lon_road - lon_b)^2)*111000]%>%
  .[order(dist)] %>% 
  unique(., by = c('magazine_ind'))

# Добавим расстояния до графа
magazine_dt[dist_matrix,
            `:=`(lat_road = i.lat_road,
                 lon_road = i.lon_road,
                 dist_to_road = i.dist),
            on = .(lat = lat_b, lon = lon_b)]

And by analogy with a grid of points, we are looking for a set of houses in the shopping area for each store.

# Для каждого бизнеса посчитаем дорожное расстояние до точек домов
stores_list_coord <- st_as_sf(x = magazine_dt,                         
                              coords = c("lon_road", "lat_road"),crs = 4326)

# расчет расстояния магазинов до домов
dist_table <-  st_network_cost(blended_graph, 
                               from = stores_list_coord,direction = "all") %>%
  t() %>%
  data.table() %>%
  .[,point_ind := seq(.N)] %>%
  melt(.,id.vars = c('point_ind') ) %>%
  .[,distance_road := as.numeric(value)] %>%
  .[distance_road <= max_distance_to_point] %>%
  .[,variable := as.integer(gsub('V','',variable))]

# Добавим координаты точек вхождения
road_coord <- blended_graph %>%
  activate('nodes') %>%
  st_geometry() %>%
  st_coordinates() %>%
  data.table() %>%
  .[,road_point_ind := seq(.N)]

dist_table[road_coord,
           `:=`(road_point_lat = i.Y,
                road_point_lon = i.X),
           on = .(point_ind = road_point_ind)]

dist_table[,c('value','point_ind') := NULL]

# для каждого дома (по координатам) найдем все магазины в допустимой торговой зоне
homes_mag_dist <- merge.data.table(x = dist_matrix_home_to_road,
                                   y = dist_table,
                                   by.x = c('lat_road','lon_road'),
                                   by.y = c('road_point_lat','road_point_lon'),
                                   all.x = T,
                                   all.y = F,
                                   allow.cartesian = T) %>%
  .[!is.na(distance_road)] %>%
  .[(distance_road + dist) <= max_distance_to_point] %>%
  .[,.(home_road_point_lat = lat_road,
       home_road_point_lon = lon_road,
       home_lat = lat_home,
       home_lon = lon_home,
       distance_home_to_road = dist,
       quarters_count,
       mag_ind = variable,
       distance_road
  )]

# Добавим информацию по бизнесу
homes_mag_dist[
  magazine_dt[,.(
    mag_name = name_business,
    mag_ind = seq(.N),
    mag_road_point_lat = lat_road,
    mag_road_point_lon = lon_road,
    mag_lon = lon,
    mag_lat = lat,
    mag_dist_to_road = dist_to_road)],
  `:=`(
    mag_name = i.mag_name,
    mag_lon = i.mag_lon,
    mag_lat = i.mag_lat,
    mag_road_point_lat = i.mag_road_point_lat,
    mag_road_point_lon = i.mag_road_point_lon,
    mag_dist_to_road = i.mag_dist_to_road
  ),
  on = .(mag_ind = mag_ind)
]

# отфильтруем все что в пределах торговой зоны с учетом всех типов расстояний
# от дома до графа, расстояния по графу, от графа до магазина
homes_mag_dist <- homes_mag_dist[(distance_road + 
                                  distance_home_to_road +
                                  mag_dist_to_road) <= max_distance_to_point]

Obtaining metrics for assessing the quality of a point, comparing and selecting the most interesting locations

Now we have a set of pizzerias in the shopping area for each household. We will consider competition as follows: for each household we find the average number of competitors in the trading area per household (i.e. the client will choose what is close to him). For each pizzeria, we calculate the weighted average number of pizzerias by household in the shopping area from the estimated pizzeria. To make it clearer, a simple example of the calculation logic. Let there be 2 pizzerias [piz1, piz2] and 3 households within a radius of which there are [ [piz1], [piz1, piz2], [piz1, piz2] ]. Then for households the number of competitors is: [ 1, 2, 2]. For pizzerias, the competition indicator will be [ (1+2+2)/3, (2 + 2)/2 ] = [1.66, 2]. Those. pizzeria 2 in a more competitive environment. And to understand how much is on average per household business, let’s simply divide the coverage by the weighted average competition [1.81, 1]

# добавим кол-во магазинов на каждое домохозяйство
homes_mag_dist[,n_business := .N, 
               by = c('home_lat','home_lon')]

# для поддержания уникальности магазина создадим уникальное имя
homes_mag_dist[,mag_uniq := paste0(mag_name, ' ', mag_lon, ' ',mag_lat)]

# Получим агрегаты по магазинам
business_aggr <- homes_mag_dist[,.(quarters_count = sum(quarters_count),
                                   avg_magazine_in_trading_area = mean(n_business)),
                                by = c('mag_name','mag_lat','mag_lon')] %>%
  .[, avg_quarters_on_magazine := round(quarters_count 
                                        / avg_magazine_in_trading_area,2)]%>%
  .[order(-quarters_count)] %>%
  .[,.(`Пиццерия` = mag_name,
       `Широта` = mag_lat,
       `Долгота` = mag_lon,
       `Кол-во домох-в в торг.зоне` = quarters_count,
       `Средневзвешенное по домох-вам число конкурентов в торг. зоне` = round(avg_magazine_in_trading_area,2),
       `Охват с учетом конкуренции` = avg_quarters_on_magazine
       )]

Let's see the level of competition of our data:

It’s interesting that Dodos are not afraid of competition) It is also clear that one of the pizzerias single-handedly occupies the territory. The rest are seriously dividing the market in another part of the city.

mag_quarters_density <-  homes_mag_dist %>%
  .[,.(quarters = sum(quarters_count)),by = c("mag_lon","mag_lat")] %>%
  .[quarters > 500] %>%
  st_as_sf(., coords = c("mag_lon","mag_lat"),crs = 4326)

buffers <- st_buffer(mag_quarters_density, 2000)

tmap_mode('view')
tm_shape(buffers)+
  tm_polygons(alpha = 0.5,
              col = "green")+
  tm_shape(mag_quarters_density) +
  tm_dots(col = "quarters",size = 0.06) +
  tmap_options(basemaps="OpenStreetMap")
Kamensk-Uralsky: pizzerias with designated direct coverage

Kamensk-Uralsky: pizzerias with designated direct coverage

Someone will say that everything is obvious here, go to the map and look. Ok, in this case yes. But if there are 20 pizzerias in the city, and the development (like here) is uneven and there are interesting niches? And we, in general, can compare niches in different cities, moving on to market volume (more on this a little later) and benefit from the fact that we see the picture on a larger scale.

Let's combine current competition with potential opening points. Each potential point will be a separate discovery scenario for us. The scenarios will change the overall picture of competition. Those. somewhere we will open a new pizzeria and partially take over the audience from the existing ones. If we already had a pizzeria open in the city, we would need to take into account the cannibalization of the business as a whole. Also, from the current analysis of competitors, we understand with whom we will compete.

# Установим порог на не интересные сценарии (в домохозяйствах)
min_people_count <- 1000
# Добавление сценариев открытия
new_points <- homes_point_dist  %>% 
  .[,mag_uniq := 'new_point'] %>% 
  .[,mag_name := 'new_point']

new_points[,quarters_count_point := sum(quarters_count), by = c('mag_ind')]
new_points <- new_points[quarters_count_point > min_people_count]
new_points[,quarters_count_point:= NULL]

new_points_list <- unique(new_points$mag_ind)

# В параллели рассчитываем сценарии
library(foreach)
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
new_point_calculate <- foreach(
  i = new_points_list,
  .packages = c("data.table", "magrittr"),
  .combine=function(x,y)rbindlist(list(x,y)),
  .inorder = F,
  .export = c('new_points','homes_mag_dist')
) %dopar% {
  
  rbindlist(list(
    new_points[mag_ind == i] ,
    homes_mag_dist
  ),use.names = T, fill = T) %>%
    .[,n_business := .N, 
      by = c('home_lat','home_lon')] %>%
    .[,.(quarters_count = sum(quarters_count),
         avg_magazine_in_trading_area = round(mean(n_business),3)),
      by = c('mag_name','mag_uniq','mag_lat','mag_lon')] %>%
    .[, avg_quarters_on_magazine := round(quarters_count 
                                          / avg_magazine_in_trading_area,
                                          2)] %>%
    .[order(-quarters_count)] %>%
    .[, point_calculate_lon := new_points[mag_ind == i]$mag_lon[1]]  %>%
    .[, point_calculate_lat := new_points[mag_ind == i]$mag_lat[1]]
  
}
stopCluster(cl)

# Посмотрим только на сценарии от определенного порога
new_points_full <- new_point_calculate[mag_name == 'new_point' &
                                       avg_quarters_on_magazine >= 10000]
new_points_full_sf <- st_as_sf(new_points_full, 
                               coords = c("mag_lon","mag_lat"),crs = 4326)  

Let's display on the map:

tmap_mode('view')
tm_shape(new_points_full_sf) +
  tm_dots(col = "avg_quarters_on_magazine",size = 0.02) +
  tmap_options(basemaps="OpenStreetMap")
Kamensk-Uralsky: attractive zones for location

Kamensk-Uralsky: attractive zones for location

Overall, there aren't many scenarios to consider. And even in a part of the city with high competition there are interesting placement scenarios.

We can also figure out the average market size for a pizzeria by dividing its coverage, taking into account competition, by the total number of households in the city. And convert the market volume into money to assess the return of the business at a given point (naturally adding n more variables to the model). In practice, I have found good relationships between coverage assessments taking into account competitors and business turnover. An excellent feature in one word.

So let's go back to the original fictional example. We have 2 cities where we can accommodate. I will not duplicate the code for Novouralsk. I will only attach the initial data on households and coordinates. In Novouralsk the situation with pizzerias is worse; I found only 3 relevant ones. After analysis the situation is as follows. Top 100 scenarios for Kamensk-Uralsky: the best scenario is 14695 households, taking into account competitors, the worst scenario is 12002. For Novouralsk: 8924 and 6520, respectively. On the one hand, it could have ended there. But there is also a difference in income and the share of the working population. I dug gorodrabot and found out that as of March 24, 2024, the average salary in Novouralsk = 68,058 rubles, and in Kamensk-Uralsky = 59,823 rubles. If you include the difference in the salary, it will not change the order of the scenarios. Although in my practice, scripts from the outback often punched through scripts from the capitals of regions and republics.

Well, we recommend Kamensk-Uralsky and specific trading zones + we can highlight who to compete with. I would be embarrassed to stand next to a federal network with a strong brand, delivery and business processes. And this can also be reflected in the model by giving them more weight.

Finally

Of course, everything I showed is just the tip of the iceberg. In fact, I don’t have many positive cases where I was able to influence the decision to locate a store using spatial analysis. More often the situation boiled down to the rejection of potential tenants. Spatial analysis has its own difficulties:

  • geo requires a lot of resources, so it’s easier for large businesses to buy ready-made solutions, which will be cheaper. It’s more difficult for medium-sized businesses. Therefore, geo is often replaced by a simpler methodology or implemented on its own side;

  • Often the subjective assessment of the customer interferes with the correct assessment of niches. In my practice, there was a case when a place that was recommended was met with a sharp refusal precisely because of a subjective assessment of the urban environment. Then the feds took this place. But this aspect relates rather to analytics in general;

  • spatial analytics requires confirmation/addition. Those. this is only a source of useful features for building more accurate models. Ultimately, any model output must be further verified. For example, before the final decision on opening, it is better to manually count traffic using a fixed methodology (pay students to count traffic on the street).

I noticed that many federal networks are engaged in geospatial analytics, clearly understanding what market share they will occupy in a particular territory. And for me the conclusion is obvious – this is worth doing, this is an excellent area of ​​analytics for everything related to offline, retail, catering, and delivery. This is an expensive but useful building block in building a decision-making strategy. And as a bonus – a lot of spinoffs in the form of applications to CRM, understanding of competition, analysis of a number of offline channels for attracting traffic.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *