Search for an open API site or We speed up parsing by 10 times

https://habr.com/ru/post/680320/image

Purpose of the article – describe the algorithm of actions for searching the open API of the site.

Target audience of the article — programmers who are interested in parsing and analyzing website vulnerabilities.

In the article, we will consider an example of searching for the API of the site edadeal.ru, get acquainted with the google protobuf protocol and compare the speed of various parsing approaches

1. Introduction

Parsing (in the context of the article) is an automated process of extracting data from the Internet.

There are 2 approaches to extracting data from website pages

  1. Extract data from the HTML code of a website page

    pros – this method is simple and always works, since the page code is always available to the user

    Minuses – this method can work for a long time (several seconds) if part of the data is generated by a java script (for example, the data appears only after scrolling the page or pressing a button)

  2. Use site API

    pros – faster than the first method and does not depend on changes in the structure of the html page

    Minus – not all sites have an open API

In the article, we will consider an example of searching for the API of the site edadeal.ru, get acquainted with the google protobuf protocol and compare the speed of two parsing approaches

2 Problem statement

The task is to extract data about products from the Edadil website (product name, price, discount amount, store, city, etc.)

3 Decision

1 We make a request to the page we want to parse.
2 We go through all the requests that the site makes. To do this, use the DevTools of the browser

https://habr.com/ru/post/680320/image

3 Analyzing queries
From the name of the request, we understand that we need a request

https://squark.edadeal.ru/web/search/offers?count=30&locality=moskva&page=1&retailer=5ka

In response to the request, we receive a file (let’s call it binary_file.bin). How can I find out the encoding of this file?
The file format from point 3 tells us the header line content-type: application/x-protobuf

4 Define the data structure (.proto file)
using the protoc utility (
http://google.github.io/proto-lens/installing-protoc.html) convert the encoded file into a human-readable format

protoc --decode_raw < binary_file.bin

Get a list of dictionaries:

1 {
  1: "e\341_\260\007\177W\202\222O\326\316\233\326\000A"
  2: "\320\242\321\203\320\260\320\273\320\265\321\202\320\275\320\260\321\217 \320\261\321\203\320\274\320\260\320\263\320\260 Familia Plus, 2 \321\201\320\273\320\276\321\217, 12 \321\200\321\203\320\273\320\276\320\275\320\276\320\262, 1 \321\203\320\277."
  3: "https://leonardo.edadeal.io/dyn/cr/catalyst/offers/u4nf6zbkjc3m5lss46ucvxjafm.jpg"
  4: 0x43ad7eb8
  5: 0x4347e666
  7: ";5\332^c\021\021\346\204\237RT\000\020\266\010"
  8: 0x41400000
  9: "\321\210\321\202"
  10: 0x422c0000
  11: "%"
  13: 43
  15: "2022-07-26T00:00:00Z"
  16: "2022-08-01T00:00:00Z"
  19: "A1\005L\332nPg\230\342q\375\031\335\014\336"
  20 {
    1: 0x3f800000
    2: 0x418547ae
    3: "\321\210\321\202"
    4: 1
  }
  21: "\224\331\203\202B\303\021\346\224\031RT\000\020\266\010"
  22: "K3\020\2537{O\271\273\374K\351\376\224\310*"
  22: "\300\336d(\224kL\025\224\300\355\256\247\327R\035"
  22: "\303O:\202\330\262A\326\246\023\307D\314F\303G"
  22: "\210\"\022?\250|L.\272\375\345{\335c,\026"
  22: "=3yP\026\004N\334\267\377\320\036F\326\331\\"
  22: "E\211\000\246e6EI\223\000)\242\3348\216M"
  22: "V#\263\022\367\324H\350\232r\013\010_KX\273"
  23: "\320\232\320\276\320\273\320\270\321\207\320\265\321\201\321\202\320\262\320\276"
  24: 1
}

5 Forming a .proto file
We use the numbers from the previous paragraph, based on the content from the previous paragraph, you need to guess which fields mean what (for example, 3 is a link to the product image)

Through trial and error, we get the following structure:

syntax = "proto2";

message Offers {
  repeated Offer offer = 1;
}

message Offer {
  optional string name = 2;
  optional string image_url = 3;
  optional float price_before = 4;
  optional float price_after = 5;
  optional float amount = 8;
  optional float discount = 10;
  optional string start_date = 15;
  optional string end_date = 16;
}

4 Moving on to writing code

Create a Python file with a structure description from a .proto file

protoc --proto_path=proto_files --python_out=proto_structs offers.proto

proto_files — directory name with .proto files

proto_structs – results are saved in this directory (_pb2.py files)

The code works like this:

  1. Makes a request to the site’s API
  2. Converts website response to json
  3. Outputs the result
import json
import requests

from google.protobuf.json_format import MessageToJson
from proto_structs import offers_pb2

def parse_page(city = "moskva", shop = "5ka", page_num = 1):
    """
    :param city: location of the shop
    :param shop: shop name
    :param page_num: parsed page number
    :return: None
    """
    url = f"https://squark.edadeal.ru/web/search/offers?count=30&locality={city}&page={page_num}&retailer={shop}"
    data = requests.get(url, allow_redirects=True)  # data.content is a protobuf message

    offers = offers_pb2.Offers()  # protobuf structure
    offers.ParseFromString(data.content)  # parse binary data
    products: str = MessageToJson(offers)  # convert protobuf message to json
    products = json.loads(products)
    print(json.dumps(products, indent=4, ensure_ascii=False,))

if __name__ == "__main__":
    parse_page()

The result of the program is a list of products with a description

{
    "offer": [
        {
            "name": "Наггетсы, куриные с ветчиной, Мираторг, 300 г",
            "imageUrl": "https://leonardo.edadeal.io/dyn/cr/catalyst/offers/necnmkv43splbm3hr5636snpry.jpg",
            "priceBefore": 218.99000549316406,
            "priceAfter": 109.48999786376953,
            "amount": 300.0,
            "discount": 51.0,
            "startDate": "2022-08-02T00:00:00Z",
            "endDate": "2022-08-08T00:00:00Z"
        },
        ...

5 Compare results

The execution time of the code from the previous paragraph is 0.3 – 0.4 seconds
An alternative parsing option is to download the entire html code of the page and extract the necessary information from this code

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://edadeal.ru/moskva/retailers/5ka")
# извлечение данных из html кода

Full page load time 5 – 6 seconds.

6 Conclusions

It is better to use the site API to retrieve data, if possible
Using the site API allows you not to depend on changes in the html code of the page

Similar Posts

Leave a Reply