How I speed up parsing in Python

The task is the following: to implement a parser for the goods of the great international Chinese marketplace with a red logo. Somehow in my head no options but Selenium And X-Path did not appear at that time. I wrote a simple implementation on the library lxml and engine for testing web applications Selenium.

I wanted to do everything right in beauty, and so I decided to deploy Selenium Grid using this docker-compose filehaving previously thrown out all browsers from it except Chrome and duplicating instances to make everything work even better. In the process of testing their solution, it was found that the marketplace in some cases offers to swipe the gray arrow to the right several times when it suspects your script that it is not a completely reliable copy of a regular buyer. Another half an hour of work with a file and the script worked out in 100 cases out of 100. But here’s the problem – the time of one such parsing took 10-17 seconds, which made me a little sad.

The new solution was not long in coming and I discovered that the marketplace uses SSR and throws out all the JSON at the end of its html.

“Here it is, the solution” – I thought then. And ran to throw Seleniumand rewrite everything to requests, without removing from my mind the fact that I sometimes received a captcha window instead of the coveted page. In order to save money, I even tried to write a parser that would collect proxies from open sources. I wrote a special class for this, but such a solution did not last long and it turned out that these proxies were in no way usable at all. I even wrote some abstraction to work with such deep trees:

Hidden code for working with objects
import re


class DictSearch:
    def __init__(self, data: dict):
        self.input_item = data

    # Для поиска по значению. Можно искать по ключу или значению, а так же с частичным совпадением или нет
    def search(
               self, 
               search_value: str, 
               search_by_key: bool = False, 
               partial: bool = False
    ):
        results = []

        def process(data, path: str):
            if not isinstance(data, dict):
                return None
            for key, value in data.items():
                if search_by_key:
                    if key == search_value:
                        results.append(f'{path}.{key}')
                prefix = '.' if path != '' else ''
                if isinstance(value, dict):
                    process(data.get(key), f'{path}{prefix}{key}')
                if isinstance(value, list):
                    for index, i in enumerate(value):
                        process(i, f'{path}{prefix}{key}[{index}]')
                else:
                    if partial and isinstance(value, str):
                        if search_value in value:
                            results.append(f'{path}{prefix}{key}')
                    if search_value == value:
                        results.append(f'{path}{prefix}{key}')

        process(self.input_item, '')

        return results[0] if len(results) == 1 else results

    # Для получения значения по пути вида value.items[0].name
    def get_value(self, path: str):
        path_iter = path.split('.')

        def process(current, item):
            lists = re.findall(r'[\d+]+', current)
            key = re.findall('[a-zA-Z]+', current)[0]
            if len(lists):
                return item.get(key)[int(lists[0])]
            else:
                return item.get(current)

        res = self.input_item

        for i in path_iter:
            res = process(i, res)
        return res

    # Для того чтобы срезать count глубины с конца пути
    @staticmethod
    def cut_path(path: str, count: int):
        process = path.split('.')[:-count]
        return '.'.join(process)

I used this class like this:

Advantages of the solution:

  • the amount of resources decreased tenfold due to throwing out two containers Chrome and one more Selenium/Hub

  • the speed of execution of one such request decreased by an average of 5 times

Minuses:


This solution worked for no more than a week, because a new idea occurred to me. It has already been mentioned in the preface, so I will not repeat it.

First of all, I came across a software called Charles Proxy. In short, the essence of the program is to intercept the entire HTTP traffic of devices that have specified the IP address of your computer as a proxy server (this is done in the network settings). It was first necessary to install a certificate in order to be able to listen HTTPS traffic. And specifically with this item, I had serious problems, which I happily solved by installing special Magisk module to your Android device. And even when everything started (in the browser I opened all HTTPS resources), half of the applications still refused to “see the Internet”. While looking for a solution, I came across a program HTTP Toolkit. The principle of its action is similar to the previous one, only the documentation for it is better. With its help, I managed to see the necessary urls and their contents. However, root on the device and in this case it was needed in order to put the certificate. Instructions for Android devices.

My request looked like this:

Secret data

The URL I managed to get:

And also the headers:

headers = {
    'User-Agent': 'ali-android-13-567-8.20.341.823566',
    'x-aer-client-type': 'android',
    'x-aer-lang': 'en_RU',
    'x-aer-currency': 'RUB',
    'x-aer-ship-to-country': 'RU',
    'x-appkey': 'XXXXXXXX',
    'accept': 'application/json',
    'x-aer-device-id': 'X0XXxX+Xxx0XXX0XxxXXxx0X'
}

Because it was supposed to be a POST request, then in the body I sent the following content:

# Здесь доставались нужные циферки из URL детальной страницы товара

body = {
    'productId': re.findall(r'\d+.html', url)[0].split('.')[0]
}

To this request, the marketplace obediently returned the correct JSON from which it was possible to get absolutely everything about the selected product.

Thanks to this approach, I got an even greater increase in performance, while now there were much fewer captchas and, accordingly, the quality of the proxy no longer mattered so much, and their number could be reduced. library lxml I threw it away, saving a couple of megabytes on the size of dependencies 🙂

To load pictures, I combined them into an array and threw them into asyncio.gather([...]), saving another couple of hundred milliseconds on loading data. Just as seen above, I use the library aiohttp instead of requestsbecause it’s faster and more asynchronous.

This article was created in order to show that you need to look around when solving a particular problem, try to go to the end and fight yourself in search of new solutions when something inside tells you that the code does not work perfectly. It is very interesting!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *