How I speed up parsing in Python
The task is the following: to implement a parser for the goods of the great international Chinese marketplace with a red logo. Somehow in my head no options but Selenium
And X-Path
did not appear at that time. I wrote a simple implementation on the library lxml
and engine for testing web applications Selenium
.
I wanted to do everything right in beauty, and so I decided to deploy Selenium Grid
using this docker-compose filehaving previously thrown out all browsers from it except Chrome
and duplicating instances to make everything work even better. In the process of testing their solution, it was found that the marketplace in some cases offers to swipe the gray arrow to the right several times when it suspects your script that it is not a completely reliable copy of a regular buyer. Another half an hour of work with a file and the script worked out in 100 cases out of 100. But here’s the problem – the time of one such parsing took 10-17 seconds, which made me a little sad.
The new solution was not long in coming and I discovered that the marketplace uses SSR and throws out all the JSON at the end of its html.
“Here it is, the solution” – I thought then. And ran to throw Selenium
and rewrite everything to requests
, without removing from my mind the fact that I sometimes received a captcha window instead of the coveted page. In order to save money, I even tried to write a parser that would collect proxies from open sources. I wrote a special class for this, but such a solution did not last long and it turned out that these proxies were in no way usable at all. I even wrote some abstraction to work with such deep trees:
Hidden code for working with objects
import re
class DictSearch:
def __init__(self, data: dict):
self.input_item = data
# Для поиска по значению. Можно искать по ключу или значению, а так же с частичным совпадением или нет
def search(
self,
search_value: str,
search_by_key: bool = False,
partial: bool = False
):
results = []
def process(data, path: str):
if not isinstance(data, dict):
return None
for key, value in data.items():
if search_by_key:
if key == search_value:
results.append(f'{path}.{key}')
prefix = '.' if path != '' else ''
if isinstance(value, dict):
process(data.get(key), f'{path}{prefix}{key}')
if isinstance(value, list):
for index, i in enumerate(value):
process(i, f'{path}{prefix}{key}[{index}]')
else:
if partial and isinstance(value, str):
if search_value in value:
results.append(f'{path}{prefix}{key}')
if search_value == value:
results.append(f'{path}{prefix}{key}')
process(self.input_item, '')
return results[0] if len(results) == 1 else results
# Для получения значения по пути вида value.items[0].name
def get_value(self, path: str):
path_iter = path.split('.')
def process(current, item):
lists = re.findall(r'[\d+]+', current)
key = re.findall('[a-zA-Z]+', current)[0]
if len(lists):
return item.get(key)[int(lists[0])]
else:
return item.get(current)
res = self.input_item
for i in path_iter:
res = process(i, res)
return res
# Для того чтобы срезать count глубины с конца пути
@staticmethod
def cut_path(path: str, count: int):
process = path.split('.')[:-count]
return '.'.join(process)
I used this class like this:
Advantages of the solution:
the amount of resources decreased tenfold due to throwing out two containers
Chrome
and one moreSelenium/Hub
the speed of execution of one such request decreased by an average of 5 times
Minuses:
This solution worked for no more than a week, because a new idea occurred to me. It has already been mentioned in the preface, so I will not repeat it.
First of all, I came across a software called Charles Proxy. In short, the essence of the program is to intercept the entire HTTP
traffic of devices that have specified the IP address of your computer as a proxy server (this is done in the network settings). It was first necessary to install a certificate in order to be able to listen HTTPS
traffic. And specifically with this item, I had serious problems, which I happily solved by installing special Magisk module to your Android device. And even when everything started (in the browser I opened all HTTPS
resources), half of the applications still refused to “see the Internet”. While looking for a solution, I came across a program HTTP Toolkit
. The principle of its action is similar to the previous one, only the documentation for it is better. With its help, I managed to see the necessary urls and their contents. However, root
on the device and in this case it was needed in order to put the certificate. Instructions for Android devices.
My request looked like this:
Secret data
The URL I managed to get:
And also the headers:
headers = {
'User-Agent': 'ali-android-13-567-8.20.341.823566',
'x-aer-client-type': 'android',
'x-aer-lang': 'en_RU',
'x-aer-currency': 'RUB',
'x-aer-ship-to-country': 'RU',
'x-appkey': 'XXXXXXXX',
'accept': 'application/json',
'x-aer-device-id': 'X0XXxX+Xxx0XXX0XxxXXxx0X'
}
Because it was supposed to be a POST request, then in the body I sent the following content:
# Здесь доставались нужные циферки из URL детальной страницы товара
body = {
'productId': re.findall(r'\d+.html', url)[0].split('.')[0]
}
To this request, the marketplace obediently returned the correct JSON
from which it was possible to get absolutely everything about the selected product.
Thanks to this approach, I got an even greater increase in performance, while now there were much fewer captchas and, accordingly, the quality of the proxy no longer mattered so much, and their number could be reduced. library lxml
I threw it away, saving a couple of megabytes on the size of dependencies 🙂
To load pictures, I combined them into an array and threw them into asyncio.gather([...])
, saving another couple of hundred milliseconds on loading data. Just as seen above, I use the library aiohttp
instead of requests
because it’s faster and more asynchronous.
This article was created in order to show that you need to look around when solving a particular problem, try to go to the end and fight yourself in search of new solutions when something inside tells you that the code does not work perfectly. It is very interesting!