Python Marketplace Parser Bot

Hello everyone! In this article I decided to show one of the parsing methods in Python using the example of the Wildberries marketplace.

The essence of the approach is that we will not parse the requested html page by link, but will use the site API, which is used by the service to receive and display all products of the required category.

The following libraries will be used in the project:

  1. requests – for parsing API data.

  2. aiogram 3.10.0 – one of the most popular libraries for developing telegram bots.

Project work

The application will send a GET request to the API, receiving data on product cards as output. Next, we will filter the required data, such as brand name, product name, etc., after which we will “wrap” the application in a telegram bot. And of course, we will not forget about deployment.

Wildberries, like other large services, can block the IP of the parser bot, so I suggest using proxies.

API Analysis

As specified above – wildberries uses API to get the information we are interested in on the page. Let's go to the site and open any category. Let it be Electronics -> Headset and headphones.

Now we'll open the developer tool with the key, switch to the network tab to display the requests sent by the site and select the Fetch/XHR filter. Reload the page.

Here we are interested in a query that starts with catalog?. We click on it, and, studying the contents in the Preview tab, we can go to the data – products key and see the product data contained inside!

Now we know what the request looks like and can move on to writing code, since very soon we will need data from DevTools.

We write the main logic of the project

First, we import the necessary library – requests, declare a variable for the proxy and set the structure main.py.

import requests

proxies="ЗАПИШИТЕ-СЮДА-СВОЙ-ПРОКСИ-В-НУЖНОМ-ФОРМАТЕ"

def get_category():
	pass 
	
def format_items(response):
    pass
            
def main():
    pass

if __name__ == '__main__':
    main()

As you can see, there will be 3 main functions used:

  • In get_category() we will specify the url and headers that we will get, by copying curl (bash) requests via DevTools. And return the response of the GET request in json:

def get_category():
    url="https://catalog.wb.ru/catalog/electronic14/v2/catalog?ab_testing=false&appType=1&cat=9468&curr=rub&dest=-1185367&sort=popular&spp=30"
    
    headers = {
        'Accept': '*/*',
        'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
        'Connection': 'keep-alive',
        'DNT': '1',
        'Origin': 'https://www.wildberries.ru', 
        'Referer': 'https://www.wildberries.ru/catalog/elektronika/igry-i-razvlecheniya/aksessuary/garnitury',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'cross-site',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
        'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
    }
    
    response = requests.get(url=url, headers=headers, proxies=proxies)
    
    return response.json()
  • format_items() – accepts the response of the request. Here, using a loop for Let's run through the product data, having first checked for the presence of the products themselves, and write everything down in products:

def format_items(response):
    products = []
    
    products_raw = response.get('data', {}).get('products', None)
    
    if products_raw != None and len(products_raw) > 0:
        for product in products_raw:
            print(product.get('name', None))
            products.append({
                'brand': product.get('brand', None),
                'name': product.get('name', None),
                'id': product.get('id', None),
                'reviewRating': product.get('reviewRating', None),
                'feedbacks': product.get('feedbacks', None),
            })
            
            
    return products

We get products_raw (all products) using the get method in response. We need to get to the products that we saw in the browser developer tool. To do this, we first get the contents of data, and then the products itself.

What is the GET method and how to work with it? First, any key is taken, in our case, it is product. And in the get method, the first parameter specifies the name of the next key/variable containing some value. The second parameter specifies what will be returned if such a key/variable is not found. If found, a dictionary with the data contained inside the key/variable is returned. Using append, we write the dictionary with data to the products array declared at the beginning of the function.

def main():
    response = get_category()
    products = format_items(response)
    
    print(products)

Now you can run it. At the output we successfully obtain the selected data!

Adapting the code for the bot to work

When the parser logic is ready, we can write a bot. It will be simple – the bot will not allow the user to select a category, but will simply send 10 cards for the category we selected.

Let's add the necessary imports:

import os
import asyncio
import time

import logging

from aiogram import Bot, Dispatcher, types
from aiogram.filters import CommandStart
from aiogram.enums.parse_mode import ParseMode
from aiogram.types.inline_keyboard_button import InlineKeyboardButton
from aiogram.utils.keyboard import InlineKeyboardBuilder

Let's start logging and declare a bot class with a dispatcher and a proxy variable:

logging.basicConfig(level=logging.INFO)

proxies = os.getenv('PROXIES')

bot = Bot(os.getenv("TOKEN"))
dp = Dispatcher()

Since we will be uploading the bot to the cloud, I recommend using environment variables.

Now let's change main() a little, make it asynchronous and instead of calling the main variable as usual, use asyncio to run asynchronous functions:

async def main():
    await bot.delete_webhook(drop_pending_updates=True)
    await dp.start_polling(bot)
    
if __name__ == '__main__':
    asyncio.run(main())

And most importantly: the /start command handler:

@dp.message(CommandStart)
async def start(message: types.Message):
    response = get_category()
    products = format_items(response)
    
    items = 0
    
    for product in products:
        text=f"<b>Категория</b>: Гарнитуры и наушники\n\n<b>Название</b>: {product['name']}\n<b>Бренд</b>: {product['brand']}\n\n<b>Отзывов всего</b>: {product['feedbacks']}\n<b>Средняя оценка</b>: {product['reviewRating']}"
        
        builder = InlineKeyboardBuilder()
        builder.add(InlineKeyboardButton(text="Открыть", url=f"https://www.wildberries.ru/catalog/{product['id']}/detail.aspx"))
        
        await message.answer(text, parse_mode=ParseMode.HTML, reply_markup=builder.as_markup())
        
        if items >= 10:
            break
        items += 1
        
        time.sleep(0.3)

Here we simply enter all the data obtained during parsing into a homemade card in the form of a bot message and add a button to go to the product page.

The bot sends up to 10 of the most popular products (the filter can be set in the url parameters in the first function) with breaks of 0.3 seconds, so that the telegram API does not complain about sending messages too often.

While the bot will be running locally, you can enter the token and proxy directly into the code, but in the future it is better to use environment variables.

We launch it, and when we press the start command we see that everything works!

The whole code looks like this:

import requests
import os
import asyncio
import time

import logging

from aiogram import Bot, Dispatcher, types
from aiogram.filters import CommandStart
from aiogram.enums.parse_mode import ParseMode
from aiogram.types.inline_keyboard_button import InlineKeyboardButton
from aiogram.utils.keyboard import InlineKeyboardBuilder

logging.basicConfig(level=logging.INFO)

proxies = os.getenv('PROXIES')

bot = Bot(os.getenv("TOKEN"))
dp = Dispatcher()

def get_category():
    url="https://catalog.wb.ru/catalog/electronic14/v2/catalog?ab_testing=false&appType=1&cat=9468&curr=rub&dest=-1185367&sort=popular&spp=30"
    
    headers = {
        'Accept': '*/*',
        'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
        'Connection': 'keep-alive',
        'DNT': '1',
        'Origin': 'https://www.wildberries.ru', 
        'Referer': 'https://www.wildberries.ru/catalog/elektronika/igry-i-razvlecheniya/aksessuary/garnitury',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'cross-site',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
        'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
    }
    
    response = requests.get(url=url, headers=headers, proxies=proxies)
    
    return response.json()

def format_items(response):
    products = []
    
    products_raw = response.get('data', {}).get('products', None)
    
    if products_raw != None and len(products_raw) > 0:
        for product in products_raw:
            products.append({
                'brand': product.get('brand', None),
                'name': product.get('name', None),
                'id': product.get('id', None),
                'reviewRating': product.get('reviewRating', None),
                'feedbacks': product.get('feedbacks', None),
            })
            
    return products
            
@dp.message(CommandStart)
async def start(message: types.Message):
    response = get_category()
    products = format_items(response)
    
    items = 0
    
    for product in products:
        text=f"<b>Категория</b>: Гарнитуры и наушники\n\n<b>Название</b>: {product['name']}\n<b>Бренд</b>: {product['brand']}\n\n<b>Отзывов всего</b>: {product['feedbacks']}\n<b>Средняя оценка</b>: {product['reviewRating']}"
        
        builder = InlineKeyboardBuilder()
        builder.add(InlineKeyboardButton(text="Открыть", url=f"https://www.wildberries.ru/catalog/{product['id']}/detail.aspx"))
        
        await message.answer(text, parse_mode=ParseMode.HTML, reply_markup=builder.as_markup())
        
        if items >= 10:
            break
        items += 1
        
        time.sleep(0.3)
            
async def main():
    await bot.delete_webhook(drop_pending_updates=True)
    await dp.start_polling(bot)
    
if __name__ == '__main__':
    asyncio.run(main())

Deploying a bot to Amvera

We will deploy our bot in the service Amvera.

Amvera – one of the best solutions for quick bot deployment due to its simplicity in file upload (via git or interface) and ease of setup (you don't need to setup a virtual machine, dependencies, etc.). The only thing you need to setup is environment variables, set parameters in the configuration and create a dependency file (requirements.txt). Although it sounds scary, in reality it can be done in a couple of clicks.

In Amvera you can quickly update your code via git (console utility or IDE interface) in just 3 commands.

Another interesting feature of Amvera is that the money will not burn out if the project does not work due to hourly pricing. If your application works for an hour – it will be written off in an hour. 20 minutes – it will be written off in 20 minutes, and so on. The price in the tariff is indicated if the application will work the whole month without interruptions.

First, let's register link. After confirming your phone number and email, you will be credited 111 rubles on balance.

Open the projects page and click the “Create” button. In the window that opens, enter the project name in Latin or Cyrillic, select the tariff you like, and leave the service type as “Application”.

Click next. Now the data download window is available. You can download it now via the interface, or via Git. I recommend using Git if there are future project updates.

The choice doesn't matter – you'll be able to use both the interface and Git anyway.

Click next and we are greeted by a configuration window. These are the instructions for running the project. Everything is simple – Select the Python environment, the pip tool, specify the python version, the name of the script to run and that's it. We don't need to configure anything else.

We complete the project setup and now all that remains is to create a dependency file and load the data.

The dependency file is created simply – you need to specify the name of the library and its version in the format

библиотека1==версия1
библиотека2==версия

In our case requirements.txt

aiogram==3.10.0
requests==2.32.3

Final setup and delivery of code to the repository (GIT)

Before uploading files, you need to set up environment variables. You need to go to the project page and go to the “Variables” tab, where you create secrets with the name specified in the code and the desired value.

We are ready to upload the code! If you want, you can quickly upload it through the site interface in the “Repository” tab, but I will use git.

I will show the sequence of commands for the first commit and sending (pushing) files:

  1. git init – initializes git locally

  2. git remote add amvera https://git.amvera.ru/Ваш_Ник/Имя_проекта – command to connect to the Amvera repository. The command can be found in the “Repository” tab

  3. git add . – adds all files in the local repository

  4. git commit -m "Комментарий" – first commit

  5. git push amvera master – push to the repository.

Sometimes problems may arise. Here Collects possible errors and solutions related to Git.

When you push, the build starts automatically. If you uploaded via the interface, go to the “Configuration” tab and click the “Build” button. It is important to build the project when updating the code/configuration.

Summary

Once the application is assembled, all you have to do is wait for the bot to launch and check its functionality.

Today we learned how to parse a page, or more precisely, how to use the marketplace API for this, receiving data about cards, added functionality to the bot and learned how to deploy a minimal application in Amvera.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *