We are writing a parser monitoring for Hyundai Showroom with uploading to a telegram channel

On the site https://showroom.hyundai.ru/ you can order a car without overpayments, directly from the Hyundai factory, but the problem is that the cars leave very quickly. At the same time, new cars appear infrequently, and, most often, you can observe a message on the site about the absence of cars.

In order to have time to book a car, we will write a parser monitoring for Hyundai Showroom with uploading to the telegram channel, which will notify if cars have appeared in the showroom.

We will use the language JavaScript, environment Node.js, and the following libraries:

Let’s create constants in which we describe the host of the Hyundai showroom website, access for the telegram channel and the environment variable:

const hyundaiHost="https://showroom.hyundai.ru/";
const tgToken = 'SOME_TELEGRAM_TOKEN';
const tgChannelId = 'SOME_TELEGRAM_CHANNEL_ID';
const isProduction = process.env.NODE_ENV === 'production';

Let’s create new instances of telegram bot and logger modules.

The logger is needed in order to save in the file system information about the data that the parser received when the page was loaded. This can help with debugging and, for example, will be useful for comparing the parser’s work with other parsers:

const bot = new TelegramBot(tgToken);
const logger = winston.createLogger({
    transports: [
        new winston.transports.File({
            filename: './log.txt',
        }),
    ],
});

Function start runs the function exec and sets cron… Function exec contains the main part of the business logic of the script:

async function start() {
    exec();

    cron.schedule('* * * * *', () => {
        exec();
    });
}

Let’s describe the function exec

Let’s create a browser instance in the mode headlessto prevent the operating system from launching the browser graphical interface. Let’s write additional arguments that will speed up the browser:

  const browser = await puppeteer.launch({
    headless: true,
    args: [
        '--disable-gpu',
        '--disable-dev-shm-usage',
        '--disable-setuid-sandbox',
        '--no-first-run',
        '--no-sandbox',
        '--no-zygote',
    ],
});

Let’s create a new page and also call the function setBlockingOnRequests, – this function will block some network requests that occur on the showroom page. This is necessary so that resources that are not related to the useful work of the parser are not loaded. For example, images or third party scripts such as Google Analytics and ad systems:

const page = await browser.newPage();

await setBlockingOnRequests(page);

Let’s make the first call try-catchin which we will load the page. If the page did not load, create an error report using the function createErrorReport… Let’s pass the arguments there:

  • browser page instance;

  • identifier no-page;

  • the message “Error visiting the page”;

  • system error.

After that, close the browser page and exit the function exec:

  try {
    await page.goto(hyundaiHost, {waitUntil: 'networkidle2'});
} catch (error) {
    await createErrorReport(page, 'no-page', 'Ошибка посещения страницы', error);

    await page.close();
    await browser.close();
    return;
}

If the page has loaded successfully, we will make the following call try-catchwhere we will try to find the CSS selector '#cars-all .car-columns' in the DOM – this is how we find out if the page displays a list of cars or not:

await page.waitForSelector('#cars-all .car-columns', {timeout: 1000});

Let’s also count the number of cars by the number of occurrences of the CSS selector in the DOM belonging to the car card:

const carsCount = (await page.$$('.car-item__wrap')).length;

We will formulate a timestamp and a message, which we will then send to the telegram channel. We will use the function pluralize, which will select the correct declension of the word depending on the number:

const timestamp = new Date().toTimeString();
const message = `${pluralize(carsCount, 'Доступна', 'доступно', 'доступно')} ${carsCount} ${pluralize(carsCount, 'машина', 'машины', 'машин')} в ${timestamp}`;

If the application is running in a production environment, send a message to the telegram channel:

if (isProduction) {
    bot.sendMessage(tgChannelId, message);
}

If the CSS selector for the list of cars is not found in the DOM, create an error message and then terminate the page and browser session:

await createErrorReport(page, 'no-cars', 'Ошибка поиска машин', error);
await page.close();
await browser.close();

Let’s analyze the function createErrorReport… We form messages for writing to the log file:

const timestamp = new Date().toTimeString();

logger.error(`${message} в ${timestamp}`, techError);

Let’s create a screenshot using puppeteer to make sure that the cars were really missing or, for example, the layout of the site has changed and the CSS selectors we are targeting have lost their relevance.

Let’s set the lowest image quality so that the file is as small as possible, and so that a large number of screenshots do not consume disk space:

const carListContainer = await page.$('#main-content');

if (carListContainer) {
    await carListContainer.screenshot({path: `${type}-${timestamp}.jpeg`, type: 'jpeg', quality: 1});
} else {
    logger.error(`Не могу сделать скриншот отсутствия автомобилей в ${timestamp}`, techError);
}

Consider the function setBlockingOnRequests, which enables the interception mode for the page in puppeteer and sets up an event handler.

Further, using getters resourceType and url, check the type and URL of the loaded resource. We will block pictures, media files, fonts, CSS files, web analytics systems and advertising systems, since they do not carry any useful information for parsing.

async function setBlockingOnRequests(page) {
    await page.setRequestInterception(true);

    page.on('request', (req) => {
        if (req.resourceType() === "https://habr.com/ru/post/593819/image"
            || req.resourceType() === 'media'
            || req.resourceType() === 'font'
            || req.resourceType() === 'stylesheet'
            || req.url().includes('yandex')
            || req.url().includes('nr-data')
            || req.url().includes('rambler')
            || req.url().includes('criteo')
            || req.url().includes('adhigh')
            || req.url().includes('dadata')
        ) {
            req.abort();
        } else {
            req.continue();
        }
    });
}

Function pluralize:

function pluralize(n, one, few, many) {
    const selectedRule = new Intl.PluralRules('ru-RU').select(n);

    switch (selectedRule) {
        case 'one': {
            return one;
        }
        case 'few': {
            return few;
        }
        default: {
            return many;
        }
    }
}

The main advantage of this parsing method is its simple implementation, but there is a drawback – insufficient reliability, as a result of the unstable operation of the showroom site. It can be fixed by going to work with the REST API that the showroom site works with – https://showroom.hyundai.ru/rest/car, but here we will encounter a new obstacle – data encryption.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *