19.6 million rubles for 2022. Site parsing. 25 tips for those who want to get involved

My name is Maxim Kulgin, my company xmldatafeed has been parsing websites in Russia for about four years. Following the results of the past 2022, I will share advice for those who are thinking about such a business. The business is very interesting, but filled with nuances, which I will discuss in the article.

I previously talked about our experience in developing a business on web scraping in two articles (part 1 and part 2). Now I want to sum up the results of 2022 and give some advice to the teams that want to compete with us (by the way, I don’t see anything wrong with this, because the market is quite large and most likely you will have your own path, different from ours). Of course, I don’t teach anyone, I only write our experience, you can agree with it or not – but that’s how it is with us … I’m always glad to have comments that make you think and look at what we are doing from a different angle.

Excerpt from a client bank - 2022.
Excerpt from a client bank – 2022.

In 2022, we grew a little compared to 2021 and, as I will write below, the February events had a significant impact (and in the chart above this is noticeable in March 2022). I’ll note right away that expenses and income are a floating balance (December 2021 to January 2022, etc.), so don’t look for the catch that expenses are more than income. All that we earn we spend on the team, on ourselves and on innovations (yes, we are slowly rummaging around and trying to find new niches in this business).

So…

1. This is a project business. I dream of a grocery business where costs don’t grow as linear as the customer base grows. In parsing, you will encounter the opposite. Now we have 6 fulltime programmers and I understand that if 2-3 large clients come, we will hire more guys (place, PC, training, etc.).

2. It is difficult for us to make a product from parsing. We approached this “projectile” a couple of times, started and … threw it away. I’m not saying that a product approach won’t work here, but we apparently don’t have enough knowledge to do it. We decided not to start again 🙂

3. Nobody really needs analytics. In the comments to previous publications, there were a lot of tips on how to do analytics and sell at a higher price. Inspired, rushed and … we did not succeed. They don’t ask, they don’t need to, they do it themselves inside their systems (1C, Excel, PowerBI, Google BigQuery, etc.). We spat and now we don’t even try. We focus on stable parsing and data provision. The format, by the way, is not important – csv/json/xml/excel – they ask for it in different ways.

4. Goods matching fails. You won’t believe it, but they constantly write micro-teams of very smart guys (I write without the slightest irony) who offer to match products using “new algorithms based on (convolutional, bubble, etc.) neural networks“and earn together. We give two sets of data from pharmacy chains for testing and ask them to link products to each other. Result? A little better than using here this free module for Excel. I’m not saying that it’s impossible, but the fact is that over the years we haven’t been able to make product matching better than by human hands.

5. NWO waved. They felt a sharp drop in revenue, they were afraid that the business would “dry out”. Many clients left, but it was saved by the fact that 50% of them then returned and new ones came at the end of spring. Saves that the data will always be needed.

6. It is impossible to parse all sites. There are sites that we do not parse in the required volumes and refuse to clients. Someone will say that this is a lack of competencies – I do not agree. The fact is that when you have tasks for a week ahead for current paying customers, the team will not be engaged in research work, but in these burning tasks.

Want to test your parsing skills? Leroy Merlin, regions Moscow + St. Petersburg, data every day for everyone goods. Will it work? We are working with you 🙂

7. Cold sales do not work for us. We have not been able to implement “cold” sales for the parsing service. They made several attempts, by different specialists – it doesn’t work out. All clients come from the site.

8. Support only “on the spot”. When a client comes to you and asks to parse, conditionally, 450 sites (we have one), then the support department simply will not physically be able to check the content every day all CSV/XLS files. All that can be done systematically is to analyze the difference in the amount of data between “yesterday” and “today” and, with a sharp difference in the amount, to climb “inside”.

9. You will need bare metal servers – just look for where it’s cheaper and that’s it. Preferably in the region where the sources for your parsing are located (in our DC in Moscow).

10. You will need a hoster with unlimited traffic. There is nothing to add. No “clouds” billed by gigabytes.

11. Never settle for image scraping. Only links to pictures on primary source sites. The issue is copyright, and most importantly, the amount of data. There will be many sites, you just can’t digest the volumes.

12. You will need a hoster who adequately perceives abuse. Once a quarter, the hoster will receive abuse from the sites that you parse. It’s not very cool if the hoster just turns off your servers? Therefore, negotiate on the “shore”.

13. Do not parse personal data. They will ask regularly, do not agree. Cause? It will not be a business, but a temporary “Temko”. There is a law and it is severe. You must be 99% in the legal field. Why exactly 99%? I leave 1% for the nuances that you will discuss intimately with customers (it happens differently, believe me).

14. You will be constantly asked to create bases for spam. Every day we receive 5-6 requests to create databases of companies, but every first one asks for this database to have personal contacts of the decision maker (general director, marketing director, etc.). There is no solution, because see the paragraph above about personal data.

15. Parsing is not rocket science. Full of ready-made libraries, especially for python and I’m sure that everyone can provide parsing services in the b2b segment. What is your competitive advantage? Only fame and team (I understand that I sound like “Captain Obvious”, but it is).

16. Programming language is not important. There is nothing to add. The client is not at all interested in what you are programming on.

17. Do not agree to requests to make a “parsing program.” Parsing is a service. We are regularly asked to “creak to parse on my PC”. We refuse. Cause? Well, I think it’s already clear – they will torture you with support, because any change in the site markup and the “script” does not work.

18. Mobile proxies are your “everything”. There is nothing to add. I advise you to have a couple of suppliers.

19. People prefer to write scraping requests from personal email addresses. I have no explanation for this, just take it for granted. Even large, well-known companies, applications for parsing are often written from the personal addresses of their employees.

20. There is a lot of support, believe me. No, not like that – it’s just “heaps“! Half of the team is working on fixing failures (the site layout has changed – parsing has stopped), and half is on connecting new sites.

21. Cloudflare gets by. There is nothing to add here. The speed is dropping, that’s a fact, but the data is being collected. Yes and qrator (the guys are definitely great, they protect against ddos) also manages.

22. Captcha solution is your “everything”. Services are full, choose any to your taste. Parsing slows down noticeably and this will lead to the fact that some clients have to refuse to provide services, because people want to parse everything at once 🙂 – but it doesn’t work out that way.

24. Western markets? Did not work out. After the release of the first articles (links above), I received many proposals to develop parsing in Western markets (more precisely, global ones – there were Uruguay, Chile, Europe, etc.). It didn’t grow together and I can’t even clearly explain why, it’s just a fact. I believe that global development requires a product, and parsing is a service.

25. There will be customers for 500 sites and the price for each site drops. We have a client for which we need to scrape ~450 sites per month. We take a maximum of 80 sites per month to connect (not forgetting about the support of those that have already been connected). The price for a site with such a quantity is reduced to 2000 rubles. per month, and work so many.

That’s all for now. Hope it was helpful and interesting. You can find more information in my personal Telegram channel “Russian IT business”- I write in it the whole “wrong side” that we encounter in the process of work, without embellishment. If I missed something – ask in the comments, I will definitely answer.

ps forgot to add point 26 – selling scraping results to multiple clients – I hasten to disappoint you – in 90% of requests, parsing is unique and cannot be resold. How would you like…

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *