How to collect a database of organizations in an hour
Hello everyone, my name is Alexander Kalyrgin, now I am actively involved in the field of data acquisition and analysis. I wanted to show how you can easily get the bases of organizations registered in the region you need.
In my work, I used data from open sources, namely:
Many thanks to the guys from ITSOFT, they are great fellows – the data should be open. However, the Federal Tax Service of the Russian Federation provides these archives for 300,000 rubles. in year.
Let’s get data on the organizations of the Sverdlovsk region operating in the construction industry.
So, let’s begin.
1) Get input
From the website of the Unified State Register of Legal Entities ITSOFT we download the data of the Federal Tax Service:
Archive of income and expenses for 2011-2020;
Archive of OKVED codes;
Archives of intermediate data (for the convenience of updating data);
Archives of organizations and updates to them.
From the archive of these organizations we obtain the following table:
We connect it with the archive of income and expenses, while choosing the values for 2020.
In the archive of OKVED codes, we look for identifiers that correspond to the construction industry (3327 – 3286).
We filter the data by the following parameters:
Compliance with OKVED;
Region – Sverdlovsk region (66);
Activity end date – must be zero (0000-00-00);
Income – above 600,000 rubles for 2020.
After these steps, we delete duplicate records and unnecessary columns in the table.
Already 2758 organizations!
2) Okay, now enrich the data
We parse mail, website and phone numbers from the Chekko website. We will do this by substituting the OGRN or TIN in the base search link: “/search?query=”. I advise you to set the delay between requests to 0.5 seconds in order to receive the correct data.
We collect the received data together with the main table. Voila! You have received an up-to-date database of construction organizations in the Sverdlovsk region!
It took me about 1 hour to form this database, including the work of the parser. There were 1554 organizations with contact information in total.
I hope the article was interesting.