Making business transparent or another example of captcha recognition

It’s no secret that captcha is a popular tool to reduce the load on the site and prevent robots from downloading information. Today, when captcha is used on almost every site, let’s consider a case with its bypass on the service “Transparent business”.

What is “Transparent Business”?

The service contains comprehensive information about the financial and legal parameters of legal entities (more here). This information is useful both for the organizations themselves in order to check counterparties, and for data scientists, for example, in order to collect statistics and build infographics for a particular region or country as a whole.

Before we start, let’s say that we already have a list of TINs for which we need to query the Service. If not, they can be removed from open data of the Federal Tax Service or here.

On your marks

Let’s start with a POST request to https://pb.nalog.ru/search-proc.json. The request body will be like this:

payload = {'page': '1', 'pageSize': '10', 'pbCaptchaToken': '', 'token': '', 'mode': 'search-ul', 'queryAll': '',
           'queryUl': 'ИНН ДЛЯ ЗАПРОСА', 'okvedUl': '', 'statusUl': '', 'regionUl': '', 'isMspUl': '', 'queryIp': '', 'okvedIp': '', 'statusIp': '',
           'regionIp': '', 'isMspIp': '', 'mspIp1': '1', 'mspIp2': '2', 'mspIp3': '3', 'queryUpr': '', 'uprType1': '1', 'uprType0': '1',
           'queryRdl': '', 'dateRdl': '', 'queryAddr': '', 'regionAddr': '', 'queryOgr': '', 'ogrFl': '1', 'ogrUl': '1', 'npTypeDoc': '1',
           'ogrnUlDoc': '', 'ogrnIpDoc': '', 'nameUlDoc': '', 'nameIpDoc': '', 'formUlDoc': '', 'formIpDoc': '', 'ifnsDoc': '',
           'dateFromDoc': '', 'dateToDoc': ''}

However, if you pull the Service too often, we get a beautiful picture,

which will be available via the link. https://pb.nalog.ru/static/captcha.bin?r=1664389287469&a=B19F70E11E1ED39188D369F4F698A07A3EF963834C354814A1500080C8EA265EE3109E11270E79BDD6E154DCB897E1B5&version=2.

The key here is the parameter version=2. Let’s try to replace version=3 and we get:

Already better, right?

Then we make preliminary transformations: we remove the background, clean it from noise and bring it to a monochrome look. It turns out like this:

Method Code
    def clean_image(self):
        for iy, y in enumerate(self.img_a):
            for ix, x in enumerate(y):
                pass
                if self.img_a[iy][ix][0] > 100 and self.img_a[iy][ix][1] > 100 and self.img_a[iy][ix][2] > 100:
                    self.img_a[iy][ix][0], self.img_a[iy][ix][1], self.img_a[iy][ix][2] = 255, 255, 255
                # Чистим полосы
                if self.img_a[iy][ix][0] >= 27 and self.img_a[iy][ix][0] <= 97 and \
                        self.img_a[iy][ix][1] >= 52 and self.img_a[iy][ix][1] <= 104 and \
                        self.img_a[iy][ix][2] >= 48 and self.img_a[iy][ix][2] <= 117:
                    self.img_a[iy][ix][0], self.img_a[iy][ix][1], self.img_a[iy][ix][2] = 255, 255, 255
                # Все, что осталось, делаем одного цвета
                if not (self.img_a[iy][ix][0] == 255 and self.img_a[iy][ix][1] == 255 and self.img_a[iy][ix][2] == 255):
                    self.img_a[iy][ix][0], self.img_a[iy][ix][1], self.img_a[iy][ix][2] = 0, 0, 0

Now we need to cut the picture into numbers. The idea is this: we scan the image from left to right, the center of the first vertical stripe encountered of 5 black pixels in a row will be the entry point. Recursively find all adjacent black pixels. The boundaries of the resulting rectangle are the boundaries of our figure. Repeating the procedure six (by the number of digits) times, we get six mini-pictures (arrays) with numbers.

Sometimes (rather rarely) it turns out that 2 digits fall into one array. We detect such cases by the inflated width of the rectangle (more than 35 pixels), divide it (rectangle) in half and hope that we are lucky.

It is known from practice that the average size of a picture with a number is 24 by 44 pixels. Therefore, using the pillow library, we transform it to these values.

We train the neural network

We have learned how to get captcha and cut it into sane numbers. But how to form a dataset for training a neuron? And here the Service (pb.nalog.ru) comes to our aid. It’s simple: every time we download a captcha, we get a new image with the same numbers. In other words, if we download 10,000 captchas, we will have 10,000 options, for example, the number 8. Such a turn saves us from manual data markup, and that’s good!

Let’s take a simple model with five inner layers as a neuron. This is more than enough to get 99% accuracy.

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(44, 24)),
    keras.layers.Dense(528, activation='tanh'),
    keras.layers.Dense(264, activation='tanh'),
    keras.layers.Dense(132, activation='tanh'),
    keras.layers.Dense(66, activation='tanh'),
    keras.layers.Dense(33, activation='tanh'),
    keras.layers.Dense(10,  activation='softmax')
])

model.compile(optimizer="adam", loss="categorical_crossentropy", 
              metrics = ['accuracy'])

The trained model is also available at GitHub.

Getting data

So, when we can convert the image into a six-digit code, we get the captcha token through a POST request https://pb.nalog.ru/captcha-proc.json

payload = {'captcha': РЕЗУЛЬТАТ РАСПОЗНОВАНИЯ КАПЧИ}

Then we return to the request https://pb.nalog.ru/search-proc.json (from which it all began) only now in the field pbCaptchaToken forwarding the received captcha token.

The organization’s token is returned in response. We pass it in the body of the request by url https://pb.nalog.ru/company-proc.json.

 payload = {'token': ТОКЕН ОРГАНИЗАЦИИ, 'method': 'get-request'}

Here the Service will most likely show another captcha. Yes, to get to the data, the captcha will have to be solved twice. But the algorithm of actions is already known! We solve the captcha, pass the captcha token with the field pbCaptchaToken and in response we get a new token and id.

Sending a request again https://pb.nalog.ru/company-proc.json only now the method will be get-response.

payload = {'token': НОВЫЙ ТОКЕН, 'id': id, 'method': 'get-response'}

Finally, if everything went well, we will get a json with organization data. What to do with them – decide for yourself.

Link to GitHub

That’s all!

Similar Posts

Leave a Reply