Captcha bypass data collection with PYAUTOGUI, KERAS and TENSORFLOW

There are a large number of methods for automatically collecting and processing large amounts of data from web resources. However, sometimes it is not available to extract data using automated code that performs GET requests, followed by parsing the HTML code and converting it to the required format, as well as all related methods. In such cases, a user action emulator (“clicker”, “bot”, “robot”) can come to the rescue.

The following task was set: it is necessary to obtain information on a specific web resource (let’s omit its name) based on an Excel file with a list of specific data (for 2000 lines). Due to the fact that all methods of automated parsing and scraping are impossible, and time is limited, we draw your attention to the fact that a mouse and keyboard emulator is also suitable for us.

For a simple emulator to work, we need libraries:

import time, pyperclip
import pandas as pd
import pyautogui
import os
import re

But there is one hurdle: enter captchawhich looks like:

On the Internet, you can find many algorithms for working with captchas in the framework of machine learning, and we will use one of them. You can get acquainted with our choice here – https://habr.com/ru/post/464337/

Our task remains to train a ready-made algorithm.

import urllib
import urllib.request
HEADERS = {'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
for i in range(1, 1001):
    html = r'здесь была ссылка на ресурс'
    urllib.request.urlretrieve(html, r"здесь пусть сохранения файлов/number" + str(i) + '.jpg')

As you can see from the code, 1000 images were downloaded to train the model. Then they need to be processed: cut and distributed into directories depending on the numbers in the images. Let’s use the code below:

from PIL import Image
import glob
for file in glob.glob(r'*.jpg')
    img = Image.open(file)
    area1=(2,0,29,45) #спереди,сверху,справа,снизу)
    img1 = img.crop(area1) 
    area2=(29,0,51,45) #спереди,сверху,справа,снизу)
    img2 = img.crop(area2)
    area3=(51,0,71,45) #спереди,сверху,справа,снизу)
    img3 = img.crop(area3)
    area4=(71,0,91,45) #спереди,сверху,справа,снизу)
    img4 = img.crop(area4)
    area5=(91,0,110,45) #спереди,сверху,справа,снизу)
    img5 = img.crop(area5)    
    img1.save(r"путь сохранения\"+filename+"1"+".jpg")
    img2.save(r"путь сохранения\"+filename+"1"+".jpg")
    img3.save(r"путь сохранения\"+filename+"1"+".jpg")
    img4.save(r"путь сохранения\"+filename+"1"+".jpg")
    img5.save(r"путь сохранения\"+filename+"1"+".jpg")

To train the model, we form the following directory logic:

In dat, we sorted the numbers into folders from 0 to 9, where each of the folders contains the corresponding number. For example, 0 (zero):

In total, there were 500 images in each of the folders.

train_simple.py – файл для обучения модели.

And most importantly, the output directory, where the trained model will be located.

The preparatory work is done, it remains to train the model. Open the command line, go to our directory and enter the command:

python train_simple.py --dataset dat --model output/simple_nn.model --label-bin output/simple_nn_lb.pickle --plot output/simple_nn_plot.png

The captcha recognition model is trained. We see the following result:

This means that a fidelity of 98.2% was achieved on the training set, 97.8% on the control set, and 97.8% on the test set. We focus on the last value. Let’s look at the visual interpretation of the model in the output directory of the simple_nn_plot.png file:

Remember that there will be no 100% result. In general, recognition of 5 numbers takes about 10 seconds.

So, let’s connect the recognition file to our main code:

import sys
sys.path.append('путь до директории, где лежит директория с обученной моделью')
import captcha

The next task is to find out all the necessary coordinates for the mouse emulator to work.

import pyautogui
import keyboard      
while True:
    if keyboard.is_pressed('space'):
        mouse_x, mouse_y = pyautogui.position()
        print(mouse_x, mouse_y)

The essence is simple: point the mouse at the desired position, press the “space” and get the coordinates we need. Let’s move on to the main emulator code:

We write a function to simplify the overall code.

def click(x, y, n, key):
    pyautogui.click(x, y, button=key) 
    time.sleep(n)

time.sleep(2) 
# Переходим в папку с проектом, где лежит обученная модель
os.chdir(путь до директории проектаproject')

# Формируем два датафрейма: в первом содержится наш файл, а второй будет содержать итоговую информацию после отработки кода.
df1 = pd.read_excel(r'путь до excel-файлазначения_для_проверки_.xlsx')
df1=df1[:]
df2 = pd.DataFrame()

The logic is simple: the data from the column of our Excel file will be entered into the input field on the web resource, after that the captcha is recognized and entered, the check button is pressed and the text result of the check is entered into the second dataframe, thus forming an Excel table with the final data taken from web resource.

for i in range(0, len(df1[' id '])):  
 	 # 1) Поле ввода значения из таблицы excel:
    click(174, 461, 1, 'left')     
    pyautogui.typewrite(str(df1['id'][i])) # вводит в поле ввода из столбца «id» значение n-ой строки
    time.sleep(2)

	 # 2) Блок работы с капчей
    while True:       
        click(150, 679, 1, 'left') # поле ввода капчи   
        
        click(361, 673, 2, 'right') # ПКМ по изображению
        click(501, 688, 3, 'left') # скачать
        click(582, 565, 5, 'left') # сохранить
        
        # Изображение скачено, теперь необходимо эту распознать капчу. Для этого мы используем ранее подключенный созданный нами модуль
      numbers = captcha.content() 

        click(150, 679, 3, 'left') # поле ввода капчи
        pyautogui.typewrite(numbers) # вводит распознанную капчу
        
        time.sleep(3) 
        click(323, 773, 5, 'left')  # кнопка «проверить»         
        pyperclip.copy('') # очищаем буфер обмена

        # 3) Следующий блок проверки. Если капча введена не верно, то возвращаемся в начало цикла
        pyautogui.click(140, 713, button='left', clicks=3) # проверка
        time.sleep(3)         
        pyautogui.hotkey('ctrl', 'c') # копируем  
        valid = pyperclip.paste()   
        time.sleep(1) 
    
        if re.findall(r'ведитеs', valid):   
            os.remove("number.jpg")
        else:   
            os.remove("number.jpg")
            break                
  
# 5) Если капча введена верно, то сохраняем значения в переменные.
    pyautogui.click(190, 156, button='left', clicks=3)
    time.sleep(2) 
    pyautogui.hotkey('ctrl', 'c')
    time.sleep(2) 
    result = pyperclip.paste()
    time.sleep(1) 
    
    pyautogui.click(177, 187, button='left', clicks=3)
    time.sleep(2) 
    pyautogui.hotkey('ctrl', 'c')
    time.sleep(2) 
    description = pyperclip.paste()
    time.sleep(1) 
    
# 6) Блок формирования результирующей таблицы. Формируем строку в новый датафрейм.
    new_row = {'id': df1['id'][i], 'number': df1['text'][i], 'text': result, 'description': description}
    print(i, new_row)
    df2 = df2.append(new_row, ignore_index=True)    
    pyautogui.scroll(1000) # Перемещаемся снова наверх.

The code contains several logical blocks:

• Working with an input field with data from the original Excel spreadsheet. The DataFrame(df1) of the original Excel table where the data is stored contains the “id” column. Values in a column contain up to 10 characters. The operation of the code assumes that the value is taken from the “i” row of the “id” column in accordance with the number of the cycle circle. Until the values in the column run out, the loop will not break.

• Working with captcha. We create an endless loop, the condition for exiting from which will be the correct input of the captcha. We download the captcha image, call the function from the previously formed module, then an input attempt occurs. If the captcha is entered correctly, then the cycle is interrupted and we move on to the next logical block, if not, then we return to the beginning of the cycle, deleting the downloaded image.

• A little about the input validation logic. In case of unsuccessful input, a warning appears on the main page and does not disappear, moreover, the text of the warning can be copied. We will use the latter. To check if the captcha was entered correctly, we search for the text at specific coordinates and check what is said there, if it is advised to retry entering, then, accordingly, we return to the beginning of the cycle.

• Data extraction block with correct captcha input. It’s simple: we take and copy the text we need at certain coordinates, saving this text to variables.

• Block forming the resulting table. A dictionary is created with the data we need. This dictionary is one line in the table. Now we have automated the data collection and connected the machine learning model for captcha recognition, which allowed us to increase the speed of work in comparison with human resources.

Good luck coding 😊