Parsing Amazon on easy without moms, dads and mortgages

I came across a script on the Internet that allows you to parse product cards from Amazon. And I just needed a solution to a similar problem.

I've been racking my brains trying to figure out how to parse product cards in Amazon. The problem is that Amazon uses different design options for different results, in particular – if you need to parse cards for the search query “bags” – the cards will be arranged vertically, as I need, but if you take, for example, “t-shirts” – here the cards are arranged horizontally, and with this arrangement the script throws an error, it processes the opening of the page, but does not want to scroll.

Moreover, having read various articles where users are racking their brains over how to bypass captcha on Amazon, I upgraded the script and now it can bypass captcha if it is encountered (it works with 2captcha), the script checks for captcha on the page after each new page load and if captcha is encountered, it sends a request to the 2captcha server, and after receiving the solution, it substitutes it and continues working.

However, how to bypass captcha is not the most difficult question, since this is a non-trivial task in our time, a more pressing question is how to make the script work not only with a vertical arrangement of product cards, but also with a horizontal one.

Below I will describe in detail what the script includes, demonstrate its operation, and if you can help in solving the problem, what to add (change) in the script so that it works on horizontal cards – I will be grateful.

Well, the script will help someone at least in its limited functionality.

So, let's break down the script piece by piece!

Preparation

First, the script imports the modules required to complete the task.

from selenium import webdriver
 from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import os
from time import sleep
import requests

Let's take it apart piece by piece:

from selenium import webdriver

Imports class webdriverwhich allows you to control the browser (in my case Firefox) via a script

from selenium.webdriver.common.by import By

Imports class Bywith which the script will search for elements to parse by XPath (it can search for other attributes as well, but in this case it will be used XPath)

from selenium.webdriver.common.keys import Keys

Imports class Keyswhich will be used to simulate keystrokes, in the case of this script, it will be scrolling down the page Keys.PAGE_DOWN

from selenium.webdriver.common.action_chains import ActionChains

Imports class ActionChainsto create complex sequential actions, in our case – pressing a button PAGE_DOWN and waiting for all elements on the page to load (since in Amazon cards are loaded as you scroll)

from selenium.webdriver.support.ui import WebDriverWait

Imports class WebDriverWait which waits until the information we are looking for has loaded, for example, a product description that we will search for XPath

from selenium.webdriver.support import expected_conditions as EC

Imports class expected_conditions (abbreviated EC) which works in conjunction with the previous class and indicates WebDriverWaitwhat specific condition it needs to wait for. Increases the reliability of the script so that it does not start interacting with content that has not yet been loaded.

import csv

Imports the module csvfor working with csv files.

import os

Imports the module osfor working with the operating system (creating directories, checking for the presence of files, etc.).

from time import sleep

Import the function sleep – this is the same function that will pause the script for a specific time (in my case 2 seconds, but you can set it longer) so that the elements load when scrolling.

import requests

Imports the library requeststo send HTTP requests, to interact with the captcha recognition service 2captcha.

Setting

Once everything is imported, the script proceeds to configure the browser for operation, in particular:

Setting up an API key to access the 2captcha service

# API key for 2Captcha
API_KEY = "Your API Key"

The script contains a user-agent (it can be changed, of course), which is installed for the browser. After that, the browser is launched with the specified settings.

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

options = webdriver.FirefoxOptions()
options.add_argument(f"user-agent={user_agent}")

driver = webdriver.Firefox(options=options)

Next comes the module for solving captcha. This is exactly the place that users search for when they request how to solve captcha. We will not analyze this piece of code for a long time, since there were no particular problems with it.

In short, the script, after each page load, checks for captcha on the page and if it finds it there, it solves it by sending it to the 2captcha server; if there is no captcha, it simply continues execution.

def solve_captcha(driver):
    try:
        captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')
        if captcha_element:
            print("Captcha detected. Solving...")
            site_key = captcha_element.get_attribute('data-sitekey')
            current_url = driver.current_url
            
            # Запрос решения капчи к 2Captcha
            captcha_id = requests.post(
                'http://2captcha.com/in.php', 
                data={
                    'key': API_KEY, 
                    'method': 'userrecaptcha', 
                    'googlekey': site_key, 
                    'pageurl': current_url
                }
            ).text.split('|')[1]

            # Ожидание решения капчи
            recaptcha_answer=""
            while True:
                sleep(5)
                response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
                if response.text == 'CAPCHA_NOT_READY':
                    continue
                if 'OK|' in response.text:
                    recaptcha_answer = response.text.split('|')[1]
                    break
            
            # Ввод решения капчи на странице
            driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML = "{recaptcha_answer}";')
            driver.find_element(By.ID, 'submit').click()
            sleep(5)
            print("Captcha solved.")
    except Exception as e:
        print("No captcha found or error occurred:", e)

Parsing

Next comes the section of code that is responsible for sorting through pages, loading them and scrolling them.

try:
    base_url = "https://www.amazon.in/s?k=bags"

    for page_number in range(1, 10): 
        page_url = f"{base_url}&page={page_number}"

        driver.get(page_url)
        driver.implicitly_wait(10)

        solve_captcha(driver)

        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')))

        for _ in range(5):  
            ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()
            sleep(2)

The next part is collecting data about products. The most important section. In this part, the script studies the loaded page and takes the data that is specified from there, in our case, this is the product name, number of reviews, price, URL, product rating.

        product_name_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')
        rating_number_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base s-underline-text"]')
        star_rating_elements = driver.find_elements(By.XPATH, '//span[@class="a-icon-alt"]')
        price_elements = driver.find_elements(By.XPATH, '//span[@class="a-price-whole"]')
        product_urls = driver.find_elements(By.XPATH, '//a[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')
        
        product_names = [element.text for element in product_name_elements]
        rating_numbers = [element.text for element in rating_number_elements]
        star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]
        prices = [element.text for element in price_elements]
        urls = [element.get_attribute('href') for element in product_urls]

Next, the specified data is unloaded into a folder (for each page, its own csv file is created, which is saved in the output files folder); if the folder does not exist, the script creates it.

        output_directory = "output files"
        if not os.path.exists(output_directory):
            os.makedirs(output_directory)
        
        with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline="", encoding='utf-8') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])
            for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):
                csv_writer.writerow([url, name, price, star_rating, num_ratings])

And the final stage is completing the work and releasing resources.

finally:
    driver.quit()

Full script

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import os
from time import sleep
import requests

# API key for 2Captcha
API_KEY = "Your API Key"

# Set a custom user agent to mimic a real browser
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

options = webdriver.FirefoxOptions()
options.add_argument(f"user-agent={user_agent}")

driver = webdriver.Firefox(options=options)

def solve_captcha(driver):
    # Check for the presence of a captcha on the page
    try:
        captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')
        if captcha_element:
            print("Captcha detected. Solving...")
            site_key = captcha_element.get_attribute('data-sitekey')
            current_url = driver.current_url
            
            # Send captcha request to 2Captcha
            captcha_id = requests.post(
                'http://2captcha.com/in.php', 
                data={
                    'key': API_KEY, 
                    'method': 'userrecaptcha', 
                    'googlekey': site_key, 
                    'pageurl': current_url
                }
            ).text.split('|')[1]

            # Wait for the captcha to be solved
            recaptcha_answer=""
            while True:
                sleep(5)
                response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
                if response.text == 'CAPCHA_NOT_READY':
                    continue
                if 'OK|' in response.text:
                    recaptcha_answer = response.text.split('|')[1]
                    break
            
            # Inject the captcha answer into the page
            driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML = "{recaptcha_answer}";')
            driver.find_element(By.ID, 'submit').click()
            sleep(5)
            print("Captcha solved.")
    except Exception as e:
        print("No captcha found or error occurred:", e)

try:
    # Starting page URL
    base_url = "https://www.amazon.in/s?k=bags"

    for page_number in range(1, 2): 
        page_url = f"{base_url}&page={page_number}"

        driver.get(page_url)
        driver.implicitly_wait(10)

        # Attempt to solve captcha if detected
        solve_captcha(driver)

        # Explicit Wait
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')))

        for _ in range(5):  
            ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()
            sleep(2)

        product_name_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')
        rating_number_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base s-underline-text"]')
        star_rating_elements = driver.find_elements(By.XPATH, '//span[@class="a-icon-alt"]')
        price_elements = driver.find_elements(By.XPATH, '//span[@class="a-price-whole"]')
        product_urls = driver.find_elements(By.XPATH, '//a[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')
        
        # Extract and print the text content of each product name, number of ratings, and star rating, urls
        product_names = [element.text for element in product_name_elements]
        rating_numbers = [element.text for element in rating_number_elements]
        star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]
        prices = [element.text for element in price_elements]
        urls = [element.get_attribute('href') for element in product_urls]
        
        sleep(5)        
        output_directory = "output files"
        if not os.path.exists(output_directory):
            os.makedirs(output_directory)
        
        with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline="", encoding='utf-8') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])
            for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):
                csv_writer.writerow([url, name, price, star_rating, num_ratings])

finally:
    driver.quit()

Thus, the script works without errors, but only for vertical product cards. Here is an example of the script's operation.

I'll be glad to discuss it in the comments if you have something to say on the matter.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *