Python – an assistant in finding cheap air tickets for those who love to travel

The author of the article, the translation of which we publish today, says that her goal is to talk about the development of a Python web-based scraper using Selenium, which searches for prices for airline tickets. When searching for tickets, flexible dates are used (+ – 3 days relative to specified dates). Scraper stores the search results in an Excel file and sends an email to the person who launched it, with general information about what he was able to find. The objective of this project is to assist travelers in finding the best deals.

If you, while dealing with the material, feel that you are lost – take a look at this article.

What are we looking for?

You are free to use the system described here as you like. For example, I used it to search for weekend tours and tickets to my hometown. If you are serious about finding the best tickets – you can run the script on the server (a simple server, for 130 rubles per month, quite suitable for this) and make it run one or two times a day. Search results will be emailed to you. In addition, I recommend setting everything up so that the script would save the Excel file with the search results in the Dropbox folder, which allows you to view such files from anywhere and at any time.



I haven’t found tariffs with errors yet, but I believe that it is possible

When searching, as already mentioned, the “flexible date” is used, the script finds sentences that are within three days from the given dates. Although when you run the script, you search for proposals in only one direction, it is easy to modify it so that it can collect data in several flight directions. With it, you can even look for erroneous tariffs, such finds can be very interesting.

Why do we need another web scraper?

When I first started web scraping, I, frankly, it was not particularly interesting. I wanted to do more projects in the field of predictive modeling, financial analysis, and, perhaps, in the field of analyzing the emotional coloring of texts. But it turned out that it is very interesting – to understand how to create a program that collects data from websites. As I delved into this topic, I realized that web scraping is the “engine” of the Internet.

Perhaps you decide that this is too bold a statement. But think about how Google started with a web scraper that Larry Page created using Java and Python. Google robots have been researching and researching the Internet, trying to provide their users with the best answers to their questions. Web scraping has an infinite number of applications, and even if you are interested in something else in the field of Data Science, you will need some of the scraping skills to acquire data for analysis.

I found some of the techniques used here in a great book about web scraping, which I recently acquired. In it you can find a lot of simple examples and ideas on the practical application of the studied. In addition, there is a very interesting chapter on bypassing reCaptcha checks. For me, this was news, because I didn’t know that there were special tools and even entire services for solving such problems.

Do you like to travel?!

The simple and fairly innocuous question in the heading of this section can often be answered with a positive answer, accompanied by a couple of stories from the journeys of the one to whom it was asked. Most of us would agree that traveling is a great way to dive into new cultural environments and expand our own horizons. However, if you ask someone if he likes to look for flights, I am sure that the answer to him will be far from being so positive. In fact, this is where Python comes to the rescue.

The first task that we need to solve on the way to creating a search system for information on air tickets will be the selection of a suitable platform from which we will take information. The solution to this problem was not easy for me, but in the end I chose the Kayak service. I tried the services of Momondo, Skyscanner, Expedia, and some more, but the protection mechanisms against robots on these resources were impenetrable. After several attempts, during which, trying to convince the system that I was human, I had to deal with traffic lights, pedestrian crossings and bicycles, I decided that Kayak was the best for me, even though too many pages load in a short time, checks start too. I managed to make the bot send requests to the site in intervals of 4 to 6 hours, and everything worked fine. Periodically, difficulties arise when working with Kayak, but if you are beginning to pester with checks, then you need to either deal with them manually, then start the bot, or wait a few hours, and the checks should stop. If necessary, you can easily adapt the code for another platform, and if you do that, you can report it in the comments.

If you’re just starting to get familiar with web scraping, and don’t know why some web sites are struggling with it, then, before embarking on your first project in this area, do yourself a favor and search Google for word materials. "Web scraping etiquette". Your experiments can be completed sooner than you think, in the event that you will be engaged in web scraping unreasonable.

Beginning of work

Here is a general overview of what will happen in the code of our web scraper:

  • Import required libraries.
  • Open the Google Chrome tab.
  • Call the function that launches the bot, passing it the cities and dates that will be used when searching for tickets.
  • This function gets the first search results, sorted by the criterion of greatest attractiveness (best), and presses a button to load additional results.
  • Another function collects data from the entire page and returns a data frame.
  • The two previous steps are performed using sorting types by ticket price (cheap) and flight speed (fastest).
  • The script user is sent an email containing a brief summary of ticket prices (the cheapest tickets and the average price), and the data frame with information sorted by the three indicators mentioned above is saved as an Excel file.
  • All the above actions are performed in a loop after a specified period of time.

It should be noted that every Selenium project starts with a web driver. I use Chromedriver, I work with Google Chrome, but there are other options. Popularity still PhantomJS and Firefox. After downloading the driver, you need to place it in the appropriate folder, this ends the preparation for its use. In the first lines of our script, a new Chrome tab is opened.

Keep in mind that, in my story, I am not trying to open up new horizons in the search for lucrative flight deals. There are also much more advanced methods of searching for similar offers. I just want to offer the readers of this material a simple but practical way to solve this problem.

Here is the code we talked about above.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Use your path to chromedriver here!
chromedriver_path = 'C: / {YOUR PATH HERE} /chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome (executable_path = chromedriver_path) # This command opens the Chrome window
sleep (2)

At the beginning of the code you can see the package import commands that are used throughout our project. So, randint used to ensure that the bot would “fall asleep” for a random number of seconds before starting a new search operation. Usually, no bot can do without it. If you run the above code, the Chrome window will open, which the bot will use to work with sites.

Let's make a small experiment and open the kayak.com website in a separate window. We will choose the city from which we are going to fly, and the city we want to go to, as well as the dates of flights. When choosing dates, check to use the + -3 day range. I wrote the code, taking into account that the site gives in response to such requests. If you, for example, need to look for tickets only for a given date, then it is highly likely that you will have to refine the bot code. Talking about the code, I make an explanation, but if you feel confused, let me know.

Now click on the button to start the search and look at the link in the address bar. It should be similar to the link I use in the example below, where the variable is declared. kayakstoring the URL and using the method get web driver. After clicking on the search button, the results should appear on the page.


When I used the command get more than two or three times in a few minutes, I was offered to undergo testing using reCaptcha. This test can be passed manually and continue the experiments until the system decides to arrange a new test. When I tested the script, I had the feeling that the first search session always runs without problems, so if you want to experiment with the code, you only have to periodically manually check and leave the code to run, using long intervals between the search sessions. Yes, and if you think about it, a person is unlikely to need information about ticket prices obtained at 10-minute intervals between search operations.

Work with page using XPath

So, we opened the window and loaded the site. In order to get pricing information and other information, we need to use XPath technology or CSS selectors. I decided to dwell on XPath and did not feel the need to use CSS selectors, but it is quite possible to work that way. Navigating the page using XPath can be difficult, and even if you use the methods I described in this article, where copying of the corresponding identifiers from the page code was used, I realized that this is actually not the best way to refer to necessary elements. By the way, in this book you can find an excellent description of the basics of working with pages using XPath and CSS selectors. This is what the corresponding web driver method looks like.


So, we continue to work on the bot. We take advantage of the program to select the cheapest tickets. In the following image, the XPath selector code is highlighted in red. In order to view the code, you need to right-click on the element of the page that interests you and in the appeared menu select the command Inspect. This command can be called for different elements of the page, the code of which will be displayed and highlighted in the code window.



View page code

In order to confirm my reasoning about the shortcomings of copying selectors from code, pay attention to the following features.

Here's what happens when you copy the code:

// *[@id="wtKI-price_aTab"]/ div[1]/ div / div / div[1]/ div / span / span

In order to copy something similar, you need to right-click on the code section you are interested in and select the command Copy> Copy XPath in the appeared menu.

Here is what I used to define the Cheapest button:

cheap_results = ‘// a[@data-code = "price"]’



Copy> Copy XPath command

Obviously, the second option looks much simpler. When used, it searches for the element a, which has an attribute data-codeequal to price. When using the first option, the item is searched. id which is equal to wtKI-price_aTabwhile the XPath path to the element looks like / div[1]/ div / div / div[1]/ div / span / span. Such an XPath request to the page will do its job, but only once. I can say right now that id will change the next time the page loads. Character string wtKI dynamically changes each time the page loads, as a result, the code in which it is used, after the next page reload will be useless. So take some time to figure out the XPath. This knowledge will serve you well.

However, it should be noted that copying XPath selectors can be useful when working with fairly simple sites, and if this suits you, there is nothing wrong with that.

Now let's think about how to be if you want to get all the search results in several lines, inside the list. Very simple. Each result is inside an object with a class. resultWrapper. All results can be loaded in a loop that resembles the one shown below.

It should be noted that if you understand the above, then you should understand without any problems most of the code that we will parse. In the course of the work of this code, what we need (in fact, this is the element in which the result is wrapped), we address using some mechanism to specify the path (XPath). This is done in order to get the text of the element and put it into an object from which data can be read (first used flight_containers, then – flights_list).


The first three lines are displayed and we can clearly see everything we need. However, we also have more interesting ways of obtaining information. We need to take data from each element separately.

For the work!

It is easiest to write a function to load additional results, so let's start with it. I would like to maximize the number of flights that the program receives information about, and at the same time not cause the service suspicions leading to testing, so I click the Load more results button once each time the page is displayed. In this code you should pay attention to the block. try, which I added due to the fact that sometimes the button does not load normally. If you also encounter this, comment out the calls to this function in the function code. start_kayak, which we consider below.

# Downloading more results in order to maximize the amount of data collected
def load_more ():
 try:
 more_results = '// a[@class = "moreButton"]'
 driver.find_element_by_xpath (more_results) .click ()
 # Displaying these notes during the program helps me quickly figure out what she is doing
 print ('sleeping .....')
 sleep (randint (45,60))
 except:
 pass

Now, after a long parsing of this function (sometimes I can get carried away), we are ready to declare a function that will be engaged in scraping the page.

I have already collected most of what I need in the following function, called page_scrape. Sometimes the returned data about the stages of the path are combined, for their separation I use a simple method. For example, when I use variables for the first time. section_a_list and section_b_list. Our function returns a data frame. flights_dfThis allows us to separate the results obtained using different data sorting methods, and later to combine them.

def page_scrape ():
 "" "This part takes care of the scraping part" ""
 
 xp_sections = '// *[@class="section duration"]'
 sections = driver.find_elements_by_xpath (xp_sections)
 sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # so we share information about two flights
 section_b_list = sections_list[1::2]
    
    # If you stumble upon reCaptcha, you may need to do something.
 # The fact that something went wrong, you will learn based on the fact that the above lists are empty
 # this if expression allows you to shut down the program or do something else
 Here you can pause the work, which will allow you to pass the test and continue scraping
 # I use SystemExit here because I want to test everything from the very beginning
 if section_a_list == []:
 raise SystemExit
 
 # I will use the letter A for departing flights and B for arriving
 a_duration = []
    a_section_names = []
    for n in section_a_list:
 # We get time
 a_section_names.append (''. join (n.split ()[2:5]))
 a_duration.append (''. join (n.split ()[0:2]))
 b_duration = []
    b_section_names = []
    for n in section_b_list:
 # We get time
 b_section_names.append (''. join (n.split ()[2:5]))
 b_duration.append (''. join (n.split ()[0:2]))

 xp_dates = '// div[@class="section date"]'
 dates = driver.find_elements_by_xpath (xp_dates)
 dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # We get the day of the week
 a_day = [value.split()[0] for value in a_date_list]a_weekday =[valuesplit()[valuesplit()[1] for value in a_date_list]b_day = [value.split()[0] for value in b_date_list]b_weekday =[valuesplit()[valuesplit()[1] for value in b_date_list]# Get prices
 xp_prices = '// a[@class="booking-link"]/ span[@class="price option-text"]'
 prices = driver.find_elements_by_xpath (xp_prices)
 prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list (map (int, prices_list))

 # stops is a large list in which the first fragment of the path is located on an even index, and the second - on an odd
 xp_stops = '// div[@class="section stops"]/ div[1]'
 stops = driver.find_elements_by_xpath (xp_stops)
 stops_list = [stop.text[0].replace ('n', '0') for stop in stops]a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '// div[@class="section stops"]/ div[2]'
 stops_cities = driver.find_elements_by_xpath (xp_stops_cities)
 stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    #carrier company information, departure and arrival times for both flights
 xp_schedule = '// div[@class="section times"]'
 schedules = driver.find_elements_by_xpath (xp_schedule)
 hours_list = []
    carrier_list = []
    for schedule in schedules:
 hours_list.append (schedule.text.split (' n')[0])
 carrier_list.append (schedule.text.split (' n')[1])
 # we share time and carrier information between flights a and b
 a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

 flights_df = pd.DataFrame ({'Out Day': a_day,
 'Out Weekday': a_weekday,
 'Out Duration': a_duration,
 'Out Cities': a_section_names,
 'Return Day': b_day,
 'Return Weekday': b_weekday,
 'Return Duration': b_duration,
 'Return Cities': b_section_names,
 'Out Stops': a_stop_list,
 'Out Stop Cities': a_stop_name_list,
 'Return Stops': b_stop_list,
 'Return Stop Cities': b_stop_name_list,
 'Out Time': a_hours,
 'Out Airline': a_carrier,
 'Return Time': b_hours,
 'Return Airline': b_carrier,
 'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime ("% Y% m% d-% H% M") # data collection time
 return flights_df

I tried to name the variables so that the code would be understandable. Remember that variables starting with a refer to the first stage of the journey, and b – to the second. Go to the next function.

Auxiliary mechanisms

We now have a function that allows you to load additional search results and a function to process these results. This article could have been completed on this, since these two functions provide everything needed for scraping pages that can be opened independently. Но мы пока не рассмотрели некоторые вспомогательные механизмы, о которых говорили выше. Например — это код для отправки электронных писем и ещё некоторые вещи. Всё это можно найти в функции start_kayak, которую мы сейчас и рассмотрим.

Для работы этой функции нужны сведения о городах и датах. Она, используя эти сведения, формирует ссылку в переменной kayak, которая используется для перехода на страницу, на которой будут находиться результаты поиска, отсортированные по их наилучшему соответствию запросу. После первого сеанса скрапинга мы поработаем с ценами, находящимися в таблице, в верхней части страницы. А именно, найдём минимальную цену билета и среднюю цену. Всё это, вместе с предсказанием, выдаваемым сайтом, будет отправлены по электронной почте. На странице соответствующая таблица должна быть в левом верхнем углу. Работа с этой таблицей, кстати, может вызвать ошибку при поиске с использованием точных дат, так как в таком случае таблица на странице не выводится.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
 
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
 
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
 
#     load_more()
 
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
 
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
 
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
 
#     load_more()
 
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
 
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
 
#     load_more()
 
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
 
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
 
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
 
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯\_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
 
    username = 'YOUREMAIL@hotmail.com'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = 'YOUREMAIL@hotmail.com'
    message['to'] = 'YOUROTHEREMAIL@domain.com'
    server.sendmail('YOUREMAIL@hotmail.com', 'YOUROTHEREMAIL@domain.com', msg)
    print('sent email.....')

Я протестировал этот скрипт с использованием учётной записи Outlook (hotmail.com). Я не проверял его на правильность работы с аккаунтом Gmail, эта почтовая система весьма популярна, но есть масса возможных вариантов. Если же вы используете учётную запись Hotmail, то вам, для того чтобы всё заработало, достаточно ввести в код свои данные.

Если вы хотите разобраться с тем, что именно выполняется в отдельных участках кода этой функции, вы можете скопировать их и поэкспериментировать с ними. Эксперименты с кодом — это единственный способ как следует его понять.

Готовая система

Теперь, когда сделано всё то, о чём мы говорили, мы можем создать простой цикл, в котором вызываются наши функции. Скрипт запрашивает у пользователя данные о городах и датах. При тестировании с постоянным перезапуском скрипта вам вряд ли захочется каждый раз вводить эти данные вручную, поэтому соответствующие строки, на время тестирования, можно закомментировать, раскомментировав те, что идут ниже их, в которых жёстко заданы нужные скрипту данные.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
 
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Вот как выглядит тестовый запуск скрипта.



Тестовый запуск скрипта

Итоги

Если вы добрались до этого момента — примите поздравления! Теперь у вас есть рабочий веб-скрапер, хотя я уже вижу множество путей его улучшения. Например, его можно интегрировать с Twilio, чтобы он, вместо электронных писем, отправлял бы текстовые сообщения. Можно воспользоваться VPN или ещё чем-нибудь для того, чтобы одновременно получать результаты с нескольких серверов. Есть ещё и периодически возникающая проблема с проверкой пользователя сайта на то, является ли он человеком, но и эту проблему можно решить. В любом случае, теперь у вас имеется база, которую вы, при желании, можете расширять. Например, сделать так, чтобы Excel-файл отправлялся бы пользователю в виде вложения в электронное письмо.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *