Who are employers looking for?

Those who have never been in the “looking for a job” mode are very lucky! My story is quite typical: at the age of almost forty, I decided to “get into IT” through popular professional courses. The educational process inspired me, and it seemed that there was a line of employers ahead of me, eager to hire a sought-after specialist. But, as it turned out, no one is in a hurry to hire junior specialists (ageism? No way…).

Banners, commercials, and influencer posts constantly promise high salaries and plenty of vacancies. I often came across articles about the shortage of IT specialists, supported by graphs and statistics, which claimed that employers were looking for suitable candidates but could not find them. But something told me that “the truth is out there” and it needs to be found. I asked myself: who are employers really looking for? And this time, I didn’t want to rely on third-party information – I wanted to figure it out, analyze it, and draw conclusions on my own. Luckily, I had all the necessary skills for this – I had studied for a reason!

In this context, I decided to use Web Scraper and Python to collect job posting data, analyze it, and understand it. Let's walk this path together: from collecting information to analyzing and drawing conclusions about what employers really want.

While searching, I thought about it and realized that this list of vacancies actually represents a great dataset for analysis. hh.ru allows you to use their information to find a job and obtain information about the labor market.

To collect information from a website, I developed a parser using the tool Web Scraper. This parser is designed to extract data on data analyst vacancies from the HeadHunter platform. Below is the code that allows you to collect information in JSON format. This code provides selectors for extracting various characteristics of vacancies, such as job title, employer name, salary level, work experience, remote work option, and location. It also provides the ability to get a link to each vacancy.

{"_id":"data_analyst_hh_2","startUrl":["https://hh.ru/search/vacancy?text=%D0%90%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85&salary=&ored_clusters=true&experience=noExperience&area=113&hhtmFrom=vacancy_search_list&hhtmFromLabel=vacancy_search_line&page=[0-33]"],"selectors":[{"id":"info","parentSelectors":["_root"],"type":"SelectorElement","selector":"div.vacancy-info--I4f9shQE53f9Luf5lkMw","multiple":true},{"id":"vacancy","parentSelectors":["info"],"type":"SelectorText","selector":".magritte-text_typography-title-4-semibold___vUqki_3-0-12 span","multiple":false,"regex":""},{"id":"employer","parentSelectors":["info"],"type":"SelectorText","selector":".company-name-badges-container--o692jpSdR2R4SR9oXuZJ span.magritte-text___tkzIl_4-2-2","multiple":false,"regex":""},{"id":"salary","parentSelectors":["info"],"type":"SelectorText","selector":".compensation-labels--uUto71l5gcnhU2I8TZmz span","multiple":false,"regex":""},{"id":"experience","parentSelectors":["info"],"type":"SelectorText","selector":".wide-container-magritte--MZDT2K1sum_GdjUzT50m div.magritte-tag__label___YHV-o_3-0-3","multiple":false,"regex":""},{"id":"remote","parentSelectors":["info"],"type":"SelectorText","selector":".wide-container-magritte--MZDT2K1sum_GdjUzT50m div:nth-of-type(2) div","multiple":false,"regex":""},{"id":"place","parentSelectors":["info"],"type":"SelectorText","selector":".wide-container-magritte--MZDT2K1sum_GdjUzT50m .wide-container-magritte--MZDT2K1sum_GdjUzT50m span","multiple":false,"regex":""},{"id":"link","parentSelectors":["info"],"type":"SelectorLink","selector":"a.magritte-link_enable-visited___Biyib_4-2-2","multiple":false,"linkType":"linkFromHref"}]}

Note the solution for pagination – using direct indication of the number of pages page=[0-33]if you use it, you will need to correct it manually. However, the site markup changes regularly, so the parser may become outdated (this one is for the August 2024 markup).

Received a file with a list of vacancies valid on August 02, 2024 for the request Data Analyst, switch Work experience – No experience:

Excel result

Excel result

Next, we connect Jupyter notebook and start “moving” the received dataset. Here I will only provide some parts of the code and conclusions, the project in full Here.

After deleting the parser's service columns, trimming links to vacancies, and converting string values ​​to lowercase, we will display the first information about the state of the data.

# создадим фукцию для вывода общей инфрмации по таблице целиком
def tab_info(df_x, name):
# получение cводной информации по параметрам данных
    print('Сводная информация по параметрам данных', name)
    display(df_x.describe().round(2))
# подсчет количества отсутствующих значений
    print('Количество отсутствующих значений', name)
    display(df_x.isna().sum())
# подсчет доли отсутствующих значений с округлением
    print('Доли отсутствующих значений с округлением', name)
    display(round(df_x.isna().sum() * 100 / len(df_x), 2))
# подсчет количества задублированных записей
    print('Задублированных записей', name)
    display(df_x.duplicated().sum())    
# получение общей информации о данных в таблице
    print('Общая информация о данных в таблице', name)
    display(df_x.info())

Number of lines: 1642. Filling by columns:

  • vacancy: 1642 filled, 100.00% – job title

  • employer: 1640 filled, 99.88% – salary

  • salary: 1005 filled, 61.21% – salary

  • experience: 1642 completed, 100.00% – experience

  • remote: 296 filled, 18.03% – possibility of remote work

  • place: 1642 filled, 100.00% – vacancy location

  • link-href: 1642 filled, 100.00% – link to vacancy

Duplicate entries – 37.

Let's delete duplicate lines and lines without the employer's name.

When searching, vacancies often appear with a difference only in the city. Let's check the number of such implicit duplicates.

# Определяем общее количество строк и количество дубликатов
total_rows = len(df)
duplicate_rows = df.duplicated(subset=['vacancy', 'employer', 'salary', 'experience', 'remote'], keep=False).sum()

# Соотношение дубликатов к общему числу строк
unique_rows = total_rows - duplicate_rows

# Выводим результаты
print(f"Общее количество строк: {total_rows}")
print(f"Количество задублированных строк: {duplicate_rows}")
print(f"Количество уникальных строк: {unique_rows}")

# Данные для круговой диаграммы
labels = ['Уникальные вакансии', 'Задублированные вакансии']
sizes = [unique_rows, duplicate_rows]
colors = ['#66c2a5', '#fc8d62']

# Создаем круговую диаграмму
plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct="%1.1f%%", startangle=90)
plt.title('Соотношение уникальных и задублированных вакансий')
plt.axis('equal')  # Чтобы круг был кругом
plt.show()

Every fifth vacancy in the search turned out to be the same! Let's look at employers with such vacancies.

# Вывод количества уникальных значений в столбце 'employer'
unique_employers_count = duplicated_df['employer'].nunique()
print(f"Количество уникальных значений в столбце 'employer': {unique_employers_count}")

# Подсчет количества значений в колонке 'employer' и выбор топ-20, сортировка по убыванию
top_employers = duplicated_df['employer'].value_counts().nlargest(20)

# Построение горизонтальной гистограммы
plt.figure(figsize=(10, 6))
bars = plt.barh(top_employers.index[::-1], top_employers.values[::-1], color="skyblue")  # Горизонтальная гистограмма

# Добавление подписей значений
for bar in bars:
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, 
             f'{bar.get_width()}', va="center")

# Добавление заголовка и подписей осей
plt.title('Топ-20 работодателей по количеству задублированных вакансий')
plt.xlabel('Количество ваканский')
plt.ylabel('Работодатель')
plt.grid(axis="x")  # Сетка по оси X

# Показать график
plt.tight_layout()
plt.show()

63 employers out of 913 have vacancies vacancies with the difference only in the vacancy location field. Yandex Crowd has the most: content. Let's delete such vacancies. After all manipulations, the following remained: The dataset contains 1373 rows. The “vacancy” column contains 1373 filled values ​​(100.00% of the total) and 1095 unique values ​​(79.75%). The “employer” column contains 1373 filled values ​​(100.00% of the total) and 913 unique values ​​(66.50%). The “salary” column contains 830 filled values ​​(60.45% of the total) and 376 unique values ​​(27.39%). The “experience” column contains 1373 filled values ​​(100.00% of the total) and 1 unique value (0.07%). The “remote” column contains 167 filled values ​​(12.16% of the total) and 1 unique value (0.07%). The “place” column contains 1373 filled values ​​(100.00% of the total) and 129 unique values ​​(9.40%). The “link-href” column contains 1373 filled values ​​(100.00% of the total) and 1373 unique values ​​(100.00%).

vacancy – job title

After removing symbols in job titles, lemmatization, and removing stop words, let's see which words are the most common in job titles.

# Разбиваем текст на слова и подсчитываем частоты
words = vacancy_without_stopwords.split()
word_counts = Counter(words)

# Получаем 20 самых распространенных слов
top_words = word_counts.most_common(20)

# Подготовка данных для графика
words, counts = zip(*top_words)

# Создание графика
plt.figure(figsize=(10, 5))
bars = plt.bar(words, counts, color="skyblue")
plt.title('Топ 20 слов')
plt.xlabel('Слова')
plt.ylabel('Частота')
plt.xticks(rotation=45)

# Добавление подписей значений над столбцами
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval, int(yval), ha="center", va="bottom")

plt.tight_layout()
plt.show()

# Создание облака слов
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(word_counts)

# Отображение облака слов
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Отключаем оси
plt.title('Облако слов')
plt.show()

Captain Obvious: the most common is Analyst, but out of 1373 lines there are only 629 mentions. Also common are: manager, business, sales, system: this is how the search works, for a more precise display of vacancies you can search only in the names of vacancies, and for a complete match of two adjacent words put the query in quotation marks: “Data Analyst”

employer – employer name

Number of unique entries in the “employer” column: 913 Let's see which campaigns are most common, meaning they are looking for workers without experience:

# Подсчет количества значений в колонке 'employer' и выбор топ-10, сортировка по убыванию
top_employers = df_employer['employer'].value_counts().nlargest(30)

# Построение горизонтальной гистограммы
plt.figure(figsize=(10, 6))
bars = plt.barh(top_employers.index[::-1], top_employers.values[::-1], color="skyblue")  # Горизонтальная гистограмма

# Добавление подписей значений
for bar in bars:
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, 
             f'{bar.get_width()}', va="center")

# Добавление заголовка и подписей осей
plt.title('Топ-30 работодателей по количеству вакансий')
plt.xlabel('Количество')
plt.ylabel('Работодатель')
plt.grid(axis="x")  # Сетка по оси X

# Показать график
plt.tight_layout()
plt.show()

BCS career start, Sber for experts, Magnit, retail chain – tops for posting vacancies for applicants without experience, who compiled vacancies so that they fall into the Data Analyst request.

salary – wages

Since this field is collected by a merged line like “from 84,000 ₽ before taxes”, there are also salaries in $, so we had to work hard to normalize the data, all dances with tambourines Here. As a result, I received a table where min is the specified minimum salary, max is the maximum specified salary, only in rubles and only in hand (-13% from before deduction).

Let's look at the distribution of salary values:

#Визуализация частотности распределения в столбцах min и max
plt.figure(figsize=(12, 6))

# Гистограмма для столбца min
plt.subplot(1, 2, 1)
plt.hist(df_salary['min'], bins=30, color="blue", alpha=0.7)
plt.title('Распределение min')
plt.xlabel('Значения min')
plt.ylabel('Частота')

# Гистограмма для столбца max
plt.subplot(1, 2, 2)
plt.hist(df_salary['max'], bins=30, color="green", alpha=0.7)
plt.title('Распределение max')
plt.xlabel('Значения max')
plt.ylabel('Частота')

plt.tight_layout()
plt.show()

The graph shows outliers, let's limit the sample to 200,000. Let's look at the minimums and maximums:

# Фильтрация данных
df_salary_min = df_salary[df_salary['min'] < 200000].dropna(subset=['min'])

# Рассчет среднего значения
mean_min = df_salary_min['min'].mean()

# Настройка графика
plt.figure(figsize=(10, 6))

# Использование seaborn для создания гистограммы
sns.histplot(df_salary_min['min'], bins=30, color="blue", kde=True)

# Настройка заголовка и меток
plt.title('Распределение значений в столбце min')
plt.xlabel('Значения min')
plt.ylabel('Частота')

# Добавление вертикальной линии для среднего значения
plt.axvline(mean_min, color="red", linestyle="--", label="Среднее значение")

# Добавление легенды
plt.legend()

# Добавление значений на график
counts, bins = np.histogram(df_salary_min['min'], bins=30)
for count, x in zip(counts, bins):
    plt.text(x, count, str(count), ha="center", va="bottom")

# Добавление текстового значения среднего на график
plt.text(mean_min, max(counts)*0.9, f'Среднее: {mean_min:.2f}', color="red", ha="center")

plt.tight_layout()
plt.show()
# Фильтрация данных
df_salary_max = df_salary[df_salary['max'] < 200000].dropna(subset=['max'])

# Вычисление среднего значения
mean_value = df_salary_max['max'].mean()

# Настройка графика
plt.figure(figsize=(10, 6))

# Использование seaborn для создания гистограммы
sns.histplot(df_salary_max['max'], bins=30, color="green", kde=True)

# Настройка заголовка и меток
plt.title('Распределение значений в столбце max')
plt.xlabel('Значения max')
plt.ylabel('Частота')

# Добавление вертикальной линии для среднего значения
plt.axvline(mean_value, color="red", linestyle="--", label=f'Среднее значение: {mean_value:.2f}')

# Добавление значений на график
counts, bins = np.histogram(df_salary_max['max'], bins=30)
for count, x in zip(counts, bins):
    plt.text(x, count, str(count), ha="center", va="bottom")

# Добавление легенды
plt.legend()

plt.tight_layout()
plt.show()

The average salary range is from 53,000 to 69,000.

experience – experience

Since the search without experience is selected, there is only 1 value – without experience.

remote – the ability to work remotely

Let's see if Data Analysts are hired for remote work.

# Заменяем пустые значения на "в офисе"
df_remote['remote'] = df_remote['remote'].fillna('в офисе')

# Подсчет значений
count_values = df_remote['remote'].value_counts()

# Определение меток и значений
labels = count_values.index
sizes = count_values.values

# Создание круговой диаграммы
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct=lambda p: '{:.0f} ({:.1f}%)'.format(p * sum(sizes) / 100, p), startangle=90)
ax1.axis('equal')  # Рисуем круг

# Заголовок
plt.title('Распределение удалённой работы')
plt.show()

Only 12% offer “remote work”. Next, let's see where the offices are.

place – location of the vacancy

# Подсчет количества упоминаний каждого города и сортировка по убыванию
place_counts = df_place['place'].value_counts().head(20).sort_values()

# Визуализация: Горизонтальная столбчатая диаграмма
plt.figure(figsize=(12, 8))
bars = plt.barh(place_counts.index, place_counts.values, color="lightgreen")  # Изменен цвет на светло-зеленый

# Добавляем подписи значений на столбцы (абсолютные и относительные)
total_mentions = place_counts.sum()
for bar in bars:
    xval = bar.get_width()
    relative_val = (xval / total_mentions) * 100  # Вычисляем относительное значение
    plt.text(xval, bar.get_y() + bar.get_height()/2, f'{int(xval)} ({relative_val:.1f}%)', va="center", ha="left")

plt.title('Топ 20 городов по количеству упоминаний', fontsize=16)
plt.xlabel('Количество упоминаний', fontsize=14)
plt.ylabel('Города', fontsize=14)
plt.grid(axis="x")
plt.tight_layout()
plt.show()

# Визуализация: Облако слов
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(' '.join(df_place['place']))

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Убираем оси
plt.title('Облако слов городов', fontsize=16)
plt.show()

More than half are in Moscow, with 129 original locations in the dataset.

Employers by salary

This part turned out to be unclear, since I took the entire sample (maybe it made sense to limit it to less than 200,000).

# Сортируем данные по max_average и min_average и выбираем топ-30
top_30_max = grouped_df.nlargest(30, 'max_average')
top_30_min = grouped_df.nlargest(30, 'min_average')

# Создание графика для max_average
plt.figure(figsize=(12, 10))
plt.barh(top_30_max['employer'], top_30_max['max_average'], color="skyblue")
plt.xlabel('Среднее значение max')
plt.title('Топ 30 работодателей по среднему значению max')
plt.gca().invert_yaxis()  # Инвертируем ось Y для удобства
plt.show()

# Создание графика для min_average
plt.figure(figsize=(12, 10))
plt.barh(top_30_min['employer'], top_30_min['min_average'], color="lightgreen")
plt.xlabel('Среднее значение min')
plt.title('Топ 30 работодателей по среднему значению min')
plt.gca().invert_yaxis()  # Инвертируем ось Y для удобства
plt.show()

conclusions

Thus: when searching on the site, 20% of vacancies will be the same with the only difference being the city. When searching without additional settings, the selection will often include vacancies from other areas: systems analysis, sales… from 53,000 to 69,000 – average salaries after taxes (net). Only 12% of remote vacancies and more than 50% of all vacancies are in Moscow.

Good luck to everyone who is looking for a job! And to me too!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *