YandexGPT for recognizing skills in a resume without SMS and data markup

Firework! My name is Gregory, and I am the head of special projects in the AllSee team. It's 2024 – the year of AI and large language models, many of us have already tamed new technologies and are using them with all our might for everything: writing code, solving work and educational problems, fighting loneliness. Let's try to use LLM to solve one interesting problem in the HR field. Today, the menu includes automatic determination of a candidate’s skills based on the text of a resume. Go?

Motivation

As part of one of our many special projects, we were faced with the task of recognizing a candidate's skills from a predetermined list based on a resume.

This task, with some modifications, is often encountered in modern realities: it is necessary to extract structured data from the text with possible total values ​​from a predefined list.

Typically, such problems are solved using direct search methods within the text, but this approach requires manually filling in dictionaries lexemes. Neural network methods NER (Named Entity Recognition) are also difficult to apply in this task due to the need to mark up a large amount of data, as well as limiting the answer options to a predetermined list.

What introductory notes?

Let us describe in more detail the operation of our future algorithm. The input is the text of the candidate’s resume and a list of possible skills for the selected specialization (about 200 unique values). The output of the algorithm will be a list of skills that the candidate possesses, and this list will contain only the exact wording from the list of possible skills.

It is also important to meet the following requirements:

  1. Our algorithm must work in Russian

  2. We do not have a labeled dataset and want to avoid labeling the training set

To evaluate the algorithm, we will limit ourselves to marking only a test sample of 20 resumes.

Solution

Yandex GPT API

In our solution we will use YandexGPT API. It is also important to note that using GPT (generative pre‑trained transformer) we remove the need for marking the training sample: we can obtain the primary results and a working algorithm through prompt engineering, while retaining the right to fine tuning models in the future (if there is a labeled dataset).

For easier interaction with YandexGPT API we use YandexGPT Python SDK — a python wrapper for the API with automatic authorization and request processing.

Authorization in YandexGPT API

To use the YandexGPT API, you need to specify the Yandex Cloud directory ID and the URI of the model we need. You can read more about where to get them in the article or in official documentation of YandexGPT API.

Since we use YandexGPT Python SDKfor authorization we just need to set the following environment variables:

# YandexGPT model type. Currently supported: yandexgpt, yandexgpt-light, summarization
YANDEX_GPT_MODEL_TYPE=yandexgpt

# YandexGPT catalog ID. How to get it: https://yandex.cloud/ru/docs/iam/operations/sa/get-id
YANDEX_GPT_CATALOG_ID=abcde12345

# API key. How to get it: https://yandex.cloud/ru/docs/iam/operations/api-key/create
YANDEX_GPT_API_KEY=AAA111-BBB222-CCC333

Preprocessing of input data

Let's move on directly to working with text. I would like to say to the model: “Here is your text, here is a list of skills. Find the skills from the list in the text.” The plan is reliable, but, as in any task, there are pitfalls. In general, working with “stones” is quite useful: when trying to figure out any problem, we study more deeply the tools available to us in order to customize them to solve our problems.

For brevity, below I will immediately describe our pitfalls and ways to deal with them:

  1. The model gets confused in large text. This problem can be solved, for example, by highlighting text segments that are interesting to us, but such a solution requires additional data markup, so we will let the model itself select the necessary text segments by dividing the text into batches.

  2. The model is confused by a large set of possible skills. We solve this problem by dividing the list of skills into batches.

  3. The input text may contain special characters that carry little additional information but increase the number of input tokens. We will remove special characters from the incoming text.

Code for preprocessing
from typing import List, Dict


def split_text_to_batches(text: str, batch_size: int) -> List[str]:
    return [text[i:i + batch_size] for i in range(0, len(text), batch_size)]


def split_list_to_batches(elements_list: List[str], batch_size: int) -> List[List[str]]:
    return [elements_list[i:i + batch_size] for i in range(0, len(elements_list), batch_size)]


def filter_special_simbols(text: str) -> str:
    return text.replace("\n", " ").replace("\xa0", " ")

Requests to YandexGPT

Let's prepare a prompt for processing our data. What points should you pay attention to?

  1. Specifying the assistant's role

  2. Separation into system and user prompts

  3. Unambiguity of instructions

  4. Response format requirements

Code for requests to YandexGPT
import time
import asyncio
from typing import List, Dict

from yandex_gpt import YandexGPTConfigManagerForAPIKey, YandexGPT

from dotenv import load_dotenv
load_dotenv("./env/.env.api_key")


async def process_message(message, yandex_gpt, sem, start_time):
    try:
        async with sem:
            if time.time() - start_time > 55:
                return ""
            await asyncio.sleep(1)
            
            result = yandex_gpt.get_sync_completion(messages=message, temperature=0.0, max_tokens=1000)
            return result
        
    except Exception as e:
        print("Error in process_message:", e)
        return ""

      
async def extract_skills_from_pdf(resume_text: str, skills: List[Dict]):
    try:
        resume_batch_size = 300
        skills_batch_size = 30

        resume_text_batches = split_text_to_batches(resume_text, resume_batch_size)
        skills_batches = split_list_to_batches([skill["skill"] for skill in skills], skills_batch_size)
        messages = []

        for resume_text_batch in resume_text_batches:
            for skills_batch in skills_batches:
                messages.append([
                    {'role': 'system', 'text': (
                        'Ты опытный HR-консультант. '
                        f'Набор возможных навыков кандидата: {"; ".join([f"\"{skill}\"" for skill in skills_batch])}. '
                        'Из отрывка резюме кандидата выдели упоминаемые им компетенции и навыки, если они присутствуют в списке возможных навыков кандидата. '
                        "Ищи не только прямые упоминания навыков, но и косвенные признаки их наличия. "
                        "Для ответа используй точные формулировки из набора навыков. "
                        'Ответ дай в формате списка навыков, отделяя каждый навык точкой с запятой (";"): "навык1; навык2; навык3; ...". '
                        'Если данный отрывок не относится к информации о навыках, верни пустую строку ("").'
                    )},
                    {'role': 'user', 'text': (
                        f'Отрывок резюме кандидата: "{filter_special_simbols(resume_text_batch)}".'
                    )}
                ])

        sem = asyncio.Semaphore(1)
        tasks = [process_message(message, yandex_gpt, sem, start_time=start_time) for message in messages]
        results = await asyncio.gather(*tasks)

        res = set()
        for skill in [find_elements_in_text([skill["skill"] for skill in skills], text) for text
                      in results]:
            res = res | set(skill)

        reverse_skills_mapping = {skill["skill"]: skill["id"] for skill in skills}
        return [(reverse_skills_mapping[skill], skill) for skill in res]

    except Exception as e:
        print("Error in extract_skills_from_pdf:", e)
        return []

Processing the Model Response

Let's return to our “stones”:

  1. The model sometimes ignores the prompt and produces a list in “bullet” format or with different delimiters. Remove any possible separators.

Otherwise, to extract a list of skills from the model’s response, it will be enough to divide the text into individual words and search for direct occurrences of the words we need without taking into account case.

Code to retrieve list of skills
import time
import asyncio
from typing import List, Dict


def find_elements_in_text(elements_list: List[str], text: str) -> List[str]:
    lower_text = text.lower().replace("\n", " ").replace("*", " ").replace('"', "")
    words = lower_text.split("; ")

    elements_count = {element.lower(): words.count(element.lower()) for element in elements_list}

    elements_mapping = {element.lower(): element for element in elements_list}
    filtered_elements = [elements_mapping[element] for element, count in elements_count.items() if count > 0]

    return filtered_elements

Full solution code

Tyk
import time
import asyncio
from typing import List, Dict

from yandex_gpt import YandexGPTConfigManagerForAPIKey, YandexGPT

from dotenv import load_dotenv
load_dotenv("./env/.env.api_key")


def split_text_to_batches(text: str, batch_size: int) -> List[str]:
    return [text[i:i + batch_size] for i in range(0, len(text), batch_size)]


def split_list_to_batches(elements_list: List[str], batch_size: int) -> List[List[str]]:
    return [elements_list[i:i + batch_size] for i in range(0, len(elements_list), batch_size)]


def filter_special_simbols(text: str) -> str:
    return text.replace("\n", " ").replace("\xa0", " ")


async def process_message(message, yandex_gpt, sem, start_time):
    try:
        async with sem:
            if time.time() - start_time > 55:
                return ""
            await asyncio.sleep(1)
            
            result = yandex_gpt.get_sync_completion(messages=message, temperature=0.0, max_tokens=1000)
            return result
        
    except Exception as e:
        print("Error in process_message:", e)
        return ""

      
def find_elements_in_text(elements_list: List[str], text: str) -> List[str]:
    lower_text = text.lower().replace("\n", " ").replace("*", " ").replace('"', "")
    words = lower_text.split("; ")

    elements_count = {element.lower(): words.count(element.lower()) for element in elements_list}

    elements_mapping = {element.lower(): element for element in elements_list}
    filtered_elements = [elements_mapping[element] for element, count in elements_count.items() if count > 0]

    return filtered_elements

      
async def extract_skills_from_pdf(resume_text: str, skills: List[Dict]):
    try:
        resume_batch_size = 300
        skills_batch_size = 30

        resume_text_batches = split_text_to_batches(resume_text, resume_batch_size)
        skills_batches = split_list_to_batches([skill["skill"] for skill in skills], skills_batch_size)
        messages = []

        for resume_text_batch in resume_text_batches:
            for skills_batch in skills_batches:
                messages.append([
                    {'role': 'system', 'text': (
                        'Ты опытный HR-консультант. '
                        f'Набор возможных навыков кандидата: {"; ".join([f"\"{skill}\"" for skill in skills_batch])}. '
                        'Из отрывка резюме кандидата выдели упоминаемые им компетенции и навыки, если они присутствуют в списке возможных навыков кандидата. '
                        "Ищи не только прямые упоминания навыков, но и косвенные признаки их наличия. "
                        "Для ответа используй точные формулировки из набора навыков. "
                        'Ответ дай в формате списка навыков, отделяя каждый навык точкой с запятой (";"): "навык1; навык2; навык3; ...". '
                        'Если данный отрывок не относится к информации о навыках, верни пустую строку ("").'
                    )},
                    {'role': 'user', 'text': (
                        f'Отрывок резюме кандидата: "{filter_special_simbols(resume_text_batch)}".'
                    )}
                ])

        sem = asyncio.Semaphore(1)
        tasks = [process_message(message, yandex_gpt, sem, start_time=start_time) for message in messages]
        results = await asyncio.gather(*tasks)

        res = set()
        for skill in [find_elements_in_text([skill["skill"] for skill in skills], text) for text
                      in results]:
            res = res | set(skill)

        reverse_skills_mapping = {skill["skill"]: skill["id"] for skill in skills}
        return [(reverse_skills_mapping[skill], skill) for skill in res]

    except Exception as e:
        print("Error in extract_skills_from_pdf:", e)
        return []

Algorithm results

To evaluate the results of the algorithm, we will use the Precision and Recall of the algorithm: the share of correct skills in the model’s response and the share of detected skills among the correct ones, respectively. We will give priority when evaluating Recall, because in the case of a manual check, it is easier for us to “click on the cross” than to search in the skills database.

According to the results of primary tests, we have Precision and Recall equal to 69% and 78%, respectively. For a first approximation, the results are decent, considering that we did not use fine-tuning models, but simply experimented a little with prompts.

Conclusion

We discussed how large language models can be used to extract predefined meanings from text, using the example of the task of detecting candidate skills from resume text, and also discussed the pitfalls that you have to deal with when solving such problems.

I will be happy to discuss all your questions in the comments. Good luck and we'll be in touch✌️

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *