how we made an AI assistant to automate resume formatting

Introduction

Hello Habr! My name is Alexander Suleikin, Ph.D., architect of Big Data solutions and CEO of the IT company DYUK Technologies. Together with our generative AI expert Roman Babenko, we prepared an article based on a real case of implementing an AI assistant for the HR-Tech field. The article will be useful to HR managers, executive directors, heads of digitalization and automation of business processes. The article describes a real case of implementing an AI assistant for the business process of searching and showing candidates for various outsource/outstaff vacancies. The system was tested on the example of one small IT company engaged in outstaffing IT developers.

With the development of technology and the increase in the volume of data that companies work with, the use of language models (LLM) is becoming an integral part of modern business. These models not only demonstrate impressive results in natural language processing, but also open new horizons for automating routine tasks. In particular, in the field of HR-Tech, LLMs can significantly simplify the processes associated with processing resumes, which makes them especially relevant for companies seeking to increase the efficiency of their HR departments, working with large amounts of text data and resumes of candidates, engaging in outstaffing as an outsourcer customers and contractors alike.

LLM for HR-Tech: opportunities and prospects for use

LLMs open up a whole range of opportunities for HR specialists, allowing them to automate routine operations and increase work efficiency. For example, LLMs can:

  • Format resume: bring resumes to a single standard, which greatly simplifies their analysis and comparison;

  • Match candidates with vacancies: analyze resumes and job requirements, identifying the most suitable candidates;

  • Generate job descriptions: create attractive and informative job descriptions that will be of interest to potential employees.

Automating such tasks with LLM allows HR professionals to focus on more strategic issues such as talent acquisition, candidate interviews, optimization of internal HR processes and employee development.

The main challenges of developing an AI assistant for HR

Despite promising prospects, the development of an AI assistant faces a number of difficulties:

  • Specifics of HR cases: The HR field requires a special approach that takes into account the variety of resume formats, differences in the very structure of writing resumes among candidates and professional skills, which creates additional difficulties in development;

  • Hallucinations: LLMs may generate non-existent or implausible information, which may lead to errors in the selection of candidates or misinterpretation of candidates as not suitable for a particular vacancy;;

  • The need for integration of external data and its constant relevance. Achieving superior results requires integration with various systems and databases, which can be technically, financially and organizationally complex. Additionally, it is necessary to ensure routine, uninterrupted updating of data, since the candidate database is constantly becoming outdated and becomes irrelevant.

Case Study: Automating Resume Formatting

Our team worked on a project for a small IT company, the goal of which was to speed up the resume processing process. At the input we received resumes in various formats that needed standardization, and at the output we received documents ready for review with a unified structure and design.

Description of the task

It is necessary to automatically process resume files according to a given template and requirements for text structuring and formatting. Sources come in the following formats:

  • .rtf format always looks the same,

  • .doc/.docx – may look different,

  • .pdf – always look different.

Rice. 1. Example of a corporate template for structuring a resume

Rice. 1. Example of a corporate template for structuring a resume

Additional requirements for the content of the converted resume:

  • The resume must follow a template, starting from the most recent experience to the very first (sometimes they throw out resumes, where, on the contrary, people indicate the experience of older years first, and at the end only the last one);

  • The functions performed should be written in a list, and in each sentence the subjects should be predicates (for example: “Development of a module…”);

  • The resume should not contain the names of companies and projects (it must be anonymized);

  • The stack should not contain, for example, “English”, etc., but only technologies;

  • The point about education is adjusted to the candidate – not everyone has a higher education; when this is not the case, it is necessary to remove the word “Higher” and leave simply “Education”. In the same place, if there are several of them – from latest to oldest (year, name of university/college, etc., faculty and direction);

  • If your resume includes completed courses and advanced training, this is indicated in the “About Me” section.

Solution architecture

We chose DUC SmartSearch as the main platform for implementing the solution, as it was ideal for our task with minimal modifications. SmartSearch is a digital ecosystem of AI assistants, its own fork of the open Danswer system, which allows you to create add-ons over various LLM models, both proprietary (GigaChat, YandexGPT, ChatGPT, etc.) and Open-Source (LLAMA, GEMMA, etc.) . A visual node pipeline editor has been added to the system, with which you can independently create new individual LLM orchestration pipelines and AI agents for any subject area in low-code mode. The conceptual architecture of the product is as follows:

Rice. 2. Diagram of the conceptual architecture of the implementation of the resume formatting case on the DUC SmartSearch platform

Rice. 2. Diagram of the conceptual architecture of the implementation of the resume formatting case on the DUC SmartSearch platform

An additional digital assistant “HR Assistant” was created in the system, and a special tool, ResumeTool, was developed and added. This tool allows the digital assistant to work with files: upload a source resume, send it to the processing pipeline, populate a template file with extracted data, and save the formatted resume in object storage. Users with the administrator role can upload their own template for a resume or other documents in the tool settings.

The following figure shows a diagram of use cases for the developed resume formatting solution.

Rice. 3. Structuring a resume according to a given template in DUC SmartSearch. Use cases diagram

Rice. 3. Structuring a resume according to a given template in DUC SmartSearch. Use cases diagram

The algorithm for processing resumes in the DUC SmartSearch system is as follows:

  1. The administrator uploads the required resume template file on the settings page of the ResumeTool tool of the digital assistant “HR Assistant”;

  2. Then the administrator sets up a pipeline for processing initial resumes in the Flowise node editor (the stage of implementation and initial configuration of the system usually takes from 2 to 4 weeks);

  3. To process resume files according to a given template, a regular user goes to the “Chatbot” page, selects the configured digital assistant “HR Assistant” and uploads a resume file for processing;

  4. When a user sends a resume file on the “Chatbot” page, it is sent to the prepared Flowise pipeline, processed, and the structured resume text is returned to the digital assistant. Next, the received text is processed into a given template and the user is provided with a link to download the prepared file. The system interface looks like this:

Rice. 4. Interface of the “Chatbot” page DUC SmartSearch

Rice. 4. Interface of the “Chatbot” page DUC SmartSearch

Development of a resume formatting pipeline

The solution for formatting the resume text was based on a data extraction approach similar to the compilation of the Knowledge Map in the case “Intelligent Assistant for User Technical Support”. You can read about it in our previous article “How we made a smart assistant: Use Case for implementing a smart chatbot based on the Knowledge Map approach and LLM GigaChat” (https://habr.com/ru/articles/829022/).

The basic idea is to extract JSON information from unstructured data using LLM. During the development process, three information extraction pipeline schemes were tested:

Rice. 5. Options for data extraction pipelines

Rice. 5. Options for data extraction pipelines

In the first iteration of the developed pipeline, one general prompt was used to obtain information on all sections of the resume. The approach showed low quality results – missing data, ignoring instructions, and a high level of hallucinations (making things up). We believe that it is still difficult for language models to handle multiple text analysis requirements in a single query.

The second iteration of the pipeline was based on populating the entire JSON structure on the first request and subsequent subsequent requests to check and edit individual fields in the overall JSON. The processing results became much better, the solution was transferred to HR specialists for testing. After processing a couple of dozen resumes, shortcomings were identified, some of which were eliminated by tuning prompts.

The transition to the third version of the pipeline made it possible to eliminate the remaining comments, and also provided the opportunity to flexibly configure each section separately. At the same time, the number of transferred tokens has increased, because the entire text of the original resume was passed to the model to extract each JSON field. For us this was not critical, because… this did not greatly affect the processing speed, and model inference was carried out locally on the GPU (without paying for tokens). In this project, the quality and stability of the processing result were prioritized over other non-functional requirements.

Interesting problems solved during the development process

Some resumes in .pdf format have a problem with incorrect encoding – when parsing text, extra spaces appear between letters in words. Such “leaky” LLM text is processed very poorly or not processed at all:

Rice. 6. Fragment of problematic text

Rice. 6. Fragment of problematic text

Based on the results of experiments with prompts and various models, the following solution was found. First, we count the number of spaces in the text and calculate their percentage to the total number of characters in the text. If it is greater than some average value for normal resume texts, then we remove all single spaces. This prepared text is sent to the preliminary pipeline, in which the model places the missing spaces. It copes well with this task, in contrast to removing extra spaces between letters in words. Next, we process the resulting text using the main pipeline.

Evaluation of results

Looking ahead, we note that the final solution works on the multilingual Gemma 2 27B model with Q4 quantization. These parameters allow you to run the model on a user segment video card with 24 GB of video memory.

As an alternative, we tested three more open local models: Mistral NeMo 12B Q8, Llama 3.1 8B FP16, Saiga-llama3 8B FP16. The quality of their processing turned out to be an order of magnitude worse than that of the selected model – inventing information, adding unspecified json fields, looping response generation, etc.

To evaluate the results of processing resumes using the selected model, a testing methodology was used, based on the processing of 10 random resumes followed by evaluation by three independent experts. Points were assigned based on the quality of completion of each section of the resume separately on a 5-point scale:

1 – fictitious information is present;

2 – information was in the text, but was not extracted;

3 – only part of the information was extracted;

4 – all information has been extracted, but formatting requirements have not been met;

5 – all information was extracted in the required form.

The average scores of the three experts and the final score are presented in the table below.

Table 1. Results of assessing the formatting of resume files according to a given template

Table 1. Results of assessing the formatting of resume files according to a given template

As you can see, the model copes worst of all with filling the technology stack. The reason is that it does not interpret some technology names as development tools. This problem is solved by adding examples of technologies that the model skips in the propmpt. Thus, when reprocessing, these technologies will be popped onto the stack.

Project results

After implementing automation, the time for formatting a resume was reduced by 6 times. Below is how much time was spent on formatting before and after using the AI ​​assistant:

  • Before automation: on average 30 minutes for formatting one resume by an HR employee.

  • After automation: 2 minutes for automatic formatting and 3 minutes for verification, total 5 minutes.

Taking into account the processing of an average of 100 resumes per month by one HR specialist/researcher, the implementation of this solution resulted in significant time savings (more than 40 hours of HR specialist work per month).

Scaling the solution

Integration of a node editor for setting up processing pipelines and loading a template file allows SmartSearch administrators to configure the digital assistant to automatically process various types of documents according to specified requirements and template. The implementation of document processing via Telegram is optionally available.

The created solution is easily scalable to various resume formats. Because we took a modular approach, integrating new functionality takes only a few days. This allows companies to quickly adapt to changing requirements and expand the capabilities of the HR assistant.

New challenges for HR-Tech

There are several new tasks in our backlog, including improving the algorithms for selecting vacancies based on candidate resumes. We plan to develop functionality that will not only match applicants' skills with vacancies, but also recommend potential positions based on their career goals. The functionality is in the design stage, if anyone is interested in piloting this task at home – Welcome!

More news in simple words about IT, trends, technologies and our cases – in the author’s telegram channel t.me/dataundercontrol.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *