Writing a chatbot to work with PDF

The popularity of language models, in particular ChatGPTis growing exponentially, but many of us still face certain limitations, such as outdated information, that OpenAI has not yet been able to overcome.

But have you ever thought about asking questions directly from your documents stored in the cloud? Save your time spent searching and manually monitoring sites, and use automation when working with PDF documents. If you are interested in this perspective, you will find this article a valuable resource.

We can avoid the risk of invalid data in ChatGPT by implementing model operation via RAG. In our material, we will explain in detail how to create a chatbot to interact with documents from your repository using LangChain.

Let's get started(:

What is RAG?

We are all aware that the answers ChatGPT are based on its training data, over which we have no control, so the need for a method that allows us to enrich the model with user data becomes urgent.

RAG (Retrieval Augmented Generation) is an approach for working with large language models such as ChatGPT, Llama or Cohere, the goal of which is to improve the accuracy of predictions by integrating external data storage into the response generation process. The idea is that we combine the context of the user's request, relevant knowledge from the RAG store and previous interaction experience, which gives the language model the ability to provide a more thorough and accurate answer

RAG for LLMs

RAG for LLMs

What about Langchain?

Simply put, Langchain is a framework that makes working with language models easier and provides us with the tools we need to create applications that run on LLM.

But why him? This framework gives us the ability to develop intelligent applications that take into account the context of requests. This is especially valuable for providing information relevant to user queries, which is a key aspect in our mission. Also remember that ChatGPT has a number of limitations (such as out-of-date facts, unavailable user data or limited knowledge), and the Langchain framework is aimed at overcoming these limitations.

FAISS

Most large language models (LLMs) are known to have limits on the number of words or tokens per query, so when it comes to working with PDFs (even if it's a single document, let alone tens or hundreds), we simply don't we can present all the data as a single context at once, which is not only inconvenient, but also ineffective.

FAISS (Facebook AI Similarity Search), developed by the Facebook AI team at Meta, is a tool for quickly searching for similar elements and clustering dense vectors. In addition, the library supports the use of graphics processing units (GPUs) for accelerated data processing.

FAISS architecture:

Within our application, using FAISS provides the ability to retrieve and return text that is relevant to the user's query, thus solving the word limit problem we mentioned earlier.

Embeddings

We mentioned vector representations in the previous section, but what is their significance? Vectors are numeric arrays that allow us to perform various mathematical operations on text, such as finding text fragments with a given meaning or context. It is on these representations that the work of large language models (LLM) and ChatGPT in particular is based.

In general, search is one of the most common uses of vector representations. That is, we convert the data into vector format and then store it in a vector database. To search, we use a method based on similarity using the KNN algorithm (K-Nearest Neighbors, or the k nearest neighbors method):

To illustrate, in the case of a traditional keyword search, if we were to use a query like “Find me Intel Mac models” in the SOLR search engine, we would probably not get high quality results because the platform does not interpret the meaning of the words, but will search for words with similar spelling and sound. At the same time, when using vector search, we can take into account not only the presence of a word, but also its context in a sentence due to the specificity of storing words in the form of vectors.

Code it?

  1. Let's install the necessary packages and import them:

# Устанавливаем необходимые пакеты

pip install openai

pip install langchain

pip install langchain-openai

pip install PyPDF2

pip install langchain-community

# Устанавливаем Faiss для GPU. 

pip install faiss-gpu

#pip install faiss-cpu – если нет GPU

# Импорт для обработки ошибок и логирования

import logging

# Импорт для чтения PDF из PyPDF2

from PyPDF2 import PdfReader

from langchain.embeddings import OpenAIEmbeddings

from langchain.text_splitter import CharacterTextSplitter

from langchain.vectorstores import FAISS

from langchain_community.chat_models import ChatOpenAI

from langchain_community.vectorstores import FAISS

from langchain_community.embeddings import OpenAIEmbeddings

from langchain.chains import RetrievalQA

from langchain.memory import ConversationBufferMemory

  1. Next we need to work with the incoming PDF file:

# Инициализируем экземпляр класса OpenAIEmbeddings для работы с векторными вложениями текста

embeddings = OpenAIEmbeddings()

# Установка уровня логирования

logging.basicConfig(level=logging.INFO)

# Функция для разбиения текста на параграфы

def split_paragraphs(rawText):

    # Инициализируем текстовый разделитель, который разбивает текст на параграфы

    text_splitter = CharacterTextSplitter(

        separator="\n",              # Разделитель - символ новой строки

        chunk_size=1000,             # Максимальный размер фрагмента текста

        chunk_overlap=200,           # Перекрытие между фрагментами

        length_function=len,         # Функция для определения длины текста

        is_separator_regex=False,    # Указываем, что разделитель не является регулярным выражением

    )

    # Разбиваем текст на параграфы с помощью текстового разделителя и возвращаем результат

    return text_splitter.split_text(rawText)

# Функция для загрузки текста из PDF-файлов

def load_pdfs(pdfs):

    text_chunks = []  # Создаем пустой список для хранения текстовых фрагментов

    # Проходим по всем PDF-файлам

    for pdf in pdfs:
try:
with open(pdf, 'rb') as file:
reader = PdfReader(file)
for page in reader.pages:
raw = page.extract_text()
chunks = split_paragraphs(raw)
text_chunks += chunks
except Exception as e:
logging.error(f"Error loading PDF {pdf}: {e}")
    # Возвращаем список текстовых фрагментов
    return text_chunks

# Создаем экземпляр класса FAISS для хранения векторных вложений

store = None  # Изначально устанавливаем значение None, оно будет изменено позже при необходимости

  1. Next, we need to ensure storage of the processed text:

# Основная функция программы

def main():

    # Список PDF-файлов для обработки

    list_of_pdfs = ["test.pdf"]

    # Загрузка текста из PDF-файлов и разбиение на фрагменты

    text_chunks = load_pdfs(list_of_pdfs)

# Инициализация экземпляра класса OpenAIEmbeddings для работы с векторными вложениями текста

embeddings = OpenAIEmbeddings()

    store = FAISS.from_texts(text_chunks, embeddings)

    # Запись индекса на диск

    store.save_local("./vectorstore")

# Если код запускается непосредственно, а не импортируется в другой модуль, то вызываем функцию main().

if name == "main":

    main()

  1. We will ensure interaction:

# Загружаем сохраненное хранилище FAISS с диска. OpenAIEmbeddings() используется для преобразования текста в векторы.

store = FAISS.load_local("vectorstore", OpenAIEmbeddings(), allow_dangerous_deserialization=True)

# Создаем экземпляр модели ChatGPT turbo

llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

# Создаем экземпляр цепочки RetrievalQA, используя экземпляры модели llm и хранилища store в качестве параметров

chain = RetrievalQA.from_chain_type(

   llm=llm,

   retriever=store.as_retriever()

)

# Импортируем список разговоров

conversations = [...] # [...] - ваш список разговоров

  1. Let's save the context:

# Создаем экземпляр ConversationBufferMemory

memory = ConversationBufferMemory()

# Проходим по списку разговоров и сохраняем контекст каждого разговора в памяти

for msg in conversations:

    memory.save_context(

        {"input": msg['human_question']},

        {"output": msg['chatbot_answer']}

    )

# Создаем экземпляр цепочки RetrievalQA, используя экземпляры модели llm, хранилища store и памяти memory в качестве параметров

chain = RetrievalQA.from_chain_type(

    llm=llm,

    retriever=store.as_retriever(),

    memory=memory

)

Let's sum it up

In this article, we took a detailed look at the process of creating a simple chatbot for working with PDF documents using the LangChain framework. Of course, the bot presented here is only the initial stage of developing a functional tool, but this example is an important starting point for understanding approaches to creating such applications. You can deploy it locally and evaluate the functionality.

I think this is of significant interest, especially since, as we've already noted, OpenAI is not currently able to address all of the flaws in its models, resulting in out-of-date or incorrect answers – and the dataset on which the models are trained may be irrelevant. Using such software will allow us to rely on trusted data sources (or take responsibility for the relevance of answers ourselves, which depends on the chosen source of information). In any case, this can save significant time, and this is a definite plus.

Of course, there are ready-made solutions like the AskYourPDF plugin, but there is a certain satisfaction in creating your own product, customized personally to your needs and preferences.

Thanks for reading! We'd love to hear your opinion(:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *