LLM agent for working with Google Spreadsheets

SberDevices. Today I want to talk about one of our research areas – the development of agent systems based on large language models.

In this article, we will try to take an overview of the key technological aspects of the design and implementation of LLM agents, consider how LLM bundles and functions work, some components of multi-agent systems, methods of controlled generation and increasing robustness. In addition, we will present and describe in detail the architecture and method of constructing one of the prototypes of LLM agents aimed at performing tasks in Google SpreadSheets.

Our LLM agent was implemented using the SDK GigaChain And GigaGraphadapted to work with GigaChat. You can look at the final version of the Google SpreadSheets agent at repositories or start developing your agent with introductory tutorial.

LLM agents

What is an LLM agent? While there is no established and universally accepted definition for agents, one option can be formulated as follows:

LLM agent – a type of software that uses LLM to determine the list of operations to be performed, their order and input parameters

In other words, when developing classical software, the sequence of function calls and their arguments are completely determined by the human developer, and in the LLM agent some language model is responsible for this. It is important to emphasize that the LLM itself does not accomplish anything (with rare exceptions), it only tells what operations, with what parameters and in what order should be performed.

Let's consider a simple example: we ask LLM about the current temperature in the capital of Argentina.

User: Какая сейчас погода в столице Аргентины?
LLM: Нужно вызвать функцию get_capital() c параметром country="Аргентина"
Executor: Буэнос-Айрес
LLM: Теперь необходимо обратиться к функции get_weather() c параметром city="Буэнос-Айрес"
Executor: +12, Дождь
LLM: В столице Аргентины сейчас дождливо, +12 градусов

Since the weights of any LLM contain information only from historical training data, the model does not know what the weather is like in the capital of Argentina at the current moment. Moreover, she may not know the capital of Argentina or, more likely, she may be hallucinating and making a mistake. To fill gaps in LLM memory and reduce factual errors, you can allow the model to use external functions.

The list of available functions and their description is transmitted to the LLM in advance in a system prompt or in a special syntactic form (for example, through dedicated tokens). When a request is received, the LLM determines which functions should be called (and whether they should be called), in what order they should be called, and how to specify the input arguments. The calls themselves are made by the external module (executor). In the example above, LLM outputs the functions selected for execution in free text form, which is convenient for human comprehension, but difficult to parse and automate. In this aspect, it is more logical to force LLM to return a list of functions to be called in a special text form, for example, in JSON.

Our modified example would look like this:

User: Какая сейчас погода в столице Аргентины?
LLM: 
{
    "name": "get_capital",
    "parameters": {
		"country": "Аргентина"
    }
}
Executor: Буэнос-Айрес
LLM: 
{
    "name": "get_weather",
    "parameters": {
		"city": "Буэнос-Айрес"
    }
}
Executor: +12, Дождь
LLM: В столице Аргентины сейчас дождливо, +12 градусов

As you can easily see, the dialogue has become more difficult to perceive by eye, but the responses from LLM are much easier to work with in code. This approach to calling LLM operations was proposed by OpenAI and is called tool/function calling (sometimes shortened to FC). It requires a rather long additional training of the model via SFT on a special dataset, so it is not supported by all language models. However, at the moment, function calling is perhaps the most common way to work with functions in LLM, however, there are also other approaches.

About the difference between tool calling and function calling

OpenAI initially called the approach function calling, following what appears to be a generally accepted definition of a function from a procedural programming point of view, and allowing that a function can not only return a value, but also implicitly change the state of an object (side effect). Proponents of functional programming, where a function cannot create side effects, strongly disagree with this. To separate these concepts, OpenAI renamed function calling to tool calling, where tool is a more general abstraction than function (any function is a tool, but not vice versa). However, at the moment there are no tools other than functions in the OpenAI API, so these concepts can be used interchangeably, although the term “function calling” is considered obsolete by the company itself.

LLM agent as RL agent

From the point of view of the logic described above, any LLM agent can be considered a private version of an RL agent, where any the state of the environment is a sequence of tokenstransferred to LLM. Schematically, the operation of such an agent can be seen as follows:

1) Agent is watching initial state s = s_0which is a set initial dialogue messages for LLM. For the example above the state s_0 – this is a system prompt + a request like “What is the weather now in the capital of Argentina?”

2) The agent selects (samples) one action a\in A from politicians \pi(a | s_0) given that state s:

a \sim \pi(a \in A| s)

For LLM agent – A is the set of all available tools (+ action “do not select any tool”), state s is the current dialog, and policy \pi(\cdot |s) modeled by LLM.
3) The agent “performs” the action a and goes into new condition (new dialogue) s_{new}:

s_{new} \sim p(\cdot | s, a)

Transitions function p(\cdot | s, a)in fact, is modeled by the LLM itself and depends on both the model's weights and feedback after the action is performed a.

4) The agent returns to point 2 and works until the state becomes terminal (for example, if the “do not select any tool” action was selected).

The most important observation in this comparison of LLM and RL agent is that politics over actions \pi(\cdot |s) present implicitly in LLM scales and we usually do not have the ability to change it directly. Even if we have the option to change the model weights through additional training, it is quite difficult to express this policy as a function of LLM weights and optimize using gradient methods (although it is possible). For this reason, decomposition of the input task and drawing up competent prompts is the main tool for setting up the LLM agent policy.

LLM Agent Components

Above we looked at a very simple example of an LLM agent, where both the task and the functions were quite trivial. In the real world, problems can be significantly more complex, causing the LLM to hallucinate and call the wrong operations. The heuristic for solving this problem is quite simple: you can break a large problem into several subtasks and solve them sequentially. For this reason, modern LLM agents typically consist of several components. Next we will look at the 4 most important of them (in fact, they more).

Context retriever

The purpose of this component is to obtain the necessary context about the task (or, more formally, about world model) for future LLM use. This step essentially collects all the available information that the agent can get its hands on. It is difficult to give specific examples of the operation of this module, since they completely depend on the subject area (domain), but we can list the most common patterns:
— RAG with search through collections of text documents;
Web search by keywords;
— Searching for information in knowledge maps and databases;
— Human feedback (requesting additional information from the user in a dialog format);
— API for accessing the target service.

We already saw an example of a Context retriever in the first example, when LLM called functions to retrieve information about the capitals of countries and the weather in the city:
— To determine the weather, it is obviously easiest to use the API of the open weather forecast service.
— To search for the capital of a country, you can, for example, use a ready-made text document with pairs “country – capital” and arrange a linear search on it (well, or binary, if you sort the countries in advance :)).

However, the last option implies that we have created such a file in advance and placed it in a place accessible to the agent. If we want to extract only the capital by country name, this approach seems normal, but if we want to be able to extract another 100–200 characteristics (population, year of education, anthem text, …), then creating and maintaining such a file(s) becomes unjustified. We can consider another approach: using the name of the country, we can download the source text of the page from Wikipedia/Ruvika and ask LLM to highlight the desired characteristic.

Example about Argentina:

User:
Аргенти́на (исп. Argentina [aɾxenˈtina]), официальное название — Аргенти́нская Респу́блика (исп. República Argentina [reˈpuβlika aɾxenˈtina]) — второе после Бразилии по территории и третье после Бразилии и Колумбии по населению государство Южной Америки, состоящее из 24 административных единиц: 23 провинций и федерального столичного округа Буэнос-Айрес. Крупнейшая по площади в мире из испаноязычных стран. По состоянию на 2023 год, по оценкам всемирной книги фактов ЦРУ, по численности населения Аргентина 33-е государство в мире (46 621 847 человек)[8]. Состав населения Аргентины по вероисповеданию по состоянию на 2022 год: католики — 48,9 %, другие христиане — 10 %, нерелигиозны — 39,8 %, верующие в другую религию или конфессию — 1,3 %[9].
User: Найди в тексте выше название столицы страны
LLM: Буэнос-Айрес

Therefore, it is important to note that to obtain context, LLM itself can and should be used to prepare and modify the extracted data.

Planner

The scheduler's task is formulated quite simply – he needs to create a plan for executing the task, which will subsequently receive an LLM to call functions. However, a simple query may give a result that is completely different from what would be useful to us.

Let's look at an example (GPT-4o):

User: Напиши пошаговый план приготовления молочного улуна
LLM: 
**Шаг 1: Подготовка необходимых инструментов и ингредиентов**
- **Ингредиенты**: 5-7 граммов молочного улуна (1-2 чайные ложки).
- **Инструменты**: чайник для заваривания (глиняный, фарфоровый или стеклянный), чашка, сито.
- **Вода**: фильтрованная вода, подогретая до 80-90°C.
**Шаг 2: Прогрев посуды**
- Вскипятите воду и налейте её в чайник и чашку для заваривания.
- Дайте постоять несколько секунд, чтобы посуда прогрелась, затем вылейте воду.
**Шаг 3: Промывка чая (по желанию)**
- Положите молочный улун в чайник.
- Залейте небольшим количеством горячей воды и сразу же слейте её.
- Это помогает очистить и разбудить чайные листья.
**Шаг 4: Заваривание**
- Залейте чайник горячей водой (80-90°C), полностью покрывая чайные листья.
- Настаивайте чай в течение 1-2 минут для первого пролива.
**Шаг 5: Разливание чая**
- Процедите чай через сито и разлейте его по чашкам.
**Шаг 6: Повторное заваривание**
- Молочный улун можно заваривать до 5-7 раз.
- Для каждого следующего пролива увеличивайте время настаивания на 30 секунд.
**Шаг 7: Наслаждение чаем**
- Пейте чай сразу после заваривания, наслаждаясь его мягким и сливочным вкусом.

From a human point of view, the plan looks complete, understandable and quite logical. However, for the LLM agent who will work on it, the plan is one big hallucination: no one can guarantee that each step written above can be completed. To do this, it is necessary that for each step of the plan there is a necessary operation (function) that LLM can call, but it may not exist.

To avoid this problem, you need to define an action space – the set of all actions that the agent can perform. The resulting action space is transferred to the scheduler, which can only operate on available events. In this case, the scheduler is called “grounded

Let's look at our modified example:

User: 
Напиши пошаговый план приготовления молочного улуна 
Ты можешь выполнять только эти действия 
- Взять определенное количество нужного чая 
- Убрать заваренный чай 
- Налить воду в чайник 
- Положить чай в чашку 
- Подождать определенное количество минут 
- Налить воду в чашку 
- Вскипятить воду до заданной температуры
LLM: 
1. Вскипятить воду до 85-90°C.
2. Взять 1 чайную ложку молочного улуна на 150-200 мл воды.
3. Положить чай в чашку.
4. Налить горячую воду в чашку.
5. Подождать 3-5 минут, чтобы чай настоялся.
6. Убрать заваренный чай.

Now LLM makes a plan only from those actions that are available to her (if she does not hallucinate, of course). However, a much more difficult question then arises: how to teach LLMs to generate the most accurate and optimal plan possible? Let's consider the main approaches.

Chain-of-thoughts (COT)
Idea: use chain-of-thoughts before generating a plan.

Thoughts: 
Нужно использовать доступные действия для правильного приготовления молочного улуна. 
Начну с кипячения воды, а затем перейду к завариванию чая, следуя шагам, 
чтобы получить идеальный вкус.

Plan:
1. Вскипятить воду до заданной температуры 85-90°C.
2. Налить воду в чайник.
3. Взять 1 чайную ложку молочного улуна.
4. Положить чай в чашку.
5. Налить воду в чашку.
6. Подождать 3-5 минут.
7. Убрать заваренный чай.

Self-consistency (SC)
Idea: Generate several plans and then have the LLM select the best one.

Plan 1:
Вскипятить воду до заданной температуры 85-90°C.
Налить воду в чайник.
Взять 1 чайную ложку молочного улуна.
Положить чай в чашку.
Налить воду в чашку.
Подождать 3-5 минут.
Убрать заваренный чай.

Plan 2:
Вскипятить воду до заданной температуры 85-90°C.
Взять 1 чайную ложку молочного улуна.
Положить чай в чашку.
Налить воду в чайник.
Налить воду в чашку.
Подождать 3-5 минут.
Убрать заваренный чай.

Best plan: 1

Tree-of-thoughts (TOT)
Idea: at each step we generate several options for the next step, evaluate or prioritize them and traverse the tree using the selected algorithm (below is a trivial example with greedy search, similarly you can use BFS, DFS, etc.)

Текущий план: 
1. Вскипятить воду до заданной температуры 85-90°C. 
2. Налить воду в чайник. 
3. Взять 1 чайную ложку молочного улуна.

Вариант 1: Положить чай в чашку.
Вариант 2: Налить воду в чашку.
Вариант 3: Подождать определенное количество минут.
Thoughts: Логичнее сначала положить чай в чашку перед тем, как налить в неё воду, чтобы чай мог сразу начать завариваться.

Лучший вариант: 1

Monte Carlo Tree Search (MCTS)
Separately, it is worth mentioning the approach through MCTS and UCB assessment, which has received much attention in the literature recently. This algorithm allows you to stochastically traverse the tree of action options and update the value of the Q-function for each pair (state, action). The approach is also notable for the fact that it allows you to automatically collect a dataset for subsequent additional training of LLM agent via DPO (or regular RL algorithms for LLM). However, MCTS requires either a good proxy estimator of the Q function (via logits or an external model) or a good reward function.

Reflector

Reflector (critic) is a component that generates comments on a ready-made model answer and, based on them, improves the answer. The critic can be used for arbitrary parts of the LLM agent, but we will concentrate on critics for planners. Their task is to improve the already prepared plan.

Critics for the scheduler can be divided into the following types:
— General critic;
— Specialized critic (can only change certain parts/aspects of the plan);
— Corrective critic (corrects obviously found errors);
— Action critic (changes plan items so that they are from action space).

Executor

The task of this module is to call the functions that the LLM agent plans to call. As input, it takes all the available information (context) collected by the Context retriever, and the plan that was generated by the Planner component and, possibly, the Reflector. Then he calls the functions available to him in a loop until he considers that the entire plan has been completed.

The key for an executor is the set of functions (or tools) available to it. Tools are usually described in the form specified by OpenAI:

{
	"type": "function",
	"function": {
		"name": "get_delivery_date",
		"description": "Get the delivery date for a customer's order.",
		"parameters": {
			"type": "object",
			"properties": {
				"order_id": {
					"type": "string",
					"description": "The customer's order ID.",
				},
			},
			"required": ["order_id"],
			"additionalProperties": False,
		},
	}
}

It is important to understand that the action space of the scheduler and the set of functions available to the agent may not coincide. Moreover, the executor may fail to execute the plan even without hallucinations in reasoning (for example, if it encounters an unavailable or prohibited action). From this point of view, the plan is not executed deterministically point by point, but only helps the LLM agent to be more predictable, robust and accurate.

Controlled generation

Above we looked at the 4 main components of an LLM agent:
– Context retriever
—Planner
—Reflector
— Executor

To make it more convenient to work with them in prompts, as a rule, a restriction is set on the form of the model’s response, for example:

User:
Напиши информацию о любой стране мира 
Твой ответ должен в форме: 
{ 
	"country": название страны, 
	"capital": название столицы, 
	"population": население 
}
LLM:
{ 
	"country": "Japan", 
	"capital": "Tokyo", 
	"population": 126.3 million 
}

And although modern models cope very well with restrictions on the response format, they can sometimes make mistakes (for example, forgetting to put a comma and not taking into account the data type). To solve this problem, controlled generation is used, which physically prohibits LLM from generating inappropriate tokens. You can read more in OpenAI documentation.

Google Spreadsheet API

Above we examined the general principles of constructing LLM agents, now we will dwell in more detail on Google Spreadsheet. This tool is similar in functionality to Excel, but runs in the cloud as SaaS. To work with it via the API, Google provides a free and quite wide tool set.

For API requests to a specific work file, 2 components are required:

  • Document ID (unique, located in the URL);

  • A file with the credentials of the user on whose behalf the file will be processed.

Request Format

Google provides a ready-made python library for working with simple requests (like read or write) and for authentication.

Example code for reading cells from a sheet:

def read_values(self, range_name):
	service = build("sheets", "v4", credentials=self.creds)
	result = (
		service.spreadsheets()
		.values()
		.get(spreadsheetId=self.spreadsheet_id, range=range_name)
		.execute()
	)
	rows = result.get("values", [])
	return rows

Our agent implemented the following operations with a spreadsheet:
— Writing a value or formula into a cell;
— Reading data from one or more cells;
— Writing tables from markdown format;
— Creation of pivot tables;
— Creation of graphs;
— Autofill cells.

This functionality is sufficient for most typical scenarios for working with tables and is fully covered by the API capabilities (except, perhaps, error information).

Google Spreadsheet Agent

As discussed above, to build an LLM agent we will need to implement several components separately. It is important to note that all modules use controlled generation and CoT for intermediate reasoning to improve the quality and robustness of the LLM agent.

Context retriever

The most important information about the world for our LLM agent is the data already recorded in the Google Spreadsheet. Generally speaking, there is no deterministic algorithm for reading and analyzing information from an excel-like sheet, since the spatial structure can be very complex, but there is approximate methodsworking with relatively simple scenarios. We settled on two main structures: a relational table and a vertical dictionary.

2 types of tables

2 types of tables

To obtain context, we use the following idea: we can pass read data to LLM to obtain its semantics. Here a problem arises: there can be a lot of data (hundreds of thousands of rows) and it cannot be put into the context of the model. However, this is not necessary – it is enough to take a small sample of data and send it to LLM.

The final request looks something like this:

You are a professional data annotator. You are given a markdown table with columns names and sample data. Also you are given a user query.
First, you should find out if this table is in a usual relation database form or in a form of dictionary

If this table is in a dictionary form, then you should return a json dictionary of this table, for example:
{
	"cats" : 5,
	"dogs" : 4,
	"all" : 9,
}

If table is in a relational database form, then describe all data columns in the following json form:
{
	"name" : name of column,
	"description" : brief description of this column, try to provide meaning of this column for the user
	"type" : one of the following: number, date, text, empty
}.

At the end give brief (1-2 sentences) annotation about entire table in the following json form:
"table_description" : description of this table
Make sure that this description useful for the user.

Example:
{
	"columns": [
		{
		"name": "column1",
		"description": "Number of cats",
		"type": "number"
		},
		{
		"name": "column2",
		"description": "Number of dogs",
		"type": "number"
		}
	],
	"table_description": "Table with information about cats and dogs"
}

The structure received from LLM contains the semantics (meaning) of the data and information about their location on the sheet.

Separately, the Context retriever implements a human feedback module, namely the ability of LLM to ask a question to the user to clarify the meaning or format of the data.

Planner

For the scheduler, we have selected the following available actions (action space):
— Write a markdown table to specified cell;
— Write a formula or a value to one specified cell;
— Create a pivot table from existing data;
— Create a simple chart of one of the following types: line, column, area;
— Repeat a formula to a range of cells with automatic increment of cell address;
— Autofill blank range with a constant value;
— Autofill blank range with a data series with constant difference.

The Google Spreadsheet API allows you to do some other things (such as conditional formatting), but these functions are enough to perform basic Excel-like operations on data.

As part of the experiments, we used different approaches to the scheduler and their combinations. The following options showed the best results:
— Chain-of-thoughts + Self-consistency;
— Tree-of-thoughts (greedy);
— Tree-of-thoughts (DFS).

We also plan to try Monte Carlo Tree Search (+UCB) in future iterations.

Reflector

The agent uses 3 critics for the scheduler:
— General Reflector: one iteration of a general request to improve the plan;
— Cell Reflector: checks that the cells for recording data are correctly placed in the plan;
– Formula Reflector: checks that all formulas in the plan are available and their arguments are correct.

Executor

Executor takes as input a description of data semantics from Context retriever and an improved plan from Reflector.

All the actions described above from the action space of the scheduler, implemented through the Google Spreadsheet API, are available to it as functions.

Example function (tool):

class write_value_description(BaseModel):
	cell_id: str = Field(description="ID of the cell in A1 notation.")
	value: str = Field(description="Value or formula that will be written into the cell.")

class write_table_tool(BaseTool, extra="allow"):
	name = "write_table"
	description = \
	"""
	Call this function with a cell ID and a markdown table to write this table into the specified cell.
	Markdown table must have headers. Pass it into argument as a string.
	Formulas should begin with "=".
	"""
	args_schema: Type[BaseModel] = write_table_description
	
	def __init__(self, gapi, **data):
		super().__init__(**data)
		self.gapi = gapi
	
	def _run(self, range_name, markdown_table):
		matrix = markdown_to_matrix(markdown_table)
		self.gapi.write_values(range_name, matrix)
		return f"The data from the table {markdown_table} has been written into the cell {range_name}"

The function allows you to write a markdown table into a cell of a Google Spreadsheet via the API (external object “gapi”).

Final architecture

The final architecture is shown schematically below:

How Google Spreadsheets Agent Works

How Google Spreadsheets Agent Works

A laptop with Google Spreadsheet Agent code is available in the GigaChain repository at link.

Demo

Key Findings

  1. Modern LLMs do an excellent job of describing the structure and semantics of data (relational tables, pivot tables, etc.).

  2. Out of the box, GPT-4o and Gemini-1.5 understand well the methods of working with Excel-like tools (formulas, graphs, pivot tables, autocomplete).

  3. For complex zero-shot queries, the scheduler is random and produces different versions of the plan even at zero temperature, which makes results difficult to reproduce.

  4. Chain-of-thought and self-consistency improve the reproducibility and determinism of the scheduler, but do not improve the quality of its work.

  5. Tree-of-thought does not provide any noticeable advantages when building a plan, but it does increase the number of API calls.

  6. On average, specialized critics improve the quality of the planner's work.

  7. Using structured inference (JSON + controlled generation) significantly reduces the number of hallucinations in the LLM agent.

What's next

The quality of an LLM agent consists of many components. Of these, we can highlight those that we will focus on in the near future:

— Implementation of the Monte Carlo Tree Search algorithm for building a plan and for the work of the executor itself.
— Conducting experiments using models of different quality for different components of the agent (for example, a strong model for the planner and a weaker one for the critic).
— Adding the ability for the agent to read data while the executor is running to reduce the number of erroneously written formulas on the sheet.

Thank you for reading 🙂


The article was prepared by: Arsenin Nikita (@Lirol).

Editorial assistance: Valery Ternovsky (@walfry17), Sergey Markov (@oulenspiegel) and Dmitry Golovin (@GolovinDS).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *