How LLMs Were (Un)Successfully Transformed into 2D Virtual Employees

The idea of ​​turning language models into virtual employees that can perform tasks on web pages and in applications has fascinated many startups and corporations. Instead of clicking buttons or integrating with legacy systems like SAP and 1C, why not delegate this to AI? And so simply create a virtual employee, sit it on a virtual desktop, throw tasks into Slack and check with half an eye. However, in practice, everything turned out to be not so simple.

For the first time, such an idea was also at adept.ai – a startup from the creators of the legendary article Attention is all you need (creators of the transformer architecture). Even in the era before chat gpt, they showed some strange demo where their bot selects real estate based on parameters on the site, received a huge amount of investment and disappeared (example of work in video).

This idea got a new lease on life about eight months ago when Reworked AI introduced llama 2d. They taught language models not just to “read” text, but truly perceive the structure and meaning of two-dimensional documentssuch as web pages. Before I tell you how it works, here are a couple of not very successful approaches that their competitors tried to take.

Why not vision models?

A year ago it wasn't that goodx vision modelsfor example an article about llava (this is the first opensource vision model of acceptable quality) was released in December 2023 – , we were able to understand what's what with image processing in LLM only now with the release of Qwen-v. In addition, vision models are still very expensive for training and inference, so-so in quality (especially when working with small elements) – I think almost everyone has read the work – Vision language models are blind

Why not HTML based approaches

You can submit the HTML code of the page to LLM and ask it to generate some actions that need to be done with it (for example, in the form of python code for selenium). However, in most modern sites, the HTML and JS code are very complex and obfuscated, so it will be very difficult for LLM to understand this mess.

In addition, the html code is very voluminous. The main page of the super-minimalistic dating application pure takes up 200k context tokens from the model. There are no models yet that can confidently work with such a volume of information.

An example of an HTML based LLM agent can be found here here

llama approach – 2d

They represent any web page as a canvas with blocks of text (inscriptions) in different parts of the page. In each block of text, the positional embedding is calculated + 2d (by Ox and Oy) embedding of the text block itself.
Thus, LLM, in addition to being able to perceive each specific element of a web page separately, has a good idea of ​​the overall structure of the page.

To interact with the model on the page, a special tag is placed on each “clickable” html element, just like this does Vimium. The model learns to predict the tokens of the next action (press $12, enter some text, etc.).

Does it work?

Overall, the model fits! With such an exotic approach to learning, we can say that this is already a resounding success.

At first glance, it even generates something on topic and copes with the simplest scenarios.

But unfortunately, this is a rather narrow niche and there are no cool benchmarks here, so it is difficult to objectively compare this approach with others, as well as to understand how many scenarios this model can cover.

I'm wondering what else to read/watch/do?

  1. First of all, a thread from the author of this model on Twitter – 1

  2. Secondly, try to start their solution and poke around in your tasks, it has a pretty nice API – 2

from llama2d.vision import Llama2dScreen
screen = Llama2dScreen()
screen.push_word(word="north",xy=(0.5,0))
screen.push_word(word="south",xy=(0.5,1))
screen.push_word(word="east",xy=(1,0.5))
screen.push_word(word="west",xy=(0,0.5))
  1. Thirdly, take a look at my telegram channel – there are some very underrated materials there – a nostalgic post about the development of LLM over the last 2 years, as well as about sycophancy models from anthropics – mlphys

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *