Building reliable systems from unreliable agents

Large language models can be used for a variety of practical purposes. One of the most interesting areas is autonomous AI agents. If you generate a large number of agents for a given request and force them to compete with each other, then theoretically you can obtain the optimal result for this problem. This can be used in information security and other areas of software development.

In addition, you can create agents, that is, software that independently evolves and improves itself based on feedback from users.


For example, developers from Rainforest QA built
pretty clear process creating LLM-based products that cope with their inherent unreliability. Thanks to the mechanics of using AI agents, these software systems independently develop the properties of reliability and stability.

The process falls into five high-level steps, with the first four being sufficient to solve most problems:

  1. Write simple routines to solve a problem
  2. Use this experience to create an assessment system to change processes and fundamentally improve productivity
  3. Deploy your AI system with good observability, continue to collect examples and improve your scores
  4. Implement a response generation system taking into account additionally found relevant information (Retrieval Augmented Generation, RAG)
  5. Refine the model using data collected in previous stages

Helpful step:

  • Involve auxiliary (additional) agents

This is all a highly iterative process. It is impossible to design and create an AI system in one attempt, the developers write. You can't predict what works and how users will use the tool, so feedback is necessary: ​​“You will create the first, inadequate version, use it, notice shortcomings, improve, expand use, etc. And if you succeed and create something… something really useful, then the result will be even more iterations and improvements to the product.” That is, it is an almost endless cycle of development and improvement.

The good thing is that this technique does not require extensive LLM programming experience and is accessible to almost anyone. Although experience with ML is easy these days priceless.

The authors provide examples of code that they used to create their own automated testing system.

Simple products

We create a simple industrial process and run it several times to evaluate the result:

As you can see, the results are unreliable and differ from each other.

At this stage, you can integrate the AI ​​component into your product to send requests to the LLM not through the console, but through the API. It is convenient to use the Python library to process responses instructor.

Evaluation system for changes in industrial items

We are talking about iterative improvement of products based on measurable criteria, where the key phrases are “iterative” and “measurable criteria”.

At this point, the evaluation loop is created and then sped up as much as possible to iterate efficiently:

For more information about the practical experience of prompt engineering, see Hereand this article about the rating system with fast iterations.

Once the evaluation cycle has been established, we can create a test set of example inputs and corresponding outputs that we want to receive from the agent.

When evaluating different change strategies, it is best to consistently use the same metric, such as percent correct.

Observation and feedback

At this point, we already have some kind of working alpha version of the product that can be deployed in production. This needs to be done as soon as possible to get real data and user feedback.

Without real feedback, development can become stymied; this is an absolutely necessary component for continually improving the system over time.

There are many options here: from well-known monitoring systems to open source libraries like openllmetry And openinferencewhich use OpenTelemetry under the hood.

RAG system

When we have exhausted all possibilities for changing prompts, and performance has reached a plateau, we can build into the pipeline a system for generating responses, taking into account additionally found relevant information (Retrieval Augmented Generation, RAG). Roughly speaking, this is real-time development of promts, where we dynamically add relevant content to the promt before asking the agent to respond.

An example would be answers to questions about very recent events. It's easy to do a search and include a few of the most relevant news articles before asking the LLM for a response.

Another example is an agent that interacts with the application UI through simple instructions in English. Based on your own collected statistics, you can give the agent prompts like “It seems that most human testers performing similar tasks pressed the X button and then entered Y into the Z field.”

To extract data at this stage, you can use an external provider or your own solution, or a combination of the two (for example, OpenAI embeddings With storing vectors in your Postgres instance.

Of particular note is the library RAGatouille, which was created specifically for RAG, although getting it to work is not so easy. In the mentioned article, the developers used BigQuery to obtain data, OpenAI to create embeddings and Pinecone for storing and searching vectors. They write that this is the easiest way to deploy a system without creating a lot of new infrastructure. Pinecone makes it very easy to store and search embeddings with associated metadata to complement industrial texts.

Model tuning

This is a controversial step, which sometimes you can (and should) do without, because in reality, sloppy actions can worsen the quality of the model.

At the same time, there are unsolved problems: OpenAI allows you to customize only old modelsand Anthropic seems to be promises to allow tuning in the near future with a lot of reservations.

Examples

You can find specific examples of using this approach. For example, a simple model

overkiLLM

guarantees obtaining a reliable result (output) from a mass of unreliable input data (inputs). No agents are used here, the author was just interested in trying this approach.

The task he chose was to write a title for a web page H1, but a similar approach will work for any short block of text. The script generates a ton of options and then puts them together in a one-on-one vote to select the best of the pair. Everything is done locally/free on software Ollamawhich conveniently runs various LLMs on its server.

The script works according to a simple algorithm:

  1. Entrance: select authors (for example, Stephen King, etc.).
  2. Direction: Select a topic or phrase.
  3. Generation: The script creates options in the style of the authors.
  4. Grade: options compete with each other, weeding out weak works.
  5. Ranging: The options are ranked by the number of victories and defeats.

Script code

OLLAMA_MODEL = 'knoopx/hermes-2-pro-mistral:7b-q8_0'
NUM_OF_VARIATIONS = 15
NUM_OF_MATCHES = 1000

import os
import random
import ollama
import uuid
import json

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 2000)

authors = [
{
    "name": "Seth Godin",
    "description": "An author, entrepreneur, and marketer, Godin writes about marketing, the spread of ideas, and managing both customers and employees with respect."
},
{
    "name": "Paul Graham",
    "description": ""
},
{
    "name": "James Clear",
    "description": "Author of \"Atomic Habits,\" Clear writes about habits, decision-making, and continuous improvement."
},
{
    "name": "Derek Sivers",
    "description": "An entrepreneur, author, and musician, Sivers writes about creativity, life philosophy, and the lessons he's learned from founding and selling CD Baby."
},
{
    "name": "David Ogilvy",
    "description": "Often referred to as the Father of Advertising, Ogilvy was known for his emphasis on research and consumer insights. His work for brands like Rolls-Royce and Hathaway shirts has become legendary."
},
{
    "name": "Stephen King",
    "description": "A prolific author of horror, suspense, and fantasy novels, King has written over 60 books and is known for his detailed character development and storytelling."
},
]

def parse_numbered_responses(input_str, tone):
    if not input_str.strip():
        raise ValueError("Input string is empty.")
    
    lines = input_str.strip().split('\n')
    parsed_responses = []

    for line in lines:
        try:
            # Attempt to split each line at the first period followed by a space.
            number, text = line.split('. ', 1)
            number = int(number.strip())  # Convert number to integer.
            text = text.strip()  # Trim whitespace from text.
            generated_uuid = uuid.uuid4()
            parsed_responses.append({'number': number, 'text': text, 'tone': tone, 'uuid': generated_uuid})
        except ValueError:
            # Skip lines that do not conform to the expected format.
            continue

    if not parsed_responses:
        raise ValueError("No valid numbered responses found in the input.")

    return parsed_responses

def update_item_by_number(items, uuid, k):
    """Updates the text of an item in a list of dictionaries, identified by its number."""
    for item in items:
        if item['uuid'] == uuid:
            if not item.get(k):
                item[k] = 0
            item[k] = item[k] + 1
            return True
    return False

def get_one_tone(h1, context, tone, n):
    system_message="You are a helpful copywriting assistant. Only reply with a numbered list of variations of the text provided."
    user_prompt = f'''Context: {context}.\nPlease generate {n} variations of the following text written in the voice of {tone}.''' 
    user_prompt += f'''Do not mention {tone} in the text:\n{h1}'''
    response = ollama.chat(model=OLLAMA_MODEL, 
        messages=[
            {
                'role': 'system',
                'content': system_message,
            },
            {
                'role': 'user',
                'content': user_prompt,
            },
        ],
        options = {
            'temperature': 1.5
        }
        )
    parsed_responses = parse_numbered_responses(response['message']['content'], tone)
    return parsed_responses

all_variations = []
n = NUM_OF_VARIATIONS
context = "I'm using this as the H1 for my website. Please write variations that are unique and engaging."
h1 = "Definite combines ETL, a data warehouse and BI in one modern platform."
for author in authors:
    print(f"Generating variations for {author['name']}...")
    tone = author['name']
    parsed_responses = get_one_tone(h1, context, tone, n)
    all_variations.extend(parsed_responses)

df = pd.DataFrame(all_variations)
print('Number of variations: ', len(df))

i = 0
while i < NUM_OF_MATCHES:
    print('i:', i)
    selected_items = random.sample(all_variations, 2)
    system_message="You are a helpful copywriting assistant. Only reply with "AAA" or "BBB". Do not include any other text or explaination."
    user_prompt = f'''Please tell me which copy is more unique and engaging. Please reply in JSON format with the key "answer" and the value of your response. The only valid options for "answer" are "AAA" or "BBB". Do not include any other text or explaination.\n
    AAA: {selected_items[0]['text']}\n\n
    BBB: {selected_items[1]['text']}
    '''
    response = ollama.chat(model=OLLAMA_MODEL, 
        messages=[
            {
                'role': 'system',
                'content': system_message,
            },
            {
                'role': 'user',
                'content': user_prompt,
            },
        ],
        format="json",
        options = {
            'temperature': 0.0
        }
        )
    try:
        j = json.loads(response['message']['content'])
        if j['answer'] == 'AAA':
            update_item_by_number(all_variations, selected_items[0]['uuid'], 'wins')
            update_item_by_number(all_variations, selected_items[1]['uuid'], 'losses')
        elif j['answer'] == 'BBB':
            update_item_by_number(all_variations, selected_items[1]['uuid'], 'wins')
            update_item_by_number(all_variations, selected_items[0]['uuid'], 'losses')
        else:
            print('Invalid response:', j)
    except:
        print('Invalid response:', response)
        pass
    i += 1

df = pd.DataFrame(all_variations)
df['wins'] = df['wins'].fillna(0).astype(int)
df['losses'] = df['losses'].fillna(0).astype(int)
df['total'] = df['wins'] + df['losses']
df['win_rate'] = df['wins'] / df['total']
df = df.sort_values('win_rate', ascending=False)

winner = df.iloc[0]
print('Author win rates: ', df.groupby('tone').win_rate.mean())
print('Top 20: ', df.head(20))
print('Winner: ', winner.text)

Cm.

examples

with the results of the work.


Thus, modern machine learning methods can be used in various areas of IT, including information security, to build reliable and secure systems from unreliable components.

Moreover, these methods allow you to create software that can make decisions independently in complex situations (chatbots, agents, simulations, etc.). Using tools like Burr the software runs on its own server. Perhaps in the future such agents can be used for information security tasks.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *