How we found a vulnerability in SQLite using LLM
Introduction
In our previous post Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models We talked about a framework for vulnerability research using language models and demonstrated its potential by improving the performance of Meta's modern CyberSecEval2 benchmarks. Naptime has since evolved into Big Sleep, a joint project between Google Project Zero and Google DeepMind.
Today we are happy to share our first real-world vulnerability. discovered by agent Big Sleep: negative overflow (underflow) of the stack buffer with the ability to implement exploits in SQLitea widely used open source database engine. We found vulnerability and reported it to the developers in early October, and they eliminated her on the same day. Luckily we found this problem before it appeared in the official release, so it did not affect SQLite users.
We believe this is the first public example of an AI agent discovering a previously unknown memory security vulnerability in widely used real-world software. That same year, at the DARPA AIxCC event, Team Atlanta detected null pointer dereference in SQLite, which inspired us to use it in our testing to see if we could find a more serious vulnerability.
We believe that our work has enormous protective potential. Finding vulnerabilities in software before its release will prevent attackers from exploiting them: vulnerabilities are eliminated even before attackers see them. Fuzzing has helped a lot in finding vulnerabilities, but we need a technique that allows defenders to find bugs that are difficult (or impossible) to find with fuzzing, and we hope that AI will help close this gap. We believe this is a promising path to revolutionizing cybersecurity and providing asymmetric advantage to defenders.
The vulnerability itself is quite curious, and the existing SQLite testing infrastructure (both through OSS-Fuzz and the project's own infrastructure) did not detect the problem, so we conducted additional research.
Methodology
The main reason for the development of Naptime and for Big Sleep was constant detection exploits of variants of previously found and patched vulnerabilities. This trend continued, so it was clear that fuzzing did not identify such options, and that manual analysis of options was a very expensive process for attackers.
We also believed that the variant analysis task was better suited to modern LLMs than the more general vulnerability research task. By giving the model a starting point, such as details of a previously fixed vulnerability, we remove a lot of the ambiguity in vulnerability research and start with a concrete, fact-based theory: “This is what the previous bug was; There’s probably a similar one somewhere else.”
Our project is still in the research stage, and we are currently using small programs with known vulnerabilities to evaluate progress. We recently decided to put our models to the test by conducting our first extensive experiment analyzing SQLite vulnerability variants. We collected a set of recent commits into the SQLite repository and manually removed trivial and documentation-only changes. We then configured the promt to pass both the commit message and the change diff to the agent, and asked the agent to view the current repository (in HEAD) for related issues that may not yet be resolved.
Detected vulnerability
This vulnerability is interesting because the special control value -1 is used in a field of the iColumn index type:
This pattern creates a potential edge case that must be handled by all code that uses this field because a valid column index is expected to be non-negative.
Function seriesBestIndex does not handle this edge case correctly, resulting in a stack buffer write with a negative index when processing a queue with a column constraint rowid
. The assembly we sent to the agent included debugging assert
and this condition was checked assert
in line 706:
However, in the release build of this assert
no, and in our testing (this depends on the compiler and the level of optimizations) the subsequent write operation on the line 718 will write below the buffer aIdx
damaging the lower 32 bits of the pointer pConstraint
which will be dereferenced in the next iteration of the loop, leading to an exploitable state.
However, given the explanation of the vulnerability, it was difficult for live researchers (at least for us) to understand how to trigger it – obviously a good starting point would be the ROWID column constraint, but a more careful reading of the code would definitely be required. It looks like the agent already knows more about SQLite than we do, so it can find shortcuts!
For example, the model immediately uses a virtual table to generate test cases generate_series
. (Also, we've seen examples of the model exploring existing virtual tables first, but it could definitely use existing knowledge here).
Highlights of the investigation
In this successful run based on Gemini 1.5 Pro, the source commit was [1976c3f7] – a fairly large and unobvious change. The bug detected by our agent is very weakly related to the change in the commit itself; This often happens when manually analyzing variants – understanding one bug in the code base often leads the researcher to other problems.
Below are excerpts of our comments. italics; all text in blocks ASSISTANT generated directly by the agent.
The agent made a reasonable estimate of the changes in the diff, and used it to create a starting point for the studies:
ASSISTANT
CL improves the handling of OR clauses common to WHERE clauses in virtual table queries by adding the functiontermFromWhereClause
. This function gets a specific expression from WhereClause, taking into account nested statements. Before this change, common expressions were likely processed incorrectly due to incorrect expression index mappings, which could potentially lead to inefficient query plans or incorrect results. The most likely vulnerability may involve incorrect handling of WHERE clause constraints in functionsallocateIndexInfo
AndwhereLoopAddVirtualOne
in which CL changed the way constraint offsets are used in the new helper function. First I explore the functionallocateIndexInfo
.
. . .
After many steps (skipped here) of examining the relevant code, the agent formed a more specific hypothesis and tried to test it by running a modified version of the test case from the patch, but this did not work because the test case included in the patch uses an extension that is not available in our configuration:
TOOL(debugger_run)
The program exited without crashing or hitting control points. Exit code: 1.
The output in STDERR was:
Runtime error near line 2: no such module: tcl
Parse error near line 3: no such table: x1
ASSISTANT
The program terminated due to the TCL module being unavailable. The tests use the TCL virtual table module, which must be included when compiling SQLite. The program input passed is based on the bestindexC.test file, which uses the TCL module to test the selection of the best virtual table indexes.To correct this, the program input must be changed to use a different virtual table that does not require the TCL module. However, the vulnerability is related to virtual table query scheduling, so we still need a virtual table for the problem to occur.
I will use the tool
code_browser_source
to find examples of built-in virtual tables in SQLite. I then adapt the program input to use one of them and again set the breakpoint towhereLoopAddVirtualOne
.
. . .
The agent adapted well to this failure, figured out the steps it needed to take to fix the test case, and after a few steps had all the information needed to build a test case that reached the whereLoopAddVirtualOne function:
ASSISTANT
Functionsqlite3_create_module
registers virtual table modules. The shell.c file contains many examples of this, including the modulegenerate_series
,fsdir
,completion
Andexpert
. The json table also looks interesting. I'll focus ongenerate_series
as a starting point. I'll try to create a query that uses this module and can cause incorrect constraint processing.
. . .
It's worth noting that during the initial discovery and reproduction of the vulnerability, the agent was quite vague about the “incorrect constraint handling” it was trying to cause; and in this case, complete trajectories are a little like the work of a live researcher – we start with one idea, think through several more along the way, and ultimately create a test case that reproduces a slightly different bug. In this case, when we call whereLoopAddVirtualOne on the generate_series table, we end up in seriesBestIndex, where the constraint is indeed processed incorrectly.
However, it is especially interesting to note that after the program crashed, the model did a much better job of explaining the causes: when performing a root cause analysis and a full crash report, the agent clearly creates a test case for the specific identified edge case:
ASSISTANT
The field that is important to us is iColumn inside the structuresqlite3_index_constraint
. This field represents the constraint column number. To cause the assert to fail, we need to impose a constraint on a column with an index greater than 3 or less than 1.Let's create an SQL query that places restrictions on these columns. In the table
generate_series
there are columns named value, start, step and stop. They have the following indexes:– value: 0
– start: 1
– step: 2
– stop: 3
To cause a bug, we can add a restriction on ROWID. Constraints on ROWID use iColumn = -1.
Here is an example request:
SELECT * FROM generate_series(1,10,1) WHERE ROWID = 1;
This request should cause the function to crash
seriesBestIndex
.
. . .
In the end, we received a clear and accurate description of the problem, which is practically a ready-made bug report!
TOOL(debugger_run)
Program crash: SIGABRT at :0.
STDERR output:
sqlite3: src/shell.c:6816: int seriesBestIndex(sqlite3_vtab *, sqlite3_index_info *): Assertion `iCol>=0 && iCol<=2' failed.
ASSISTANT(report_success)
The program crashes with input data input SELECT * FROM generate_series(1,10,1) WHERE ROWID = 1; due to assert failure in function
seriesBestIndex
.assert(iCol>=0 && iCol<=2) fails because when the constraint affects ROWID, the iColumn field in the sqlite3_index_info structure may contain values outside the range allowed for column indexes, for example -1.
This causes the result of the iCol calculation to become -2, causing the assert to crash.
What about fuzzing?
Given the obvious simplicity of the reproduction method, the question arises: why was this bug not discovered earlier by traditional fuzzing? The “simple” answer lies in the configuration of the fuzzing harness. The wrapper used by OSS-Fuzz was built without the generate_series extension enabled, and the alternative wrapper fuzzingshell.c contained an older version of the function seriesBestIndex
not affected by the bug. Although in SQLite AFL repositories There is a configuration for fuzzing the same CLI binary that we passed to the Big Sleep agent, it doesn't seem to be used as widely.
To understand whether the bug really “lies on the surface”, we tried to detect it using fuzzing. We followed the fuzzing instructions from SQLite documentation and chose the CLI platform as the target. Also, before running AFL, we made sure that the fuzzing corpus contained the generate_series and rowid keywords. However, even after 150 processor hours of fuzzing, the problem could not be identified.
We then tried to simplify the fuzzer's task, for example by adding the necessary keywords to the AFL SQL dictionary. However, it appears that the bug can only be quickly discovered if the corpus contains an example very close to the crash-causing input, because code coverage is not a reliable basis for this particular problem.
It is known that AFL is not the most suitable tool for text formats like SQL, in which most incoming data is syntactically invalid and will be rejected by the parser. However, it is interesting to compare this result with Michal Zalewski's post on SQLite fuzzingwritten in 2015. At the time, AFL was quite effective at finding bugs in SQLite; It seems that after years of fuzzing, the tool has reached a natural saturation point. Although our results may seem insignificant compared to the significant qualitative change that has occurred since the release of AFL, it is interesting to note that it has its own strengths and can effectively detect unique many vulnerabilities.
Conclusion
For our team, this was a moment of self-confidence and success: finding a vulnerability in a widely used and well-fuzzed open source project is an amazing result! With the right tools, modern LLMs are capable of performing vulnerability research.
However, we want to emphasize that these are still largely experimental results. The Big Sleep team believes that currently a platform-specific fuzzer will be at least as effective (in finding vulnerabilities).
We hope that in the future, our efforts will lead to providing a major benefit to the defender, potentially allowing not only the detection of crash-causing test cases, but also high-quality root cause analysis; In the future, addressing and resolving problems may become much cheaper and more efficient. We intend to continue to talk about our research in this area, reducing the gap between public and private as much as possible.
The Big Sleep team will continue to work in this area, pursuing Project Zero's mission to increase the sophistication of 0-day vulnerabilities.
Team Big Sleep
This is no longer just the work of Project Zero alone, below are all those who participated in this project (names are listed in alphabetical order):
Miltos Allamanis, Martin Arjovsky, Charles Blundell, Lars Buesing, Mark Brand, Sergei Glazunov, Dominik Maier, Petros Maniatis, Guilherme Marinho, Henryk Michalewski, Koushik Sen, Charles Sutton, Vaibhav Tulsyan, Marco Vanotti, Theophane Weber, Dan Zheng