Part 1. GPU-Based Fuzzing. What kind of animal is this?

Approximate fuzzing schemeSource: 3. GNATfuzz User's Guide - GNATDAS Manuals [25.0w (20240326)] (adacore.com)

Approximate fuzzing scheme
Source: 3. GNATfuzz User's Guide – GNATDAS Manuals [25.0w (20240326)] (adacore.com)

But before we start discussing GPU fuzzing, we need to understand what fuzzing is. Within the framework of the article, we will only consider White-Box fuzzing (when we have the source code), if we try to cover all the fuzzing techniques, we will have a year’s worth of articles.

So, fuzzingone of the dynamic testing methods that provides a program with a large amount of input data and analyzes the program’s behavior with this data in order to find a number of possible errors in the software.

Errors can be found something like this:

Mapping bugs on CWE. Source: What Bugs Can You Find With Fuzzing?  (code-intelligence.com)

Error mapping on CWE.
Source: What Bugs Can You Find With Fuzzing? (code-intelligence.com)

  • Logical errors – division by zero, race condition and etc.

  • Insufficient number of checks – endless recursion or loop, going beyond array boundaries, etc.

  • Problems when working with resources

    • Segmentation errors

    • Undefined behavior

    • Buffer/heap overflow

    • Over-allocating memory

Fuzzer components

In the fuzzing process, the main tool is obviously fuzzer. Essentially this a utility that beret control over the program under testmonitors the events occurring inside the software process after transferring the process to it some data. From here two questions can be distinguished:

  1. What data? Where do they come from?

  2. How do we control the program's actions?

To do this, you need to enter two entities inside the fuzzer – mutator, “instrumentator”.

Mutator

With the first, based on the name, everything is clear. This is some kind of fuzzer interface that either receives a primary set of input data (that is, a directory with initial data is passed to it), or generates them itself (for example, using Google protobuffers). If the fuzzer is given some initial input data, it goes through an initial verification phase, which checks that the data leads to new program behavior, but does not cause it to crash. A set of data that has passed this test is called seeds.

Based on these seeds, the mutator algorithm is as follows:

  1. Analysis of primary data. Output: seeds

  2. Transferring seeds to a queue. In fact, the queue for most fuzzers is a separate directory

  3. For each element from the queue:

    1. We perform a series of operations on the data (we change a byte/bit, remove a sequence of bytes from the seed, mix bytes or a sequence of bytes, substitute boundary values ​​of data types, etc.)

    2. Running data from a queue element through the program under test.

    3. Receiving feedback on the program's performance from “instrumentator”.

    4. If the data leads to new behavior, we add it to the queue and to the seeds.

If the data leads to an error, then the mutator is not particularly interested in this, and in this case the fuzzer will save the data leading to the error in some separate place.

Instrumentator

I couldn't find the right word to describe it. There was a thought about calling it “Analyzer”, but I decided that the word “analysis” had been encountered quite a lot before, but this did not change the essence.

Instrumentatora set of functions, modifications of the code of the program under test, which allow you to track which branch the program execution followed during operation, and with what data. All this is necessary for the mutator to change the data more accurately. Based on the definition, the question arises about code modification. To answer this question, you can look at an example of a function graph after assembling the source code with the fuzzer instrumentation tools:

Code modification when assembling for fuzzing

Code modification when assembling for fuzzing

Under the hood, most fuzzers use their own markup to determine the current branch. The markup is a matrix, the cell of which is responsible for a certain branching of the program. If the program takes a new path, the cell value changes based on the internal rules of the fuzzer ((FIGURATORY) for example, XOR of the current cell value with the value of the previous one). A similar matrix is ​​also stored as a file in the directory with the fuzzer output data. A number of other inserts used by the fuzzer can be seen in the screenshot below:

Other variables for fuzzer operation

Other variables for fuzzer operation

A few bad words about CUDA memory and threads

Before starting to consider GPU fuzzing, it is also worth considering the memory structure on the cards. Why? The fact is that in video cards, the operating speed significantly depends on the number of memory accesses. Stronger than when working with a CPU. Moreover, understanding the principle of organizing threads will allow you to analyze how fuzzing can be parallelized on the GPU.

Threads, warps, blocks

As written in most articles and in the official CUDA documentation, the main difference between CPU and GPU cores is that the latter are simply simpler in structure (screenshot below). It is also worth mentioning that when working with CUDA, we are faced with an architecture where one instruction is executed by several threads simultaneously (SIMT)

Comparison of CPU and GPU coresSource: CUDA C++ Programming Guide (nvidia.com)

Comparison of CPU and GPU cores
Source: CUDA C++ Programming Guide (nvidia.com)

On the physical side, the video card combines these cores (scalar processors) into streaming multiprocessors, which will perform parallel calculations. The threads I mentioned above are combined into a more abstract structure – warps, which are combined into blocks, which are combined into grids.

The principle of executing threads on a video cardSource: CUDA Programming: THREAD AND BLOCK HEURISTICS in CUDA Programming (cuda-programming.blogspot.com)

The principle of executing threads on a video card
Source: CUDA Programming: THREAD AND BLOCK HEURISTICS in CUDA Programming (cuda-programming.blogspot.com)

One multiprocessor receives a queue of blocks for execution of threads from them. And here it’s worth clarifying why I called warps an abstract structure. Because warps are used only on the multiprocessor for simultaneous execution, it itself groups similar threads into warps. Before this, all threads are stored together in a block. It turns out that the multiprocessor executes exactly one block, step by step (one warp at a time).

Types of memory in CUDA

There are 6 possible types of memory in Nvidia video cards (you can see them in the screenshot above). They differ in the speed of access to each of them, as well as the levels of read/write access.

  1. Register memory. The fastest among all. Used to store thread data during execution. Each thread is allocated a certain number of registers, which can be calculated by knowing the number of threads in a block and the number of blocks in the grid. All registers are 32-bit, CUDA does not provide an API for working with this type of memory

  2. Local memory (or the thread's memory section in global memory). The slowest type of memory that allows you to store local function data if it does not fit in register memory (for example, arrays). In fact, it is part of global memory and is no different from it.

  3. Global memory. Memory available to all threads. Designed to store large amounts of data. Slowest memory.

  4. Shared memory. Fast memory type. Used to store data used by multiple threads within a single block.

  5. Constant memory. Fast memory type. Used to store immutable data transferred from the host. Each thread can only read from this memory

  6. Texture memory. It is obviously intended for working with textures, which is why it has a number of features in terms of addressing

Let's start crossing. Underwater rocks

Now that we've covered the theory behind two important aspects of GPU fuzzing, it's worth agreeing “on shore” that the task of creating a GPU fuzzer is not an easy one. Why? Because many questions arise:

  • All applications are written for standard work on the CPU, which means that you can’t just transfer the program and execute it on the GPU architecture.

  • In what memory should we shove the data so as not to lose speed?

  • What do we need to store on the device and what on the host?

  • Even more problems arise from the very first point: how to determine branches, how to catch program errors if there is no OS on the video card, which means no one will send us a signal about an abnormal termination 🙁

Most of these problems somehow This is how the gentlemen from Trail Of Bits decided and described them in the article, the link to which I attached above (I duplicate it so that you, my dears, do not scroll through: https://blog.trailofbits.com/2020/10/22/lets-build-a-high-performance-fuzzer-with-gpus/). However, alternative solutions also exist, and you can think about them forever, so let's dream up the architecture of the GPU fuzzer.

First sketches on architecture

Very bad GPU fuzzer architecture

Very bad GPU fuzzer architecture

Hidden text

I agree that this version of architecture is not very good, but I think that this is enough to understand the idea. Let me know in the comments how bad it looks 🙂

Let me explain using the diagram. On the left we have the host zone, which initially contains an application compiled for the GPU. We understand that in the fuzzing process there are a number of tasks that are performed once to initiate the fuzzing process: obtaining seeds, and translating the CPU code into PTX code for the GPU, obtaining the results. It is these operations that the host performs once. In addition, the host transmits the information necessary for the fuzzing cycle to the card memory: seeds, executable file. Nothing else is required from the host; all other operations are performed by the GPU.

Based on the previously studied material, you need to understand how the essence “Kernel” will be parallel. The most obvious thing is that the application code is in constant memory, which means instructions and operations are unchangeable, and such an entity as “Application Loader” can be executed in parallel, creating, for example, another core (in theory, this can be done in the latest versions of CUDA). Mutated data from the mutator will be transmitted to the bootloader input, the kernel will be launched, and its errors will be caught thanks to cudaAssert. If this function is trimmed, then it can potentially be considered an error and shelved, that is, global memory.

Hidden text

Usage global memory is determined by two parameters: the probability of an error occurring and the limitation of shared memory (16 – 48 KB on average).

If the mutated data leads to new behavior, then the coverage matrix needs to be changed. How to do this is not yet clear; most likely you will have to think about instrumentation of the PTX code.

But now the question arises: will there be any problems when working? mutator V multi-threaded mode? Let's think: we need the mutator to not duplicate mutated data (this is simply useless). On the other hand, our fuzzer has a limited set of operations, which we can store in the form of, for example, an array with pointers to functions that perform this operation. If you assign a specific operation to each running mutator, then problems should not arise.

Data mutation.  Picture taken from the article Trail Of Bits

Data mutation. Picture taken from the article Trail Of Bits

Subtotal

Thank you for reading (or scrolling) this far. Within the framework of the article, I tried to structure everything that I myself know and understand about this topic. If you find any errors or inaccuracies, as I said, I welcome everyone in the comments.

What we have:

  • Lots of words, zero practice. Of course, in words and diagrams this may all sound simple, but when it comes to implementation, this architecture has a chance to simply not work

  • Little memory that can be used quickly. As shown in the diagram, the mutated data + coverage matrix will be stored in shared memory. In the case of testing a large application, the coverage matrix increases proportionally; perhaps the shared memory is simply not enough. Switching to global memory will significantly reduce the speed of the fuzzer.

  • It is not completely clear how to control and catch errors that occur when a program runs with mutated data

  • A similar situation to the previous one, when the mechanism for collecting the coating is unknown. Yes, you can come up with instrumentation for PTX code and collect coverage for GPU code, but how can you then apply this coverage to CPU code?

Somehow, there are a lot of inaccuracies and unknowns, but an attempt is an attempt. I hope you liked this material. Thanks again everyone!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *