Why PVS-Studio uses data flow analysis: due to an interesting error in the Open Asset Import Library

PVS-Studio - Data Flow Analysis

Data flow analysis is an integral part of any modern static code analyzer. However, from the outside, it is not very clear what this is and what is most important – why it is needed. Until now, some people associate static analysis with searching for something in the code according to a certain pattern. Therefore, from time to time we write notes in which we demonstrate how this or that technology used in the PVS-Studio analyzer helps to reveal another interesting error. Today is just such an article in which we will consider a bug in one of the implementations of the Base64 binary data encoding standard.

It all started with checking the latest version of the Qt 6 library. This was a separate classic article, where I described 77 errors found. It so happened that at first I decided to skim through the report, not yet hiding the warnings related to third-party libraries. In other words, I have not disabled warnings related to src 3rdparty in the settings. And it so happened that I immediately came across an interesting example of an error in the library Open Asset Import Library, about which I decided to make this separate small note.

The defect found demonstrates why tools such as PVS-Studio, it is useful to analyze the data flow. Without this, the search for many errors is simply impossible. By the way, if you are interested in learning more about data flow analysis and other aspects of the instrument design, I bring to your attention the article “Technologies used in the PVS-Studio code analyzer to find bugs and potential vulnerabilities“.

Now let’s move on to the error found in the Open Asset Import Library (assimp). File: src 3rdparty assimp src code FBX FBXUtil.cpp.

std::string EncodeBase64(const char* data, size_t length)
{
    // calculate extra bytes needed to get a multiple of 3
    size_t extraBytes = 3 - length % 3;

    // number of base64 bytes
    size_t encodedBytes = 4 * (length + extraBytes) / 3;

    std::string encoded_string(encodedBytes, '=');

    // read blocks of 3 bytes
    for (size_t ib3 = 0; ib3 < length / 3; ib3++)
    {
        const size_t iByte = ib3 * 3;
        const size_t iEncodedByte = ib3 * 4;
        const char* currData = &data[iByte];

        EncodeByteBlock(currData, encoded_string, iEncodedByte);
    }

    // if size of data is not a multiple of 3,
    // also encode the final bytes (and add zeros where needed)
    if (extraBytes > 0)
    {
        char finalBytes[4] = { 0,0,0,0 };
        memcpy(&finalBytes[0], &data[length - length % 3], length % 3);

        const size_t iEncodedByte = encodedBytes - 4;
        EncodeByteBlock(&finalBytes[0], encoded_string, iEncodedByte);

        // add '=' at the end
        for (size_t i = 0; i < 4 * extraBytes / 3; i++)
            encoded_string[encodedBytes - i - 1] = '=';
    }
    return encoded_string;
}

If you want, you can try to find the error yourself first. And so that you don’t accidentally read the answer right away, I’ll give you a list of some other interesting articles and briefly tell you what Base64 is :). List of additional articles on related topics:

  1. 31 february;
  2. Using machine learning in static analysis of program source code;
  3. How to implement a static code analyzer in a legacy project and not demotivate the team

Ok, let’s continue. Before us is the implementation of the algorithm for encoding a string of bytes into encoding Base64… It is a standard for encoding binary data using only 64 characters. The coding alphabet contains text-numeric Latin characters AZ, az and 0-9 (62 characters) and 2 additional characters depending on the implementation system. Every 3 original bytes are encoded with 4 characters.

If there is only one or two bytes left to encode, then the result is only the first two or three characters of the string, and the output string is padded with two or one “=” characters. This prevents extra bits from being added to the recovered data. This moment is just implemented in the function under consideration incorrectly.

If you find a mistake, you are great. If not, that’s okay too. You need to delve into the code to notice that something is going wrong. The analyzer warns about this “something is wrong”: V547 [CWE-571] Expression ‘extraBytes> 0’ is always true. FBXUtil.cpp 224

To understand why the analyzer is worried, let’s see how a variable is initialized. extraBytes:

// calculate extra bytes needed to get a multiple of 3
size_t extraBytes = 3 - length % 3;

The programmer planned to calculate how many additional bytes of input data need to be processed if their total number is not 3. To do this, you just need to divide the number of processed bytes modulo 3. The correct variant of variable initialization:

size_t extraBytes = length % 3;

Then, if, for example, 5 bytes are processed, then we get 5% 3 = 2, and you need to additionally process 2 bytes. If the input received 6 bytes, then nothing needs to be processed separately, since 6% 3 = 0.

Although, perhaps it meant how many bytes are not enough for a multiple of three. Then the correct code should be like this:

size_t extraBytes = (3 - length % 3) % 3;

Now I’m not interested in trying to figure out exactly what the correct option should be. In any case, the programmer wrote some average meaningless version of the code:

size_t extraBytes = 3 - length % 3;

And just when analyzing this code, the analyzer needed a data flow analysis mechanism. Whatever value is in the variable length, after dividing by modulus, a value in the range will be obtained [0..2]… The PVS-Studio analyzer can work with ranges, exact values ​​and sets. That is, we are talking about Value Range Analysis… In this case, it is the range of values ​​that will be used.

Let’s continue the calculations:

size_t extraBytes = 3 - [0..2];

It turns out that the variable extraBytes will never be zero. The analyzer will calculate the following possible range of its values: [1..3]…

Until the moment of checking, the variable is not changed anywhere. Therefore, the analyzer is absolutely right in warning that the result of the check will always be true:

if (extraBytes > 0)

This is a simple but beautiful example, when the analysis of the data flow made it possible to calculate the range of values ​​of a variable, to make sure that it does not change, and, finally, that the condition is always true.

Of course, the incorrectness of the function is not limited to the fact that it will execute a piece of code that should not be executed. There, in general, everything goes at random. Let’s say you want to encode 6 characters. In this case, the output string must be 8 characters long. Let’s take a quick look at how this function will behave.

// calculate extra bytes needed to get a multiple of 3
size_t extraBytes = 3 - length % 3; // 3-6%3 = 3

// number of base64 bytes
size_t encodedBytes = 4 * (length + extraBytes) / 3; // 4*(6+3)/3 = 12

std::string encoded_string(encodedBytes, '=');

It has already turned out that the output line will contain 12 characters, not 8. Further, everything will also work incorrectly – there is even no point in going into details.

This is how easy and simple static analysis found an error in the code. Imagine, after all, someone will suffer and debug to understand why the characters are encoded incorrectly into Base64 encoding. Here we, by the way, come to the question of the quality of the used third-party libraries, which I considered in the publication “Why it is important to conduct static analysis of open source libraries that you add to your project“.

Try to incorporate the regular use of PVS-Studio into your development process in order to find many errors at the earliest stage. You’ll like it :). If you are developing an open source project, then the analyzer can be used is free… Thank you for your attention and the code you don’t need.

If you want to share this article with an English-speaking audience, please use the translation link: Andrey Karpov. Why PVS-Studio Uses Data Flow Analysis: Based on Gripping Error in Open Asset Import Library.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *