I found 213 security vulnerabilities in the codebase using GPT-3


“I’m disappointed by your level of cybersecurity”

Brief summary: GPT-3 found 213 security vulnerabilities in git repositories. In comparison, one of the best commercial tools on the market (developed by a reputable cybersecurity company) only found 99 problems; however, this tool provides context in a more structured format. After manually testing a random sample of 50 of the 213 vulnerabilities identified by GPT-3, only one turned out to be a false positive. Both instruments had many false negative results.


In recent years, the field of artificial intelligence and machine learning has experienced exponential growth, opening up a new space of opportunity. One such feature that has attracted attention has been AI-based code analysis, specifically the use of AI models to identify security vulnerabilities in code. In this experiment, we used OpenAI’s GPT-3 to find security vulnerabilities in

code repositories

containing 129 files with vulnerabilities.

How it works

The GPT-3 variant I used (text-davinci-003) has a context window of 4,000 tokens, which is roughly equal to 3,000 English words. This means that it can process no more than a few hundred lines of code in one request. Unfortunately, the current GPT-3 architecture cannot handle an entire repository at once.

To get around this problem, I had to scan each file individually with GPT-3. This means that GPT-3 could have difficulty finding security vulnerabilities that arise when multiple code files interact; however, sometimes imports/exports turned out to be clear enough to make an assumption without studying the code that these functions perform. This happened quite often, especially when the source code used popular libraries like express.js, Flask, the Python standard library, the C standard library, and so on. Probably, GPT-3 partially or completely remembered the most popular libraries (or encoded them in some other way). In the case of the code analyzed in this article, GPT-3 had enough knowledge of the imported libraries to accurately identify security vulnerabilities without having to examine the code of the imported libraries.

To be fair to GPT-3, I’ll say that many commercial vulnerability scanners don’t actually examine the code of imported libraries when doing static analysis, so AI doesn’t work much differently than most tools on the market.

Analyzed code

Each folder in

repositories

has a name corresponding to the type of security vulnerability and contains files with sample code that includes one or more vulnerabilities. Some files contain trivial code, but many contain quite realistic code snippets that you might encounter in a production codebase (note: these are still snippets and lack the context of a larger codebase). The README.md file in each folder contains the GPT-3 security vulnerability analysis of all files in that folder.

Let’s look at a few examples to see how GPT-3 worked!

Example 1 (trivial)

It’s simple

C program

vulnerable to Format String Attack:

#include <stdio.h>
 
int main(int argc, char **argv) {
    printf(argv[1]);
 
    return 0;
}

And here is the GPT-3 analysis (quite correct):

Vulnerabilities Detected:
1. Unvalidated user input: The program does not check the length of the user input, which could lead to a buffer overflow attack.
2. Format string vulnerability: The program does not check the format of the user input, which could lead to a format string attack.

Translation:

Identified vulnerabilities:
1. Unvalidated user input: the program does not check the length of the user input, which can lead to a buffer overflow attack
2. String formatting vulnerability: the program does not check the format of user input, which can lead to a string formatting attack

Example 2 (less trivial)

Let’s test the AI ​​with a less trivial high-level language program like this

C# program

, containing the Log Forging vulnerability. Personally, I have seen similar (though less trivial) code examples in open source projects and production codebases:

using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

namespace WebFox.Controllers
{
    [Route("api/[controller]")]
    [ApiController]
    public class LogInjection : ControllerBase
    {
        private readonly ILogger<LogInjection> _logger;


        public LogInjection(ILogger<LogInjection> logger)
        {
            _logger = logger;
        }

        [HttpGet("{userInfo}")]
        public void injectLog(string userInfo)
        {
            _logger.LogError("error!! " + userInfo);
        }
    }
}

And here is the output of GPT-3:

Vulnerabilities Detected:
1. Log injection vulnerability: The code is vulnerable to log injection attacks as user input is being directly logged without any sanitization.

Translation:

Identified vulnerabilities:
1. Log injection vulnerability: The code is vulnerable to log injection attacks because user input is logged directly and without sanitization.

Example 3 (non-trivial)

Below

C program

reads and processes the image. It contains many security vulnerabilities, in particular Out Of Bounds read and write:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

struct Image
{
    char header[4];
    int width;
    int height;
    char data[10];
};

int ProcessImage(char* filename){

    FILE *fp;
    char ch;
    struct Image img;

    fp = fopen(filename,"r"); 

    if(fp == NULL)
    {
        printf("\nCan't open file or file doesn't exist.");
        exit(0);
    }

    printf("\n\tHeader\twidth\theight\tdata\t\r\n");

    while(fread(&img,sizeof(img),1,fp)>0){
        printf("\n\t%s\t%d\t%d\t%s\r\n",img.header,img.width,img.height,img.data);
    
        int size1 = img.width + img.height;
        char* buff1=(char*)malloc(size1);

        memcpy(buff1,img.data,sizeof(img.data));
        free(buff1);
    
        if (size1/2==0){
            free(buff1);
        }
        else{
            if(size1 == 123456){
                buff1[0]='a';
            }
        }

        int size2 = img.width - img.height+100;
        //printf("Size1:%d",size1);
        char* buff2=(char*)malloc(size2);

        memcpy(buff2,img.data,sizeof(img.data));

        int size3= img.width/img.height;
        //printf("Size2:%d",size3);

        char buff3[10];
        char* buff4 =(char*)malloc(size3);
        memcpy(buff4,img.data,sizeof(img.data));

        char OOBR_stack = buff3[size3+100];
        char OOBR_heap = buff4[100];

        buff3[size3+100]='c';
        buff4[100]='c';

        if(size3>10){
                buff4=0;
        }
        else{
            free(buff4);
        }

        free(buff2);
    }
    fclose(fp);
}

int main(int argc,char **argv)
{
    ProcessImage(argv[1]);
}

GPT-3 output:

Vulnerabilities Detected:
1. Unchecked return value: The return value of the fopen() function is not checked, which could lead to a null pointer dereference.
2. Out-of-bounds read: The memcpy() function is used to copy data from img.data to buff1 and buff2 without checking the size of the destination buffer, which could lead to an out-of-bounds read.
3. Out-of-bounds write: The buff3 and buff4 arrays are written to without checking the size of the source buffer, which could lead to an out-of-bounds write.
4. Uninitialized memory access: The OOBR_stack and OOBR_heap variables are accessed without being initialized, which could lead to undefined behavior.
5. Memory leak: The buff4 array is not freed if the size3 variable is greater than 10, which could lead to a memory leak.

Translation:

Identified vulnerabilities:
1. Unchecked return value: The return value of fopen() is not checked, which can lead to null pointer dereference.
2. Out-of-bounds read: The memcpy() function is used to copy data from img.data to buff1 and buff2 without checking the size of the target buffer, which can lead to an out-of-bounds read.
3. Out-of-bounds write: Writes to the buff3 and buff4 arrays are performed without checking the size of the original buffer, which can lead to an out-of-bounds write.
4. Access to uninitialized memory: access to the OOBR_stack and OOBR_heap variables is performed without initialization, which can lead to undefined behavior.
5. Memory leak: The buff4 array is not freed if the size3 variable is greater than 10, which can lead to a memory leak.

GPT-3 is right again!

Example 4

This last one

Java example

– my favorite, because, unfortunately, the first time I read the code, I could not understand what was wrong with it. To me he looked completely innocent and

familiar

– I have already seen similar code in real life, but never thought about it.

class Utils
{
    // Function to serialize an object and write it to a file
    public static void SerializeToFile(Object obj, String filename)
    {
        try
        {
            FileOutputStream file = new FileOutputStream(filename);
            ObjectOutputStream out = new ObjectOutputStream(file);

            // Serialization of the object to file
            System.out.println("Serializing " + obj.toString() + " to " + filename);
            out.writeObject(obj);

            out.close();
            file.close();
        }
        catch(Exception e)
        {
            System.out.println("Exception: " + e.toString());
        }
    }

    // Function to deserialize an object from a file
    public static Object DeserializeFromFile(String filename)
    {
        Object obj = new Object();

        try
        {
            FileInputStream file = new FileInputStream(filename);
            ObjectInputStream in = new ObjectInputStream(file);

            // Deserialization of the object to file
            System.out.println("Deserializing from " + filename);
            obj = in.readObject();

            in.close();
            file.close();
        }
        catch(Exception e)
        {
            System.out.println("Exception: " + e.toString());
        }

        return obj;
    }
}

GPT-3 output:

Vulnerabilities Detected:
1. Insecure Deserialization: Deserializing untrusted data can lead to remote code execution.
2. Insecure File Permissions: The file being written to or read from may have insecure permissions, allowing unauthorized access.

Translation:

Identified vulnerabilities:
1. Insecure deserialization: Deserializing unverified data can lead to remote code execution.
2. Insecure file permissions: A file being written to or read from may have insecure permissions, allowing unauthorized access.

GPT-3 is right again.

An important “shift to the left” in cybersecurity

Left Shift is the process of applying security measures earlier in the development process, such as testing for security bugs before deploying code to production.

AI code analysis could be a major shift to the left in cybersecurity by applying security best practices and educating developers about cybersecurity along the way writing and committing code. This can nurture a strong cybersecurity culture in an organization and even prevent the next big hack, but…

But the technology is not yet perfect and it lacks many aspects that an experienced cybersecurity professional would not miss in an audit. However, for a large general-purpose language model, AI does surprisingly well; besides, it is much cheaper than its alternatives.

A release date for GPT-4 hasn’t been announced yet, but I’m sure these big language models will continue to evolve with more and more features.

False positives

After manually analyzing about 50 of the 213 vulnerabilities identified by GPT-3, I found a single false positive on line 13 of this

Javascript programs

:

var express = require('express');
var cookieParser = require('cookie-parser');
var escape = require('escape-html');
var serialize = require('node-serialize');
var app = express();
app.use(cookieParser())
 
app.get('/', function(req, res) {
 if (req.cookies.profile) {
   var str = new Buffer(req.cookies.profile, 'base64').toString();
   var obj = serialize.unserialize(str);
   if (obj.username) {
     res.send("Hello " + escape(obj.username)); // <--- GPT-3 makes a mistake here
   }
 } else {
     res.cookie('profile', "eyJ1c2VybmFtZSI6ImFqaW4iLCJjb3VudHJ5IjoiaW5kaWEiLCJjaXR5IjoiYmFuZ2Fsb3JlIn0=", {
       maxAge: 900000,
       httpOnly: true
     });
 }
 res.send("Hello World");
});
app.listen(3000);

GPT-3 output:

Vulnerabilities Detected:
– Insecure cookie settings (no secure flag set)
— Unvalidated deserialization of user input (cookie)
– Potential XSS vulnerability due to lack of escaping of user input (username)

Translation:

Identified vulnerabilities:
– Insecure cookie options (secure flag not set)
– Unvalidated serialization of user input (cookie)
– Potential XSS vulnerability due to missing user input escaping (username)

GPT-3 is correct about the first two vulnerabilities, but the third one is a false positive –

obj.username

escaped, although GPT-3 claims it is not.

The results of the experiment showed that GPT-3 was able to detect security vulnerabilities in 85 out of 129 scanned files. Impressive!

Script summarize_results.py generates a full summary of GPT-3 results:

Vulnerabilities detected in 86 / 129 files.
Detected 213 vulnerabilities in total.

Frequency of introductory sentences used in GPT-3’s responses (1 response / file scanned):
{‘vulnerabilities detected’: 73, ‘no vulnerabilities detected.’: 43, ‘vulnerability detected’: 6, ‘answer’: 2, ‘potential vulnerabilities detected’: 2, ‘analysis’: 1, ‘security vulnerabilities detected’: 1, ‘no response given’: 1}

Distribution of file types scanned:
129 files of code in total (excluding markdown and flatfiles)
{‘.php’: 50, ‘.js’: 20, ‘.cs’: 16, ‘.c’: 14, ‘.java’: 9, ‘.py’: 8, ‘.rb’: 5, ‘.asp’: 3, ‘.ts’: 2, ‘.go’: 1, ‘.html’: 1}

Translation:

Vulnerabilities were found in 86/129 files.
A total of 213 vulnerabilities were identified.

Frequency of introductory sentences used in GPT-3 responses (1 response / scanned file):
{‘vulnerabilities detected’: 73, ‘no vulnerabilities detected.’: 43, ‘vulnerability detected’: 6, ‘answer’: 2, ‘potential vulnerabilities detected’: 2, ‘analysis’: 1, ‘security vulnerabilities detected’: 1, ‘no response given’: 1}

Distribution of types of scanned files:
129 code files in total (excluding markdown and flat files)
{‘.php’: 50, ‘.js’: 20, ‘.cs’: 16, ‘.c’: 14, ‘.java’: 9, ‘.py’: 8, ‘.rb’: 5, ‘.asp’: 3, ‘.ts’: 2, ‘.go’: 1, ‘.html’: 1}

Comparison with commercial tools

To complete the experiment, I compared the results of GPT-3 with those of a commercial vulnerability scanner.

Snyk Code

, developed by Snyk, a company that, in my opinion, makes excellent security products. By running this repository through Snyk Code, I found 99 security vulnerabilities compared to 213 found by GPT-3.

One of the limiting factors was that Snyk Code does not support all the programming languages ​​of the repository, so it was only able to scan 103 files, not 129 like GPT-3.

The vulnerable code snippets in this repository are taken from an amazing resource

snoopysecurity/Vulnerable-Code-Snippets

. I tried to remove any comments embedded in the code that reported what security vulnerabilities are contained in specific fragments. To do this, it was necessary to remove comments containing links to posts and articles from which these sample snippets were taken. All references to sources present in the original repository can be found in the file

attributions.md

.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *