Not a panacea, but a helper. About static code analyzer

Let's get straight to the point. So, what is static analysis? Let's try to figure it out quickly and on our fingers. Imagine that you are writing text in Word, and it underlines errors with a red wavy line. Static analysis is about the same thing, only for code. A special utility analyzes the source code before the program is launched and looks for potential errors in it, based on certain rules. Usually, IDE plugins are used for this, for example, Visual Studio, JetBrains, VS Code. The plugin runs a static analyzer in the background, the analyzer issues warnings, and the developer can immediately go to the problematic place in the code and figure everything out, relying on the information in the warnings and detailed documentation.

This is how it looks to the user. But how does the algorithm understand that there is an error in the code, a potential vulnerability? Does it need millions of patterns of bad code to compare with?

Source code is just a set of symbols. To analyze it, you need to transform it into a more convenient representation, for example, an abstract syntax tree (AST). First, the lexer breaks the code into tokens (words), then they are combined into constructions according to grammatical rules, and the result is an AST. In addition, of course, you need to understand the language's type system (what is int, for example) and track the symbol table (what functions, variables are declared).

This is the basic level at which you can already create diagnostic rules. Most often, they are pattern-based. It becomes clear that certain constructs often lead to errors (for example, dereferencing a null pointer, comparing an operand to itself). Based on such patterns, you can draw conclusions about potential errors.

Based on the AST, the type system and the symbol table, diagnostic rules are created. In addition, other technologies are used: data flow analysis, control flow analysis, etc. Static analysis is made up of all these “building blocks”.

How is it different from a compiler? In fact, it does the same thing, being a kind of frontend. The thing is that the compiler is focused on code optimization, which takes the lion's share of time. At the same time, optimization can lead to unexpected side effects, which can be very difficult to track. Static analysis, in turn, may not save time and carefully analyze all branches of the code, pay a little more attention to functions, identify some interesting facts and then issue a warning that through a call chain you, for example, dereferenced a null pointer or a null reference, depending on the programming language.

By the way, analyzers are often able to determine which files to check. First, you need to get an idea of ​​the project structure. For Microsoft, this is MSBuild, for C/C++, it is Compile Commands JSON, CMake, etc. After that, a list of changed files is taken from the commit and, based on it, it is determined what needs to be analyzed. That is, the plugin starts the analysis after the build and checks only those files that have been changed. Of course, if there are header files among them, more will be analyzed.

Of course, static analyzers can also give false positives. It is impossible to analyze absolutely everything in a reasonable time, so different assumptions are used, the depth of analysis is limited, etc.

What to do in such situations? You can use comments to suppress warnings. Each analyzer has its own syntax, but usually it is enough to add a comment to the line with the code. There is also a mechanism for mass suppression of warnings (baseline). For example, you run the analyzer for a C/C++ project and get 100 thousand warnings. Obviously, this “technical debt” cannot be fixed immediately. Therefore, you can create a baseline and the next time you analyze, you will see only new warnings. In other words, you can simply “zero the mileage”.

Of course, developers of static analyzers are constantly fighting against the causes of false positives. This is helped, among other things, by users who share such cases, as well as errors in the analyzer scripts, via feedback. Based on these reports, analyzers are constantly being improved, “learning” to catch more and more errors and becoming more universal.

And so, thinking about the usefulness of static analyzers, one idea comes to mind: what if we implement static analysis in repositories like npm, Maven Central and others? Imagine: before publishing, libraries are automatically checked for known vulnerabilities. An outdated log4j with a security hole is detected, and the developer immediately receives a notification – update. This would significantly increase the security of open-source projects.

The same GitHub could be a great platform for implementing such functionality. Imagine: you search for a project there, open the Warnings tab and see a list of potential errors found by static analysis. One project has a lot of stars, but also a ton of warnings, while another has fewer stars, but the code is clean. And it immediately becomes clear who cares about quality. Of course, GitHub is unlikely to do this. But this is a good opportunity for startups. For example, a service that analyzes repositories by link and provides a detailed report would be in high demand.

Well, at the end of my article, I can’t ignore the popular topic of neural networks. Ready-made solutions for code analysis based on them are a dream. But now all the tasks that static analysis faces are mostly solved by classical methods. Neural networks, of course, can also be used to find errors, but this approach has its drawbacks. Firstly, programming language standards are constantly changing. How to teach a neural network something that doesn’t exist yet? Classical methods are more flexible in this regard: a new C++ standard was released — new rules were added, and everything works. Secondly, it is difficult to link a neural network to documentation. Its behavior is constantly changing: today it searches for one pattern, tomorrow — another. In addition, a lot of data is needed to train a neural network. Yes, someone trains a neural network directly on the code. But here a problem with semantics arises: the neural network does not understand it. If you find a way to transmit semantic information to it, the results can be interesting. This is how dreams of how to make a neural network work for you are shattered by harsh reality…

Well, I will probably finish the article on this thought and sum it up. So, static code analysis is convenient. Such a tool saves time and resources, errors are detected at early stages. In general, it is probably not difficult even to carry out calculations and analysis of economic benefits. On the other hand, you need to understand that this is not a panacea, but assistants in development. High-quality code is, first of all, the attentiveness and professionalism of the programmer.

Thanks for reading!

Write good code and don't be lazy to share your ideas and bug reports with tool developers. =)

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *