recipe for progressive improvement of a programming language

Although, Clang and is used as a tool for refactoring and static analysis, it has a serious drawback: the abstract syntax tree does not provide information about the origin of specific macro extensions on CPP, due to which a specific AST node can be built on. In addition, Clang does not demote macro extensions to the LLVM level, that is, to intermediate representation (IR) code. This makes it extremely difficult to design static analysis schemes that take macros into account. This topic is currently being actively researched. But things are looking up since the tool was created last summer Macroniwhich simplifies static analysis of just this kind.

In Macroni, developers can define the syntax of new C language constructs using macros, and also provide semantics for these constructs using MLIR (multi-level intermediate representation). Macroni uses a tool VAST, downgrading C code to MLIR. In turn, the tool PASTA allows you to find out where certain macros came from in the AST, and based on this information, macros can also be downgraded to MLIR. Developers can then defineproprietary MLIR converters to convert Macroni output into domain-specific MLIR dialects to analyze a subject in a nuanced manner. This article will use several examples to show how Macroni allows you to extend C with more secure language constructs and enable C security analysis.

❯ Strong type definitions

Type definitions (

typedef

) in C help give lower-level types more meaningful names. However, C compilers do not deal with these names when checking types; they only check the actual names of low-level types. The simplest manifestation of this inconvenience is a type confusion bug, where semantic types represent different formats or measures, as in the following example:

typedef double fahrenheit;
typedef double celsius;
fahrenheit F;
celsius C;
F = C; // Компилятор не выдаст ни ошибки, ни предупреждения

Example 1: When checking types in C, only typedefs of base types are taken into account.

The above code passes type checking successfully, but between types fahrenheit And celsius There is a semantic difference that should not be ignored, since temperatures are measured differently in Celsius and Fahrenheit. If you work only with ordinary typedef in C, there is no way you can enforce this distinction solely through strong typing.

But, working with Macroni, you can use macros that will define the syntax of strong typedef-s and MLIR, thereby implementing specialized type checking for them. The following example shows how you can use macros to determine strong typedef-s that allow you to distinguish between temperatures in Fahrenheit and Celsius:

#define STRONG_TYPEDEF(name) name
typedef double STRONG_TYPEDEF(fahrenheit);
typedef double STRONG_TYPEDEF(celsius);

Example 2: How to use macros to define syntax for strong type inference in C.

If you wrap the typedef name in a macro STRONG_TYPEDEF()then Macroni will be able to identify those typedefwhose names were obtained by calling extension STRONG_TYPEDEF() and convert them to types from a specialized MLIR dialect (e.g. temp), like this:

%0 = hl.var "F" : !hl.lvalue<!temp.fahrenheit>
%1 = hl.var "C" : !hl.lvalue<!temp.celsius>
%2 = hl.ref %1 : !hl.lvalue<!temp.celsius>
%3 = hl.ref %0 : !hl.lvalue<!temp.fahrenheit>
%4 = hl.implicit_cast %3 LValueToRValue : !hl.lvalue<!temp.fahrenheit> -> !temp.fahrenheit
%5 = hl.assign %4 to %2 : !temp.fahrenheit, !hl.lvalue<!temp.celsius> -> !temp.celsius

Example 3: Using Macroni, you can downgrade type definitions to MLIR types and force strong typing.

By integrating the following into the type system typedef-s obtained using macros, we can now define our own type checking rules for them. For example, you could specify strict type checking for operations performed on temperature values. Then the above program will fail type checking. You could also add your own type casting logic for temperature values. In this case, when converting a temperature value from one scale to another, instructions for such a conversion would be implicitly inserted.

The point of using macros to add strong syntax typedef is that macros are both backward compatible and portable. While custom types can be identified using Clang alone, annotating our typedef-s using attribute syntax GNU or Clangit is impossible to guarantee that the method annotate() will be available when working with any platforms and compilers we need. At the same time, you can confidently expect that there will be a C preprocessor there.

You may already be thinking: C already has its own version of strong typedefand this struct. Therefore, we could implement stricter type checking by converting our types typedef into structures (struct) (eg. struct fahrenheit { double value; }). True, because of this, both the API and the ABI type would change, which would ruin the existing client code and also break backward compatibility. If we undertake to transform typedef-s in struct-s, then the compiler may produce completely different assembly code. For example, consider the following function definition:

fahrenheit convert(celsius temp) { return (temp * 9.0 / 5.0) + 32.0; }

Example 4: Defining a function that converts Celsius to Fahrenheit.

If we define strong typedef-s with application typedef-s obtained using macros, then Clang will produce the following LLVM intermediate representation to call convert(25). LLVM intermediate representation for a function convert coincides with a similar construction from C, takes only one type argument double and returns a value like double.

tail call double @convert(double noundef 2.500000e+01)

Example 5: An LLVM intermediate representation of the convert(25) function, where strong type inference is implemented using macros.

Compare this code with the intermediate representation that Clang would produce if strong typedef-s were defined using structures. Now, when called, a function can no longer take one argument, but four. First argument ptr indicates the place where convert will save the return value. Just imagine what would happen if the client called this new version convert in accordance with the calling conventions that were in effect for the original.

call void @convert(ptr nonnull sret(%struct.fahrenheit) align 8 %1,
                   i32 undef, i32 inreg 1077477376, i32 inreg 0)

Example 6: LLVM intermediate representation of convert(25), which uses structs for strong type inference.

Weaknesses are ubiquitous in C code bases. typedef-s that should be strong. This applies to critical infrastructure such as libc and Linux kernels. If you want to add strong standard type checking, e.g. time_t, then it is fundamentally important to maintain compatibility at the API and ABI level. If you wrapped time_t в struct (eg. struct strict_time_t { time_t t; }) to provide strong type checking, not only will you need to change all the APIs that access the values time_t-typed, but also the ABIs operating at these points. Those clients who have already used bare values time_twill have to scrupulously change the code in all those places where the code uses time_t, so that your structure is used there, activating strong type checking. On the other hand, if you used typedef with macros for aliasing the original time_t (eg. typedef time_t STRONG_TYPEDEF(time_t)), then API and ABI time_t will remain consistent. In this case, client code that correctly uses time_tmay remain unchanged.

❯ Improving Sparse from the Linux kernel

In 2003, Linus Torvalds developed his own preprocessor, C language parser and

compiler

entitled

Sparse

. Sparse performs type checking that takes into account the specifics of the Linux kernel. Sparse relies on macros to operate, in particular

__user

, which are scattered throughout the kernel code. In normal build configurations they do nothing, but when a macro is defined

__CHECKER__

their functionality is expanded to use

__attribute__((address_space(...))

You need to limit macro definitions like this using __CHECKER__, since most compilers don't allow you to tamper with a macro or implement specialized type checking… at least that was the case until recently. Macroni allows you to tamper with macros, check and analyze security in the same way as Sparse. But while Sparse is limited to C (due to its implementation of its own C parser and preprocessor), Macroni can work with any code that can be parsed by Clang (eg C, C++ and Objective C).

The first Sparse macro we will connect to is − __user. Currently the core sets __user for an attribute recognized by Sparse:

# define __user     __attribute__((noderef, address_space(__user)))

Example 7: Linux Kernel Macro __user

Sparse looks into this attribute to look for macros from user space – as in following example. noderef tells Sparse that these pointers cannot be dereferenced (e.g. *uaddr = 1), since information about their origin cannot be trusted.

u32 __user *uaddr;

Example 8: Using the __user macro, we annotate a variable as coming from user space.

Macroni can interfere with the macro and extended attribute to downgrade the declaration to MLIR, like this:

%0 = hl.var "uaddr" : !hl.lvalue<!sparse.user<!hl.ptr<!hl.elaborated<!hl.typedef<"u32">>>>>

Example 9: Kernel code after downgrading to MLIR using Macroni

The code downgraded to MLIR embeds the annotation into the type system by wrapping declarations coming from user space into the type sparse.user. Now we can add our own type checking logic for user space variables, similar to how we previously created strong typedef-s. You can even hook into a Sparse-specific macro __forceto disable strong type checking on occasion. Currently This is already sometimes done:

raw_copy_to_user(void __user *to, const void *from, unsigned long len)
{
   return __copy_user((__force void *)to, from, len);
}

Example 10: Using the __force macro to copy a pointer to user space

Also, using Macroni it is convenient to identify in the kernel critical RCU read sections and ensure that certain RCU (read-copy-update) operations occur only in these sections. Consider, for example, next call To rcu_dereference():

rcu_read_lock();
rcu_dereference(sbi->s_group_desc)[i] = bh;
rcu_read_unlock();

Example 11: Calling rcu_derefernce() on the read side of the critical RCU section in the Linux kernel

The above code calls rcu_derefernce() in the critical section – that is, in the area of code that begins with the call rcu_read_lock() and ends with a call rcu_read_unlock(). Should be called rcu_dereference() only in critical sections located on the read side, but this restriction cannot be forced.

Working with Macroni, you can use calls rcu_read_lock() And rcu_read_unlock() to identify critical sections that form implicit lexical areas of code. In this case, you can make sure that calls to rcu_dereference() occur only in these sections:

kernel.rcu.critical_section {
 %1 = macroni.parameter "p" : ...
 %2 = kernel.rcu_dereference rcu_dereference(%1) : ...
}

Example 12: Result of downgrading an RCU critical section to MLIR, types omitted for brevity

The above code turns both RCU-critical sections and calls to rcu_dereference(). Therefore, it is not difficult to verify that rcu_dereference() appears only in those areas where it is needed.

Unfortunately, RCU-critical sections do not always map exactly onto specific areas of the code, and rcu_dereference() also not always called in such areas. Consider the following example:

__bpf_kfunc void bpf_rcu_read_lock(void)
{
       rcu_read_lock();
}

Example 13: Kernel codecontaining a non-lexical RCU-critical section

static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
{
return rcu_dereference(dev->ip_ptr);
}

Example 14: Kernel codecalling rcu_dereference() outside the RCU-critical section

Using a macro __force challenges of this kind can be allowed to rcu_dereference()just as was previously done to avoid type checking for user-space pointers.

❯ Rust-like unsafe areas

It's clear that Macroni allows you to strengthen type checking and even enable application-specific type checking rules. But, if we mark types as strong, we must adhere to the stated level of strictness when checking. In a large codebase, this strategy may require a large set of changes. To make adaptation to a stricter type system happen in a more controlled manner, we could design for C something like the “unsafety” mechanism that operates in

Rust

: In an unsafe area, strict type checking is not applied.

#define unsafe if (0); else


fahrenheit convert(celsius C) {
 fahrenheit F;
 unsafe {
         F = (C * 9.0 / 5.0) + 32.0;
 }
 return F;
}

Example 15: C code showing syntax using macros for unsafe areas

This snippet shows what the syntax is in our security API: call the macro unsafe before entering potentially unsafe areas of code. All code not listed as being in unsafe areas will be subject to strict type checking. At the same time, the macro unsafe can be used for designations areas of relatively low-level code that we purposefully intend to leave unchanged. This is progress!

But the unsafe macro provides only the syntax for our security API, not the logic. To batten down this full of holes abstraction, we will need to convert the if statement marked with the macro into an operation in the safety dialect that we theoretically have:

...
"safety.unsafe"() ({
   ...
}) : () -> ()
...

Example 16: Using Macroni, we can downgrade our security API to the MLIR dialect and implement security verification logic.

You can now disable strict type checking for nested operations in the MLIR representation of a macro unsafe.

❯ More secure signal processing

You may have already noticed a pattern that emerges when creating safe language constructs: macros are used to define syntax that allows you to mark certain types, values, or areas of code as subject to a certain set of invariants. Logic is then defined at the MLIR level to ensure that these invariants are met.

With Macroni you can ensure that signal handlers execute only code that is safe at the signal level. Consider, for example, next signal handlerdefined in the Linux kernel:

static void sig_handler(int signo) {
       do_detach(if_idx, if_name);
       perf_buffer__free(pb);
       exit(0);
}

Example 17: Signal handler defined in the Linux kernel

sig_handler() calls three other functions directly in its own definition, all of which must be safe in the context of signal processing. But there is no check in the above code to ensure that we are only calling signal-safe functions inside the definition sig_handler(). C compilers simply do not provide a way to express semantic checks that apply to lexical domains.

With Macroni, you can add macros that mark some functions as signal handlers and others as signal-safe. You can then implement logic at the MLIR level to ensure that signal handlers only call signal-safe functions, like this:

#define SIG_HANDLER(name) name
#define SIG_SAFE(name) name


int SIG_SAFE(do_detach)(int, const char*);
void SIG_SAFE(perf_buffer__free)(struct perf_buffer*);
void SIG_SAFE(exit)(int);


static void SIG_HANDLER(sig_handler)(int signo) { ... }

Example 18: Token-based syntax for marking signal handlers and signal-safe functions

In the above code the function sig_handler() is marked as a signal handler, and the three functions it calls are marked as signal-safe. Each macro call is distributed over a single token, specifically the name of the function we want to tag. With this approach, Macroni connects to the extended token and determines by the function name whether it handles signals or is signal-safe.

An alternative approach is to define these macros for magic annotations and then connect to them via Macroni:

#define SIG_HANDLER __attribute__((annotate("macroni.signal_handler")))
#define SIG_SAFE __attribute__((annotate("macroni.signal_safe")))


int SIG_SAFE do_detach(int, const char*);
void SIG_SAFE perf_buffer__free(struct perf_buffer*);
void SIG_SAFE exit(int);


static void SIG_HANDLER sig_handler(int signo) { ... }

Example 19: Alternative attribute syntax to mark signal handlers and signal-safe functions

With this approach, calling a macro looks more like a type specifier, and some people will find this option nicer. The whole difference between token-based syntax and attribute-based syntax is that using the second option requires the compiler to support the attribute annotate(). If this is not a problem or if it can be used to limit __CHECKER__-like constructs, then either of the two syntax options will work fine. The MLIR server logic for checking signal safety will remain the same regardless of the syntax we choose.

❯ Conclusion

The Macroni tool downgrades C code and macros to a Multi-Level Intermediate Representation (MLIR), so you can avoid relying on Clang's bland abstract syntax tree for your analysis and instead build your analysis around a domain-specific intermediate representation. This intermediate view provides full access to the types, control flow, and data flow within the high-level MLIR dialect of VAST. Macroni will demote domain-relevant macros to MLIR and invalidate all other macros for you. This opens up the full power of static analysis that takes macros into account. You can set your own analysis options, transformations and optimizations. Macros are used at each stage of analysis. As shown in this article, it is even possible to combine macros and MLIR, thereby defining new syntax and semantics for C. The Macroni tool is free and distributed freely, here it is

repository on GitHub

❯ Acknowledgments

Thanks to Trail of Bits for the opportunity to create the Macroni tool. Thanks to my manager and mentor Peter Goodman for giving me the idea of downgrading macros to MLIR and for suggesting how Macroni could potentially be used. Thanks also to Lukas Korenik for reviewing the Macroni code and for his advice on how it could be improved.