I'm sitting on a tree, doing preprocessing

According to description,

Tree-sitter is a parser generation tool and incremental parsing library. It can generate a specific syntax tree for a source file and efficiently update the syntax tree as the source file is edited.

But how does Tree-sitter handle languages that require a preprocessing stage?

Since the preprocessor affects the text content, it is very difficult to fit it into the grammar of the language. Therefore, it is necessary to think of how to implement preprocessor support with the least losses, without preprocessing.

tree-sitter-cpp inherits tree-sitter-c and does not change the rules for preprocessor directives. tree-sitter-c acted in a fundamental way: the parser must consider the preprocessor as a full-fledged part of the grammar. But any preprocessor directive that modifies the text (#if, #include) can appear in the middle of a grammar rule and change it to something completely different. Therefore, for full support #if in a single grammar, it is necessary to generate a unique preprocessor directive rule for any possible combination of rules. This can be done using one of the advantages of Tree-sitter: scriptability via JavaScript. In this parser limited themselves only four cases:

    ...preprocIf('', $ => $._block_item),
    ...preprocIf('_in_field_declaration_list', $ => $._field_declaration_list_item),
    ...preprocIf('_in_enumerator_list', $ => seq($.enumerator, ',')),
    ...preprocIf('_in_enumerator_list_no_comma', $ => $.enumerator, -1),

Rule preproc_if used in rules for expressions within blocks and the global scope. Rules preproc_if_in_enumerator_list And preproc_if_in_enumerator_list_no_comma are found in enumeration lists, and preproc_if_in_field_declaration_listas you may have already guessed, in structures, associations and classes.

This set of rules successfully copes with primitive examples:

#if 9            // (preproc_if condition: (number_literal)
int a = 3;       //   (declaration)
#else            //   alternative: (preproc_else
int b = 3;       //     (declaration)))
#endif           //

int main(void) { // (function_definition body: (compound_statement
#if 9            //   (preproc_if condition: (number_literal)
    int a = 3;   //     (declaration)
#else            //     alternative: (preproc_else
    int b = 3;   //       (declaration)))
#endif           //
}                // ))

struct {         // (struct_specifier body: (field_declaration_list
#if 9            //   (preproc_if condition: (number_literal)
    int a;       //     (field_declaration)
#else            //     alternative: (preproc_else
    int b;       //       (field_declaration)))
#endif           //
};               // ))

enum {           // (enum_specifier body: (enumerator_list
#if 9            //   (preproc_if condition: (number_literal)
    a = 2,       //     (enumerator)
#else            //     alternative: (preproc_else
    b = 3,       //       (enumerator)))
#endif           //
};               // ))

But already in the last example, you can make a small change that will put tree-sitter-c in a dead end:

enum {           // (enum_specifier body: (enumerator_list
#if 9            //   (preproc_if condition: (number_literal)
    a = 2,       //     (enumerator)
#else            //     alternative: (preproc_else)
    b = 3        //       (ERROR (enumerator)))
#endif           //
};               // ))

A perfectly valid C code without a trailing comma contains different grammar rules for different branches of the preprocessor directive: an enumeration element with and without a comma.

A more complex example:

int a =          // (ERROR)
#if 1            // (preproc_if condition: (number_literal)
    3            //   (ERROR (number_literal))
#else            //   alternative: (preproc_else
    4            //     (expression_statement (number_literal)
#endif           //       (ERROR))))
;                //

And in this case tree-sitter-c can't even process it correctly #else:

int a            // (declaration)
#if 1            // (preproc_if condition: (number_literal)
    = 3          //   (ERROR (number_literal)
#else            //   )
    = 4          //     (expression_statement (number_literal)
#endif           //       (ERROR)
;                // )))

If the result of the substitution #if can be predicted from the source code, the result of the substitution #include is completely unpredictable for the parser. However, in the grammars for C and C++ the directive #include allowed only in global scope and within blocks.

#include "a"     // (preproc_include path: (string_literal))
int main(void) { // (function_definition body: (compound_statement
    #include "b" //   (preproc_include path: (string_literal))
}                // ))
int a =          // (declaration (init_declarator
    #include "c" //   (ERROR) value: (string_literal)
;                // ))

IN tree-sitter-c-sharp did the same, but a little more diversified context:

    ...preprocIf('', $ => $.declaration),
    ...preprocIf('_in_top_level', $ => choice($._top_level_item_no_statement, $.statement)),
    ...preprocIf('_in_expression', $ => $.expression, -2, false),
    ...preprocIf('_in_enum_member_declaration', $ => $.enum_member_declaration, 0, false),

What allows us to parse such an example is a special rule for the preprocessor directive inside expressions:

int a =          // (variable_declaration
#if 1            //   (preproc_if condition: (integer_literal)
    3            //     (integer_literal)
#else            //     alternative: (preproc_else
    4            //       (integer_literal))))))
#endif           //
;                //

But it breaks the example with enumeration that works in tree-sitter-c:

enum A {         // (enum_declaration body: (enum_member_declaration_list
#if 9            //   (preproc_if condition: (integer_literal)
    a = 2,       //     (enum_member_declaration) (ERROR)
#else            //     alternative: (preproc_else
    b = 3,       //       (enum_member_declaration) (ERROR)))
#endif           //
};               // ))

enum A {         // (enum_declaration body: (enum_member_declaration_list
#if 9            //   (preproc_if condition: (integer_literal)
    a = 2,       //     (enum_member_declaration) (ERROR)
#else            //     alternative: (preproc_else
    b = 3        //       (enum_member_declaration)))
#endif           //
};               // ))

Moreover, here the error nodes correspond only to commas, so we count the attempt as successful.

However, more complex rules such as operators are still not taken into account:

int a            // (ERROR (variable_declaration)
#if 1            //   (preproc_if condition: (integer_literal)
    = 3          //     (ERROR) (integer_literal)
#else            //     alternative: (preproc_else
    = 4          //       (ERROR) (integer_literal))
#endif           //   ))
;                // (empty_statement)

What distinguished the grammar for C# is the interpretation of the remaining preprocessor directives. In Tree-sitter exists grammar field extraswhich allows you to mark special rules that can appear anywhere. Usually, spaces and comments are added to this list. The grammar can be greatly simplified by adding directives to this list:

  extras: $ => [
    /[\s\u00A0\uFEFF\u3000]+/,
    $.comment,
    $.preproc_region,
    $.preproc_endregion,
    $.preproc_line,
    $.preproc_pragma,
    $.preproc_nullable,
    $.preproc_error,
    $.preproc_define,
    $.preproc_undef,
  ],

Thus, these directives are still included in the syntax tree and participate in syntax highlighting, but do not affect the other rules in any way.

int a                                 // (variable_declaration (variable_declarator
#pragma warning disable warning-list  //   (preproc_pragma)
    = 3                               //   (integer_literal)
#pragma warning restore warning-list  //   (preproc_pragma)
;                                     // ))

Despite a small bug in the rule preproc_pragmaeverything else was interpreted correctly.

To of this pull request #if was also in extrasWhat allowed parse files with fewer errors.

In general, the grammars for C/C++ and C# work quite well, and thanks to the error-resistance of Tree-sitter, invalid constructions do not affect the parsing of subsequent text. A parsing error can, of course, be noticed by incorrect syntax highlighting or incorrect operation of other editor features implemented with Tree-sitter, but when using a language server, the highlighting can be slightly corrected by Semantic Tokens. For example, clangd marks missing branches #if as comments:

semantic tokens

You could even say that Tree-sitter in some way punishes excessive use of the preprocessor. Personally, I prefer the approach of putting directive rules in extras. In the next article I will tell you how I solved the problem of preprocessing when writing grammar For FastBuildusing this approach.