Lexical analysis in 11l

This article talks about the lexical analyzer, which is an integral part of any compiler.

The task of the lexical analyzer is to break the source code of the program into lexemes or tokens.

So, for example, the code

print(1 + 2)

will be split into tokens

print

,

(

,

1

,

+

,

2

and

)

Importance of indentation

Historically, the compiler of most programming languages

discards

all “whitespace” characters such as space, tab, and linefeed.

What does this lead to? To the fact that the compiler sees the source code like this:

if (foo)
   if (bar)
      m1();
else
   m2();

if

,

(

,

foo

,

)

,

if

,

(

,

bar

,

)

,

m1

,

(

,

)

,

;

,

else

,

m2

,

(

,

)

,

;

This code is taken from documentation for the Nemerle programming language. It contains the so-called hanging else bug, which means that when looking at this code, one might think that the else branch refers to the foo condition, while in programming languages ​​whose syntax is based on the C language (including C++, C#, Java and others), the else branch refers to the bar condition.
As a solution to this problem, the developers of the Nemerle language introduced such a rule that the if statement should always have an else branch, and if an else branch is not required, then the when statement should be used instead of if.
In many new programming languages, such as Swift, Rust, and Go, it is mandatory to use curly braces even if the body of an if statement [и других операторов управления] consists of only one line. When using curly brackets, this problem disappears:

if (foo) {
   if (bar) {
      m1();
   }
}
else {
   m2();
}

In some coding standards, for example

by Oracle

,

from Stanford University

or

Carnegie Mellon University

there is a requirement to enclose the body of if and else statements in curly braces to avoid such an error:

int login;
 
if (invalid_login())
    login = 0;
else
    printf("Login is valid\n");
    login = 1;

Here the line

login = 1;

will be executed in any case, regardless of the value returned by the function

invalid_login

.

But in both of these cases, the problem is not at all in the wrong if logic. [допускающем как отсутствие ветви else, так и её наличие], and not even in forgotten curly braces, but in … a discrepancy in the perception of this code by a human programmer and a compiler. A person perceives the boundaries of blocks visually, due to indentation, and the compiler – with the help of [малоприметных для человека] characters, and the compiler [языков C/C++, C#, Java, Nemerle и др.] concatenates all “whitespace” characters, and thus completely ignores indentation. This is what it is the root of the problem is that the compiler and the person see such code differently.

And to solve this problem, the compiler one way or another must take into account the indentation [чтобы приведённые ранее примеры кода либо работали так, как ожидает этого программист, либо выдавали ошибку на этапе компиляции (или как минимум предупреждение)].

For this reason, in the new programming language 11l, it was decided to make the lexical analyzer take into account indentation.

Now let’s look at the implementation process. A brief description of the indentation algorithm is given in documentation for the Python programming language. In short, the algorithm works like this: at the beginning of each line, the indent level is compared with the value at the top of the special stack. If it is greater then it is pushed onto the stack and the lexical analyzer generates an INDENT token. And if it is less, then all values ​​exceeding the indentation level of the current line are removed from the stack, and a DEDENT token is generated for each removed value. Also, a DEDENT is generated at the end of the source file for each value left on the stack.

In addition to the fact that the lexical analyzer of the 11l language takes into account indentation, like the lexical analyzer of the Python language, the new language adds support for curly braces, which allows you to write code on one line, or without indentation:

In addition, you can write code with both indentation and curly braces.

Here’s how it works (in a nutshell)

Stack of indentation levels (

indentation_levels

) in 11l, unlike the lexical analyzer of the Python language, in addition to the indentation level, stores a flag that is set to true when a new code block is formed by the character

{

, rather than increasing the indentation level. In this case, the indentation level is set to

-1

.

Immediately after the symbol

{

there is a new arbitrary {i.e. lower indent level allowed} indent {its level is replaced

-1

}, which is valid up to the matched character

}

.

Here

link

to the corresponding source code.

This solution is the most versatile and will suit both those who prefer to use curly braces and those who support indentation.

All popular

indent styles

:

Automatic string concatenation

In addition to splitting the source code of the program into lexemes or tokens, the task of the lexical analyzer is also to determine the boundaries of statements.

[Хотя в некоторых языках программирования (например в JavaScript и Kotlin) требуется помощь синтаксического анализатора.]

Traditionally, the semicolon character is used to mark the end of statements in C-based programming languages ​​(C++, C#, Java, D). However, most new programming languages ​​(eg Swift, Go, Kotlin, CoffeeScript, Nim, Julia, Crystal, and others) allow you to omit “semicolons” by using a newline character to end a statement. [Не только я заметил эту тенденцию.]

However, in this case, the question arises: in which cases adjacent lines of source code specify one statement, and in which cases – different statements. And different programming languages ​​solve this problem in different ways.

In Python used one simple rule for implicit string concatenation: if a string contains an unclosed parenthesis, square bracket, or curly brace, then it is concatenated with the following strings until all matching parentheses are closed.

In the Go language used rule: if a newline character follows a token that can terminate the assertion, then the assertion terminates. However, such a rule imposes significant restrictions on the coding style.
So, for example, when using the operator if the curly brace that opens the block must be located on the same line, in other words, it is not allowed to wrap it to the next line. Also, the operator else should appear on the same line as the closing curly brace.

The JavaScript language uses a rather complex

rule system

on which the semicolon character is automatically inserted.

a = 1
b = 2

In this example, a “semicolon” will be automatically inserted at the end of the first line, because

a = 1 b = 2is not a valid statement

.

However, in some cases the “semicolon” will not be inserted, which will cause the code to work incorrectly.

For example, the following two lines of JavaScript code will erroneously be merged into one.

a = b + c
[d, e] = [e, d]

And in this example, on the contrary, “semicolon” will be inserted by mistake immediately after

return

.

return 
  "something";

Those. instead of returning a string

"something"

value will be returned

undefined

.

In most programming languages ​​that allow semicolons to be omitted (for example, Python, Go, Kotlin, Ruby, Julia, and Nim), this code will not work:

r = 1
  + 2

Moreover, in the Python, Go and Nim languages, an error message will be displayed, and in the Kotlin, Ruby and Julia languages, the value of the variable

r

will be installed in

1

.

To solve this problem, 11l uses the following 3 simple rules that are easy for a programmer to understand and easy to implement in a lexical analyzer:

  1. If a string ends with a binary operator, then it is concatenated with the next string.
  2. If a line begins with a binary operator, then it is concatenated with the previous line.
    (You need to check that this is not a unary operator!)
    a = b
    ++i // символ плюса в начале этой строки не должен быть принят за бинарный оператор `+`
    
  3. And just like in Python, for expressions in parentheses and square brackets, the newline character is ignored.
// 1.
if условие1 & // эта строка будет объединена со следующей,
   условие2   \\ так как она оканчивается на бинарный оператор `&`
    some_func()

// 2.
some_variable = 2
              + 3 // эта строка будет объединена с предыдущей,
                  \\ так как она начинается на бинарный оператор `+`

// 3.
some_func(    // так как в этой строке не закрыта скобка, все
   argument1, \\ последующие строки будут объединяться до тех пор,
   argument2) \\ пока скобка не будет закрыта

Semicolon

Although it is not required to use a semicolon to end statements in 11l, semicolons can be used to place multiple statements on the same line:

a = 1; b = 2; c = 3

But, for example, in Python it is not obvious what this line of code corresponds to:

if b: print(1); print(2)

Python’s behavior in this case is logical in principle, but not obvious.

And the behavior of 11l is both logical and obvious.

Also, in Python, you can’t write a one-line function like this:

def f(x): if x: print(x)

In 11l, this is possible:

fn f(x) {if x {print(x)}}

There is no limit to perfection

In my

not

In my humble opinion, 11l implements the most advanced lexical analyzer of all existing programming languages ​​at the moment. However, it can still be improved.

For example, you can opt out of the 3rd automatic string concatenation rule to support code like this:

set_timeout(
            1.0,
            fn ()
               alert(‘!’)
            ,
            0
           )

In this case, the 3rd rule should be replaced with the following 2 rules:

  • If the string ends with an opening parenthesis ((), square ([) или запятую (,), то она объединяется со следующей строкой.
  • Если строка начинается на закрывающую круглую скобку ()) или квадратную (]), then it is merged with the previous line.

But since the 11l language parser does not currently support such constructs {in particular, anonymous functions are not supported

[вместо них можно использовать “стрелочные” функции (например (x, y) -> x + y)]

}, implementation at the level of the lexical analyzer has no practical meaning.

Finally, here are links to the source code for the 11l lexer, which is available in three languages: Python, 11l and C++.

Similar Posts

Leave a Reply Cancel reply