Fascinating lexical analysis of the Rust language

Introduction

Let’s talk about lexical analysis. At first I was going to call this post “Implementing a tokenizer,” but the wind changed, times have changed… and, in order not to drown in a stream of comments like “hell, where is my LLama BPE tokenizer that you promised me,” we will limit ourselves for now to lexical analysis.

This article is aimed at readers just starting to try their hand at Rust lexical analysis. Therefore, please remember the target audience before complaining: “hmm, yes, I sketched out a search in the table on my knees, and it works ten times better than this misunderstanding” and “with such times in my life, I myself before completing the program I won’t make it.”

But constructive comments and tips on how things could really be done better are always welcome.

A bit long for an introductory disclaimer. I hope that by reading this far you have shuddered at least once. Enough words, let’s get started.

Lexical tokenization is the transformation of text into (semantically or syntactically) meaningful lexical units (tokens), which are distributed into categories using a “lexer” program.

Simply put, this program is given a set of symbols as input, and it offers a set of tokens as output. Please note: tokens must have meaning in your context, and skin the cat You can tokenize a string in different ways. Here, for example, is how you can do this with the string “hello world”:

# Единственный токен
<HELLO_WORLD>

# Два токена, разделённых пробелом
<HELLO> <WHITESPACE> <WORLD>

# Один токен в обрамлении непробельных символов 
# Интерпретация непробельных символов
<NON_WHITESPACE> <SPACE> <NON_WHITESPACE>

In essence, this turns the tokenizer into a function of the form t:(S^*, T)→T^*which accepts a list of characters from the alphabet S and an alphabet of tokens, after which it produces a list of tokens from the alphabet T. At least, in my opinion, this is exactly how it is.

Definition of tokens

In this article, we will implement a tokenizer to parse a simple Cron expression. We will try to work with such expressions, each of which consists of exactly 5 fields, and only 0-9, *, /, ,, -and at the level of each field the following rules apply:

Fields

Valid values

Valid special characters

minutes

0 – 59

* ,

Watch

0 – 23

* ,

Day of the month

1 – 31

* ,

Month

1 – 12

* ,

Day of the week

0 – 6

* ,

Now you can start a new Rust project. To do this, let’s do cargo new cronlexer and let’s get down to business.

I prefer to first define the tokens in code. Since no two tokens are the same, the most convenient way to express them in Rust is as an enum. Here’s an example demonstrating how tokens can be structured:

#[derive(Debug, PartialEq, Clone)]
pub enum Token {
    Whitespace,
    Number(u8),
    Dash,
    Slash,
    Star,
    Comma,
}	

The names of the options included in the enumeration are self-evident, and directly correspond to the tokens that interest us. True, as you may have noticed, we decided to implement the token-digit as Number(u8). This way we force the tokenizer to combine all the digits into a number, and only after that issue a token. But we could also make each digit a separate token, like this:

#[derive(Debug, PartialEq, Clone)]
pub enum Token {
    Whitespace,
    Zero,
    One,
    Two,
    Three,
    Four,
    Five,
    Six,
    Seven,
    Eight,
    Nine,
    Dash,
    Slash,
    Star,
    Comma,
}

You could even do the exact opposite and provide separate tokens for days, weeks, and years, like this:

#[derive(Debug, PartialEq, Clone)]
pub enum Token {
    Whitespace,
    Minutes(u8),
    Hours(u8),
    DayOfMonth(u8),
    Month(u8),
    DayOfWeek(u8),
    Dash,
    Slash,
    Star,
    Comma,
}

Depending on how this information is structured, its parsing may be easier or more difficult. For now, let’s ignore the syntactic parsing and simply accept a set of tokens that make sense in the context of our example and assume that they (no matter how many there are) are syntactically correct. As a result we get:

#[derive(Debug, PartialEq, Clone)]
pub enum Token {
    Whitespace,
    Number(u8),
    Dash,
    Slash,
    Star,
    Comma,
}

Designing the tokenizer interface

When I design a library interface, I always think about the end user. How will it handle my library? What will be the input and output values ​​of the public functions, structures and enumerations provided in the library? Then, by feeling for how best to work with the library, I can make it more ergonomic, and it will also become more convenient for the user to use it.

First, I’ll try to use my library like this:

let source = "*/5 1,2,3 * * *";
let tokenizer = Tokenizer::new(source.as_bytes());
for token in tokenizer {
    println!("token: {:?}", token);
}

Since we want to iterate over tokens, we can rely on the following convention: we can implement the trait IntoIterator for our Tokenizer, and it is in it to return a specific iterator. In Rust, it’s common practice that whenever you want to return some entity that implements a set of traits, you wrap them in a struct and have that struct implement them. This happens everywhere, for example, such different functions as map, peek, iter, split_whitespacesall return various auxiliary structures that implement the necessary traits.

But then we quickly realize that tokenization may return an error. For example, if you try to arrange tokens in schedule every day in our Cron expression, then the grammar simply cannot cope with this expression.

let source = "schedule every day";
let tokenizer = Tokenizer::new(source.as_bytes());
// Мы ведь не можем перебрать токенизацию, которая прошла неудачно, или можем?
for token in tokenizer {
    println!("token: {:?}", token);
}

Instead, let’s use the method tokenize, which will return the result of tokenization. In a real situation, other things may arise here that the tokenizer will also need to do: look for the next value, then perform a search with return, and do it quickly (with optimization). Let’s keep it simple for now: we just want to tokenize the data and iterate over the tokens. Our library will be used something like this:

let source = "*/5 1,2,3 * * *";
let tokenizer = Tokenizer::new(source.as_bytes());
// обратите внимание на '#' или на сопоставление с ошибкой
let tokens = tokenizer.tokenize()?;
for token in tokenizer {
    println!("token: {:?}", token);
}

Implementation

First let’s fill in our basic structures. We need a main structure Tokenizerand TokenizerError

// захватить здесь какое-то сообщение, которое затем пригодится пользователю библиотеки 
#[derive(Debug)]
pub struct TokenizerError {
    pub message: String,
}

// вектор токенов
pub type TokenizerResult = Result<Vec<Token>, TokenizerError>;

pub struct Tokenizer<'a> {
    source: &'a [u8],
    pos: usize,
}

impl<'a> Tokenizer<'a> {
    pub fn new(source: &'a [u8]) -> Self {
        Self { source, pos: 0 }
    }

    pub fn tokenize(&mut self) -> TokenizerResult {
        todo!()
    }
}

Now let’s implement the tokenizer. It will simply scan the tokens from left to right, then greedily consume them and put them into the resulting token vector.

impl<'a> Tokenizer<'a> {
    // создать новый токенизатор
    pub fn new(source: &'a [u8]) -> Self {
        Self { source, pos: 0 }
    }

    pub fn tokenize(&mut self) -> TokenizerResult {
        let mut tokens = Vec::new();

        while self.pos < self.source.len() {
            // свериться с каждым конкретным токеном и получить как сам токен,
            // так и количество потреблённых байт
            let (token, consumed) = match self.source[self.pos] {
                // жадно потреблять пробелы
                b' ' | b'\t' => {
                    let count = self.peek_while(is_whitespace);
                    (Token::Whitespace, count)
                }

                b'-' => (Token::Dash, 1),
                b',' => (Token::Comma, 1),
                b'/' => (Token::Slash, 1),
                b'*' => (Token::Star, 1),
                b'0' | b'1' | b'2' | b'3' | b'4' | b'5' | b'6' | b'7' | b'8' | b'9' => {
                    // жадно потреблять цифры
                    let count = self.peek_while(is_digit);

                    (
                        Token::Number(to_u8(&self.source[self.pos..(self.pos + count)])),
                        count,
                    )
                }
                ch => {
                    // мы нашли символ, которого нет в нашем алфавите, поэтому
                    // вернём ошибку
                    return Err(TokenizerError {
                        message: format!("Could not parse token: {ch}",),
                    })
                }
            };
            // развить внутреннее состояние
            self.pos += consumed;
            // прикрепить токен
            tokens.push(token);
        }

        Ok(tokens)
    }

    // посмотреть, пока предикат равен true, 
    // не изменяя состояние,
    // после чего возвратить число подсмотренных элементов
    fn peek_while<P>(&self, pred: P) -> usize
    where
        P: Fn(u8) -> bool,
    {
        let mut consumed = 0;

        while (self.pos + consumed) < self.source.len() && pred(self.source[self.pos + consumed]) {
            consumed += 1;
        }

        consumed
    }
}

// проверить, является ли байт цифрой
fn is_digit(v: u8) -> bool {
    v >= b'0' && v <= b'9'
}

// проверить, является ли байт пробелом
fn is_whitespace(v: u8) -> bool {
    v == b' ' || v == b'\t'
}

// дёшевый и сердитый способ извлечь цифру из байтовой строки 
// напр. b'12' -> 12
fn to_u8(vs: &[u8]) -> u8 {
    let mut result: u8 = 0;
    const RADIX: u8 = 10;
    let len = vs.len();
    for i in 0..len {
        result = result + (RADIX.pow((len - i - 1) as u32) * (vs[i] - b'0'));
    }
    result
}

Some interesting notes and observations:

  • tokenize accepts mut self, since it changes the internal state and is not idempotent (if you call tokenize twice, it will lead to an error). This makes it very convenient to signal to the client that calling tokenize multiple times is not allowed and to enforce this condition at compile time.

  • Our Tokenizer accepts source link. When you need to read this information in simple cases, such as those we are looking at now, you need to think about how long the link to Tokenizer? In general, link source must survive Tokenizer (or live as long as he did). Otherwise we won’t have a valid point from which we can read data.

  • In the internal state we maintain the current index pos, eagerly advancing it as the tokens are parsed. Parsing tokens is just one giant match operation (which could be a separate function). As part of this operation, we return the token itself, as well as the number of bytes consumed.

  • Our Cron grammar lends itself to unambiguous tokenization. Therefore, the whole process turns out to be a rather trivial exercise: we just need to carefully grow the index and match tokens until we exhaust the supply of bytes.

Testing

Let’s write some tests:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn basic_tokenization() {
        let source = "*/12";
        let tokenizer = Tokenizer::new(source.as_bytes());
        let tokens = tokenizer.tokenize();
        assert!(tokens.is_ok());
        assert_eq!(
            tokens.unwrap(),
            vec![Token::Star, Token::Slash, Token::Number(12)]
        );
    }

    #[test]
    fn failed_tokenization() {
        let source = "why cant't I just say 'every hour'";
        let tokenizer = Tokenizer::new(source.as_bytes());
        let tokens = tokenizer.tokenize();
        assert!(tokens.is_err());
    }
}

100% test coverage. This is exactly how I write in production.

p/s Black Friday is underway at the publishing house “Peter”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *