We write a virtual machine (interpreter) of a simple bytecode + JIT compilation

All my code is in my repositories.

Compiling to bytecode

As I mentioned PigletC can compile a simple C-like language into text-based bytecode. I first need a program that will convert the bytecode from text to binary.

Some components:

  • For simplicity, we will encode each instruction with 4 bytes (type int). This is not quite optimal, since there are only 25 instructions, but it will be easier because I want the instruction arguments to be 4 bytes too.

  • Thus, the output binary file is a sequence int-s, each of which is either an instruction code, an argument to the instruction, or a special value (see below). Instruction arguments come only after those instructions that have an argument.

  • After jump instructions (JUMP, JUMP_IF_TRUE, JUMP_IF_FALSE) the index of the instruction to which the transition is going is set – that is, the ordinal number of the number encoding this instruction.

  • For convenience, I decided that numbers will be written in place of each label in the binary file 0xcafe And 0xbabe. Thus, when analyzing the bytecode, it will be immediately clear that jump instructions can lead here. Actually, this is optional: one could just keep all the arguments of the jump instructions.

Otherwise, there is nothing particularly interesting in the assembler code. He is brought Here.

Bytecode interpretation

Next, we will write the implementation of the bytecode interpreter so that there is something to compare the execution of the JIT-compiled code with.

The state of the virtual machine is determined by the following elements:

  • Stack. You can put 4-byte signed numbers on the stack

  • Memory is a tape of 4-byte signed numbers that can be accessed by index.

  • Next instruction number to execute

Initially, STL containers can be used to implement the stack and memory stack And vector, respectively (you can use a vector for both, or you can use a deque for both) – then the sizes of memory and stack will not be limited by any constants. However, later I had to abandon this, because it would not work well with JIT compilation of code pieces (actually, in principle, calls to stack/vector/deck methods could be pushed into JIT compiled code, but I’m almost sure that it would be impossible to inline, and without inline it would work inefficiently).

In general, the implementation of the bytecode interpreter is quite trivial: it is a huge switch, in which for each instruction it is written what it does.

code
void store_to_memory(int addr, int value) {
        memory[addr] = value;
    }

int stack_pop() {
    return stack[--stack_size];
}

void run() {
        size_t ip = 0;
        while (ip < instructions_number) {
            ip++;
            switch (instructions[ip - 1]) {
                case OP_JUMP: {
                    auto arg = instructions[ip];
                    ip++;
                    ip = arg;
                    break;
                }
                case OP_JUMP_IF_TRUE: {
                    auto arg = instructions[ip];
                    ip++;
                    if (stack_pop()) {
                        ip = arg;
                    }
                    break;
                }
                case OP_JUMP_IF_FALSE: {
                    auto arg = instructions[ip];
                    ip++;
                    if (!stack_pop()) {
                        ip = arg;
                    }
                    break;
                }
                case OP_LOADADDI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size - 1] += memory[arg];
                    break;
                }
                case OP_LOADI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size++] = memory[arg];
                    break;
                }
                case OP_PUSHI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size++] = arg;
                    break;
                }
                case OP_DISCARD: {
                    stack_size--;
                    break;
                }
                case OP_STOREI: {
                    auto addr = instructions[ip];
                    ip++;
                    store_to_memory(addr, stack_pop());
                    break;
                }
                case OP_LOAD: {
                    auto addr = stack_pop();
                    stack[stack_size++] = memory[addr];
                    break;
                }
                case OP_STORE: {
                    auto val = stack_pop();
                    auto addr = stack_pop();
                    store_to_memory(addr, val);
                    break;
                }
                case OP_ADDI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size - 1] += arg;
                    break;
                }
                case OP_DUP: {
                    stack[stack_size] = stack[stack_size - 1];
                    stack_size++;
                    break;
                }
                case OP_SUB: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] -= arg;
                    break;
                }
                case OP_ADD: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] += arg;
                    break;
                }
                case OP_DIV: {
                    auto arg = stack_pop();
                    if (arg == 0) {
                        cerr << "ZERO DIVISION\n";
                        return;
                    }
                    stack[stack_size - 1] /= arg;
                    break;
                }
                case OP_MUL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] *= arg;
                    break;
                }
                case OP_ABORT: {
                    cerr << "OP_ABORT called\n";
                    return;
                }
                case OP_DONE: {
                    cout << "program DONE\n";
                    return;
                }
                case OP_PRINT: {
                    cout << stack_pop() << "\n";
                    break;
                }
                case OP_POP_RES: {
                    stack_pop();
                    break;
                }
                case OP_GREATER_OR_EQUALI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size - 1] = stack[stack_size - 1] >= arg;
                    break;
                }
                case OP_GREATER_OR_EQUAL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] >= arg;
                    break;
                }
                case OP_GREATER: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] > arg;
                    break;
                }
                case OP_LESS: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] < arg;
                    break;
                }
                case OP_LESS_OR_EQUAL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] <= arg;
                    break;
                }
                case OP_EQUAL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] == arg;
                    break;
                }
                case 0xcafe: {
                    // skip 0xbabe
                    ip++;
                }
            }
        }
  }

Moving on to JIT compilation

Since I’m writing in C++, I’ll be using the C++ version of libjit, with classes and inheritance.

Actually, with libjit you first generate some intermediate representation of your function code, and then libjit turns this representation into native code. The intermediate representation is generated by calling libjit functions.

In order to create a compiled function, you need to create a class that inherits from jit_functiondefine constructor and override methods build And create_signature. Actually, in PigletC, ​​a program (at least for now) can have only one function, and bytecode does not have instructions for calling a function and returning from it (and since jump instructions can only jump to addresses known at the time of compilation, there is no way to do without additional support at the bytecode level), but nevertheless, the create_signature function can be written universal:

jit_type_t create_signature() override {
        auto ret_type = (is_void ? jit_type_void : jit_type_int);
        jit_type_t params[num_args];
        for (int i = 0; i < num_args; i++) {
            params[i] = jit_type_int;
        }
        return jit_type_create_signature(jit_abi_cdecl, ret_type, params, num_args, 1);
}

Here we assume that the function can only use type arguments intand return either intor void.

Now all that’s left is to assemble the function based on the bytecode. Here I spent a lot of time thinking about what to do with the stack. When the code is executed by the interpreter, the stack is an array/vector/stack defined in the virtual machine. When JIT-compiling, I thought that it would be effective for the real stack to be used as the stack of the virtual machine (that is, the one pointed to by the register under x86 %rsp). However, I did not find any functions in libjit that would allow this to be done: firstly, apparently, the library must be cross-platform, and the stack on different architectures may be different, and secondly, libjit will use the stack somehow on its own, and therefore cannot give it direct access.

Therefore, the pointer to the beginning of memory, the beginning of the stack, and the pointer to the size of the stack will be passed to the constructor of the JIT-compiled function and used in it in the same way as in the interpreter. A side effect of this will be that you can switch between executing a function in JIT mode and normal – that is, if JIT compilation does not support some bytecode instructions, then when it encounters such an instruction, it will return control to the interpreter that will execute such an instruction and switch back to JIT compilation (in this case, you will also have to update the pointer to the next instruction). To begin with, let’s assume that JIT compilation will not support jump instructions.

Preparing to compile the function would look like this:

auto memory = (jit_type_create_pointer(jit_type_int, 1));
memory = this->new_constant(memory_outer);
auto stack = new_value(jit_type_create_pointer(jit_type_int, 1));
stack = this->new_constant(stack_outer);
jit_value stack_size_ptr = new_value(jit_type_create_pointer(jit_type_int, 1));
stack_size_ptr = this->new_constant(stack_size_ptr_outer);
auto push = [this, &stack, &stack_size_ptr] (jit_value&& value) {
    push_on_stack(stack, stack_size_ptr, std::move(value));
};
auto pop = [this, &stack, &stack_size_ptr] () {
    return pop_from_stack(stack, stack_size_ptr);
};
int& ip = *ip_ptr;

And somewhere after this code:

void push_on_stack(jit_value& stack, jit_value& stack_size_ptr, jit_value&& arg) {
    insn_store_elem(stack, insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int), arg);
    insn_store_elem(stack_size_ptr, new_constant(0),
                        insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int) + new_constant(1));
}

jit_value pop_from_stack(jit_value& stack, jit_value& stack_size_ptr) {
    insn_store_elem(stack_size_ptr, new_constant(0),
                    insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int) - new_constant(1));
    return insn_load_elem(stack, insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int), jit_type_int);
}

jit_value peek_stack(jit_value& stack, jit_value& stack_size_ptr) {
    return insn_load_elem(stack, insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int) - new_constant(1),
                          jit_type_int);
}

All values ​​(constants) and variables that a JIT-compiled function can use must be declared as variables of type jit_value. Line by line, we do the following:

  1. By using jit_type_create_pointer create type “pointer to number”. First argument jit_type_int – the type to which we point, the number “1” – by how much the counter of the number of uses of the type increases. Finally, we pass the type to new_value – and create a variable of the “pointer to number” type in the JIT function to store an array with virtual machine memory

  2. We write to the variable for the memory address the actual memory address

  3. do point 1. for the stack

  4. do point 2. for the stack

  5. do step 1. for the pointer to the stack size

  6. do step 2. for the pointer to the stack size

Further, for convenience, it is necessary to write functions that will “put values ​​on the stack” and “take values ​​from the stack”. In quotation marks because in reality they will add instructions to the intermediate representation of our function that will perform the corresponding operations.

The general meaning is this: we read the value at the pointer to the stack size, write the value at this pointer by 1 more or less. Next, we read (or write) the value at the address to the desired stack cell.

Finally, inside the function build function, we declare two helper lambdas to make stack operations shorter.

Now the processing of the bytecode instruction is written quite simply. Here are some examples:

case OP_STORE: {
    auto val = pop();
    auto index = pop();
    insn_store_elem(memory, index, val);
    break;
}
case OP_LOAD: {
    auto index = pop();
    push(insn_load_elem(memory, index, jit_type_int));
    break;
}
case OP_PRINT: {
    auto tmp = pop().raw();
    this->insn_call_native("print_int", (void*)(&print_int), signature_helper(jit_type_void, jit_type_int, end_params),
        &tmp, 1, 0);
    break;
}

In the third instruction, you can see how to call a native function, that is, compiled into machine code without the participation of libjit.

Here you can see that the code for creating the internal representation of libjit is quite similar to the code for the virtual machine interpreter. It seems that if you wish, you could make this code completely the same (at least for most of the instructions) – it would only call different functions for stack operations and so on, depending on whether it is a jit compilation or interpretation, which would avoid doing double work when adding new instructions.

The virtual machine (interpreter) will now check if the next instruction is a branch. If yes, it does. Otherwise invokes JIT compilation, which compiles the function until it encounters an instruction it does not support. Next, the compiled function will be saved to the cache and called.

Naturally, while this all works very slowly – the pieces that we compile are quite small, and the interpreter has to work very often. For now, this is just a demonstration of libjit’s capabilities.

Now we need to add support for jump instructions. In fact, in libjit, in addition to variables, you can create labels with which you can do two operations: specify where in the code (that is, while the internal representation of libjit) this label should be and declare transitions to this label.

Now, during JIT compilation, we will create a dictionary with mapping of instruction addresses to labels. When we meet 0xcafe, we put a label corresponding to this place. When we meet a transition instruction, we indicate that we need to jump by the label, which we get from the dictionary at the address of the transition. Code:

case 0xcafe: {
    // skipping 0xbabe
    ip++;
    // ip is now pointing to the instruction after the label
    if (labels.count(ip) == 0) {
        labels[ip] = jit_label_undefined;
    }
    insn_label(labels[ip]);
    break;
}
case OP_JUMP: {
    auto arg = instructions[ip];
    ip++;
    if (labels.count(arg) == 0) {
        labels[arg] = jit_label_undefined;
    }
    insn_branch_if(new_constant(true), labels[arg]);
    break;
}
case OP_JUMP_IF_TRUE: {
    auto arg = instructions[ip];
    ip++;
    if (labels.count(arg) == 0) {
        labels[arg] = jit_label_undefined;
    }
    insn_branch_if(pop(), labels[arg]);
    break;
}
case OP_JUMP_IF_FALSE: {
    auto arg = instructions[ip];
    ip++;
    if (labels.count(arg) == 0) {
        labels[arg] = jit_label_undefined;
    }
    insn_branch_if_not(pop(), labels[arg]);
    break;
}

We start the virtual machine to execute some simple code, expecting that with JIT compilation it will work much faster – but it turns out that it works even a little slower: 1.2 seconds versus 1.4 seconds.

What’s the matter? The problem is that we’re generating pretty complex code, and the libjit optimizer is very weak – so it can’t really do anything about it.

Let’s try to slightly optimize the code we generate to see the speedup from JIT compilation.

Actually, the implementation of bytecode instructions is relatively simple. Stack operations look very tricky – we don’t just increment or decrement a variable that stores the stack size (one instruction) – we read it at some address, then write it there – and most likely libjit cannot move it to a register. Let’s simplify it all by making the stack size a local variable:

jit_value stack_size = new_value(jit_type_int);
stack_size = new_constant(0);
auto push = [this, &stack, &stack_size](jit_value&& value) {
    insn_store_elem(stack, stack_size, value);
    stack_size = stack_size + new_constant(1);
};
auto pop = [this, &stack, &stack_size]() {
    stack_size = stack_size - new_constant(1);
    return insn_load_elem(stack, stack_size, jit_type_int);
};
auto peek = [this, &stack, &stack_size]() {
    return insn_load_elem(stack, stack_size - new_constant(1), jit_type_int);
};

Here, this time, the speedup is significant: bytecode interpretation works in 1.2 seconds, and JIT-compiled code works in 0.4 seconds.

Although, on the other hand, one would expect a more significant acceleration. My guess is that libjit’s weak optimizations, combined with the fact that our bytecode is from a stack rather than a register virtual machine, does not allow the code to be really fast – after all, bytecode instructions constantly use the stack, which is much slower than using registers. We can partly solve this problem by replacing some sequences of instructions with more complex instructions, but in a smaller number. This optimization was described in the article about optimization of bytecode interpreters, which I referred to at the beginning of the article.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *