Globally optimal, eighth and fastest type of bytecode interpreters

Совершать невозможное и раздавать пинки здравому смыслу — в этом и состоит жизнь членов Гуррен-Дана! (C) Камина

This article enters into a technical debate with a 2015 article by Atakua, the approaches from which I attack. Atakua explores 7 types of bytecode interpreters, but does it without respect – the fastest turns out to be binary translation, which, in fact, is no longer a bytecode interpreter, but a form of Ahead-Of-Time compiler. This binary translation translates bytecode into machine code, which is a chain of calls to compiled service procedures. The same ones that in the bytecode interpreter are responsible for the execution of each opcode.

But Atakua has not squeezed all the speed possible out of bytecode interpreters. So this article is a tutorial: how to write a bytecode interpreter that can outperform JIT/AOT compilation in speed. Interesting? Read on!

Benchmark is attached. There will be a little hardcore and not a single image generated by a neural network!

For starters, let's look at the histogram. I use the Atakua benchmark and added my results to it. The rightmost, aka the smallest blue column (native) is the native code given by Atakua as an assessment of the optimization limits. It is not entirely correct for evaluation, because it does not contain checks for the number of steps, stack boundaries, counters of called procedures and other attributes of the virtual machine – it is simply a demonstration of the minimum time limit for the selected algorithm for finding prime numbers.

My results are marked in red. Yes, I was able to beat the binary translation. Three times.

Why is this important and who needs it?

Because a bytecode interpreter is one of the best ways to implement an in-place runtime. This allows you to solve many problems in an elegant way. For example?

to add a scripting language to your product to automate user actions,
to develop a variant of regular expressions for analyzing not text, but sequences of events that come from mobile application telemetry – from millions of devices and in real time,
for porting abandonware to modern architecture (emulators),
creating safe sandboxes for execution and heuristic analysis of untrusted code,
developing a plugin system for your new browser,
you are making a cross-platform application and are looking for a way to write a minimum of architecture-dependent code,
you are trying to make Ethereum smart contracts work on the Solana blockchain, which is not intended for them,
write a code instrumentation utility, a program monitor or an emulating debugger for a freshly baked System-On-Chip,
deliver humanity from headache of searching for drivers,
write your own Fort for your own pleasure.

Almost a silver bullet, judging by the breadth of applications!

All this requires understanding how fast a bytecode interpreter can be. And knowledge of methods on how to make it fast. Especially if you are not ready to use heavy artillery like JIT compilation and want to make do with “little blood.”

This article will help you understand how to write what is probably the fastest bytecode interpreter.

Let's restore the context a little. Here's a picture from Atakua's article that explains everything (if not, it's worth brushing up on his article)

Its tailrecursive variant is good: it demonstrates the best speed among its interpreters. Why? Because at the end of each service procedure there is what forters call “NEXT”: inline code for three operations: ADVANCE_PC, FETCH_DECODE, and finally DISPATCH. This last DISPATCH performs the tail call.

Why is this method the fastest? The processor branch predictor associates a history with each branch. History improves predictions. There is logic in the sequence of bytecode commands: for example, a comparison procedure is usually followed by a conditional branch procedure. Thus, each tail DISPATCH has assumptions about where it will need to jump. These assumptions are known even before this jump begins to execute. Therefore, the processor starts executing the next command in advance, but if it doesn’t guess right, you can simply discard the execution results.

Atakua did an excellent job of leading us to the implementation of a tailrecursive virtual machine, which, of course, had to consider all the previous options (they are also interesting). And now we can move forward.

Let's try to globally optimize the entire program: our task is well suited for this, because the program itself, i.e. Bytecode interpreter is compact.

To begin with, you can make all interpreter state data global variables. This will allow you to save on passing function parameters, their prologues and epilogues, saving and restoring registers. Tail calls don't need return addresses, so we don't need call/ret, and all calls inside service procedures are already inline.

This is where we can beat the compiler, whose optimizations are local. Global analysis of any program would require the compiler to analyze too many execution paths. But, unlike the compiler, we know for sure that in tailrecursive all execution paths have a common pattern – service procedures jump into one another until they reach Halt or Break. Here we are smarter, or, better said, more informed than the compiler, which is forced to act within the constraints of the most general scenario.

Let's look at the data structures that manage the state of the virtual processor and the virtual machine as a whole:

  typedef uint32_t Instr_t;

  typedef enum {
      Cpu_Running = 0,
      Cpu_Halted,
      Cpu_Break
  } cpu_state_t;

  /* Simulated processor state */
  typedef struct {
      int32_t sp;  /* Stack Pointer */
      uint64_t steps; /* Statistics - total number of instructions */
      uint32_t stack[STACK_CAPACITY]; /* Data Stack */
      uint32_t pc; /* Program Counter */
      const Instr_t *pmem;            /* Program Memory */
      cpu_state_t state;
  } cpu_t;

  /* A struct to store information about a decoded instruction */
  typedef struct {
      Instr_t opcode;
      int length; /* size of instruction (1 or 2 for our bytecode) */
      int32_t immediate; /* argument of opcode if exists */
  } decode_t;

Oh, Atakua wrote a very minimalistic virtual machine! What if we put all this into registers? Then our virtual machine will not be able to touch memory at all (except the stack) during its operation:

  #define sp              %rsp
  #define steps           %r8
  #define pc              %r9
  #define prog_mem        %rsi
  #define state           %r15

  #define opcode64        %rdx
  #define opcode32        %edx
  #define immed64         %r14
  #define immed32         %r14d

In the original Atakua virtual machine, the stack is 32-bit and can contain up to 32 values. This is something you have to live with: if you do it differently, the comparative benchmark becomes irrelevant. But when implementing such a stack head-on, you need to deal with an array, which will be accessed using a combination of the base address of the stack and the offset of the stack cell. This is less optimal than using the host machine's 64-bit stack. Therefore, for the sake of optimization, you can change the way you work with the stack:

use 64-bit stack elements instead of 32-bit ones, leaving the top bits zero,
use the stack pointer in the %RSP register as an offset (offsets will be a multiple of the size of the stack element)

Thus, we remove the code that manually recalculates the stack pointer, using instead processor instructions that both push/remove a stack element and change the %RSP. This way we simplify addressing and gain speed in the “hottest” code of the stack machine – working with the stack.

All of these parts of the interpreter work together, improving performance in one aspect opens up opportunities for improvement in another. At the same time, by optimizing the interpreter, we do not change anything in the virtual machine itself. To disambiguate, here is an explanation from ruv:

You should distinguish between the concepts of “machine” (or “virtual machine”) and “execution environment” (the “running environment”) for this machine.

The term “stack machine” (or, “virtual stack machine”) implies that the machine's instructions take parameters from the stack and push the results onto the stack. A stack is part of a virtual machine. Different stack machines have different “executable” code formats.

The code is “executable” in quotes because the question is who/what is executing it. If it is executed directly by a real CPU, then it is truly executable code. Otherwise, this code is executed by another program. In Fort terminology this program is called an “address interpreter”, for bytecode it will be called a “bytecode interpreter”.

The interpreter is part of the runtime environment. Service procedures that implement machine instructions are also part of the execution environment. The implementation details of different parts of the runtime environment are tied to each other – that is, if you change the implementation of the address interpreter, you may need to change the service procedures.

Different execution environments can be made for the same virtual machine (and code format).

The code format (threaded code, subroutine code, bytecode, etc.) characterizes the virtual machine, not the execution environment.

The implementation details of the address interpreter (if there is one) characterize the execution environment, not the virtual machine.

If the address interpreter uses a stack, then, formally, it is a different stack, not the stack of a stack machine. In some cases, the address interpreter (or bytecode interpreter) itself may be a state machine and not use a stack at all.

In our case, the interpreter itself does not use the stack, since it is built on tail calls and does not need the stack to call the next service instruction. Only one service procedure (Print) uses the stack in the classic way – to call printf(), the rest use the stack to manipulate the data that the virtual machine bytecode works with.

There is one more important thing for a stack – its boundaries. Since they are checked during every operation on the stack, we all the more have to put them in registers. Boundary values must be calculated when the virtual machine starts, before bytecode execution begins.

  /* Удобно запомнить, если воспринимать "b" в имени регистра как "border" */
  #define stack_max       %rbp
  #define stack_min       %rbx

What else (frequently used) can be put in the registers to make the interpreter work faster? There are two things left: the first is the limit on the number of steps that the interpreter can take, and the second is the base address of the array of pointers to procedures. Each of these procedures maintains its own virtual machine opcode.

  #define steplimit       %rcx
  #define routines        %rdi

Great! We placed all the variables in registers and we even had extra registers left. Two of them can be used for frequently used constants:

  # 1 = Cpu_Halted
  #define one             %r11
  # 2 = Cpu_Break
  #define two             %r12

And there are still two registers left that can be used to cache the top two elements of the stack. This is used in the implementation of fort machines and helps improve the performance of frequently performed SWAP and OVER. Below I will show this technique in detail.

  #define top             %rax
  #define subtop          %r10

Note the choice of %RAX as the register that caches the top of the stack. Some machine instructions, such as DIV, use the %RAX register as an implicit operand. And if we already have an operand on top of the stack, it won’t have to be loaded, which will save us one assembler instruction in implementing the MOD service procedure later.

So, we have occupied all registers except one. Let's call it “battery” and use it if necessary:

  # define acc            %r13

И на третий день Бог создал "Ремингтон" со скользящим затвором, чтобы человек стрелял в динозавров и прикладных программистов... Аминь! (с)

“But wait!” – the person with the compiler will say, “Can we manually allocate all the registers without leaving a single one for the compiler? Even Atakua in its binary translation nailed only one variable to register %r15!”

The compiler's recommendation to bind one global variable to a register is just that, a recommendation, and the compiler may ignore it. But hitting all the registers is already trolling. Therefore, let's spare the feelings of the compiler and unpack the assembler. Which assembler should I use? Of course, we will be using an assembler intended to serve as a GCC backend, and not for humans to write in. An assembler with the order of its operands turned inside out, so explosive that it is even reflected in its name: GAS.

So, each Atakua service procedure ends with the following sequence:

  ADVANCE_PC();
  ,*pdecoded = fetch_decode(pcpu);
  DISPATCH();

..and this code is repeated a little less than everywhere and is an excellent candidate for optimization. What's going on in it?

  #define DISPATCH() service_routines[pdecoded->opcode](pcpu, pdecoded);

  #define ADVANCE_PC() do {               \
    pcpu->pc += pdecoded->length;         \
    pcpu->steps++;                        \
    if (pcpu->state != Cpu_Running        \
          || pcpu->steps >= steplimit)    \
       return;                            \
    } while(0);

  static inline decode_t fetch_decode(cpu_t *pcpu) {
    return decode(fetch_checked(pcpu), pcpu);
  }

Decode places the current instruction in the opcode variable and calculates its length. If an instruction has an immediate operand that follows it, it is placed in the immediate variable. fetch_checked will check whether program_counter has gone beyond the program bytecode:

  static inline Instr_t fetch_checked(cpu_t *pcpu) {
      if (!(pcpu->pc < PROGRAM_SIZE)) {
          printf("PC out of bounds\n");
          pcpu->state = Cpu_Break;
          return Instr_Break;
      }
      return fetch(pcpu);
  }

Perhaps I’d rather not show you what the compiler turns this code into (children can read us!): even at high levels of optimization you won’t look at it without tears. Many people now say that compilers are now much better at optimization than humans. But I suspect that's because while the average compiler was getting smarter, the person he was competing with was doing who knows what. Needless to say, these days some virtual machine developers even allow themselves to have a family!

So, we will follow the path that Atakua paved: using assembler macros will replace us with inline for code inlining purposes. For quick visual recognition, I will refer to them in capital letters.

  .macro FETCH_DECODE
      FETCH_CHECKED
      DECODE
  .endm

These two: FETCH_CHECKED and DECODE always go in pairs.

  .macro FETCH_CHECKED
      .if MAX_PROGRAM_SIZE_CHECK
         ...
      .endif
      FETCH
  .endm

The check for exceeding 512 program cells has been made disabling (using a compile-time variable) so that you can evaluate how much it slows down execution (almost not). If it works, the bytecode interpreter prints a message and exits, as in other error handling cases.

Now let's move on to something more important: FETCH and DECODE. Their job is to obtain the opcode and its immediate operand, if the opcode accepts it. But using an entire conditional jump to analyze whether an opcode needs an immediate operand is wasteful. It’s better that we always choose it, and if the opcode doesn’t need it, that’s not our problem. Thus, everything can be reduced to two lines:

  .macro FETCH
      mov     (prog_mem, pc, 4), opcode32     # prog_mem[pc]
  .endm

  .macro DECODE
      mov     4(prog_mem, pc, 4), immed32     # prog_mem[pc+1]
  .endm

Do you remember that in GAS the source operand is on the left, and the destination operand is on the right? Okay, I was just asking just in case.

An experienced assembly programmer might notice that we could get rid of the prog_mem base address by adding it to pc at program start. I also fell into this trap at first. As a result, the program becomes slightly slower. This is due to the fact that in the service procedures Jump and Je, which are responsible for jumping over bytecode, it becomes necessary to multiply the immediate operand by 4 (the word size of the virtual machine in bytes). Since the immediate operand of jumps can be a negative number (for backward jumps), the optimal way to do this is to use a SAR arithmetic shift. But even then, it's an extra command in a frequently performed procedure that takes time. On my machine, this means, on average, the difference between 3.02 and 2.94 seconds for the entire program to run. You can make such sacrifices if you need to save a register for prog_mem, but there is no need for this: there are still enough registers.

Another discarded idea is to try, instead of reading two 32-bit values, read one 64-bit value and apply shifts and moves to get the desired halves. But this takes more time than it wins – perhaps on machines with slower memory access it would work better.

Finally, we move on to DISPATCH – the last instruction of each service procedure:

  .macro DISPATCH
      jmp     *(routines, opcode64, 8)
  .endm

We make a jump to an address located in an array of pointers. The array address is in routunes, the array element number is in opcode64, and the address size is 8 bytes. Essentially, this means getting the value from routines+(opcode64*8) and jumping to that address. Perhaps these detailed explanations will be useful to those who are not familiar with GAS assembler.

An interesting fact about the life of opcode64: it is initialized in FETCH and used in DISPATCH. And until the next FETCH, any service procedure can use it as a temporary register, making sure that before the next FETCH its upper half is filled with zeros. Much the same can be said for immed64 – especially for those procedures that do not use an immediate value. Thus, we already have 3 free registers – with them we can expand to the fullest! But don't try to explain this register usage strategy to the compiler…

Oh yes, we almost forgot about ADVANCE_PC:

  .macro ADVANCE_PC cnt:req
      .if \cnt == 1
        inc     pc
      .else
        lea     \cnt(pc), pc
      .endif

      .if (STEPLIMIT_CHECK || STEPCNT)
        # Аксакалы верят что если разнести инкремент и проверку, то
        # это позволит процессору выполнить все быстрее
        inc     steps
      .endif

      .if STATE_RUNNING_CHECK
        test    state, state        # Cpu_Running(0) != state
        jne     handle_state_not_running
      .endif

      .if STEPLIMIT_CHECK
        cmp     steps, steplimit    # steps >= steplimit
        jl      handle_steplimit_reached
      .endif
  .endm

Since each service procedure itself knows whether it has an immediate operand or not, I parameterized this macro so that it increments the Program Counter depending on the argument passed. In the virtual machine under study, there are no opcodes where the Program Counter needs to be shifted by more than 2, so using INC or LEA is optimal for all cases.

Ускорение петли обратной связи может привести к пороговым эффектам. Замедление петли обратной связи может привести к упущенным возможностям.

So, our virtual machine is stack-based, service procedures receive values from the stack and push their results onto the stack. For convenience, there is even a notation for stack diagrams that resembles a runic design:

A typical Atakua service procedure looks like this:

  void sr_Swap(cpu_t *pcpu, decode_t *pdecoded) {
      uint32_t tmp1 = pop(pcpu);
      uint32_t tmp2 = pop(pcpu);
      BAIL_ON_ERROR();
      push(pcpu, tmp1);
      push(pcpu, tmp2);
      ADVANCE_PC();
      ,*pdecoded = fetch_decode(pcpu);
      DISPATCH();
  }

Therefore, the first thing we need is the auxiliary routines push() and pop() – they are inlined into almost all service procedures. Their peculiarity is that they check for exceeding stack boundaries:

  static inline void push(cpu_t *pcpu, uint32_t v) {
      assert(pcpu);
      if (pcpu->sp >= STACK_CAPACITY-1) {
          printf("Stack overflow\n");
          pcpu->state = Cpu_Break;
          return;
      }
      pcpu->stack[++pcpu->sp] = v;
  }

  static inline uint32_t pop(cpu_t *pcpu) {
      assert(pcpu);
      if (pcpu->sp < 0) {
          printf("Stack underflow\n");
          pcpu->state = Cpu_Break;
          return 0;
      }
      return pcpu->stack[pcpu->sp--];
  }

Therefore we must do the same:

  .macro PUSH_IMM reg
      .if STACK_CHECK
      cmp     sp, stack_min
      jae     handle_overflow
      .endif

      push    \reg
  .endm

  .macro POP_IMM reg
      .if STACK_CHECK
      cmp     sp, stack_max
      jb      handle_underflow
      .endif

      pop     \reg
  .endm

An experienced systems engineer will immediately notice here that some of these checks can be evaded: in fact, if a procedure takes two words from the stack and then puts two words on the stack, then only one check is needed! Fortunately, you won't need to write a complex macro that will calculate the aggregate check, because we'll have a classic optimization of Forth language implementations: caching the top of the stack in registers!

To clarify this, a picture is required:

I measured performance without caching, with caching the top value of the stack and the top two values and decided to go with the latter option (it showed the best results).

Look, the SWAP procedure does not touch the stack at all:

  RTN Swap
  xchg   top, subtop
  ADVANCE_PC 1
  FETCH_DECODE
  DISPATCH

(RTN is a very simple macro that builds the start of the next procedure, adding to it an increment of the procedure call counter so that you can estimate which procedures are called more often – a small convenience that can be disabled by a compile-time variable):

  .macro RTN name
      .global srv_\name
      .type srv_\name, @function
  srv_\name:
      .if DBGCNT
      incq    cnt_\name(%rip)
      .endif
  .endm

You're probably wondering how many times each service procedure is called when executing our algorithm? Here's the data:

  Counters     :
   cnt_Print   :                 9592
   cnt_Je      :            910487889
   cnt_Mod     :            455189149
   cnt_Sub     :            455298740
   cnt_Over    :           1820985370
   cnt_Swap    :            910387890
   cnt_Dup     :                    0
   cnt_Drop    :                99998
   cnt_Push    :               100000
   cnt_Nop     :                    0
   cnt_Halt    :                    1
   cnt_Break   :                    0
   cnt_Inc     :            455198741
   cnt_Jump    :            455198741

The last two lines directly hint that they can be automatically combined into one superinstruction – they follow each other in the bytecode. And there are plenty of such places, for example, the sequence “OVER, OVER, SWAP” is just a laboratory work on peephole optimization. I hope I have interested someone and soon you can read the third article on optimizing virtual machines, with even more impressive results.

Of course, sometimes stack tricks come at a price. Simple procedures, like DROP, force values to be pushed through the cache along a chain (so more than two stack elements are usually not cached):

  RTN Drop
  movq      subtop, top   # subtop -> top
  POP_IMM   subtop        # from stack -> subtop
  ADVANCE_PC 1
  FETCH_DECODE
  DISPATCH

But on the other hand, we force more complex procedures to touch the stack only once, or at most twice. Take a look at OVER for example:

  RTN Over
  xchg  top, subtop
  PUSH_IMM  top
  ADVANCE_PC 1
  FETCH_DECODE
  DISPATCH

Here is a crude alternative, without using caching of the top elements of the stack (5 stack accesses):

  RTN Over
  POP_IMM immed64
  POP_IMM acc
  PUSH_IMM acc
  PUSH_IMM immed64
  PUSH_IMM acc
  ADVANCE_PC 1
  FETCH_DECODE
  DISPATCH

Yes, of course, it can be done more elegantly using indirect addressing, but even so it will be less fast. My best option was this:

  RTN Over
  movq       8(sp), acc
  PUSH_IMM   acc
  ADVANCE_PC 1
  FETCH_DECODE
  DISPATCH

In the same way (almost without regaining consciousness) all other procedures that are needed to execute the original Primes algorithm are implemented. I didn't implement anything beyond what was necessary:

Print
Je
Sub
Dup
Push
Nop
Halt
Break
Inc
Jump

One procedure worth considering is MOD:

  RTN Mod
  # Так как мы для top выбрали RAX то не требуется
  # делать "mov top, %rax" для подготовки к делению
  test    subtop, subtop
  je      handle_divide_zero
  xor     %rdx, %rdx        # rdx = opcode64
  div     subtop            # rdx:rax / operand -> rax, rdx
  movq    %rdx, top
  POP_IMM subtop
  ADVANCE_PC 1
  FETCH_DECODE
  DISPATCH

In it we see that from the point of view of working with the stack, it is as simple as DROP.

Один вводящий в заблуждение бенчмарк может за минуту достичь того, что невозможно получить за годы хорошей инженерной работы. (с) Dilbert.

Here are my results of profiling the program in gprof.

  Each sample counts as 0.01 seconds.
    %   cumulative   self              self     total
   time   seconds   seconds    calls  Ts/call  Ts/call  name
   31.23      0.86     0.86                             srv_Swap
   24.37      1.54     0.68                             srv_Over
   19.86      2.09     0.55                             srv_Mod
   16.25      2.54     0.45                             srv_Je
    3.61      2.64     0.10                             srv_Sub
    2.53      2.71     0.07                             srv_Jump
    1.08      2.77     0.03                             srv_Inc

And these are the results of time measurement using the original Atakua benchmark. Compared to the picture in his article, you can see that computers have become faster since 2015, but, of course, not as much as we would like. Therefore, people who understand how to optimize work speed will always have something to do.

So, is an optimized virtual machine bytecode interpreter capable of outperforming binary translation? Or, as many novice compilers believe, is this impossible? Is JIT (or AOT) our last hope for productivity? I think I was able to answer this question.

Let's see what the community of compiling virtual machine enthusiasts responds to this. If it exists, then, in about 7-9 years, I hope to read another article..

The article is written and I had a lot of fun, it's time to get to work! Thank you for your attention!

All source code can be viewed in my fork of the Atakua repository. There are interesting things there that did not fit into the article.