Why WebAssembly is bad for Java

As a developer TeaVM, a compiler for JVM bytecode to JavaScript and WebAssembly, I often recommend that users who for some reason want to generate WebAssembly start with JavaScript. To be honest, I haven’t developed the WebAssembly backend for a very long time, I don’t implement missing features in it and I don’t fix bugs. They ask me: why is that? Usually, I just ignore questions like this, because it’s impossible to answer them in two sentences, and I don’t have time to write more sentences. Usually if I meet someone’s attempts to explain why WebAssembly is bad for the implementation of the JVM (as well as the CLR, JavaScript and other dynamic environments), then they boil down to the following: “Java (.NET, JavaScript, your version) is a managed language with garbage collection and exceptions, so you have to carry a giant runtime with you.” Well, in fact, the situation is somewhat more complicated, and the runtime size is not at all so terrible and is not the main source of trouble.

I also want to make a reservation that the above is my opinion, I am not the most skilled compiler developer in the world, I have not worked on the development of arbitrarily well-known managed environments, and all my compiler experience is the PET project mentioned above (which, however, is used in production some companies) and about 2 years of work on Kotlin/JS.

How did I even get to live like this?

That is, before the development of the JVM bytecode compiler in WebAssembly. It all started 10 years ago, when there were no plans for WebAssembly yet. Then I somehow met with GWT and I really liked that you can have a common code base, a common IDE, a build system, a single frontend and backend development culture (I mean frontend and backend in the classical, not in the compiler sense ). But what always seemed strange: why are Java source codes taken as “source” and not JVM bytecode? At the time, I had some knowledge of how the JVM and bytecode worked, and it seemed that the latter was high-level enough to be able to generate pretty decent-looking JavaScript from it. After all, it would be possible to translate code written in any more or less static language like Java or Scala into JavaScript (Kotlin went to the masses a little later). Because I was interested in the topic of writing compilers from an early age, it occurred to me that for self-educational purposes I could write a proof-of-concept analogue of GWT that takes bytecode as input.

A lot of time passed, my proof-of-concept became more and more suitable for real projects, someone even began to carefully use it. And here comes WebAssembly. Since TeaVM clearly lacked popularity, I tried to ride on the hype, maybe jump on the train before others, and started writing a WebAssembly backend (in the compiler sense). However, after a while, I became disillusioned with this technology: wasm files turned out to be huge, and performance was not much better than JavaScript. At the same time, the generated WebAssembly code was very difficult to debug and very difficult to interact with JavaScript. Well, the rate of development of the specification seemed to me very snail-like, and it seemed unlikely to shout to the committee and try to “push through” the features necessary for runtime.

A little about the stack

But before I talk about the difficulties with WebAssembly, I need to explain how the stack is implemented in the CPU and how compilers use it to call functions. I will talk about x86 as an example, because. more or less well know only this architecture. I think the situation on ARM is almost identical except for small subtleties.

So, for the CPU there is no concept of “stack”. The CPU has linear contiguous memory from which to read and write values. The stack is implemented using a special register (esp or rsp on x86) that stores the memory address of the top of the stack. Writing to the stack (push operation) is decrementing esp and writing the value to the address stored in esp. Reading from the stack is reading the value at the address stored in esp and incrementing esp. There are also operations of calling a procedure (call) and returning from a procedure (ret). call %target% pushes the address of the next instruction (software counter) onto the stack and writes %taget% to eip. ret reads the value from the stack into the eip register (the address of the current instruction).

At the same time, no one bothers to take and read / write to the stack bypassing esp and the described operations. Here, for example, is how a function generated by a typical 32-bit x86 compiler starts:

push  ebp
mov   ebp, esp
sub   esp, N

The last operation reserves N bytes on the stack, where N is the total number of bytes in all local variables of the function. Further, if the function does something with local variables, then it accesses the address relative to ebp. For example, the line

local1 = local2 + local3;

could turn into something like

mov eax, [ebp - 8]
add eax, [ebp - 12]
mov [ebp - 4], eax

and here you need to pay attention to this: the function does not have “its” stack frame. More precisely, it is only there if the author of the function wants the program not to break. But at the iron level, there are no guarantees. You can take and read the contents of the stack, you can write to it – this is quite acceptable from the point of view of the CPU. And the authors of runtimes actively use this fact.

What is WebAssembly

I believe that some developers have an opinion that WebAssembly is such a set of commands, like x86 / 64 or ARM, which is not implemented in real hardware, but is executed by the browser. This is partly true, however, there are several fundamental differences:

  1. WebAssembly does not have access to the stack. WebAssembly defines entities such as functions and local variables. Compiling function calls will most likely use stack instructions. However, there is no way to access the hardware stack, even for reading.

  2. In WebAssembly, you cannot goto to an arbitrary place. And, even inside the function. You can declare a block and make a break out of it.

  3. There is no code access in WebAssembly. You can’t just take and rewrite it, as it can be done in x86.

  4. WebAssembly cannot link to system dynamic libraries. WebAssembly does not implement anything similar to POSIX.

Thus, WebAssembly is more like a kind of minimalistic C, written in binary form. Next, I will show how this interferes with the efficient execution of Java.

Garbage collection

The simplest garbage collector is a pretty trivial thing. The first phase, mark, comes down to traversing the object graph in breadth or depth, starting with the so-called GC roots – the values ​​of static fields and local variables on the JVM stack. Sample pseudocode:

void mark() {
  for (var root : staticFields) {
    if (root.deref() != null) {
      queue.add(root.deref());
    }
  }
  for (var frame : stack) {
    for (var localVar : frame.type.localVars) {
      if (frame.data.read(localVar) != null) {
        queue.add(frame.data[localVar]);
      }
    }
  }
  while (queue.notEmpty()) {
    o = queue.remove();
    if (o.marked) continue;
    o.marked = true;
    for (field in o.type.fields) {
      next = o.data[field];
      if (next != null && !next.marked) {
        queue.add(next);
      }
    }
  }
}

In the case of x86, the second loop is an iteration starting from the current value of esp. When mark is entered, the return address will be at the top of the stack. Knowing the return address, you can calculate the function that called mark, and for it the size of the stack frame is known. By adding this size to esp, we get the address of the next function on the stack, and so on. So you can go through all the frames. Again, knowing the return addresses, you can use a table that maps these addresses to the offsets at which reference local variables are located. By traversing these offsets relative to the beginning of the stack frame, you can traverse the variables of this frame (loop through frame.type.localVars).

And here the first problem awaits us: there is no way to access the WebAssembly stack. There is only one way out – it is necessary to programmatically emulate the stack and the values ​​of local variables pointing to objects, to duplicate in this stack before EVERY call of any method. Such a structure is called a shadow stack. Approximately it will look like this:

// начало метода
localRoots = shadowStack.alloc(FRAME_SIZE);

// ...

localRoots[0] = localVar1;
localRoots[1] = localVar2;
// и т.д.
foo();

// ...

shadowStack.release(FRAME_SIZE);
// конец метода

Or an example from a real generated wasm:

;; начало метода

(set_local 12                       ;; выделяем в shadow stack место под хранение 
                                    ;; одной 4-байтовой переменной
  (call $meth_otbw_WasmRuntime_allocStack
    (i32.const 1)))
    
;; ...

;; вызов (call site)

(i32.store offset=4 align=4         ;; помещаем значение local7 в shadow stack
  (get_local 12)
  (get_local 7))
(set_local 1                        ;; вызываем метод (полезная работа)
  (call $meth_jl_Character_forDigit 
    (get_local 1)
    (get_local 3)))
    
    
;; ...

;; конец метода

(i32.store align=4                  ;; возвращаем занятые 4 байта в shadow stack
  (i32.const 728)                   ;; 728 - это адрес в памяти, по которому находится
                                    ;; текущее значение вершины стека
  (i32.sub
    (get_local 12)
    (i32.const 4)))

It is clear that this is, firstly, slow, and, secondly, unnecessary waste of memory. There are tricks that allow you to minimize entries in the shadow stack, but in any case, it turns out to be expensive.

I guess the motivation for the developers of the specification was security. Indeed, the ability to just take and change the contents of the stack or stack register is one of the main sources of holes in software. On the other hand, to implement GC, you do not need to change the stack register at all or write to the stack in an arbitrary place. It is enough just to read the data from there. After all, WebAssembly can read data from memory – and nothing. The same with optimizations: with a well-written specification, the overhead can be minimized, and this overhead is in any case much less than what the shadow stack gives.

Why is there no such problem in JavaScript? Because JavaScript already has a garbage collector. It’s easy enough to turn Java objects into JavaScript objects.

Throwing Exceptions

How exceptions are thrown in the JVM: go up the stack and look for a return address that is inside catch with the appropriate type of exception. Next, they modify the stack register (esp/rsp in the case of x86) and do a goto to the address of the exception handler. Here is an example pseudocode:

exceptionType = exception.getClass();
for (var frame : stack) {
  callSite = callSites.lookup(frame.returnAddress)
  for (var handler : callSite.exceptionHandlers) {
    if (handler.exceptionType.isAssignableFrom(exceptionType)) {
      stack.top = frame;
      goto handler.address;  // межпроцедурный goto, которого нет в Java
    }
  }
}

Those. catch , which does not catch the exception, has no overhead, only throw does all the work.

However, WebAssembly does not allow you to walk the stack, do a goto, or find out the address of an instruction. Therefore, all that remains to be done is to check for some return code at the exit of each method. In addition, before calling the method, you must write its identifier to the shadow stack. In fact, TeaVM stores the “return code” in the shadow stack – it writes it over the call site identifier. Here is the pseudocode of the method with try/catch in mind:

localRoots[0] = localVar1;
localRoots[1] = localVar2;
// и т.д.
localRoots.callSiteId = 1;
foo()
// и т.д.
switch (localRoots.callSiteId) {
  case 1: goto nextInstruction;
  case 2: releaseStack(); return;
  case 3: goto firstExceptionHandler;
  case 4: goto secondExceptionHandler;
  // и т.д.
}
nextInstruction:
// continue after call

Here is an example from a real wasm file:

(block $block_4
  (i32.store offset=4 align=4  ;; сохраняем локальную переменную в shadow stack
    (get_local 13)
    (get_local 2))
  (i32.store align=4           ;; сохраняем в него же call site id (47)
    (get_local 13)
    (i32.const 47))
  (set_local 3                 ;; вызываем метод (полезная работа)
    (call $meth_jl_Integer_parseInt_0
      (get_local 2)))
  (block $block_5
    (block $block_3
      (block $block_2
        (block $block_1        
          (br_table $block_1 $block_2 $block_3
                               ;; читаем код возврата
            (i32.sub
              (i32.load align=4
                (get_local 13))
              (i32.const 47))))
        (br $block_4))
      (br $block_5))
    (br $block_0))             ;; если код 48, то выходим из метода
  ;; если код возврата 49, попадаем сюда, т.е. в обработчик NumberFormatException
  
  ;; тут обрабатываем исключение
  (br $block_6)                ;; выходим из метода
)
;; если код возврата 47, попадаем сюда, т.е. продолжаем выполнение после catch

Why is this not a problem in JavaScript? Because JavaScript supports try/catch out of the box, and it’s easy enough to turn try/catch in Java into try/catch in JavaScript.

null checks

In snippets like this, the Java machine might throw a NullPointerException.

doSomethingWith(o.someField);
doSomethingWith(o.someMethod());

Of course, putting an explicit check for null is expensive. Therefore, any self-respecting implementation of the JVM during initialization calls some mprotect on the zero page (in Windows this is done by a similar function VirtualAlloc). What is this mprotect? In modern CPUs, you can assign different areas of memory (usually divided into pages of a fixed size) access rights: read, write, execute. If the program violates these rights, then the processor will generate an exception that you can try to handle – this is handled by the operating system kernel. The program code can ask the kernel to hang a handler on processor exceptions using the function signal.

This has the following effect: reading the field someField from an object is reading memory at the address o.objectAddress + OClass.someFieldOffset. If the object is not gigantic, then OClass.someFieldOffset will be much smaller than the memory page size. So if o.objectAddress == 0 (namely, this is how null is implemented), then the instruction to read the field someField will read something from page zero and throw an exception. This exception will fall into the handler that the JVM sets during initialization. The handler will construct a NullPointerException and throw it just like the Java code would.

Those. to read a field in x86, it is enough to generate one instruction:

mov eax, [ebx + 12]

However, there is no POSIX in WebAssembly, no concept of signals and memory protection. Therefore, all that remains to be done is to do explicit checks for null all the time. In pseudocode it looks like this:

if (obj == null) {
  localRoots.callSiteId = 1;
  throw NullPointerException();
}
doSomethingWith(obj.someField);

and in Wasm like this:

(set_local 4                  ;; прочитали поле q объекта this
  (i32.load offset=8 align=4
    (get_local 0)))
(if                           ;; если оно null
  (i32.eq
    (get_local 4)
    (i32.const 0))
  (then
    (i32.store align=4        ;; сохраняем call site id в shadow stack
      (get_local 20)
      (i32.const 471))
    (call $teavm_throwNullPointerException)
                              ;; кидаем NPE
    (br $block_0)))
;; иначе делаем что-то c local 4

In JavaScript, this problem is partially present. On the one hand, by default JS throws an obscure exception TypeError, which can be caused by a bunch of other reasons. On the other hand, at least he throws it, while the virtual machine does not fall apart, the heap is not damaged. In the TeaVM JavaScript backend, you can choose whether to generate strict code or not. The same explicit checks in JS are inserted into strict code, while non-strict code simply ignores such situations.

Class initialization

In the JVM, classes are initialized lazily. There are a number of actions that cause a class to be initialized: reading or writing to a static field, calling a static method, calling a constructor, explicit initialization via reflection. When a class is initialized, a hidden method is called void <clinit>().

In HotSpot, this is not a problem. The first time a method is called, it will most likely be interpreted, i.e. by the time of compilation to native code, we can be sure that the class is initialized and, therefore, do not add any checks. In the case of AOT compilation, the situation is more complicated. On the one hand, x86 and ARM do not have a separation between code and data memory, you can rewrite the code as you like. On the other hand, there is memory protection and the operating system can prohibit code rewriting after loading. However, if you managed to force the OS to give the rights to overwrite the code, instead of calling <clinit> write instructions nop. However, WebAssembly definitely does not support this under any circumstances. Those. if we have, say, a static field read:

doSomethingWith(SomeClass.staticField);

then it should be transformed into the following pseudo-code:

if (!SomeClass.<initialized>) {
  SomeClass.<clinit>();
}
doSomethingWith(SomeClass.staticField);

but we remember that for the needs of the GC and exceptions, it is necessary to surround the method call with additional instructions? Yes, the situation is exactly the same if you look at the generated Wasm:

(if                              ;; если класс не проинициализирован
  (i32.eq
    (i32.and
      (i32.load align=4
        (i32.const 6260))        ;; 6260 - это адрес скрытого статического поля
                                 ;; BigInteger.<flags>
                                 ;; нулевой бит <flags> - флаг инициализации
      (i32.const 1))
    (i32.const 0))
    (i32.store offset=4 align=4  ;; далее очищаем shadow stack
      (get_local 11)             ;; прошлый вызов положил в эти ячейки shadow stack
      (i32.const 0))             ;; какие-то значения, но теперь они 
    (i32.store offset=8 align=4  ;; перестали быть живыми (live)
      (get_local 11)             ;; и мы не хотим держать ссылки на них
      (i32.const 0))
    (i32.store offset=12 align=4
      (get_local 11)
      (i32.const 0))
    (i32.store align=4           ;; сохраняем id этого call site - 577
      (get_local 11)
      (i32.const 577))
    (call $initclass_jm_BigInteger)
    (if                          ;; если id call site изменился - выбросили исключение
      (i32.ne
        (i32.load align=4
          (get_local 11))
        (i32.const 577))
      (then
        (br $block_1)))))        ;; в этом случае выходим из метода немедленно
(set_local 1                     ;; наконец читаем поле
  (i32.load align=4
    (i32.const 6208)))

Fortunately, for simple initializers, you can understand that the semantics of the program will not change if you call this initializer at the very beginning of the program. It turns out that such cases are the vast majority, and this partially solves the problem. You can also analyze the control flow of the method and understand that before reading / writing the field, the class has already been initialized and remove the initialization instructions.

In JavaScript, you can resort to the following trick:

function SomeClass_clinit() {
  // do initialization
}
function SomeClass_callClinit() {
  SomeClass_callClinit = function() { };
  SomeClass_clinit();
}

and a call will be added at the read field location SomeClass_callClinit. It would seem that this is also an overhead. But in fact, JS is JIT-executed these days. The first time a method is called, it will be interpreted, called SomeClass_callClinit, which will rewrite itself. So when the JIT decides to compile a method, it will see that it has a call to an empty SomeClass_callClinit and just remove it.

Specification updates that didn’t help

Since MVP, useful additions have come out that can solve some of the problems stated here: GC and Exception handling. However, when you look closely at them, they either solve problems poorly or give rise to other problems.

So, first of all about GC. Let’s start with the fact that de facto it does not exist: 6 years have passed since I heard about it, and it still has not made it into the specification and is implemented only in Google Chrome under an experimental flag. In any case, I have a number of complaints about the specification.

First, an overhead projector from memory. The essence of the specification is that the concept of “structure” is introduced, they can be created, read / write fields. Structures that have become unreachable are removed. This is what you need. But in order for the structures to be removed by the GC, it is necessary to know their type in runtime, i.e. at the very beginning of the structure, allocate N bytes (I suspect that 4 or 8) to store the type. On the other hand, Java supports virtual calls. How languages ​​support virtual calls: A class has an invisible field at the beginning that points to a function table. In a virtual call, the compiler generates a sequence of instructions that receive a reference to a function table and then call a function with a known number from this table. For example, this code

class A {
    void foo(int param) { /* do something */ }
    long bar() { /* do something */ }
}
class B extends A {
    void foo(int param) { /* do something */ }
    long bar() { /* return something */ }
}

void test(A a) {
  a.foo(23);
  a.bar();
}

can be rewritten in C like this:

struct A;
typedef struct {
    void (*foo)(A*, int);
    long (*bar)(A*);
} vtable;

typedef struct A {
    vtable* __vtable;
} A;

void A_foo(A* self, int parm) { /* do something*/ }
long A_bar(A* self) { /* do something*/ }
vtable A_vtable = {
    &A_foo,
    &A_bar
};

void B_foo(A* self, int parm) { /* do something*/}
long B_bar(A* self) { /* do something*/ }
vtable B_vtable = {
    &B_foo,
    &B_bar
};

void test(A* a) {
  a->__vtable->foo(a, 23);
  a->__vtable->bar(a);
}

Well, creating class instances will look like this:

A* a = malloc(sizeof(A));
a->__vtable = &A_vtable;

A* b = malloc(sizeof(A));
b->__vtable = &B_vtable;

In fact, this “table of functions” just describes the type. So in Java it is enough to have one field in order for both GC and virtual calls to work. Those. garbage collector-aware Java struct vtable modified like this:

typedef struct {
    int fieldCount;
    short** fieldOffsets;
    void (*foo)(A*, int);
    long (*bar)(A*);
} vtable;

In the GC specification for WebAssembly, this will not work. Only the WebAssembly compiler knows about the location of the fields in the object, and it generates code like this:

typedef struct {
    int fieldCount;
    short** fieldOffsets;
} tuple_layout;
typedef struct {
    tuple_layout __layout;
    A data;
} A_tuple;

it turns out that as a result there will be two reference fields in the structure: __layout And __vtableresulting in memory overrun.

Secondly, checks for null. Struct field access operations throw what the WebAssembly specification calls a trap. A trap is no exception; it cannot be caught and processed in any way. This is already better: in the absence of a check for null, the virtual machine does not fall apart in an unexpected way, but simply crashes without the possibility of recovery. However, this is even worse than JavaScript without a null check, where you can somehow catch the NPE. And, of course, worse than the ideal option – throwing out honest NPEs without an overhead.

The second significant update is exception handling. With browser support, everything is much better here. However, the specification itself is strange. First, there is no inheritance in exception types. This is not the biggest problem: you can substitute the identifiers of all subtypes of the declared exception type in catch, although, of course, this is an overhead. Another nuisance: in the throw statement, the exception type is a constant. Apparently, they did not think about situations other than those when an exception was created and immediately thrown away. I can’t think of anything better than making a huge switch that takes a value of type i32 (the type of exception) and throwing an exception with a given type in each branch.

How would I fix the problem

First, I would implement stack walking. Such a specification would involve the introduction of an additional call instruction (or an extension of an existing one), in which one could specify:

  1. The ID of this instruction.

  2. Local variable numbers that can be observed via stack walking

  3. For each variable from the previous list, whether it can be modified via stack walking.

  4. Label list.

Next, a set of instructions to:

  1. Create stack walker

  2. Select next frame if possible

  3. Select the next local variable in the block, if possible.

  4. Read the currently selected variable.

  5. Write to the currently selected variable.

  6. Destroy the stack walker.

  7. Destroy the stack walker, end the current function, rewind the stack to the selected frame, and immediately jump to the label with the specified index.

Secondly, I would immediately start working on the trap handling specification (you can take inspiration from the same POSIX signals).

Thirdly, I would speed up the consideration of the proposal memory control.

Bonus. Some numbers

For this article, I was not too lazy to make some measurements. So, let’s debunk the myths that the GC implementation is a huge number of bytes in the binary. According to my calculations, the total size of the code responsible for runtime (GC + exception handling) at the moment (08/19/2023) is 11,340 bytes. For the calculation, I summed up the sizes of the wasm functions corresponding to the class methods org.teavm.runtime.Allocator, org.teavm.runtime.GC And org.teavm.runtime.ExceptionHandling (yes, GC is also written in Java, on some special subset of it, flavored with intrinsics). I counted on the following examples:

  • benchmark (calculates the scene using the jbox2d physical library):

    • total size: 771 331

    • section size code: 547 065

    • data section size: 194 177

    • demo

  • pi (counts PI up to the specified number of digits)

    • total size: 144,000

    • section size code: 74 996

    • data section size: 55 509

    • demo

If you really opened these examples, you might have noticed that the WebAssembly examples work many times faster than JavaScrtipt. Yes, this is true for synthetic tests. In real applications, the overall performance gain is not significant enough to make such sacrifices as a large binary size (again, look at the example of demos) and difficulties in debugging and interacting with JavaScript.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *