How the Argentum language makes fast dynamic_cast and dispatch of interface methods with four processor instructions

Why look for methods at runtime at all

Polymorphism, one of the three pillars of object-oriented programming, requires that objects of different classes respond differently to the same requests. For example, calling a method to_string at the object Animal and object Engine will lead to radically different results. In general, having a reference to an object, we do not know the exact code that will respond to the call to_string at the object by reference.

It turns out that the application code and the runtime library of the language must find the correct entry point to the correct method corresponding to this class and method.

A necessary digression about devirtualization. Naturally, modern compilers try to avoid this expensive method lookup operation if possible. If the type of an object is known at the compilation stage, the compiler can remove the dispatch and call the desired method of the desired class directly. This opens up the possibility of inlining the body of this method at the point of call and further numerous optimizations. Unfortunately, there are situations, and they are quite numerous, when devirtualization at compile time is not possible and one has to use runtime dispatch.

How virtual methods are called

To call a virtual method, the compiler builds tables of pointers to methods and uses different algorithms for calculating indices in these tables. The case of single inheritance and a tree class hierarchy is the fastest and most straightforward.

Let’s look at an example in pseudocode a little like C++):

struct Base {
    virtual void a_method() { puts("hello from Base::a_method"); }
    // virtual ~Base(){}  // эта строка не нужна для псевдокода
};
struct Derived : Base {
    void a_method() override {
        puts("hello from Derived::a_method");
    }
    virtual void additional_method() { puts("hello from additional_method"); }
};
 
// Somewhere at the caller site
void some_function(Base& b) {
   b.a_method();                          // {1}
}

In line {1} the compiler does magic: Depending on the real type of the object that the variable refers to b can be called or base Base::a_method or derived method Derived::a_method. This is achieved using method pointer tables and processor instruction pairs. For example, for an x86-64 processor in the Windows ABI, this code looks like this (sorry for the Intel syntax):

mov rcx, b
mov rax, [rcx + offset_to_vmt_ptr]  ; {2}  offset_to_vmt обычно 0
call [rax + offset_to_a_method]     ; {3}

This code works because somewhere inside the referenced object b there is an invisible field, which is usually called vmt_ptr. This is a pointer to a static data structure that contains pointers to the virtual methods of this class.

In line {2} we get a pointer to the virtual method table and the line {3} we load the address of the entry point into the method and call it.

For everything to work, we also need two tables (one for each class) with pointers to methods:

Base_VMT   dq Base_a_method
Drived_VMT dq Derived_a_method, Derived_added_method
In the diagram: if the variable b refers to Base, the mov and call operations will refer to red pointers, if to Derived - to blue.

In the diagram: if the variable b refers to Base,
mov and call operations will refer to red pointers,
if on Derived – in blue.

The whole scheme above requires that somewhere, when creating any instance of the class, someone put in the field vmt_ptr the address of one of the two tables. This is usually done in the constructor.

This calling method is easy to understand, easy to implement, consumes very little memory, provides free casts to base classes, and has negligible overhead when called. It is used in Java when inheriting classes, in C++ when we don’t have multiple inheritance, and generally everywhere where it can be applied.

Complex inheritance

Unfortunately, in real applications, each class falls into several orthogonal roles (serializable, drawable, physical, collidable, evaluable), sometimes roles form groups with other roles (SceneItem: drawable, collidable). And all these roles, classes and groups do not add up to a single tree-like class inheritance hierarchy: Not every graphic element is serializable, but there are some. Not every element that supports collisions works with physics, but some do. Therefore, all modern programming languages ​​have somehow allowed different kinds of multiple inheritance in their class hierarchies.

In Java, Swift, C#, you can inherit implementation from one class and implement multiple interfaces. C++ allows multiple inheritance, although this introduces additional complexity when the same base class can be inherited through different branches, so virtual base classes appeared in the language. Rust implements multiple inheritance as an implementation of traits. Go doesn’t formally use the term inheritance and replaces it with interface delegation and state composition, but if that quacks like inheritance… In general, today we can say that all modern programming languages ​​have deviated from the principle of single inheritance and tree-like class organization.

How a method call is implemented today with complex inheritance

Different languages ​​went different ways:

In Swift and Rust, a protocol/trait implementation reference is a structure consisting of two raw pointers – one pointing to a data structure and the other pointing to a virtual method table (witness table in Swift, vtable in Rust). By doubling the size of each reference, Rust and Swift allow interface methods to be called at the same speed as regular class virtual methods.

Go also stores each interface reference as two pointers, but builds the method tables dynamically on first use and adds the results to a global hash map.

In Java and Kotlin, the first time a method is called, a linear search is performed in the list of implemented interfaces, the search result is cached. If a small number (1-2) of different classes are encountered at any point in the call, the JIT compiler builds special optimized dispatcher code, but if a new class appears, it falls back to linear search.

Perhaps the most convoluted approach is used in C++: each class instance contains many pointers to virtual method tables. Each cast to a base class or derived class, if they cannot be reduced to a tree, causes the pointer to be moved through memory so that the pointer to an object cast to type T points to the table of virtual methods of that type T in that object. This allows virtual methods to be called at the same speed in both single and multiple inheritance.

Each approach is a trade-off between memory footprint, method call complexity, and pointer complexity.

The Java/Kotlin approach allows you to optimize the passing of benchmarks by caching the last methods called and perform “runtime devirtualization” where possible. For the general case of true dynamic dispatch of highly polymorphic interface methods, it all comes down to a linear search in lists of interface names. This is justified if the class implements one or two interfaces, but in general it is very expensive.

Go, Rust, and Swift call methods quickly, but at the cost of doubling the size of each pointer, which can quickly exhaust the register file for parameter passing in method calls and when dealing with local/temporary references. And this, in turn, will cause the registers to be spilled onto the stack. In addition, this complicates the casting of references between types (traits/protocols/interfaces), which in Swift is made dynamic dispatch code inherited from Objective C (with a search in dictionaries of protocol identifiers), while in Rust there is no such casting at all and is implemented by manually written methods ` as_trait_NNN`. Swift has a mechanism to suppress virtualization via template function instantiation for each protocol implementation (keywords “some” vs. “any”) this mechanism does not work for true polymorphic containers. In Rust, the virtualization suppression mechanism is always on and can be turned off with the “dyn” keyword.

C++ does not consume additional memory in each raw pointer, and directly calling a method with multiple inheritance is as fast as with single inheritance, but the complexity has not gone away: this approach leads to a significant complication of the object structure, complication of the code of methods – thunks are needed, to complication constructor code and all type casting operations. In the C++ paradigm, these operations are not so frequent, their complexity can be neglected, but if you try to adopt this approach in systems with introspection, or automatic memory management, then each operation with an object requires access to the object header, which stores marker words, counters and flags will require static_cast which, firstly, is not free, and secondly, is incompatible with virtual inheritance. And such an operation will be required every time a reference to an object is copied and deleted, or every time an object is scanned in the case of GC. That is why in C++ smart pointers store a separate pointer to counters and markers, consuming memory just like in Rust/Swift. By the way, the safe dynamic_cast in C++ requires a search in RTTI data, that is, it is comparable in complexity to that in Swift/Java/Go.

To sum up the section, there are problems with method dispatching in multiple inheritance, and the existing solutions leave much to be desired.

Argentum approach

Argentum program example with classes and interfaces (syntax similar to Java)

//
// Объявляем какой-то интерфейс
//
interface Drawable {
    width() int;
    height() int;
    paint(c Canvas);
}

//
// Реализуем этот интерфейс в каком-нибудь классе
//
class Rectangle {
    + Drawable{
        width() int { right - left }
        height() int { bottom - top }
        paint(c Canvas) {
            c.setFill(color);
            c.drawRect(left, top, right, bottom);
        }
    }
    + Serializable { ... }  // Реализуем еще...
    + Focusable { ... }    // ...несколько интерфейсов
    left = 0;       // Поля
    right = 0;
    top = 0;
    bottom = 0;
}

// Создаем экземпляр.
r = Rectangle;

// Вызываем методы интерфейса
w = Window.initCentered("Hello", r.width(), r.height());
r.paint(w);

At the call point r.paint(w); the compiler will build the code:

; rdx - pointer to `w`
; rcx - pointer to `r`
mov rax, Drawable_interface_id + Drawable_paint_method_index
call [rсx]

For each class, the first field is a pointer to the dispatch function. Our Rectangle will have this function:

Rectangle_dispatch:
	movzx r10, ax
	pext rax, rax, Rectangle_magic_number
	mov rax, Rectangle_interface_table[rax*8]
	jmp [rax + r10*8]

Not all modern processors can pextso for the time being we replace with bextr or:

    and rax, Rectangle_magic_mask
    shr rax, Rectangle_magic_offset

In addition to the dispatch function, Rectangle will have a set of tables:

Rectangle_interface_table:
    dq Rectangle_Drawable_method_table
    dq empty
    dq Rectangle_Serializable_method_table
    dq Rectangle_Focusable_method_table

Rectangle_Drawable_method_table:
    dq Rectangle_Drawable_width   ; method entry points
    dq Rectangle_Drawable_height
    dq Rectangle_Drawable_paint
Rectangle_Serializable_method_table dq ...
Rectangle_Focusable_method_table    dq ...

How and why it works

At the compilation stage, each interface receives a randomly selected unique 48-bit identifier (it is stored in 48 bits of a 64-bit machine word with zeros in the lower 16 bits for the method index).

When calling an interface method, the caller invokes the correct class manager, passing the interface identifier and the 16-bit index of the method in the interface as a parameter.

The dispatcher must distinguish between these identifiers the interfaces implemented in this class. It does not matter how many classes and how many interfaces exist in the application. There may be hundreds or thousands of them. It is important to distinguish only the interfaces of a given class, which are few or, in the worst case, dozens. Moreover, thanks to strong typing, we have a guarantee that only the correct interface identifiers will come (pro dynamic_castwhich it provides us in runtime – below).

If there is only one interface, the dispatcher skips interface selection and immediately transfers control by method index.

MyClass1_dispatcher:
	movzx rax, ax
    jmp MyClass1_InterfaceX_methods[rax*8]

If a class implements two interfaces, their identifiers guaranteed will differ by one bit in one of the 48 bits of their identifiers. The task of the compiler is to find this bit and build a dispatcher that checks this bit:

MyClass2_dispatcher:
	movzx r10, ax              ; забираем индекс метода в r10
	shr rax, BIT_POSITION  ; можно заметить одной ...
	and RAX, 1             ; ...инструкцией pext/bextr
    ; загружаем указатель на одну из двух таблиц методов
	mov rax, MyClass2_itable[rax*8]
	jmp [rax + r10*8]      ; переходим к началу метода
MyClass2_itable:
	dq MyClass2_InterfaceX_mtable
	dq MyClass2_InterfaceY_mtable
MyClass2_InterfaceX_mtable:
	dq MyClass2_InterfaceX_methodA
	dq MyClass2_InterfaceX_methodB
MyClass2_InterfaceY_mtable:
	dq MyClass2_InterfaceY_methodA

If the class implements three interfaces, we need a two-bit selector. When choosing three random 48-bit numbers, on average there will be 17.6 unique selectors from two adjacent bits. So the above approach will very likely work. More interfaces will require a larger selector size.

Example:

Let’s say we have a class that implements five different interfaces. The identifiers of these interfaces have a unique sequence of three bits at offset 4.

Interface

ID (hexadecimal)

ID binary (first N bits)

IA

f78745bed893

0100010110111110110110001 001 0011

IB

9b5aed351b36

1110110100110101000110110 011 0110

IC

08d460f812a6

0110000011111000000100101 010 0110

ID

6d0a3a225df6

0011101000100010010111011 111 0110

IE

54d4c7d9bd0f

1100011111011001101111010 000 1111

The dispatcher of such a class would look like:

class_N_disp:
   movzx r10, ax
   shr rax, 4
   and rax, 7
   mov rax, class_N_itable[rax*8]
   jmp [rax + r10*8]
class_N_itable dq clN_IE, clN_IA, clN_IC, clN_IB, empty, empty, empty, clN_ID
clN_IE dq classN_interfaceIE_method1...
clN_IA...

Naturally, situations may arise when it is impossible to select a selector that has a unique value for each interface in randomly encountered interface identifiers. What are the success probabilities for the basic dispatcher algorithm?

Number of interfaces in a class

Selector width in bits

Unused interface table slots

Average number of unique selectors in 48-bit interface IDs

3

2

1

17.6250000000

4

2

0

4.4062500000

5

3

3

9.4335937500

6

3

2

3.5375976563

7

3

1

0.8843994141

8

3

0

0.1105499268

9

4

7

2.7184523642

10

4

6

1.1893229093

eleven

4

5

0.4459960910

12

4

4

0.1393737784

Starting with seven interfaces in one class, the probability of finding a continuous selector group of bits drops significantly. We can fix this:

  • Using wider tables (+1 bit)

  • Or allowing selectors Not be continuous

  • Or by introducing new levels of tables.

Wide tables

An example of a class with eight interfaces:

Interface

ID Hexadecimal

ID binary (lower bits)

IA

36d9b3d6c5ad

101100111101011011000101101 0110 1

IB

6a26145ca3bf

000101000101110010100011101 1111 1

IC

c4552089b037

001000001000100110110000001 1011 1

ID

917286d627e4

100001101101011000100111111 0010 0

IE

889a043c83da

000001000011110010000011110 1101 0

IF

6b30d1399472

110100010011100110010100011 1001 0

IG

5939e20bb90b

111000100000101110111001000 0101 1

IH

850d80997bcf

100000001001100101111011110 0111 1

There is no 3-bit unique selector among interface IDs, but there is a 4-bit one at position 1.

By increasing the size of the table by an average of 15 machine words, we get significantly better probabilities of finding suitable selectors up to the case when the class implements 13 interfaces.

Breaks in selectors

Often, 48-bit interface identifiers contain the necessary selectors, but they are not in contiguous bits. The ideal option would be to use the instruction pextwhich can pull arbitrary bits from the register by mask, but such an instruction is not present in all processors, and in some places it is executed for indecent 300 ticks. Therefore, let’s try to get by with a cheaper and more common option: N adjacent bits + one separate bit.

Such a sequence can be distinguished by adding just one operation add:

Expression

A binary value in which:
ABC show the right bits,
X – junk bits with arbitrary value

interface identifier id

xABxxxxCxx

mask

0110000100

id & mask

0AB0000C00

adder

0000111100

(id & mask) + adder

0ABCxxxx00

((id & mask) + adder) >> offset

0000000ABC

The code that does this bit allocation is:

movzx r10, ax
and rax, 184h
add rax, 1ch  ; extra `add` instruction
shr rax, 5
mov rax, class_N_itable[rax*8]
jmp [rax + r10*8]

Simultaneous use of an additional instruction add And +1bitallows you to confidently build dispatchers for classes with 20+ interfaces, which exceeds all practical dispatching scenarios.

Application instructions pext will allow you to further increase the probabilities and reduce the size of tables without going beyond the limit of four instructions.

In general, the task of finding the perfect hash function that can be calculated with minimal resources can have many solutions, but extracting a bitmask is the simplest of them.

How this approach speeds up dynamic_cast in the Argentum language.

// Кстати о синтаксисе
// Dynamic_cast в Аргентуме имеет вид:
// выражение ~ Тип и порождает optional(Тип),
// например:
someRandomOject ~ MyPreciousType ? _.callItsMetod();
// Читается как: прокастить `someRandomOject` к типу `MyPreciousType`,
// и если получилось, вызвать у _него_ метод `callItsMetod`.

Each method table in Argentum has a utility method at index 0 that compares the identifier of the interface used for dispatch with the actual interface implemented by that table and returns either this object or null.

If we need to check if an object has an X interface, we call the method with index 0 and the 48-bit identifier of this X interface on that object.

If the interface is implemented in this class, then after passing through the selection of the selector and the reference to the interface table, we will get to the method with index 0, where the identifier X matches the constant encoded in this method. And so the method with index 0 will return this.

If the interface X is not implemented, then after passing through the selection of the selector, we will get into the only interface table and the only check method in which it could be at all. And therefore the only comparison of the identifier X and the identifier of the real interface, which this method table is dedicated to, will establish the failure of dynamic_cast.

By the way, it is precisely because of dynamic cast that the unused entries of interface tables are filled not with zeros, but with links to a special method table with a single element that always returns nullptr.

Thus dynamic_cast to an interface in Argentum always takes 3 machine instructions and is executed in 10 machine instructions:

  • 3 instructions to call method 0 of the specified interface, passing a parameter (can be shortened to 2)

  • 4 dispatcher instructions

  • 3 instructions метода0: comparison, cmov, ret. (may be reduced to 2 if returned zf)

Comparison with existing languages

Any reference in Argentum is just a pointer to the beginning of an object. One machine word.

  • Compared to Swift/Rust/Go, Argentum has no overhead associated with double-width pointers, no register file overflows and spills.
    For example, in the x86/64/Win ABI, only 4 registers are used to pass parameters – two Swift references will occupy them all, and the third parameter of the function will go through memory.

  • Compared to static_cast C++ in Argentum there is no overhead for moving this-pointer over an object (with checks for nullptr).

Each object has only one dispatch-related field, and that is a pointer to the dispatcher.

  • Compared to C++, Argentum has no overhead for multiple pointers to different VMTs and offsets of virtual bases within object data, and no overhead for creating such objects.

Compared to a simple virtual method call with single inheritance, we have four additional instructions.

  • This is orders of magnitude cheaper than dispatching in Java.

  • This is close to C++, where with multiple inheritance we often find ourselves having to adjust this by the amount of offset stored in the VMT. Such a correction is performed in C++ by automatically generated thunk code, which is comparable in complexity to our dispatcher’s four instructions.

  • Rust, Go, and Swift win these four instructions in the call operation, but lose two instructions in each transfer, save, and load operation of the reference due to its double size. And these operations are performed more often than a call.

Argentum supports dynamic_cast to interfaces, which takes three machine instructions in the program code and is completed in 10 instructions.

  • This is many orders of magnitude cheaper than in Swift, Java, Go and dynamic_cast in C++.

  • Rust has no such instruction.

By the way, this dispatch method is also suitable for the case of dynamic reloading of modules that bring new classes and interfaces to the AppDomain:

  • When a new interface is added, it will receive the following randomly generated unique 48-bit identifier. There is no need to rebuild existing dispatchers and tables.

  • The same can be said about classes. Adding them to an application will only require the generation of their own controllers and tables, and will not affect existing ones.

Unlike many other features of Argentum due to the architecture of the language (no memory leaks, no GC, no mutable state, races, deadlocks, etc.), the method of dispatching interface methods described here can be borrowed and applied in other languages.

Links:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *