A little trick for easy interaction between Rust and C++

At work, I rewrite messy C++ code into Rust.

Because of its heavy use of callbacks (sigh), Rust sometimes calls C++, and C++ sometimes calls Rust. This is because both languages ​​provide a C API for functions that can be called from the other language.

That's all for functions; but what about C++ methods? Here's a little trick that lets you rewrite one C++ method at a time without any headaches. And by the way, it works regardless of the language you're rewriting the project in, it doesn't have to be Rust!

Cunning

  1. Create standard layout C++ class. It is defined by the C++ standard. In simple terms, it makes a C++ class look like a regular C struct with some caveats: for example, a C++ class can still use inheritance and some other features. But most importantly, virtual methods are not allowed. I don't care about this restriction because I never use virtual methods and they are my least favorite feature in any programming language.

  2. Create a Rust structure with exactly like this the same layout as a C++ class.

  3. Create a Rust function with the C calling convention, where the first argument is the created Rust struct. Now you can access every member of the C++ class!

Note: Depending on the C++ code you are working with, the first step may be trivial or even impossible. It depends on the number of virtual methods used and other factors.

In my case, there were several virtual methods that could have been successfully made non-virtual.

Sounds too abstract? Let's look at an example!

Example

Here is our C++ class User. It stores the name, UUID and number of comments. The user can write comments (just a string), which we display on the screen:

// Path: user.cpp

#include <cstdint>
#include <cstdio>
#include <cstring>
#include <string>

class User {
  std::string name;
  uint64_t comments_count;
  uint8_t uuid[16];

public:
  User(std::string name_) : name{name_}, comments_count{0} {
    arc4random_buf(uuid, sizeof(uuid));
  }

  void write_comment(const char *comment, size_t comment_len) {
    printf("%s (", name.c_str());
    for (size_t i = 0; i < sizeof(uuid); i += 1) {
      printf("%x", uuid[i]);
    }
    printf(") says: %.*s\n", (int)comment_len, comment);
    comments_count += 1;
  }

  uint64_t get_comment_count() { return comments_count; }
};

int main() {
  User alice{"alice"};
  const char msg[] = "hello, world!";
  alice.write_comment(msg, sizeof(msg) - 1);

  printf("Comment count: %lu\n", alice.get_comment_count());

  // This prints:
  // alice (fe61252cf5b88432a7e8c8674d58d615) says: hello, world!
  // Comment count: 1
}

Let's first make sure the class is correct standard layout. Let's add this check to the constructor (you can place it anywhere, but the constructor is a good place):

// Path: user.cpp

    static_assert(std::is_standard_layout_v<User>);

Aaaand… the project is being successfully assembled!

Now we move on to the second step: let's define an equivalent class on the Rust side.

Let's create a new library project in Rust:

$ cargo new --lib user-rs-lib

Let's place our Rust structure in src/lib.rs.

We need to be careful about the alignment and order of the fields. To do this, we mark the structure as repr(C)so that the Rust compiler uses the same layout as in C:

// Path: ./user-rs/src/lib.rs

#[repr(C)]
pub struct UserC {
    pub name: [u8; 32],
    pub comments_count: u64,
    pub uuid: [u8; 16],
}

Please note that if you want, fields in a Rust structure can be named differently.

It is also important to note that std::string is represented here as an opaque array of 32 bytes. This is because on my machine, with my standard library, sizeof(std::string) is 32. This is not guaranteed by the standard, so this approach makes the code not very portable. We will look at possible ways to work around this limitation at the end. My point was to show that using standard library types does not prevent a class from being standard layout class, but also creates certain difficulties.

Let's forget about this obstacle for now.

Now we can write a stub for the Rust function that will be the equivalent of the C++ method:

// Path: ./user-rs-lib/src/lib.rs

#[no_mangle]
pub extern "C" fn RUST_write_comment(user: &mut UserC, comment: *const u8, comment_len: usize) {
    todo!()
}

Now let's use the tool cbindgen to generate a C header corresponding to this Rust code.

$ cargo install cbindgen
$ cbindgen -v src/lib.rs --lang=c++ -o ../user-rs-lib.h

And we get the following C-header:

// Path: user-rs-lib.h

#include <cstdarg>
#include <cstdint>
#include <cstdlib>
#include <ostream>
#include <new>

struct UserC {
  uint8_t name[32];
  uint64_t comments_count;
  uint8_t uuid[16];
};

extern "C" {

void RUST_write_comment(UserC *user, const uint8_t *comment, uintptr_t comment_len);

} // extern "C"

Now let's go back to C++, include this C header, and add some checks to make sure the layouts actually match. Again, we put these checks in the constructor:

#include "user-rs-lib.h"

class User {
 // [..]

  User(std::string name_) : name{name_}, comments_count{0} {
    arc4random_buf(uuid, sizeof(uuid));

    static_assert(std::is_standard_layout_v<User>);
    static_assert(sizeof(std::string) == 32);
    static_assert(sizeof(User) == sizeof(UserC));
    static_assert(offsetof(User, name) == offsetof(UserC, name));
    static_assert(offsetof(User, comments_count) ==
                  offsetof(UserC, comments_count));
    static_assert(offsetof(User, uuid) == offsetof(UserC, uuid));
  }

  // [..]
}

This ensures that the memory layout of the C++ class and the Rust struct match. We could generate all these checks with a macro or code generator, but for the purposes of this article, we can do it manually.

Now let's rewrite the C++ method in Rust. For now, we'll omit the field nameas it is a bit problematic. Later we will see how we can still use it from Rust:

// Path: ./user-rs-lib/src/lib.rs

#[no_mangle]
pub extern "C" fn RUST_write_comment(user: &mut UserC, comment: *const u8, comment_len: usize) {
    let comment = unsafe { std::slice::from_raw_parts(comment, comment_len) };
    let comment_str = unsafe { std::str::from_utf8_unchecked(comment) };
    println!("({:x?}) says: {}", user.uuid.as_slice(), comment_str);

    user.comments_count += 1;
}

We want to build a static library, so we specify this cargoadding the following lines to Cargo.toml:

[lib]
crate-type = ["staticlib"]

And now let's compile the library:

$ cargo build
# This is our artifact:
$ ls target/debug/libuser_rs_lib.a

We can use our Rust function from C++ in a function mainbut with some awkward ghosts:

// Path: user.cpp

int main() {
  User alice{"alice"};
  const char msg[] = "hello, world!";
  alice.write_comment(msg, sizeof(msg) - 1);

  printf("Comment count: %lu\n", alice.get_comment_count());

  RUST_write_comment(reinterpret_cast<UserC *>(&alice),
                     reinterpret_cast<const uint8_t *>(msg), sizeof(msg) - 1);
  printf("Comment count: %lu\n", alice.get_comment_count());
}

And link (manually) our new Rust library with our C++ program:

$ clang++ user.cpp ./user-rs-lib/target/debug/libuser_rs_lib.a
$ ./a.out
alice (336ff4cec0a2ccbfc0c4e4cb9ba7c152) says: hello, world!
Comment count: 1
([33, 6f, f4, ce, c0, a2, cc, bf, c0, c4, e4, cb, 9b, a7, c1, 52]) says: hello, world!
Comment count: 2

The output is slightly different for UUIDs because in the Rust implementation we use the trait Debug default for slice output, but the content remains the same.

A few thoughts:

  • Challenges alice.write_comment(..) And RUST_write_comment(alice, ..) are strictly equivalent, and in fact the C++ compiler will convert the first call to the second in pure C++ code if you look at the generated assembly code. So our Rust function is simply mimicking what the C++ compiler would do anyway. However, we can place the argument User at any position in the function. In other words, we rely on API compatibility, not ABI.

  • A Rust implementation can freely read and modify private members of a C++ class, such as a field comment_countwhich is only accessible in C++ via a getter, but Rust can access it as if it were public. This is because public/private modifiers are just rules imposed by the C++ compiler. However, your processor doesn't know or care about this. Bytes are just bytes. If you can access the bytes at runtime, it doesn't matter that they were marked as “private” in the source code.

We are forced to use tedious type casts, which is fine. We are actually reinterpreting memory from one type (User) in another (UserC). This is allowed by the standard because the C++ class is standard layout class. If this were not the case, it would result in undefined behavior and would probably work on some platforms but break on others.

Accessing std::string from Rust

std::string should be considered an opaque type from Rust's perspective, because its representation may differ across platforms or even compiler versions, so we cannot describe its layout precisely.

But we want to access the original bytes of the string. So we need a helper function on the C++ side that will extract those bytes for us.

First, Rust. We define a helper type ByteSliceViewwhich is a pointer and a length (analogous to std::string_view in the latest versions of C++ and &[u8] in Rust), and our Rust function now takes an extra parameter, name:

#[repr(C)]
// Akin to `&[u8]`, for C.
pub struct ByteSliceView {
    pub ptr: *const u8,
    pub len: usize,
}


#[no_mangle]
pub extern "C" fn RUST_write_comment(
    user: &mut UserC,
    comment: *const u8,
    comment_len: usize,
    name: ByteSliceView, // <-- Additional parameter
) {
    let comment = unsafe { std::slice::from_raw_parts(comment, comment_len) };
    let comment_str = unsafe { std::str::from_utf8_unchecked(comment) };

    let name_slice = unsafe { std::slice::from_raw_parts(name.ptr, name.len) };
    let name_str = unsafe { std::str::from_utf8_unchecked(name_slice) };

    println!(
        "{} ({:x?}) says: {}",
        name_str,
        user.uuid.as_slice(),
        comment_str
    );

    user.comments_count += 1;
}

We re-run cbindgen and now C++ has access to the type ByteSliceView. So we write a helper function to transform std::string into this type and pass an additional parameter to the Rust function (we also define a trivial getter get_name() For Userbecause name still private):

// Path: user.cpp

ByteSliceView get_std_string_pointer_and_length(const std::string &str) {
  return {
      .ptr = reinterpret_cast<const uint8_t *>(str.data()),
      .len = str.size(),
  };
}

// In main:
int main() {
    // [..]
  RUST_write_comment(reinterpret_cast<UserC *>(&alice),
                     reinterpret_cast<const uint8_t *>(msg), sizeof(msg) - 1,
                     get_std_string_pointer_and_length(alice.get_name()));
}

We build and run again, and lo and behold, the Rust implementation now prints the name:

alice (69b7c41491ccfbd28c269ea4091652d) says: hello, world!
Comment count: 1
alice ([69, b7, c4, 14, 9, 1c, cf, bd, 28, c2, 69, ea, 40, 91, 65, 2d]) says: hello, world!
Comment count: 2

Alternatively, if we can't or don't want to change the Rust signature, we can make a C++ helper function get_std_string_pointer_and_length with the C convention and accept a void pointer so that Rust can call this helper function itself, at the cost of multiple casts to and from void*.

Improved situation with std::string

  • Instead of modeling std::string as a byte array whose size depends on the platform, we could move this field to the end of the C++ class and remove it from Rust entirely (since it is not used there). This would break the equality sizeof(User) == sizeof(UserC)now it will be sizeof(User) - sizeof(std::string) == sizeof(UserC). This way, the layout will be exactly the same (down to the last field, which is fine) between C++ and Rust. However, this will result in an ABI violation if external users depend on the exact layout of the C++ class. The C++ constructors will also have to be adapted, since they rely on the order of the fields. This approach is essentially the same flexible array functions in C.

  • If memory allocation is cheap, we can store the name as a pointer: std::string *name; on the C++ side, and on the Rust side – as a pointer to void: name: *const std::ffi::c_voidsince pointers have a guaranteed size on all platforms. The advantage is that Rust can access data in std::stringcalling a C++ helper function with the C calling convention. However, some may not like the use of a “naked” pointer in C++.

Conclusion

We have successfully rewritten a method of a C++ class. This is a great technique because a C++ class can contain hundreds of methods in real code, and we can rewrite them one by one without breaking or touching others.

Big caveat: the more C++-specific features and standard types a class uses, the harder this technique is to use, because it requires helper functions to convert from one type to another and/or lots of tedious type casts. If a C++ class is essentially a C struct and uses only C types, this will be very easy.

However, I have used this technique often at work and really appreciate its relative simplicity and the ability to approach it incrementally.

All this can also theoretically be automated, for example using tree-sitter or libclang to work with C++ ASTs:

  1. Add a check to the constructor of the C++ class to ensure that it is standard layout class, for example: static_assert(std::is_standard_layout_v<User>); If the check fails, we skip this class – it requires manual intervention.

  2. Generate an equivalent Rust structure, such as struct UserC.

  3. For each field of a C++ class/Rust struct, add a check to ensure the layout is the same: static_assert(sizeof(User) == sizeof(UserC)); static_assert(offsetof(User, name) == offsetof(UserC, name)); If the check fails, then we terminate the work.

  4. For each C++ method, generate an equivalent empty Rust function, e.g. RUST_write_comment.

  5. A developer implements a Rust feature. Or AI. Or something else.

  6. For each call site in C++, replace the C++ method call with a Rust function call: alice.write_comment(..); becomes RUST_write_comment(alice, ..);

  7. Remove C++ methods that have been rewritten.

And voila, the project is rewritten!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *