1. Background
For quite some time, I have been bothered by this thought: Individual programming languages (C++, Rust, Go, etc.) are traditionally viewed as walled gardens. If your main() function is written in C++, you had better find yourself C++ libraries like Qt to build the rest of your codebase with. Do you want to use Flutter to build your app's user interface? Get ready to build the logic in Flutter, too. Do you really want to use that Rust library to make your application safer? You get to either rewrite the whole app in Rust or build an ugly extern "C" wrapper around it that won't fit well in your object-oriented C++ code.
This has been the standard view on using multiple programming languages for many years. However, I've decided that this view is fundamentally flawed, because every compiled language uses the same set of concepts when it is compiled:
- Code is split up into functions that can be reused.
- Functions are identified by a string generated from the function name in the source code. For example, g++ generates _Z3foov as the identifier for void foo(). This string is always reproducible; for example, both Clang and GCC on Linux follow the Itanium C++ ABI convention for mangling function names.
- Functions are called by storing all parameters to that function at a specific location in memory and then using a call instruction or equivalent to move control to the function. For example, to call void foo() from earlier, the compiler converts a C++ statement foo(); into the assembly call _Z3foov. The assembler then replaces call with the appropriate opcode and replaces _Z3foov with the location of the first instruction identified by _Z3foov.
- Functions return by storing their return value (if they have one) at a specific location and then using a ret instruction or equivalent.
- Classes and structs can be boiled down to a collection of primitive types (although some classes do have vtables).
- Class methods are just another function that happens to take a pointer to the class object as the first parameter. In other words, when you write this:
cpp
class Foo { void foo(int bar); int baz; };
your code actually compiles to something that is better represented this way:
cpp
class Foo { int baz; }; void foo(Foo *this, int bar);
Since every compiled programming language uses the same concepts to compile, why can't they just interact?
2. What we want to achieve
Before we go any further, I'd like to give an example of what we want to achieve:
cpp
//file1: main.cpp
#include "rustmodule.h"
// or in an ideal C++ 20 world:
// import rustmodule;
int main()
{
foo();
return 0;
}
cpp
// file2: rustmodule.h
#pragma once
// this is defined in Rust
void foo();
cpp
// file3: rustmodule.rs
pub fn foo() {
println!("Hello from Rust");
}
We want to be able to compile those files and get an executable file that prints Hello from Rust to stdout.
3. Why this won't just work out of the box
The most obvious reason that compiled programming languages can't just interact with each other is the most obvious one: syntax. C++ compilers don't understand Rust, and Rust compilers don't understand C++. Thus neither language can tell what functions or classes the other is making available.
Now, you might be saying "But if I use a C++ .h file to export functions and classes to other .cpp files, certainly I could make a .h file that tells C++ that there is a Rust function fn foo() out there!" If you did say (or at least think) that, congratulations! You are on the right track, but there are some other less obvious things we need to talk about.
The first major blocker to interoperability is name mangling. You can certainly make a .h file with a forward declaration of void foo();, but the C++ compiler will then look for a symbol called _Z3foov, while the Rust compiler will have mangled fn foo() into _ZN10rustmodule3foo17hdf3dc6f68b54be51E. Compiling the C++ code starts out OK, but once the linking stage is reached, the linker will not be able to find _Z3foov since it doesn't exist.
Obviously, we need to change how the name mangling behaves on one side or the other. We'll come back to this thought in a moment.
The second major blocker is data layout. Put simply, different compilers may treat the same struct declaration differently by putting its fields at different locations in memory.
The third and final blocker I want to look at here is standard libraries. If you have a C++ function that returns an std::string, Rust won't be able to understand that. Instead, you need to implement some sort of converter that will convert C++ strings to Rust strings. Similarly, a Rust Vec object won't be usable from C++ unless you convert it to something C++ understands.
4. Name mangling
4.1 extern "C" and why it sucks
The easy way is to use the extern "C" feature that nearly every programming language has:
cpp
// file: main.cpp
#include "rustmodule.h"
int main()
{
foo();
return 0;
}
cpp
// file: rustmodule.h
#pragma once
extern "C" void foo();
cpp
// file: rustmodule.rs
#[no_mangle]
pub extern "C" fn foo() {
println!("Hello from Rust");
}
This actually will compile and run (assuming you link all the proper standard libraries)! So why does extern "C" suck? Well, by using extern "C" you give up features like these: ???(感觉是原文忘记列举哪些features了)
It's possible to create wrappers around the extern "C" functions to crudely emulate these features, but I don't want complex wrappers that provide crude emulation. I want wrappers that directly plumb those features and are human readable! Furthermore, I don't want to have to change the existing source, which means that the ugly #[no_mangle] pub extern "C" must go!
4.2 D language
D is a programming language that has been around since 2001. Although it is not source compatible with C++, it is similar to C++. I personally like D for its intuitive syntax and great features, but for gluing Rust and C++ together, D stands out for two reasons: extern(C++) and pragma(mangle, "foo").
With extern(C++), you can tell D to use C++ name mangling for any symbol. Therefore, the following code will compile:
调用链条: D main() --> C++ foo() --> D bar()
cpp
// file: foo.cpp
#include <iostream>
void bar();
void foo()
{
std::cout << "Hello from C++\n";
bar();
}
cpp
// file: main.d
import std.stdio;
extern(C++) void foo();
extern(C++) void bar()
{
writeln("Hello from D");
}
void main()
{
foo();
}
However, it gets better: we can use pragma(mangle, "foo") to manually override name mangling to anything we want! Therefore, the following code compiles:
调用链条: D main() --> Rust foo() --> D bar()
cpp
// file: main.d
import std.stdio;
pragma(mangle, "_ZN10rustmodule3foo17h18576425cfc60609E") void foo();
pragma(mangle, "bar_d_function") void bar()
{
writeln("Hello from D");
}
void main()
{
foo();
}
cpp
// file: rustmodule.rs
pub fn foo() {
println!("Hello from Rust");
unsafe {
bar();
}
}
extern {
#[link_name = "bar_d_function"] fn bar();
}
With pragma(mangle, "foo") we can not only tell D how Rust mangled its function, but also create a function that Rust can see!
You might be wondering why we had to tell Rust to override mangling of bar(). It's because Rust apparently won't apply any name mangling to bar() for the sole reason that it is in an extern block; in my testing, not even marking it as extern "Rust" made any difference. Go figure.
You also might be wondering why we can't use Rust's name mangling overrides instead of D's. Well, Rust only lets you override mangling on function forward declarations marked as extern, so you can't make a function defined in Rust masquerade as a C++ function.
4.3 Using D to glue our basic example
In this example, when main() calls foo() from C++, it is actually calling a D function that can then call the Rust function. It's a little ugly, but it's possibly the best solution available that leaves both the C++ and Rust code in pristine condition. (看起来有点丑,但是C++和Rust原先的代码都不需要改动)
调用链条: C++ main() --> D foo() -> Rust foo()
cpp
// file: main.cpp
#include "rustmodule.h"
int main()
{
foo();
return 0;
}
cpp
// file: rustmodule.h
#pragma once
// this is in Rust
void foo();
cpp
// file: rustmodule.rs
pub fn foo() {
println!("Hello from Rust");
}
cpp
// file: glue.d
@nogc:
// This is the Rust function.
pragma(mangle, "_ZN10rustmodule3foo17h18576425cfc60609E") void foo_from_rust();
// This is exposed to C++ and serves as nothing more than an alias.
extern(C++) void foo()
{
foo_from_rust();
}
4.4 Automating the glue
Nobody wants to have to write a massive D file to glue together the C++ and Rust components, though. In fact, nobody even wants to write the C++ header files by hand. For that reason, I created a proof-of-concept tool called polyglot that can scan C++ code and generate wrappers for use from Rust and D. My eventual goal is to also wrap other languages, but as this is a personal project, I am not developing polyglot very quickly and it certainly is nowhere near the point of being ready for production use in serious projects. With that being said, it's really amazing to compile and run the examples and know that you are looking at multiple languages working together.
5. Data layout & Standard libraries
"In the beginning, there was C."
This sentence actually could serve as the introduction to a multitude of blog posts, all of which would come to the conclusion "legacy programming conventions are terrible, but realistically we can't throw everything out and start over from scratch". However, today we will merely be looking at two ways C has contributed to making language interoperability difficult.
5.1 Use #[repr(C)] for structs in Rust
In the first installment of this series, I mentioned that one blocker to language interoperability is struct layout. Specifically, different programming languages may organize data in structs in different ways. How can we overcome that on our way to language interoperability?
Layout differences are mostly differences of alignment, which means that data is just located at different offsets from the beginning of the struct. The problem is that there is not necessarily a way to use keywords like align to completely represent a different language's layout algorithm.
Thankfully, there is a solution. In our example used previously, we were using Rust and C++ together. It turns out that Rust can use the #[repr(C)] representation to override struct layouting to follow what C does. Given that C++ uses the same layouting as C, that means that the following code compiles and runs:
cpp
// file: cppmodule.cpp
#include <iostream>
#include <cstdint>
struct Foo
{
int32_t foo;
int32_t bar;
bool baz;
};
void foobar(Foo foo)
{
std::cout << "foo: " << foo.foo
<< ", bar: " << foo.bar
<< ", baz: " << foo.baz
<< '\n';
}
cpp
extern {
#[link_name = "_Z6foobar3Foo"] pub fn foobar(foo: Foo);
}
#[repr(C)]
pub struct Foo {
pub foo: i32,
pub bar: i32,
pub baz: bool,
}
fn main() {
let f = Foo{foo: 0, bar: 42, baz: true};
unsafe {
foobar(f);
}
}
My proof-of-concept project polyglot automatically wraps C++ structs with #[repr(C)] (and also does so for enums).
The one major downside of this approach is that it requires you to mark structs that you created in your Rust code with #[repr©]. In an ideal world, there would be a way to leave your Rust code as is; however, there is currently no solution that I am aware of that does not require #[repr©].
5.2 Handling a list of items: Arrays, strings, and buffer overflows
Now that we've covered structs in general, we can look at the next bit of C behavior that turned out to be problematic: handling a list of items.
In C, a list of items is represented by an array. An array that has n elements of type T in it really is just a block of memory with a size n * sizeof(T). This means that all you have to do to find the kth object in the array is take the address of the array and add k * sizeof(T). This seemed like a fine idea back in the early days of programming, but eventually people realized there was a problem: it's easy to accidentally access the seventh element of an array that only has five elements, and if you write something to the seventh element, congratulations, you just corrupted your program's memory! It's even more common to perform an out-of-bounds write when dealing with strings (which, after all, is probably the most used type of array). This flaw has led to countless security vulnerabilities, including the famous Heartbleed bug, (you can see a good explanation of of how Heartbleed works at xkcd 1354).
Eventually, people started deciding to fix this. In languages like Java, D, and pretty much any other language invented in the last 25 years or so, strings (and arrays) are handled more dynamically: reading from or writing to a string at an invalid location will generally throw an exception; staying in bounds is made easy by the addition of a length or size property, and strings and arrays in many modern languages can be resized in place. Meanwhile, C++, in order to add safer strings while remaining C-compatible, opted to build a class std::string that is used for strings in general (unless you use a framework like Qt that has its own string type).
All of these new string types are nice, but they present a problem for interoperability: how do you pass a string from C++ to Rust (our example languages) and back again?
The answer, unsurprisingly, is "more wrappers". While I have not built real-life working examples of wrappers for string types, what follows is an example of how seamless string conversion could be achieved.
We start with a C++ function that returns an std::string:
cpp
// file: links.cpp
#include <string>
std::string getLink()
{
return "https://kdab.com";
}
cpp
// file: main.rs
mod links;
fn main() {
println!("{} is the best website!", links::getLink());
}
Normally, we would just create a Rust shim around getLink() like so:
cpp
// wrapper file: links.rs
extern {
#[link_name = "_Z7getLinkB5cxx11v"]
pub fn getLink() -> String; // ???
}
However, this doesn't work because Rust's String is different from C++'s std::string. To fix this, we need another layer of wrapping. Let's add another C++ file:
cpp
// wrapper file: links_stringwrapping.cpp
#include "links.h" // assuming we made a header file for links.cpp above
#include <cstring>
const char *getLink_return_cstyle_string()
{
// we need to call strdup to avoid returning a temporary object
return strdup(getLink().c_str());
}
Now we have a C-style string. Let's try consuming it from Rust. We'll make a new version of links.rs:
cpp
// wrapper file: links.rs
#![crate_type = "staticlib"]
use std::ffi::CStr;
use std::os::raw::c_char;
use std::alloc::{dealloc, Layout};
extern {
#[link_name = "_Z28getLink_return_cstyle_stringv"]
fn getLink_return_cstyle_string() -> *const c_char;
}
pub fn getLink() -> String {
let cpp_string = unsafe { getLink_return_cstyle_string() };
let rust_string = unsafe { CStr::from_ptr(cpp_string) }
.to_str()
.expect("This had better work...")
.to_string();
// Note that since we strdup'ed the temporary string in C++, we have to manually free it here!
unsafe { dealloc(cpp_string as *mut u8, Layout::new::()); }
return rust_string;
}
With these additions, the code now compiles and runs. This all looks very convoluted, but here's how the program works now:
- Rust's main() calls links::getLink().
- links::getLink() calls getLink_return_cstyle_string(), expecting a C-style string in return.
- getLink_return_cstyle_string() calls the actual getLink() function, converts the returned std::string into a const char *, and returns the const char *.
- Now that links::getLink() has a C-style string, it converts it into a Rust CString wrapper, which is then converted to an actual String.
- The String is returned to main().
There are a few things to take note of here:
- This process would be relatively easy to reverse so we could pass a String to a C++ function that expects an std::string or even a const char *.
- Rust strings are a bit more complicated because we have to convert from a C-style string to CString to String, but this is the basic process that will need to be used for any automatic string type conversions.
- This basic process could also be used to convert types like std::vector.
Is this ugly? Yes. Does it suffer from performance issues due to all the string conversions? Yes. But I think this is the most user-friendly way to achieve compatible strings because it allows each language to keep using its native string type without requiring any ugly decorations or wrappers in the user code. All conversions are done in the wrappers.
6. Existing technologies for mixing C++ and Rust
In the previous posts we looked at how to build bindings between C++ and Rust from scratch. However, while building a binding generator from scratch is fun, it's not necessarily an efficient way to integrate Rust into your C++ project. Let's look at some existing technologies for mixing C++ and Rust that you can easily deploy today.
6.1 bindgen
bindgen is an official tool of the Rust project that can create bindings around C headers. It can also wrap C++ headers, but there are limitations to its C++ support. For example, while you can wrap classes, they won't have their constructors or destructors automatically called. You can read more about these limitations on the bindgen C++ support page. Another quirk of bindgen is that it only allows you to call C++ from Rust. If you want to go the other way around, you have to add cbindgen to generate C headers for your Rust code.
6.2 CXX
CXX is a more powerful framework for integrating C++ and Rust. It's used in some well-known projects, such as Chromium. It does an excellent job at integrating C++ and Rust, but it is not an actual binding generator. Instead, all of your bindings have to be manually created. You can read the tutorial to learn more about how CXX works.
6.3 autocxx
Since CXX doesn't generate bindings itself, if you want to use it in your project, you'll need to find a generator that wraps C++ headers with CXX bindings. autocxx is a Google project that does just that, using bindgen to generate Rust bindings around C++ headers. However, it gets better---autocxx can also create C++ bindings for Rust functions.
6.4 CXX-Qt
While CXX is one of the best C++/Rust binding generators available, it fails to address Qt users. Since Qt depends so heavily on the moc to enable features like signals and slots, it's almost impossible to use it with a general-purpose binding generator. That's where CXX-Qt comes in. KDAB has created the CXX-Qt crate to allow you to integrate Rust into your C++/Qt application. It works by leveraging CXX to generate most of the bindings but then adds a Qt support layer. This allows you to easily use Rust on the backend of your Qt app, whether you're using Qt Widgets or QML. CXX-Qt is available on Github and crates.io.
If you're interested in integrating CXX-Qt into your C++ application, let us know. To learn more about CXX-Qt, you can check out this blog.
6.5 Other options
There are some other binding generators out there that aren't necessarily going to work well for migrating your codebase, but you may want to read about them and keep an eye on them:
- Diplomat
- Crubit
- rust-cpp
In addition, there are continuing efforts to improve C++/Rust interoperability. For example, Google recently announced that they are giving $1 million dollars to the Rust foundation to improve interoperability.
7. Conclusion
In the world of programming tools and frameworks, there is never a single solution that will work for everybody. However, CXX, CXX-Qt, and autocxx seem to be the best options for anyone who wants to port their C++ codebase to Rust. Even if you aren't looking to completely remove C++ from your codebase, these binding generators may be a good option for you to promote memory safety in critical areas of your application.
Have you successfully integrated Rust in your C++ codebase with one of these tools? Have you used a different tool or perhaps a different programming language entirely? Leave a comment and let us know. Memory-safe programming languages like Rust are here to stay, and it's always good to see programmers change with the times.
原博地址
【1】https://www.kdab.com/mixing-c-and-rust-for-fun-and-profit-part-1/
【2】https://www.kdab.com/mixing-c-and-rust-for-fun-and-profit-part-2/
【3】https://www.kdab.com/mixing-c-and-rust-for-fun-and-profit-part-3/