Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Mrustc: a Rust compiler written in C++ (github.com/thepowersgang)
182 points by mabynogy on Sept 21, 2017 | hide | past | favorite | 121 comments


This was recently posted to /r/rust with the engagement of the author (mutabah) if you'd like to ask questions or get involved. In particular here's him answering why he started this project: https://www.reddit.com/r/rust/comments/718gbh/mrustc_alterna...

One of the interesting things about this project is that because its initial goal is to validate the reproducibility of the main Rust compiler it doesn't actually need to implement a borrow checker, as the borrow checker strictly rejects programs and has no role in code generation.


> it doesn't actually need to implement a borrow checker

I was under the impression that Rust's lifetime and ownership system is not just used to ensure safety, but also for knowing where the compiler should insert deallocation code for dynamically allocated objects at the end of their lifetimes. Is that not the case? Does mrustc generate code that leaks memory, or is there enough information in the Rust source code to do precise deallocation even without a borrow checker?


That is not the case. Destructors run for things that go out of scope. When you create a reference to something ("borrowing") that reference is associated with a lifetime. This lifetime represents how long the reference is valid. A reference can never outlive the things it references (or it would be a dangling pointer), but the opposite is not true: you can create a reference with a lifetime that is shorter than the thing it references.

So basically lifetimes are used only by the borrow checker for intelligent linting. Once the borrow checker is done, the compiler loses interest in lifetimes entirely, and they don't affect codegen.


Destructor scopes are determined from syntactical rules and lifetimes don't play any role in them. OTOH, lifetimes are restricted by the shape of the destructor scopes.


If you don't really need the borrow checker to compile Rust, can the original Rust compiler have a mode where it doesn't do the borrow checking? That might make it easier to ease into Rust, because right now, "fighting" the borrow-checker is the hardest problem with learning Rust.


Be careful not to underestimate the failure mode of trying to compile arbitrary Rust code without the borrow checker. It appears that your goal is to let people "ease into Rust" by trading compilation errors for runtime segfaults, but it's just as likely that un-borrow-checked Rust code could produce malformed LLVM IR that will cause totally unpredictable compile-time failures in LLVM with indecipherably cryptic error messages.


That really wasn't my intention -- it mostly makes sense to use Rust for the borrow checker. Instead my only intention was that it simplify the process of learning Rust -- using all it's features (except the borrow checker), getting a working program fast and have some easy wins when learning Rust without stumbling constantly over the borrow checker. Maybe production mode could always enable the borrow checker.


When folks complain about the "borrow checker" usually they're complaining about a superset of features including move semantics and how data works in Rust. This is far more deeply tied into the whole model of Rust and is frankly very core to the language, "turning it off" won't make it easier to learn, it will make it easier to learn a completely different language.

Besides, all of this is necessary for soundness. Turning off the borrow checker won't magically get you a compiler that is less permissive, it will get you a compiler that will very likely produce nonsense if your program would have failed the borrow checker had it been present.


The need for the borrow checker is driven by Rust's chosen resource management model. It sounds like what you really want is an optional mode with garbage collection, in which case I recommend you check out F#.


... or the "original", OCaml, from which Rust was heavily inspired.



F# is not a superset of OCaml. There are many things in OCaml that are not in F#. The second link you pointed out lists them.


This is akin to suggesting that learning Java could be simplified by making its type system optional. Also, if one could turn off safety features, why would they ever switch them back on? They learned a different language, one where those checks are not present.


You still need the borrow checker, and this compiler won't change the kinds of programs you can write. It only works on programs that pass the borrow checker, just like the regular compiler, the difference being that it doesn't check that the program is valid (i.e. passes the checker) but assumes that its validity has already been checked earlier by the regular compiler.

If you give an invalid program to the regular compiler it will be rejected, but if you give the same program to this compiler it might produce nonsense. So the developer of a program would feed it to the regular compiler for checking, but others could take the finished release of the program, assume that it's valid and feed it to this compiler instead, as I understand it.


OTOH that's already the case with C and C++ where the compiler is allowed to assume undefined behavior cannot happen.


The borrow checker is needed for soundness; this means mrustc will miscompile things.

Assuming the input is sound, mrustc will do the right thing. Ignoring the borrow checker will not make things magically work.


Removing the borrow checker would allow you to compile broken code (eg. use after free, data races). However given code that compiles on rustc - passes the borrow checker - you don't need to reverify the borrow checking to get correct code from mrustc.


(note, not the OP) Yea but by the same token, the borrow checker prevents me from doing perfectly valid logic and is frustrating in its own right.


That's true of literally all compile time validation.


No I'm not saying I have an issue with compile time validation. Far from it.

My point is that the borrow checker is explicitly meant to protect against data races and such. There are situations where I write code I can prove does not violate any hazards but Rust's "validation" is overly-zealous.


Which ones do you run into most often?


But its more true of some than others.


Yes, some valid code won't pass the borrow checker. But at the same time, far more invalid code doesn't pass it. Personally I see it as a tool to help me reduce possible runtime errors, and instead have them as compile time errors. This does mean that I often need to design my program around it, but that's try of basically every other language.

The few times I do have an issue I'm usually writing FFI code to interact with C, and I try to minimise that code as much as possible.


And of course in Rust, if you Try you might get an Ok Result, or you may just return in Err.

It's often a good exercise to craft your program to the language you're targetting, to take advantage of it's strengths or embrace it's traditions.


There are probably assumptions/invariants built into how rustc generates LLVM IR that you will have to guarantee are met if borrowck is disabled.

Do you really want that?


Similarily, I like Haskell, but find it's syntax and the whole function and currying thing confusing. Can't I just write Haskell without HKT, functions and currying?

-------------------

Jokes, aside, if you aren't there for Rust borrow-checker, then I wonder, do you really need Rust at all?


Learning Rust without borrowck is throwing the baby out with the bath water


How about "throwing the baby out but keeps the bath water"?

Which is more appropriate for this scene I think.

LOL


I never understand this. I don't mind the borrow checker, I mind the affine type system. It really bugs me that I can't pass the same String into two functions in series, or why preventing that would be a desirable language property.


IIRC because the first function could deallocate it.


Seems like explicitly freeing memory should be regarded as unsafe. Unsafe things aside, the compiler should regard it as a copy instead of a move so multiple functions can safely use the resource without having to pass it by reference. If there is a good reason for an affine type system, I suspect it has to do with parallelism and not manual memory management.


Your suspicion is wrong. It has benefits for parallelism, but is critical for ensuring safety of the more predictable (vs GC) Rust/C++ style of "manual" memory management.

When one doesn't have a pervasive GC to dynamically clean things up, things still need to be freed but have to be freed in static locations (either explicit free calls in C or end of scopes in Rust and C++). Making this safe means defending against pointers becoming dangling which Rust does by restricting mutation. This restriction translates into some things that can't be copied, such as mutable references (that can lead to iterator invalidation and use after free, among every other undefined behaviour).

In other words, explicit/scope-based freeing being unsafe means the only way to safely manage memory is a garbage collector (tracing or reference counting), which limits how many programs can be written that are verified-safe by a computer. I believe Ada takes the approach you suggest, but this limits it to be most useful for programs that don't allocate (or only do O(1) allocation during startup).

Affineness also allows modeling things like session types better, letting programmers construct their own APIs that defend against mistakes.

You could, and people have, argue that some types could be "autoclone", as in, have `clone` calls automatically inserted where necessary, but when it is raised, a lot of people express dislike because they feel it will encourage slow code and mean people aren't guided towards the (usually) better solution of using references.


I understand not copying mutable pointers; I don't understand preventing copies of pointers to immutable data. I also don't see why affine types allow rust to get away without a GC in a way that scoped-based memory doesn't.


&T is Copy, so I'm not sure what you mean with the first bit.

For the second one, consider Box<T> vs uniq_ptr<T>; Rust can statically prevent use after-move, C++ cannot, and you'll get a NPE. That is, you can totally get rid of a GC through only RAII, but you can't guarantee memory safety (as far as we know!) without affine/linear types.


To be clear, Rust does scope-based memory management and does allow copying pointers to immutable data.

Affine typing (plus the rest of Rust's system) on top of scoped memory management is needed to avoid problems like returning a pointers into something that is deallocated at the end of a function:

  int &foo() {
    std::vector<int> v = ...;
    return v[0];
  } // return value is dangling
And similarly, avoiding having pointers into things that are destroyed when their parent is modified:

  std::vector<std::unique_ptr<int>> v = ...;
  int &ref = *v[0];
  v.clear();
  // ref is dangling


Maybe this compiler could help with bootstrapping the official rust compiler. At the moment, it seems to be quite a mess for packagers. Rust is written in Rust, but also needs cargo to build, which is also written in Rust. To break this circular dependency chain, you need to have pre-compiled binaries first in order to bootstrap the build process.


I've been beating my head against the wall for weeks trying to compile rustc/cargo on Openindiana. I want it so that I can compile a recent version firefox and/or thunderbird for Openindiana, both of which require rustc and cargo.


The typical route is to cross compile a compiler and cargo on another platform and then run the resulting binary. Is that approach infeasible for Openindiana?


I don’t see why that’s an issue. Can you name a C compiler that distros use that isn’t written in C?


C compilers are ubiquituous. Name a platform which does not have at least one C compiler. Rust might come into such a position at some later time as well, but until then at least a transitional solution besides cross-compiling would make packaging easier and reproducible.


Really, the solution is just to make cross compiling as easy as any other kind. Compilation is basically a pure function, so by rights compilers with different targets should be drop-in replacements of each other.


> Compilation is basically a pure function

Sure, a pure function taking the universe as an argument.


No, a pure function taking the original source code as the only argument.


Exactly. There's no reason a compiler should take any input but the source (maybe some config) nor have any output but the translated code. There's no dependence on running on a particular piece of hardware in this equation. Just run the codegen for platform Y on platform X.


Wouldn't it need to link to native libraries, and thus require you to have all those around too?


Sure, I guess in the 10,000 foot view those look like source code :). But there's no fundamental reason you can't have libraries for platform X merely present on a platform Y system, it's just inconvenient today. That's the point I'm really trying to get at.


I'll admit I don't know a whole lot about compilers but don't most of them do optimizations and other changes based on the CPU like its architecture, cache size, and instruction set? That may be what GP is getting at.

Stuff like LLVM, and emscripten exist though so it's probably not as big of a deal as they say.


Sure, based on the target CPU. But there's no fundamental reason I can't do all those optimizations for, say, x86 in a compiler that happens to be running on an Arm CPU. As far as the compiler is concerned it's all just data that gets stuffed in files, just like any other software.


The super-easy cross compilation in go is one of the best parts. Every install comes with the capacity to compile to every go target with no differences in input except setting it by name.

This is really something more languages should strive for.


> This is really something more languages should strive for.

Most languages these days obviate cross-compilation in the first place by being interpreted. Of the languages that remain, the natively-compiled ones, most haven't gone to Go's lengths of writing a custom libc, which is emphatically not recommended for both Windows and Mac (the syscall interfaces aren't stable, and this has broken Go code in the past: https://github.com/golang/go/issues/16272 ). And gc, the primary Go compiler and the one with out-of-the-box cross-compilation, supports relatively few platforms (I count 11, whereas rustc looks like it supports 50-70 platforms). You can use gccgo to get more platforms, but AFAICT gccgo's cross-compilation story isn't nearly as nice: https://github.com/golang/go/wiki/GccgoCrossCompilation .

For Rust, cross-compilation looks like this:

1. Install the libc for the target system, and, if necessary, a compatible linker.

2. Run `rustup target add foo` where "foo" is one of the target triples on https://forge.rust-lang.org/platform-support.html

3. Add the target triple to your Cargo.toml.

Which isn't too shabby. You can read more about it here: https://github.com/japaric/rust-cross


Well, it's not just data - it's also frequently pointers or labels to code in dynamically imported libraries; things which can't always be calculated without the exact library being used on hand.


Dynamic libraries are also made of data. See also my response to coldtea here: https://news.ycombinator.com/item?id=15310255


Be careful, let's not forget the lesson of Ken Thompson's Reflections on Trusting Trust. It's more accurate to say that compilation is a function that takes two arguments: the source code, and the compiler itself; this is how trusting-trust attacks propagate despite total absence from the source code.


You're being pedantic, but your pedantry is incorrect, so allow me to be pedantic in correcting you :p. The output of all pure functions depends on the function and its inputs; this isn't more true for compilers than, say, addition. The problem with Trusting Trust is that the function isn't practically inspectable, not that the function is impure. A trusting trust compiler is still pure.


A functionally pure compiler would always provide the same output for the given source code. The challenges in creating repeatable builds of highly popular open source projects shows that the average compiler is anything but functionally pure. The global state of the system running the compiler has a huge impact on the output of the compiler.


I agree, but that's irrelevant to the Trusting Trust topic.


And even if you could manually verify the binary output of a compiler and prove it correct, you still don't know that the CPU itself is not malicious.


Incidentally, this is the principle that ccache[1] operates on.

[1] https://ccache.samba.org/


>I don’t see why that’s an issue. Can you name a C compiler that distros use that isn’t written in C?

Can you name a distro or environment that doesn't already have a C compiler at arm's reach?

That's not the case with Rust compilers, which is obviously what the issue is.


No, it's not obvious what the issue is. If you want to compile a C compiler, you need a C compiler binary. If you want to compile a rust compiler, you need a rust compiler binary. It's the exact same issue, but people don't whine about the C compiler, probably because it's been that way for 50 years.


Actually, yes. GCC and Clang are both written in C++ now.


gcc and clang are both C++ in my understanding.

Regardless, those compilers are usually grandfathered in; distro policies are different for other languages. It just depends.


Even the C++ front end in GCC is written in C.

https://gcc.gnu.org/git/?p=gcc.git;a=tree;f=gcc/cp


No, they use C++ even for the C front-end. But you are right, you can still see its C past in the source code.

What's important though is that the GCC devs actively try to avoid the newest C++-features, this allows it to compile current GCC-trunk even with very old GCC-Releases. IIRC even GCC 4.3 (released in March 2008) is able to compile current trunk (GCC 8.0).

This is somewhat different for Rust, where I think the current policy is that master needs to compile with last released version (releases every 6 weeks and just to make it clear: Rust is written in Rust). Before that, the compiler was updated even more frequently. Bootstrapping Rust from 0 therefore is quite hard, since Rust is far from being as ubiquitous as C/C++-compilers. Even if you already have a rustc on your system, it's not unlikely that it is too old for compiling Rust-master. The first Rust-compiler was written in OCaml, so you need to compile that first, then compile Rust commit-for-commit until you reach current master. This could take quite some time. IMHO this is why mrustc is great, since you just compile current master (if it supports all features) with it and then use the generated compiler to compile Rust.


It looks like C, but actually it is common subset from C++.

GCC has migrated to C++ in 2008.

https://lwn.net/Articles/542457/


Any carefully written C program that doesn't use certain C99 an C11 things can be said to be C++.

My language implementation compiles as C and C++. Every so often I build it as C++ (like before releases) to check for regressions and flush out any issues caught by the C++ compiler.

I don't think of it as "written in C++", though it is not technically a false statement.

All Scheme programs are really written in Common Lisp; they just need a suitable library of macros and functions ...


Technically not only C anymore, not sure about the C++ frontend specifically, but there is C++ in GCC since... 5? 4?


Some of those files are C++!


Can you name a C compiler that distros use that isn't GCC or Clang?


MSVC. Yeah, Windows isn't a "distro" but it is a major OS and MSVC is a pretty major compiler. But it's binary only so there's no need to bootstrap it unless you're working at MS.


I don't get the issue as well, but maybe from ignorance. Can't I just get the rust compiler in LLVM's IR form and bootstrap from there?


LLVM IR is still very targeted to the architecture it will be run on, it's not intended to be a portable representation.


That's one of the goals, yeah. The author is working on Cargo now.


This is true of a lot of languages, go, haskell (even more once shake becomes the build system for it) etc...

It makes porting things... interesting.


Usually you port by cross compiling an initial compiler, and bootstrapping from there.


Speaking from personal experience, that can be quite painful. Being able to build rust and cargo, even if only for bootstrapping when porting, would truly be a wondrous thing.


Why, we could always use emscripten to compile rustc to JavaScript and use that for bootstrapping!

On a more serious note, ghc at least solved this problem by allowing you to compile Haskell to C (-fvia-C). Writing a naive C code generator from whatever your backend is using is probably not that hard.


I actually did steveklabnik said porting ghc to alpine linux. Lets just say, ghc's cross compilation can be frustrating.


Forgot a what in there, just woke up me can not english.


Or by making use of an interpreter as first stage.


Go bootstrap uses Go1.4 which compiler is written in C. So: compile Go1.4 and use that to compile the Go 1.X compiler (I don't know but multiple steps might be necessary).


Of the distros, Fedora has the strictest compiler bootstrapping policy, and Fedora seems to be doing great with Rust packaging (for a number of CPU architectures).


Neat project. It looks like this was largely written by one person, and I'm fairly in awe at anyone who can take a big project like a compiler this far alone.

Isn't there a bit of cognitive dissonance in believing that Rust as a language is an important idea (i.e. by the additional code safety and code maintainability that it conveys), but then simultaneously making the effort to rewrite the current Rust-implemented compiler in C++?

C++ is fast, but aside from a shared value around performance, it has fairly little in common with the ideas that Rust is built on.


Multiple implementations of a compiler lets you implement the "Diverse Double-Compiling"[0] countermeasure to the famous "Reflections on Trusting Trust"[1] attack. You wouldn't necessarily use the C++ implementation in production, but it still improves the security of the Rust language just by existing.

[0] https://www.dwheeler.com/trusting-trust/dissertation/html/wh... (previous HN discussion at: https://news.ycombinator.com/item?id=12666923 )

[1] http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thomps...


DDC is irrelevant here, DDC is an argument to not write the second compiler in C++ and write it in Rust too.

Having a Rust compiler in C++ is a mitigation to the trusting trust attack, period. You don't need DDC for this.

DDC is necessary when you have two self hosted compilers (e.g. GCC and clang). Here we have one self-hosted compiler (rustc), and one in another language (C++). To mitigate trusting trust in rustc, use mrustc to compile rustc, and then use that rustc to compile itself, and now you have a trusted binary (provided you trust your C++ compiler. you can fix this by DDCing the C++ compilers)


By not self-hosting in Rust, it should be easier to bootstrap on a new target. C++ is ubiquitous -- what other language would you choose?


> what other language would you choose?

As far as I can tell the goal of the project isn't to target more platforms (Rust targets quite a few by way of LLVM), so I don't think I'd choose any other language, including C++.

Having a compiler and standard library written in the language that it compiles has some huge benefits for increasing the pool of possible contributors.


> As far as I can tell the goal of the project isn't to target more platforms (Rust targets quite a few by way of LLVM)

There appears to be an attempt to leverage this project to target ESP8266, which AFAIK LLVM does not support: https://github.com/emosenkis/esp-rs


Interesting. Along with with your other comment about the borrow checker, I guess you could develop using (or occasionally check against) the rustc compiler for borrow checker correctness, and deploy using mrustc. That's pretty cool.


> Having a compiler and standard library written in the language that it compiles has some huge benefits for increasing the pool of possible contributors.

But this isn't the official compiler, this is someone's personal project?


> But this isn't the official compiler, this is someone's personal project?

True, but compilers are complicated machines and Rust is still changing at a fairly frantic rate.

The author seems to be doing quite a good job of development today, but if it has any hope of staying current, it probably needs to think about how to increase its bus factor (something happens like changing jobs, starting a family, or they just become interested in something else, and a single person suddenly has less time to contribute).


rust is changing, but in a backwards compatible way. That said the standard library aggressively makes use of new features, so the challenge isn't the language, but compiling libstd.


Could rustc have a way to output desugared code or code targeting a specific epoch with new features like generators expand to a backwards compatible form. This might allow for preprocessed source that could be compiled by something like mrustc even if it doesn't implement every single RFC?


That's sort of MIR, but we have no intentions of stabilizing it any time soon, if ever.


Yes, but I mean outputing actual Rust source but with generators or async/await expanded into calls to Futures. Similar to how Go is now bootstrapped from a down-level compiler.


Yeah, I get it. But that's what I mean; MIR is the common sub-language that's the same across epochs.

I don't think there's any real plans for a source-based approach. But epochs can only change a limited amount of things for exactly this reason; they minimize the compiler burden of supporting them.


If bootstrapping/portability is the goal, probably C.


Conveniently there are already C++ compilers bootstrappable with C.


Any that are both open-source and C++11 compliant? Guess you can still build g++-4.6 with just C, then a newer g++ from that, but it's a bit of a pain.


If the goal is to break the dependency cycle, a higher level language like Python would make development much easier. C++ is powerful, but not as rapid to develop in.


Why the downvotes?

It's clearly easier to make a compiler in a higher level language (Python is just an example, but Lisps are suited to this kind of thing). For example, text parsing is easier in Perl/Ruby/Python/Swift/etc. As someone who knows C++, more thought is required to do the same thing as in a higher level language, although it runs much faster. If you just wanted to bootstrap the compiler, then you'd choose the easiest route to that. It could also be easier to read and understand than a C++ compiler.


Ada, for example.

There is an open source version available in every system supported by gcc.

However C++17 is quite safe, provided one doesn't follow "write C with C++ compiler" idiom.


> However C++17 is quite safe, provided one doesn't follow "write C with C++ compiler" idiom.

No, it's not. There is no use-after-free protection among other things. It doesn't matter how much you write code like C or not: it's just not safe.


Then better stop using LLVM on Rust.

Until then, those comments only annoy those of us that happen to like Rust, but don't find it mature enough to replace C++ on the use cases we happen to care about.


I didn't say you can't write useful software in C++. I said that C++ software is not, in general, memory safe. Sometimes the benefits of a particular piece of software (for example, having an excellent production-grade optimizer) outweigh its drawbacks (for example, not being memory-safe).

I don't think you'd find a single LLVM developer who would claim that LLVM is memory safe. Giving invalid IR to LLVM and not running the verifier frequently segfaults it, for example...


I do agree with that, but that is exactly why I always mention the "C with C++ compiler" idiom, which is usually followed by your remark.


safety is a spectrum, I think declaring it unsafe entirely might not be conducive to more interesting points


The Rust compiler infrastructure is built on LLVM (a C++ project), so this doesn’t seem terribly weird.


> C++ is fast, but aside from a shared value around performance, it has fairly little in common with the ideas that Rust is built on.

The core guidelines are very similar to Rust, in spirit and intent.


... but the details are very, very different. Unless I've missed significant revisions, data races and concurrency are a non-goal of the Core Guidelines, but are central to Rust.

That said, I always welcome tooling to make C++ safer; the end game is making programs better, not language partisanship!


I will say this for C++: post C++11, it’s one of the few languages in widespread use to have an explicitly defined memory model. The people working on C++ definitely do care about concurrency and parallelism. I’d still choose Rust over C++ for that kind of program any time I was given the choice, though. =)


Absolutely, I'm not saying they don't care; it's that solving that problem is an explicit non-goal of the GSL work. The C++ committee is clearly working on concurrency related things, I was reading the various coroutines TSes recently in fact, as we're working on similar things in Rust.


SaferCPlusPlus[1] is the library that's probably closer to Rust "in spirit and intent". And more importantly, safety effectiveness. And it does address the data race issue.

[1] shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus


Im in awe too how single handedly some folks have that drive to keep hitting the keyboard and put all the brain's power into a model defined by electricity and binary. Truly amazing! Bravo


Lots of discussion and questions answered by the author on the Rust subreddit the other day https://www.reddit.com/r/rust/comments/718gbh/mrustc_alterna...


It's a major milestone for a compiled language to have it be self hosting. Another is having a competing implementation of the compiler/runtime. That's legit; cool project and kudos to the beautiful Rust ecosystem.


IF the goal is bootstrapping another approach would be to use the llvm c backend to generate a version of rustc and cargo that can be built by a c compiler.


IIRC the c backend for llvm was deprecated quite a while ago


It may have been nice to build it on GCC instead of LLVM. First, the existing Rust compiler uses LLVM so this won't be a fully independent implementation. Second, it would provide GCC with a Rust front end.


Curious why this wasn't written in Rust?


Rust compiler is already written in Rust, which means it can't be deployed to some environments that C++ could.

This compiler in essence "bridges" the gap. Sorta.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: