Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cranelift code generation comes to Rust (lwn.net)
444 points by ridruejo on March 18, 2024 | hide | past | favorite | 110 comments


This article provides an excellent overview of the latest in speed of optimizer vs quality of optimization.

In particular, copy-and-patch compilation is still the fastest approach because it uses pre-compiled code, though leaves little room for optimization.

Cranelift uses e-graphs to represent equivalence on the IR. This allows for more optimizations than the copy-and-patch approach.

Of course, the most optimized output is going to come from a more traditional compiler toolchain like LLVM or GCC. But for users who want to get "fast enough" output as quickly as possible, newer compiler techniques provide a promising alternative.


From what I understand, the big advantage of the e-graphs approach is, that the quality of the output is (within limits) a function the time and memory given.

The more memory, the more nodes can be generated in the e-graph and the more time for search, the better the selected node.

It might never be as fast as copy-and-patch or as good as LLVM or GCC, but this flexibility is a value in itself.


We actually take a fairly unconventional approach to e-graphs: we have a few linear passes and we do all rewrites eagerly, so we use them to provide a general framework for the fixpoint problem into which we plug in all our rewrites, but we don't have the usual "apply as much CPU time as you want to get better results" property of conventional equality saturation.

I gave a talk about this approach, aegraphs (acyclic e-graphs), here: slides (https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf), video (https://vimeo.com/843540328)

(disclosure: Cranelift tech lead 2020-2022 and main author of the e-graphs mid-end, as well as regalloc, isel and its custom DSL, and other bits, along with the excellent team)


I found the video to be great! Informative and understandable even when I'm just casually interested in compilers, with not a lot of real world experience writing them.


When we first reviewed the equality saturation paper, we thought there was one major pro, and one major con: [pro] phase-ordering invariant; [con] at the time there was no (believable) way to extract the transformed graph in a non-hand-wavy-way.

Personally, I think e-graphs should be combined with Massalin superoptimization — they're natural "duals" — and just turn the whole exercise into a hill-climbing process. You can tune the total effort by the set of passes used, the amount of time to drive the graph to saturation, and the method (and time) for graph extraction.


Can you memoize across invocations so that the time spent optimizing is cumulative across all builds?


Speaking as an e-graph rookie - I believe so, with caveats.

You can certainly store the chosen extractions/optimizations and re-use any that don't change. For example, if a long chain of rewrites yields A ==> B ==> ... ==> Z, you could short-circuit that and assume that Z is still what you want to replace A with. Perhaps each time you compiled, you could only rewrite the 'changed' sections of the program, along with a randomly selected portion to explore for greater-optimization. (Although, you may just want to go all the way and write a tool that constantly adds to your equivalence DB in the background, with compilation just a matter of running your extraction heuristics.)

You probably wouldn't want to store every possible rewrite, though, as the space complexity or retrieval cost is likely a bigger deal than the time complexity (you wouldn't load 2+2=4 from a database, you'd just do the addition). But, if you restore only a subset of equivalencies that are known to be useful, there's not a simple way to keep the optimization from re-exploring the "rejected candidates" (without shutting off the eclass entirely) - and, it's even not clear to me that you'd want to, as changing program context may end up meaning that M becomes a better replacement candidate for A than Z.

(Boring prerequisites for this approach also include: consistent serialization of terms, avoiding storage of infinite equivalencies, like `A => A + 0 => A + 0 + 0 ==> ...` (and possibly of exponential equivalencies like the kind associativity rules can produce), and strong heuristics to capture "high-value" equivalencies that are both useful and cheaper to lookup than re-derive.)


I always thought that it'd be nice to build an expository version of eq-sat that just operated on straight-line code, and didn't try to use the complicated e-graph structure (does it still use pi-nodes? I've not looked at the paper in a decade). The major complexity of eq-sat is in the novel extraction heuristics, and the complicated data-structures only mask that.

EDIT: Right. Blech. Theta & Phi nodes.


Is there any literature or guidelines on what to do if I'm willing to spend effectively unlimited CPU cycles in return for a more optimized final output?



Isn't it a bit weird that this isn't just the standard? Like imagine if Chrome was optimized with such a superoptimizer; let the optimizer spend a couple hours every month or so when cutting a new release. Surely that has to be worth it?


I'm not a compiler engineer, but I think it's a diminishing returns issue. Modern optimising compilers are already impressive, and they get more impressive year on year, but only quite slowly. We don't see compilers producing code that runs 30% faster than the code generated last year, for instance.

Such improvements can happen with compiler updates, but not when the starting point is a compiler that's already of decent quality. I recall some GPU driver updates from ATi (years ago) delivered pretty drastic performance improvements, but I believe that's because the original drivers were rather primitive.

(Perhaps a drastic improvement to autovectorisation could give a 30% boost, or better, but this would apply only to certain programs.)

You could grant a compute budget 100x the typical build set-up, but no one has built a production-ready compiler to take advantage of that, and if they did I suspect the improvements would be unimpressive. They may also run into a 'complexity ceiling' issue, as the compiler would presumably be even more complex than today's ordinary optimising compilers, which are already enormous.

As Filligree says, superoptimisers tend to only be practical for very short programs. They can't be applied to monstrous codebases like Chromium.


The ability of compilers to make code faster is 12 (twelve) times slower than Moore's law: they double the program's speed in 18 years [1].

[1] https://proebsting.cs.arizona.edu/law.html

This might seem discouraging, but it is not - one can still reap the benefits of code optimization twelve times as long after Moore's law stops working.


Something I missed: compilers are a mix of clever theoretical work and hard graft in implementation. Even if you somehow implement a practical compiler that exhausts the state of the art in compiler theory (a multi-million-dollar project in itself), you might still not be where you want to be in terms of optimisation.


But maybe you can superoptimize some hot sections, and encode the superoptimizer findings somewhere. Then the compiler can validate the optimizations and apply them to the particular piece of code for the rest of the program life, untill the preconditions hold.


As I understand it they just aren't very practical. Perhaps you could use a superoptimizer in a targeted way, assuming that:

1. The codebase has hot loops of very short sequences (or can be automatically reshaped into this patten)

2. The superoptimising compiler can produce code that significantly outperforms the code generated by ordinary optimising compilers

3. A practical superoptimising compiler exists

I can imagine these assumptions may not hold in practice.

> the compiler can validate the optimizations

A compiler transforms code from one representation to another, it doesn't validate arbitrary transformations.

It's not easy to prove that a given fragment of source-code corresponds to some given assembly code. I've heard of only one instance of this being done. [0]

> untill the preconditions hold

I'm not sure what you mean by this.

[0] https://entropy2018.sciencesconf.org/data/myreen.pdf


Doing a full-fledged release-this-to-millions-of-users build of Chrome or Firefox already takes on the order of 24 hours (or at least it did when last I checked a few years ago).


Is it just a couple hours? Even just ordering the optimisation passes is already an NP-complete process, so I could easily imagine superoptimisers would take trillions of years for a large program...


I think the wide variety of hardware and complexities of running on a real system sometimes make this problem more difficult.

Even within a given processor microarchitecture (say, just Zen 2, or just Haswell), different CPUs will be running at different frequencies, have different cooling solutions, and be running different microcode releases, all of which will affect which program is the fastest. And this is without considering cache pressure or memory latency, which is also dependent on any other programs the user happens to be running.

Running a superoptimizer for your linear algebra program that runs on your Cray supercluster can give clear gains. Doing the same for every combination of user hardware seems less feasible - you may find output that is a clear win for the machines you tested on, but it's often possible that it will lose out on other machines.


Yes, it is weird.

One thing that would help is that if we explicitly reimagine programming is characterizing the search space for the compiler, just as the e-graphs stuff in the article talks about separating generating alternatives from finding the best alternative.



I am pretty sure the existing release process already takes much more than 2 hours.


Do you even understand what super compilation is? If you understood it, you wouldn't be thinking of a "couple hours every month". Supercompilation on Chrome would require more than a million hours of CPU time.


Have a look at Unison which uses constraint programming to do optimal code generation.

https://unison-code.github.io/


Agree, especially when developing, I'd assume speed of optimizer matters more than quality of optimization. I wonder though if LLVM spent less time on that "phase ordering" problem, what is the tradeoff between these two factors?


LLVM doesn’t spend really any runtime solving the phase ordering problem since the pass pipelines are static. There have been proposals to dynamically adjust the pipeline based on various factors, but those are a ways out from happening.


Slightly off-topic, but if you fancy writing compilers in your free time, Cranelift has a great Rust library[0] for doing code generation - it’s a pleasure to use!

[0]: https://docs.rs/cranelift-frontend/0.105.3/cranelift_fronten...


I've seen it used for really simple JIT in Advent of Code puzzles.


Why am i not surprised and even find this amusing? :D


I see that there are many comments on full debug builds, but for me the most important difference are incremental build times when making minor changes. In my opinion this is what speeds up the development iterations.

Here are my build times when making a trivial change to a print-statment in a root function, comparing nightly dev vs adding cranelift + mold for rust-analyzer[0] (347_290 LoC) and gleam[1] (76_335 LoC):

    $ time cargo build
    Compiling rust-analyzer v0.0.0 (/home/user/repos/rust-analyzer/crates/rust-analyzer)
    # nightly
    Finished `dev` profile [unoptimized] target(s) in 6.60s
    cargo build  4.18s user 2.51s system 100% cpu 6.650 total
    
    # cranelift+mold
    Finished `dev` profile [unoptimized] target(s) in 2.25s
    cargo build  1.77s user 0.36s system 92% cpu 2.305 total

    Compiling gleam v1.0.0 (/home/user/repos/gleam/compiler-cli)
    # nightly
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 4.69s
    cargo build --bin gleam  3.02s user 1.74s system 100% cpu 4.743 total

    # cranelift+mold
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.99s
    cargo build --bin gleam  0.71s user 0.20s system 88% cpu 1.033 total
For me this is the most important metric and it shows a huge improvement. If I compare it to Go building Terraform[2] (371_594 LoC) it is looking promising. This is a bit unfair since it is the release build for Go and this is really nice in the CI/CD. Love Go compilation times and I thought it would be nice to compare with another language to show the huge improvements that Rust has made.

  $ time go build
  go build  3.62s user 0.76s system 171% cpu 2.545 total
I was looking forward to parallel front-end[3], but I have not seen any improvement for these small changes.

[0] https://github.com/rust-lang/rust-analyzer

[1] https://github.com/gleam-lang/gleam

[2] https://github.com/hashicorp/terraform

[3] https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

*edit: code-comments & links + making it easier to see the differences


For completion here is the compilation time for a small project from the axum/examples[0] (125 LoC), also comparing nightly dev vs adding cranelift + mold:

    $ time cargo build
    Compiling example-todos v0.1.0 (/home/user/ws/rust/example-todos)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.65s
    cargo build  1.49s user 0.58s system 123% cpu 1.685 total

    Compiling example-todos v0.1.0 (/home/user/ws/rust/example-todos)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.55s
    cargo build  0.47s user 0.13s system 102% cpu 0.590 total
[0] https://github.com/tokio-rs/axum/tree/main/examples/todos


This is most definitely self-promotion (I am the author), but also extremely relevant if you are looking at incremental build speed-ups: https://news.ycombinator.com/item?id=39743431


Tried out the instructions from the article on a tiny Bevy project, and compared it to a "normal" build:

> cargo build --release 23.93s user 22.85s system 66% cpu 1:09.88 total

> cargo +nightly build -Zcodegen-backend 23.52s user 21.98s system 68% cpu 1:06.86 total

Seems just marginally faster than a normal release build. Wonder if there is something particular with Bevy that makes this so? The author of the article mentions 40% difference in build speed, but I'm not seeing anything near that.

Edit: just realized I'm caching my release builds with sccache and a local NAS, hence the release builds being as fast as Cranelift+debug builds. Trying it again with just debug builds and without any caching:

> cargo +nightly build 1997.35s user 200.38s system 1878% cpu 1:57.02 total

> cargo +nightly build -Zcodegen-backend 280.96s user 73.06s system 657% cpu 53.850 total

Definitely an improvement once I realized what I did wrong, about half the time spent compiling now :) Neat!


Maybe your build is not limited by code generation, which seems like the only thing that changed here. There was a good thread recently about the variation in what slows down compilation:

https://news.ycombinator.com/item?id=39721922


Edited my comment now, forgot I was caching the release builds with sccache so wasn't actually compiling all the units, but fetching a lot of them instead :/


Which speaks to maybe that cargo should just ship with sccache turned on by default so that the “normal” experience for developing Rust has this seamless experience. Intelligent caching should always win. I’d like to see rustc get sccache integration that works at the MIR level too so that changing something that doesn’t change the code gen meaningfully still gets a cached response (e.g. changing some comments/whitespace, moving functions around, etc).

Even cooler would be if LLVM itself could also cache internal expensive parts of compilation and optimization across process instances. That would make a huge impact in cutting down incremental builds.


> I’d like to see rustc get sccache integration that works at the MIR level too so that changing something that doesn’t change the code gen meaningfully still gets a cached response (e.g. changing some comments/whitespace, moving functions around, etc).

Isn't incremental compilation already like that?


cache on a nas would limit cache speed to network speed. 1gbit is just about 125 megabytes/sec, while you can get a nvme disc with 4000+ megabytes/sec speed. so consider running sccache locally or just disable it for better build performance


Is that Bevy project open-source by any chance? I'd love to it out myself


Bevyengine is open source and very findable by most search engines.

Here is a direct link: https://github.com/bevyengine/bevy


Ah yep I'm aware of that. Bevy is a framework/engine that devs built on top of. I was referring to the project that OP has built on top of Bevy


You can use different backends and optimization for different crates. It often makes sense to use optimized LLVM builds for dependencies, and debug LLVM or even Cranelift for your own code.

See https://www.reddit.com/r/rust/comments/1bhpfeb/vastly_improv...


I would not expect this to work without issues, as Rust does not support a stable binary ABI across different compiler versions. How can we be sure that the two "codegen backends" will always be implementing the same binary ABI?


Because the ABI is defined and implemented by shared code in the runtime and compiler. There is no need for cross version compatibility between the two backends, only compatibility within the same version. This isn't particularly new either, FWIW. The Glasgow Haskell Compiler also has an unstable ABI that is not standardized, but LLVM compiled code interoperates seamlessly with code generated by the non-LLVM compiler backend (the mechanism by which that is achieved is likely different though.)


In theory yes, this could be problematic with alternative Rust implementations, but it works fine and many people are using it in this case. We can be sure LLVM and Cranelift codegen backends will be implementing the same binary ABI, because binary ABI decisions are made in the shared code.


Presumably the LLVM and Cranelift backends in the same rustc version generate code with the same ABI, and it would be a bug if that wasn't the case.


If you do the compilation locally, you know the structure of the ABI for the compiled code. You know that regardless of the backend you are using. At this stage, no functions with generics are compiled, only what is non-generic.

What is left is to account for the ABI in any monomorphizations that occur at the boundary, i.e. when your own structures monomorphize generic functions in the dependency.

When the compiler creates this monomorphic variant, and lowers down, it can provide the necessary details of the ABI.


I suppose that might depend on how the ABI is represented internally. If the ABI is fully described by (one of?) the lowered IR the backends consume (e.g., does MIR fully describe struct layouts/etc.) then I wouldn't expect there to be any issues outside "regular" bugs since all the relevant information would be contained in the inputs.


It is done in the shared code, see https://rustc-dev-guide.rust-lang.org/backend/backend-agnost... for details.


Struct layout happening in generic code makes sense (and is actually required since you can reference it in `const`s). It seems unlikely that they made function calling fully backend agnostic, since it'd require assigning the registers for parameters and result in generic code, and not in the backend.

I'd expect the generic code to lower the function parameters to primitive types (pointers, ints, floats, etc.), but the backend would then distribute those over registers and/or stack. Keeping that compatible would still require an (unstable) specification implemented by all compatible backends.

Unwinding might be tricky as well.


> I'd expect the generic code to lower the function parameters to primitive types (pointers, ints, floats, etc.), but the backend would then distribute those over registers and/or stack

Not sure about that. Calling conventions are defined by the platform ABI which is what the backend implements, so any conforming backend should still be mutually invokable. That's why a Rust program can emit an ABI-stable C API and why you can call GCC built libraries from an LLVM built executable. The lowering of parameters to registers is constrained by this because the intermediate representation understands that what are parameters to functions & then follows the platform ABI to lower it to function calls.


Ah, I think that's about what I had in mind. Just didn't know what to look for. Thanks!


This is for local work, so you use the same compiler for your dependencies and for your own code. Only the codegen backend differs.


The Equality Graphs link [0] led me to discover ESC/Java [1] [2]. Has anyone actually tried or had any success with ESC/Java? It's piqued my curiosity to compare with Spot bugs (formerly known as Findbugs).

[0] https://en.wikipedia.org/wiki/E-graph

[1] https://en.wikipedia.org/wiki/ESC/Java

[2] https://www.kindsoftware.com/products/opensource/escjava2/


Very excited for Cranelift for debug builds to speed up development iteration - in particular for WASM/Frontend Rust where iteration speed is competing with the new era of Rust tooling for JS which lands in the sub 1 second builds sometimes (iteration speed in Frontend is crucial).

Sadly, it does not yet support ARM macOS, so us M1-3 users will have to wait a bit :/


But usually, at least I don't build an executable while iterating. And if I do, I try to set up the feature flags so the build is minimal for the tests I am running


As someone who doesn't work in the frontend much, what tools are you talking about?


There’s been a recent trend of rewriting tools in Rust with insanes speed gains

- Rspack (webpack compatible)

- Rolldown (rollup compatible)

- Turbopack

- Oxc (linter)

- Biome (linter and more)

- Bun (writing in Zig, does crazy fast bundling)

There’s several parts here that are crucial to Frontend development

For production you need:

- Minificion of source code

- Bundling of modules and source code into either one JS file or split into multiple for lazy loading only the parts you need

- Transforming various unsupported high-level constructs into something older target browsers support

- Typechecking/compiling, or stripping TypeScript if that’s in use

Build times could easily go to 10-20 minutes with older tools.

The development loop also gets hurt, here you’d want the loop from saving your change to seeing it in the UI to be almost instant. Anything else means you’ll have to develop crutch methods to workaround this (imagine moving and styling components only to need to sit and wait during each small incremental change).


Thanks for your informative post!


Does anyone by chance have benchmarks of runtime (so not the compile time) when using Cranelift? I'm seeing a mention of "twice as slow" in the article, but that's based on data from 2020. Wondering if it has substantially improved since then.


There are some benchmarks of Cranelift-based Wasm VMs (Wasmtime) vs. LLVM-based Wasm VMs here: https://00f.net/2023/01/04/webassembly-benchmark-2023/

The (perhaps slightly exaggerated but encouraging to me at least!) money quote there is:

> That’s right. The cranelift code generator has become as fast as LLVM. This is extremely impressive considering the fact that cranelift is a relatively young project, written from scratch by a very small (but obviously very talented) team.

In practice anywhere from 10%-30% slower maybe is reasonable to expect. Compiler microbenchmarks are interesting because they're very "quantized": for any particular benchmark, often either you get the right transforms and achieve the correct optimized inner loop, or you don't. So the game is about getting more and more cases right and we're slowly getting there.

(disclosure: I was tech lead of Cranelift in 2020-2022)


> JIT compilers often use techniques, such as speculative optimizations, that make it difficult to reuse the compiler outside its original context, since they encode so many assumptions about the specific language for which they were designed.

> The developers of Cranelift chose to use a more generic architecture, which means that Cranelift is usable outside of the confines of WebAssembly.

One would think this has more to do with Wasm being the source language, as it's fairly generic (compared to JS or Python), so there are no specific assumptions to encode.

Great article though. It's quite interesting to see E-matching used in compilers, took me down a memory lane (and found myself cited on Wikipedia page for e-graphs).


Can anyone explain why Cranelift is expected to be faster than LLVM? And why those improvements can't also be applied to LLVM?


It uses E-graphs, as explained in the article. That’s a completely different approach to compilation, and to use it in LLVM you’d have to rewrite LLVM.

It’s also unlikely that the resulting code will ever be as fast as a traditional compiler’s output. It’s great for development, but I wouldn’t use it in a release build.


I don't believe the performance difference between Cranelift and LLVM really has much to do with E-graphs. Cranelift had this same performance profile (faster than LLVM but generating slower output) before they switched to E-graphs.

Rather it's about how much effort Cranelift puts toward optimizing its output- it has fewer, less involved passes (regardless of whether those "passes" are expressed as part of the E-graph framework). More subtly, this means it is also written to generate "okay" code without as much reliance on those passes- while LLVM on the other hand generates a lot of naive code at -O0 which contributes to its slower compile times in that mode.


Right, it's about algorithmic tradeoffs throughout. A good example I wrote about is here: https://cfallin.org/blog/2021/01/22/cranelift-isel-2/ where we use a single-pass algorithm to solve a problem that LLVM has a multi-part fixpoint loop to solve.

Most CPU time during compile is in the register allocator and I took a really careful approach to optimization when I rewrote it a few years ago (more details https://cfallin.org/blog/2022/06/09/cranelift-regalloc2/). We generally try to pay close attention to algorithmic efficiency and avoid altogether the fixpoint loops, etc that one often finds elsewhere. (RA2 does backtrack and have a worklist loop, though it's pretty minimal, and also we're planning to add a single-pass mode.)

(disclosure: I was Cranelift tech lead in 2020-2022)


> why those improvements can't also be applied to LLVM?

LLVM is a huge and bloated ecosystem (it has tons of tools and millions of LoC), also the code base itself is pretty old or rather the project has his age, so there's a lot of legacy code, other aspect is that is hard to try new/radical things because how big the project itself is.


X is too bloated! We need Y, which is leaner and more tailored to our use-case.

(years pass...)

Y is too bloated! We need Z...


After years of trying (IIRC there was funded effort from Intel), LLVM still can't parallelize code generation at function level. Cranelift had function level code generation parallelization from day 1.


I have always been told that LLVM doesn't parallelize codegen because the right place for that is in the build system: just do make -j N (or use ninja, which parallelizes automatically). And when you're building lots of source files in a normal project that's definitely true.

It's different when you're a wasm VM that receives big chunk of wasm that contains many source files, and you get it all at once. And for that reason pretty much every wasm compiler parallelizes codegen: Cranelift as you said, and also V8, SpiderMonkey, JSC, and not just VMs but also optimizers like Binaryen. It's a crucial part of their design.

For LLVM, the right design may be what it has today: single-core codegen. (LTO is the main issue there, but that's what thin LTO is for.)


I don't get it. Is your point that new software shouldn't be written to fulfill new usecases? We should always bolt new patterns and architecture onto existing codebases?


Yes. But in the meantime one can experiment with new things, void some assumptions. And I don't know about the LLVM project, but its quite more difficult to try different ideas in a big and old organization and codebase than it is in a greenfield project. Maybe they won't be successful, maybe they will move the whole field ahead.


iirc Zig want to replace LLVM for similar reasons, also hard to debug.


I've heard compiler writers complain that LLVM's API is inherently slow. Something about using runtime polymorphism everywhere. I don't know how much there is to this though, I haven't investigated it myself.


Is there no native support for M1-M3 Macs currently, and no Windows support either?

Unclear what the roadmap is there, as this update from the most active contributor is inconclusive:

> Windows support has been omitted for now. And for macOS currently on supports x86_64 as Apple invented their own calling convention for arm64 for which variadic functions can’t easily be implemented as hack. If you are using an M1 processor, you could try installing the x86_64 version of rustc and then using Rosetta 2. Rosetta 2 will hurt performance though, so you will need to try if it is faster than the LLVM backend with arm64 rustc.

Source is from Oct 2023 so this could easily be outdated, but I found nothing in the original article: https://bjorn3.github.io/2023/10/31/progress-report-oct-2023...




FTA: “Because optimizations run on an E-graph only add information in the form of new annotations, the order of the optimizations does not change the result. As long as the compiler continues running optimizations until they no longer have any new matches (a process known as equality saturation), the E-graph will contain the representation that would have been produced by the optimal ordering of an equivalent sequence of traditional optimization passes […] In practice, Cranelift sets a limit on how many operations are performed on the graph to prevent it from becoming too large.”

So, in practice, the order of optimizations can change the result? How easy is it to hit that limit?


For example, simplification of arithmetic expressions and propagating constants need to be combined (discovery of constants can allow further simplification, simplification can reach constant results).

With destructive updates you have to decide what variant has the last word (e.g. 2*a or a+a or a<<1), while an equality graph collects progress without heuristic and imprecise choices allowing equality saturation,


Any fresh compilation time benchmarks and comparisons to LLVM?


Few reports comparing Cranelift to LLVM from this day old reddit thread [1]

- 29.52s -> 24.47s (17.1%)

- 27s -> 19s (29.6%)

- 11.5s -> 8.4s (26.9%)

- 37.5s -> 29.6s (28.7%) - this measurement from TFA.

To put these numbers in context, all the perf improvements over the last 4 years have helped the compiler become faster on a variety of workloads by 7%, 17%, 13% and 15%, for an overall speed gain of 37% over 4 years. [2] So one large change providing a 20-30% improvement is very impressive.

When you add that to the parallel frontend [3] and support for linking with LLD [4], Rust compilation could be substantially faster by this time next year.

[1] - https://old.reddit.com/r/rust/comments/1bgyo8a/try_cranelift...

[2] - https://nnethercote.github.io/2024/03/06/how-to-speed-up-the...

[3] - https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

[4] - https://github.com/rust-lang/rust/issues/39915


Thanks, that's really great news. The only thing I wish for now is aarch64 support on Apple Silicon


The very last paragraph:

> A full debug build of Cranelift itself using the Cranelift backend took 29.6 seconds on my computer, compared to 37.5 with LLVM (a reduction in wall-clock time of 20%). Those wall-clock times don't tell the full story, however, because of parallelism in the build system. Compiling with Cranelift took 125 CPU-seconds, whereas LLVM took 211 CPU-seconds, a difference of 40%. Incremental builds — rebuilding only Cranelift itself, and none of its dependencies — were faster with both backends. 66ms of CPU time compared to 90ms.


Not quite fresh, but

> A paper from 2020 [0] showed that Cranelift was an order of magnitude faster than LLVM, while producing code that was approximately twice as slow on some benchmarks.

[0] https://arxiv.org/pdf/2011.13127.pdf


The changes in Cranelift since 2020 have been quite significant so I would not put any trust in those benchmarks.


From the article:

> A full debug build of Cranelift itself using the Cranelift backend took 29.6 seconds on my computer, compared to 37.5 with LLVM (a reduction in wall-clock time of 20%)

That seems much smaller difference than what I would have expected


Those numbers are the entire compile time, parsing/typecheck/MIR+MIR optimization/linking

Don't have numbers handy, so hard to say how much faster Cranelift is making the codegen portion, but gets into Amdahl's Law


And there are faster linkers than the default.

…why is it still the default?


Because the linker provided by the system itself is the one that's guaranteed to be the most broadly compatible. For a toolchain to begin distributing an alternative linker is a lot of work on every supported platform (future versions of Rust may sidestep this issue by making LLD the default on certain platforms, which they get "for free" by dint of shipping alongside LLVM). But also, because there are about ten people on planet Earth qualified to write a production-grade linker, which is why LLD still isn't broadly recommended yet.


Compatibility with niche usecases


That sounds like a good reason to keep the existing one as an opt-in possibility.


Very interesting article. I had not heard of equality graphs before. Here's some pretty good background reading on the subject: https://inst.eecs.berkeley.edu/~cs294-260/sp24/2024-03-04-eq...


It sucks that there is no way to use cranelift from outside of rust to create your own toy language. I would have loved to use cranelift in a toy compiler, but I am not ready to pay the Rust price of complexity.


As far as I know, you should be able to generate textual CLIF files and feed them to Cranelift: https://github.com/bytecodealliance/wasmtime/blob/main/crane...


Would using cranelift frontend[1] with PyO3[2] be an option for you?

[1]: https://docs.rs/cranelift-frontend/0.105.3/cranelift_fronten... [2]: https://pyo3.rs/v0.15.1/


There is always qbe if you need something simpler


Would it be naive to assume a general compile-time reduction of 20% for all Rust projects by swapping llvm with cranelift?


There are still projects out there which won’t hit 20% or even some %, if the bottleneck isn’t code gen (for example). So the part with"all" can be wrong, but beside that 20% is a good number.


imo rust debug builds are fast enough, but its nice to see things are going to get even faster! hopefully this will eventually make `rust-analyzer` faster and more efficient.


Really looking forward to the death of non-e-graph-based compilation :)


I've tried to look into this a couple times, including today. To me this looks alot like unification? but I don't really understand how operationally one gets from equivalence classes to instructions. is there an e-graphs for dummies writeup?


The approach that Cranelift uses is what we call the "aegraph" (talk I gave about it: slides https://cfallin.org/pubs/egraphs2023_aegraphs_slides.pdf, video https://vimeo.com/843540328). The basic idea is that eclasses hold sets of operator expressions (think sea-of-node IR), and we keep that alongside the CFG with a "skeleton" of side-effecting ops. We build that, do rewrites, then extract out of it back to a conventional CFG-of-basic-blocks for lowering. The "equivalence" part comes in when doing rewrites: the main difference between an egraph and a conventional sea-of-nodes IR with rewrite rules is that one keeps all representations in the equivalence class around, then chooses the best one later.

We had to solve a few novel problems in working out how to handle control flow, and we're still polishing off some rough edges (search recent issues in the repo for egraphs); but we're mostly happy how it turned out!

(Disclosure: tech lead of CL for a while; the e-graphs optimizer is "my fault")


If by instructions you mean machine instructions, you don't; it's used for internal optimization passes only.


that helps, so this is really an alternative to pushing analysis attributes aronud a dataflow graph


Finally, looking forward for wider adoption.


[flagged]


Seriously, spam at HN? That is pretty unusual to see.



If you have showdead set to "yes" in your settings, you'll see a lot more spam like this. You also get to see comments that add to discussion but were downvoted to oblivion because people don't like certain things being questioned.


It wasn't dead when I replied. I know this, I show dead since I often find my comments to be flagged or killed since I have some unpopular opinions. Just watch the my history and view the sea of down votes I've collected. ;)


I feel like im reading advertising blurb reading that article

I wish them every success, but i hope for a more balanced overview of pros and cons rather than gushing praise at every step...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: