It's not really fair to compare async/await systems with Go and Java.
Go and Java with lightweight threads provide a full-blown system, without any limitations. You don't have to "color" your code into async and blocking functions, everything "just works". Go can even interrupt tight inner loops with async pre-emption.
The downside is that Go needs around 2k of stack minimum for each goroutine. Java is similar, but has a lower per-thread constant.
The upside is that Go can be _much_ faster due to these contiguous stacks. With async/await each yield point is basically isomorphic to a segmented stack segment. I did a quick benchmark to show this: https://blog.alex.net/benchmarking-go-and-c-async-await
On the contrary I think it is critical to compare them. You can then decide if the tradeoff is worth it for your use case.
Even at a million tasks Go is still under 3 GiB of memory. So that is roughly 3KiB of memory overhead per task. That is likely negligible if your tasks are doing anything significant and most people don't need to worry about this number of tasks.
So this comparison shows that in many cases it is worth paying that price to avoid function colouring. But of course there are still some use cases where the price isn't worth it.
One thing I wanted to add is that in golang, you end up passing context.Context to all asynchronous functions to handle cancellations and timeouts, so you “color” them regardless. Java folks with structured concurrency have the right idea here.
The correct interpretation here is that golang chose explicit context passing where java chose implicit.
It's similar to explicit DI vs implicits.
the function coloring metaphor doesn't make sense since the calling convention is the same nor are there extra function keywords (`async` vs non-async).
This isn't true. I have use cases that don't require cancellations or timeout. The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.
This is an interesting post. My understanding: Most of the use case for async code is I/O bound operations. So you fire off a bunch of async I/O requests and wait to be notified. Logically, I/O requests normally need a timeout and/or cancel feature.
However, you raise a different point:
The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.
This sounds like CPU-bound, not I/O-bound. (Please correct me if I misunderstand.) Can you please confirm if you are using Go or a different language? If Go, I guess it still makes sense, as green threads are preferred over system threads. If not Go, I would be nice to hear more about your specific scenario. HN is a great place to learn about different use cases for a technology.
I think I just responded too hastily. I am working in Go. There is file IO going on in addition to the calculation (which because of a NAS or whatever could also be network IO). As a practical matter I had never felt the need to offer cancellation or timeout for these use cases, but I probably should, so mea culpa.
What's the point of multiplexing tasks on a particular core if the tasks don't do any I/O? It will be strictly faster to execute the tasks serially across as many cores as possible then.
It's not _quite_ the same: you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.
> you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.
You can always pass context.Background, in this metaphor creating a new tree of color.
You can always call "runtime.block_on(async_handle)", in this metaphor also creating a new tree of color.
You can always pass the async executor to the sync code and spawn async coroutines into it. And you can keep it in a global context as well to avoid parameters. E.g. there is `Handle::current()` for exactly this purpose in Tokio. Function coloring is just a theoretical disadvantage - people like to bring it up in discussions, but it almost never matters in practice, and it is even not universally considered a bad thing. I actually like to see in the signature if a function I'm calling into can suspend for an arbitrary amount of time.
It's not the same because you can have interleaving functions which don't know about context.
Say you have a foreach function that calls each function in a list. In async-await contexts you need a separate version of that function which is itself async and calls await.
With context you can pass closures that already have the context applied.
No, because if the function that includes those lines is itself async, it will now block, while the equivalent go coroutine will still preserve the stackfull-asyncness. I.e. you can't close over the yield continuation in rust, while it is implicitly done in go.
For a more concrete example: let's say you have a generic function that traverses a tree. You want to compare the leaves of two trees without flattening them, by traversing them concurrently with a coroutine [1]. AFAIK in rust you currently need two versions of traverse, one sync one async as you can't neither close over nor abstract over async. In go, where you have stackful coroutines, this works fine, even when closing over Context.
So yes, in some way Context is a color, but it is a first class value, so you can copy it, close over it and abstract over it, while async-ness (i.e. stackless coroutines) are typically second class in most languages and do not easily mesh with the rest of the language.
[1] this is known as the "same fringe problem" and it is the canonical example of turning internal iterators into external ones.
How do you figure? What’s requiring my async function to take a context in the first place? Even if it takes one, what’s stopping it from calling a function that doesn’t take one? Similarly, what’s stopping a function that doesn’t take a context from calling one which does?
Erlang / Elixir processes use about 0.5kb per process, a preemptive scheduler on the BEAM to make sure work is consistent and isolated heap space for each one to avoid stop the world garbage collection.
It’s as good as it gets for this type of thing as it was designed for it from the ground up.
This was the first thing that occurred to me when I saw the error - should have been pretty straightforward to tweak the settings to give a clearer picture.
Thanks! Looked back at the results and that's about what I'd expect. Erlang/Elixir isn't a silver bullet, but amazing in its own way for other reasons. Ultimate max performance and memory efficiency definitely isn't one of them :).
> Erlang / Elixir processes use about 0.5kb per process
Last I checked it used about 2.5K. Beam reserves around 300 words per process (depends on exact options), but a word is 8 bytes not 8.
You can get it lower (obviously at a cost as soon as you start using the process and need space) but nowhere near 512 bytes, just the process overhead is around 800.
I don't know how C#, but Rust async/await doesn't allocate anything on the heap by itself. So it is not a universal property of all async/await implementations contrary to what you imply in your blog post.
Yeah and I think NAOT still embeds the same(?) runtime GC into the native binary. So, for memory usage, I would expect it to be nearly/exactly the same.
I may remember wrongly here, but wasn't ValueTask merely the go-to option when it's expected that the task would finish synchronously most of the time? I think for the really async case with a state machine you just end up boxing the ValueTask and ending up with pretty much the same allocations as Task, just in a slightly different way.
What Rust does, it allows to remove one layer of allocations by using Future to store the contents of the state machine (basically, the stack frame of the async function).
It allows to remove all allocations within a task, except for recursive calls. This results in the entire stack of a task being in one allocation (if the task is not recursive, which is almost always the case, for exactly this reason), exactly like you describe the advantage of Go. And unlike Go, that stack space is perfectly sized, and will never need to grow, whereas go is required to reallocate the stack if you hit the limit, an expensive operation that could occur at an unknown point in your program.
Again, no magic. Future trait impl stores all the state machine that Rust can see statically. If you have complicated code that needs dynamic dispatch or if you need recursive calls, Rust will also have to allocate.
This is not at all that different from Go, except that Go preallocates stack without doing any analysis.
Actually... I can fix that! I can use Go's escape analysis machinery to statically check during the compilation if the stack size can be bounded by a lower number than the default 2kb. This way, I can get it to about ~600 bytes per goroutine. It will also help to speed up the code a bit, by eliding the "morestack" checks.
It can be crunched a bit more, by checking if the goroutine uses timers (~100 bytes) and defers (another ~100 bytes). But this will require some tricks.
I'm NOT trying to show that Go is faster than async/await or anything similar. I'm showing that nested async/await calls are incredibly expensive compared to regular nested function calls.
You need to add to go keyword to change a normal function to a goroutine.
If you would remove async/await and Task/Return from the C# code example, it would perform pretty much the same as Go.
If you want to show that async/await calls are expensive, than you should have shown two code samples of C#, one with async/await, and one without.
Or could have done the same for Go, show one example with goroutines, and one without.
But I think everyone already know that async/await and goroutines has it's costs.
The problem is more that you are comparing Go without goroutines (without it's allocation costs) to a C# example with a poor implementation of async/await.
I'm wondering that as well. The c# code seems really unrealistic though, using a lot of tasks for CPU bound work.
A fair comparison would at least need to involve some degree of IO.
Does go maybe automatically parallelize everythinh it can? That would be one potential answer to the initial question
The author doesn't have the understanding to have even gotten close to the capability for nuance you're asking for. They just copied code from ChatGPT and ran it and timed it and made graphs and it somehow got on HN.
Racket has full-blown green threads (virtual threads)[1] that serve as an alternative to the async/await paradigm. It also supports use of native (OS) threads[2].
C++ has now both async (built-in) and multiple flavors of stackful coroutines (as external libraries). You can run both on top of ASIO so that you can measure purely difference of the coroutine implementation as opposed to the runtime.
With async/await (which I think C# also uses) you actually unwind the entire stack and start a new one every time you do async callbacks. With fibers / green threads / whatever you instead store stacks in memory.
It took me a while to figure this out, thanks to articles I came across (such as “what color is your function?”.
There are memory / speed trade-offs, as per usual. If you have a lot of memory, and can keep all the stacks in memory, then go ahead and do that. It will save on all the frivolous construction / destruction of objects.
Having said that, my own experience suggests that when you have a startup, you should just build single-threaded applications that clean up after every request (such as with PHP) and spawn many of them. They will share database pool connections etc. but for the most part it will keep your app safer than if they all shared the same objects and process. The benchmarks say that PHP-FPM is only 50% slower than Swoole for instance. So why bother Starting safe beats a 2x speed boost.
And by the way, you should be building distributed systems. There is no reason why some client would have 1 trillion rows in a database, unless the client themselves is a giant centralized platform. Each client should be isolated and have their own database, etc. You can have messaging between clients.
If you think this is too hard, just use https://github.com/Qbix it does it for you out of the box. Even the AI is being run locally this way too.
I missed my opportunity to reply to your comment, but I really appreciate it, and I wanted to find a way to get this back to you. The comment in question:
"Well, I was one of the engineers that made the change :) I'm not sure how much I can tell, but the public reason was: "to make pricing more predictable".
Basically, one of the problems was customers who just set the spot price to 10x of the nominal price and leave the bids unattended. This was usually fine, when the price was 0.2x of the nominal price. But sometimes EC2 instance capacity crunches happened, and these high bids actually started competing with each other. As a result, customers could easily get 100 _times_ higher bill than they expected."
There was more to it than that, but I figure that's a good enough reference point.
Thank you for these improvements. It doesn't change anything, in terms of how much savings I can get by following the latest generations and exotic instance-types, but it does help with the reliability of my workloads.
It's been a huge benefit to me, personally, that I can provide some code that enables the potentiality of servers dying, with the benefit of 80% cost savings without using RIs.
You can also try another product of my former team: https://aws.amazon.com/savingsplans/ - it's similar to RI, but cheaper because it doesn't provide an ironclad guarantee that the instance will be available at all times. It's still a bit more expensive than spot, but not by much.
> You don't have to "color" your code into async and blocking functions
Unpopular opinion: this is a good idea. It encourages you to structure your code in a way such that computation and I/O are separated completely. The slight tedium of refactoring your functions into a different color encourages good upfront design to achieve this separation.
Good upfront design? This stinks of implementation detail leakage affecting high-level design, which should be a cardinal sin.
One should design based on constraints that best match the problem at hand, not some ossified principle turned "universal" that only really exists to mask lower-level deficiencies.
When the implementation detail involves whether or not the function will perform I/O, it is better to let that leak.
Excessive hiding of implementation detail is what leads to things like fetching a collection of user IDs from a database and then fetching each user from an ID separately (the 1+N problem). Excessive hiding of implementation detail is what leads to accidental O(N^2) algorithms. Excessive hiding of implementation detail is what leads to most performance problems.
You don't necessarily know ahead of time if an async or a sequential strategy is better. With colored functions you kind of have to pick ahead of time and hope you picked right or pay a big effort penalty to do a side by side comparison.
async/sync and sequential/parallel are orthogonal concerns. You can write sync code which does work in parallel (on different threads/cores) for something like a numerical algorithm that can be parallelized in some way. Deciding whether something is sync or async is about whether it needs to suspend (in practice, mostly whether it needs to do IO) which is much easier to understand up front. Sometimes it changes of course, in which case you have to do some refactoring. But in a decade of programming professionally in Scala and Rust (both of which have "colored" functions) I can count on one hand the number of times where I had to change something from sync to async and it took more than a few minutes of refactoring to do it.
The title is asking a very quantifiable question. And I was hoping the article would be a summary of expertise and deep dives the author took into various language's runtimes.
But then I read
> With a little help from ChatGPT
And sure enough the results & prose end up pretty much worthless. Because the author didn't do any actual critical thinking or technical work. (For instance, there's nothing more to "Go's memory usage ballooned at 1M tasks" beyond the observation. Why not? Just dig in there it's all documented you just have to read some stuff.)
The whole idea of async functions is to run tasks concurrently and with simple syntax on a single thread. Hiding the actual concurrency under the runtime, only when it is waiting for IO.
It doesn’t matter that node is single threaded, nor does the GIL have any impact on single thread.
It’s once you start doing cpu heavy work inside the tasks this will hit you.
Because goroutine stacks start at minimum 2 kB. But it is not obvious if this is the only factor that contributes to memory usage so I believe experimenting to verify theory is still quite valuable.
> "All programs were launched using the release mode if available. Other options were left default."
So it's likely these numbers simply reflect choices in the default configuration, not the ultimate limit with tuning.
I'm starting to dislike these low effort ChatGPT articles. Getting an answer from it is not the same thing as actually learning what's going on, but readers will walk away with the answers uncritically.
Lately I’m seeing a lot of this baked-in assumption that ChatGPT writes good/idiomatic/performant code, as if the code it generates is ~The Answer~. I guess it’s the modern version of copy/pasting from Stack Overflow. The code I’ve had it write isn’t universally terrible but it almost always has some low-hanging performance fruit (in JS for instance, it seems to love spreading arrays). I’m not sure how much I trust it to write good benchmark code!
Adding to the others here, yeah - it has to loop over the whole array, and it allocates a new one. The allocation itself costs something and it also creates work for the garbage collector (freeing the old one). So it’s a lot of work compared to a .push()! Sometimes you do need the immutability aspect though, because a lot of things use referential equality for change detection (React is a big one).
Spreading allocates a new array. Most things you can do with spread you can do with normal functions that either just directly access an entry or mutate an array directly. [first, …rest] is nice, but will always be slower than direct access.
Spreading an array requires looping over all the array elements which is O(n) compared to something like Array.push() which is just 0(1), although spreading has other benefits like copying rather than mutating
I've been seeing a lot more "ChatGPT only produces junk code" comments (case in point, this entire thread), and that hasn't really been the case either.
For majority of those benchmarks it produced the right code from the first go. It struggled more with Java VirtualThreads - probably fewer programs in the training set. It also had a tendency to overcomplicate things (adding unncecessary code). So there were a few iterations needed plus some hand edits.
To play Devil’s Advocate, this is a good test of “how many concurrent tasks will an uncritical developer relying on ChatGPT end up achieving with low effort”.
Right, realistically your product will not primarily be used by the most talented experts, who are optimising as best as possible given intimate knowledge of internals. People will do what basically works, and it's arguably doing a disservice when "benchmarks" avoid showing us also naive performance without the tuning. The tuned performance is worth knowing, but realistically most customers will get the naive performance, ain't nobody got time to learn to tune everything.
I think this applies all the way down to hardware features and programming languages. For example the naive use of sort() in C++ will go faster than in Rust (in C++ that's an unstable sort in Rust it's stable) but may astonish naive programmers (if you don't know what an unstable sort is, or didn't realise that's what the C++ function does). Or opposite example, Rust's Vec::reserve is actively good to call in your code that adds a bunch of things to a Vec, it never destroys amortized growth, but C++ std::vector reserve does destroy amortized growth so you should avoid calling it unless you know the final size of the std::vector.
That is a fair point. I don’t think it saves this particular analysis, of course. Probably what’s important to know is something we might call the “Pareto naive performance”; with just a little bit of effort (the first 20% or so), you can get some significant performance improvement (the first 80% or so). Even the naive programmer just banging together something that works in Elixir is going to quickly find out how to raise Erlang’s default 260k process limit when they hit that, after all.
$ /usr/bin/time -v elixir --erl "-P 10000000" main.exs 1000000
08:42:56.594 [error] Too many processes
** (SystemLimitError) a system limit has been reached
“This limit can be adjusted with the `+P` flag when starting the BEAM… To adjust the limit, you could start the Elixir application using a command like the following:
elixir --erl "+P 1000000"
“If you are getting this error not because of the BEAM limit, but rather because of your operating system limit (like the limit on the number of open files or the number of child processes), you will need to look into how to increase these limits on your specific operating system.”
Yes, again ChatGPT was better than just random search. If writing this blog post teaches me something, then it teaches me to use ChatGPT more not less ;)
BTW, I fixed the 1M benchmark and Elixir is included now.
I’m glad you saw the irony in my reply. I was a little worried it would come off acerbic, like “since you use ChatGPT, you only deserve ChatGPT”. The real intent was double-edged, of course - to demonstrate that ChatGPT isn’t just dumb-codemonkey-as-a-service, it can also quite easily solve a lot of the problems it is purported to create. You have been handling mean comments here (including my own mean comments) with grace, so I took the risk.
> Even the naive programmer just banging together something that works in Elixir is going to quickly find out how to raise Erlang’s default 260k process limit when they hit that, after all.
Sure. Will redo this part. I doubt it would allow Elixir to get a better result in the benchmark, though, as it was already losing significantly at 100k tasks. Any hints on how to decrease Elixir memory usage? Many people criticize using defaults in the comments, but don't suggest any certain settings that would improve the results. And a part of blogging experience is too learn things - also from the author perspective ;)
Curious what build configuration changes would help with this. Specifically with python/Go because that’s what I’m most familiar with, but any insight into tweaking build configs to increase performance here would be really interesting to me.
.. So tasks.push is called with an IIFE, which returns a promise, which resolves after its internally awaited another promise...
You could just do this, which avoids about 3 unnecessary allocations:
tasks.push(delay(10000))
And I'm not convinced promisify(setTimeout) is what you want here. I'd just use this:
const delay = timeout => new Promise(resolve => setTimeout(resolve, timeout))
... Which honestly should be part of the standard library in javascript. However, I'm less convinced that replacing the delay function will make a big difference to performance.
Heh confirming the old adage: If you want to know the right answer to a problem online, post the wrong answer. People will come out of the woodwork to correct you.
Yeah Python got done dirty here. I tried a million tasks and I saw a peak of about 166MB. I'm still getting used to async in Python, but this seems to work fine though?
Thank you! I fixed the code and updated the benchmarks. However, your 166 MB doesn't match my testing (I'm getting about 240M at 100k tasks) and 10x more at 1M. Are you sure you haven't misread your RSS?
Ok, this one is funny. Actually ChatGPT got it right and there was a create_task before, but right before committing I "inlined it" and introduced a bug. Thanks for letting me know. Will double check if it didn't messed up the results.
And BTW - I'm very far from copy pasting from chat gpt. ChatGPT helped with this code, but often required many iterations and also some manual cleanup.
Another benchmark of the "I don't know what I'm actually measuring" flavor.
I don't even know where to start:
from the lack of env settings, system settings, compiler settings to the fact that no concurrent work was actually done, this is one giant flawed benchmark.
The only lesson here should be: Don't take any of this seriously.
Just read the text. It lists versions and states it was using defaults in the release mode. Most of those compilers don't have any fancy options actually. E.g go is just `go build`, Python is just `python3 ...`. Rust options are in the project definition. Anyway, I'll add those command line invocations if it causes so much controversy.
Benchmarks for threads might slightly underestimate memory usage. For threads, some amount of memory is spent by the kernel (eg, memory maps themselves) and won’t show as maxrss.
> a high number of concurrent tasks can consume a significant amount of memory,
So Go needs 3gigs for a million tasks. Relatively speaking, for an app that does a million concurrent somethings, 3gb is probably peanuts.
It seems that memory isn’t the limiting factor here.
n = ARGV.first.to_i32
finished = Channel(Nil).new
n.times do
spawn do
sleep(10.seconds)
finished.send(nil)
end
end
n.times { finished.receive }
Memory usage determined with `ps ax -o pid,rss,args`. Runtime measured with `/usr/bin/time -v`. (Had to run `sudo sysctl vm.max_map_count=2500000` to get the final two rows.) Results:
"System time" rather than "user time" is the majority (7.94s system time, 2.83s user time in the 20.25s wall time run). Is this pointing to memory allocations?
Crystal Fiber docs https://crystal-lang.org/api/1.8.2/Fiber.html says "A Fiber has a stack size of 8 MiB which is usually also assigned to an operating system thread. But only 4KiB are actually allocated at first so the memory footprint is very small." -- and perhaps unsurprisingly, 4KiB times 1000000 is approximately 4 GiB. Nice when the math works out like that :)
To me this is useful as a baseline for something like a websocket server, with some number of idle connections, each Websocket connection mapped to a Fiber that is waiting for I/O activity but mostly idle.
I'm a big Rust fan, so I'm happy Tokio holds up so well. I find it to be one of the simplest and quickest ways to write task based programs.
Gotta wonder whether you can challenge those numbers with c++ though. I suspect you can but with a lot of work, where Rust is pretty effective written naively.
Of course the big issue with these comparisons is it's hard to be an expert in so many different languages. How do you know if what you're doing is the best way in each language?
I find the opposite to be true, Tokio is a fine library but it makes the whole program very complex.
Perhaps it's because I generally don't write code for many short lived workloads, but I find using real threads to be a lot easier to work with (plus it saves time during compilation). Tokio is great at what it does, but I have to wonder how relevant these types of threads are when "fearless concurrency" is one of the token advantages of the language.
I think this article would've been a lot better if it did some actual calculations in the spawned threads, rather than simply waiting. A smart enough compiler would see that the results of the wait calls, which usually don't come with any other side effects, are never used and optimize them out entirely.
As of today (with .NET 7 and 8 preview) 120MB base footprint memory usage by .NET is indeed surprising. A few questions:
- How are you exactly launching the workload?
- How much memory does your system have?
- Could you try running it with .NET 7 instead?
The last two are especially relevant because in the recent versions each release got extra work to reduce memory footprint on Linux (mostly because it's an important metric for container deployments).
Overall, 120MB sounds about right if you are launching a .NET 6 workload with some extra instrumentation attached while the runtime picks Server GC which is much more enthusiastic about allocating large segments of memory, especially if the system has large amount of RAM.
As for Tasks - C# tasks are in many ways similar to Rust's futures (both are yielding-to-the-caller state machines) and take very little memory, which scales directly with the amount of variables that persist across "yield points" (which you have none).
> Overall, 120MB sounds about right if you are launching a .NET 6 workload with some extra instrumentation attached while the runtime picks Server GC which is much more enthusiastic about allocating large segments of memory, especially if the system has large amount of RAM.
Also if you have lots of CPU cores then the CLR will (pre-)allocate up-to ~10MB/core for per-hardware-thread GC heaps. So my 16-core (32-thread) desktop will run every .NET program with a starting (OS-level) memory usage of ~200MB for a Hello, World program - though actual total GC usage will be < 1 MB.
Using .NET 6 is perfectly fine! It is the latest LTS after all, it's just that on machines with a lot of RAM the GC can indeed pre-allocate quite a bit of memory if, based on hardware configuration, it chooses to use Server GC. This, however, is being tuned with each subsequent release hence I'm curious to see the numbers on 7 on the author's hardware.
This is not a comparison of 1 million concurrent tasks.
For a bunch of these tests, it is measuring 1 million tasks multiplexed onto a smaller number of os threads.
In many implementations the cost of concurrency is going to be small enough that this test is really a measure of the stack size of a million functions.
There is sometimes a distinction made between concurrency and parallelism, where "concurrency" just means the programming interface looks parallel but is allowed to use out-of-order task-switching under the hood, and "parallelism" is real-deal usage of CPU cores. Both ways of looking at things are useful in different situations.
The Java tests (Loom in particular), and possibly other langauge runtimes too in this benchmark, can't be compared without looking at the runtime settings.
JVM for example has a default heap size of 25% to 50%, depending on how much memory is available in the system. Given that the code is using ArrayList, there will be more need for bigger heap, due to array expansion. To eliminate that noise, the code should be changed to a pure Thread[] array with a fixed size based on numTasks.
Once that is done, you may as well run up to 10_000 tasks (virtual threads) within a JVM that consumes no more than 32 MB of RAM (compared to 78MB as originally reported by OP).
These were the best results I could find after playing a bit:
# Run single task
docker run --memory 26m --rm java-virtual-threads 1
# Run 10 tasks
docker run --memory 26m --rm java-virtual-threads 10
# Run 100 tasks
docker run --memory 26m --rm java-virtual-threads 100
# Run 1_000 tasks
docker run --memory 28m --rm java-virtual-threads 1000
# Run 10_000 tasks
docker run --memory 32m -e JAVA_TOOL_OPTIONS=-Xmx25m --rm java-virtual-threads 10000
# Run 100_000 tasks
docker run --memory 200m -e JAVA_TOOL_OPTIONS=-Xmx175m --rm java-virtual-threads 100000
# Run 1_000_000 tasks
docker run --memory 950m -e JAVA_TOOL_OPTIONS=-Xmx750m --rm java-virtual-threads 1000000
I’m surprised at Elixir failing with 1M tasks. I’ve definitely done this sort of test with Erlang without an issue. Although I can’t really remember right now if I pushed it to 1M tasks.
“The maximum number of simultaneously alive Erlang processes is by default 262,144. This limit can be configured at startup. For more information, see the +P command-line flag in the erl(1) manual page in ERTS.”
Almost certainly a system cofiguration issue. I’m pretty sure the default process limit on BEAM is something like 200K. Additionally I don’t see mention of updating file handles or anything else on the operating system.
Maybe I’m cynical but this is what happens if you rely on chatGPT to give you programming answers without understanding the fundamentals first.
Unless I’m missing something in the writeup, it doesn’t mention how much RAM is on the system. A little odd to mention the CPU and the OS but not the RAM, for a benchmark that doesn’t care about execution speed but does care about memory usage. It makes figuring out that Elixir error of “system limit reached” a little harder to figure out.
edit: oh apparently it’s got nothing to do with memory, 1M just happens to be higher than the default process limit for Erlang and thus Elixir.
Lots of pedants like to dump on these sorts of articles but there are enough interesting things to see, even in a relatively trivial comparison like this.
It is interesting that some systems allocate more memory up-front but this doesn't grow much per-task.
It is interesting that in the form the app was written the Go program doesn't appear to scale as well. Sure there are possibly reasons for it so anyone with more knowledge of Go could attempt to write a more "correct" program and ask to rerun it.
It is interesting that with the defaults, things behave as they do, again, someone who cares more about "it's just that the defaults are different" is welcome to investigate that and ask "why" some program defaults might lead to better performance.
It is interesting that as-in most comparisons, the languages/frameworks don't neatly fit into Stereotypes.
It is interesting that most of them probably perform well enough for 99% of applications we might write.
It seems like most people still don't appreicate that YMMV and that trade-offs are always made. Just add it to your stash of brain juice that helps you become a better engineer and stop complaining.
Would be nice to see throughput tested at this level. There is may be a good reason for Go to allocate more per thread if each thread is really doing something. Hat tip to Rust and C# tho!
I suspect 1 million concurrent task very likely becomes unmanageable in any real world scenario anyway. 1KB per task giving 1GB of total memory consumption, which means 100KB would be enough to raise memory limits on most systems.
There may be scenarios such as network connexion handling where those connexion are kept idle 99% of the time, but you still need something to handle the very rare case of receiving something. However, is 1 million network connexion manageable from a kernel point of view anyway ? Wouldn't you reach system limits before that ?
> However, is 1 million network connexion manageable from a kernel point of view anyway ? Wouldn't you reach system limits before that ?
You can raise all those defaults as long as you have sufficient memory. Yes, you need to be working with fairly idle connections, otherwise you're likely to run out of CPU. Real world examples include Chat, IMAP, push messaging, that sort of thing where having a live connection means a significant reduction in latency vs periodically opening a connection to poll, but at the same time, there's not a whole lot going on.
I was at WhatsApp when we accidentally hit 2.8M connections on one machine [1] I can't remember if we ever got to 3M, but we ended up doing a lot more work per connection so didn't keep those connection counts for long. Kernel wise, as long as you're mostly inbound connections, everything is well oiled; I ran into problems initiating many thousands of outbound connections/second in an HAProxy use case, although I think FreeBSD 13 may have improved that bottleneck. I hear Linux also works, but I've not scaled it nearly as far.
If you want to do well on this benchmark, rather than calling sleep, the sleeping processes should call erlang:start_timer(10000, self(), []) to schedule a timer and then call erlang:hibernate/3 to wait for the message without having a normal heap. The efficiency guide [1] says a process will start at 326 words of memory, and a word on a 64-bit system is 8 bytes, so rough math says 2.5k per process and 2.5GB for 1M processes, so the observed 4GB isn't that far off, comments on the blog page show someone else got the example to 2.7GB by eliminating extra steps in the program, and reducing initial heap size got it to 1.1GB. Hibernation should make it smaller, but I'm not going to check.
Note also, that BEAM uses several allocation arenas and is slow to release those back to the OS. It's important to measure memory from the OS perspective, but in a real application you'd also measure memory from inside BEAM. There are knobs you can tune if you need to reduce apparent OS memory use at the expense of additional cpu use during allocation. There's tradeoffs everywhere.
Also, yes, 4GB is a lot of memory, but once you start doing real work, you'll probably use even more. Luckily memory is not that expensive and capacities are getting bigger and bigger. We ran some servers with 768GB, but it looks like the current Epycs support 6TB per CPU socket; if everything scales, you could run a billion sleeping BEAM processes on that. :p I recognize this sounds a lot like an argument to accept bloat, but it's different because I'm saying it ;) Also, I think the benefits of BEAM outweigh its memory costs in general; although I've certainly had some fights --- binary:copy/1 before storing binaries into ets/mnesia can be really helpful!
I like the general idea of the test. Things I would have liked to see (apart from basic system tweaks like allowing more file handles) are a performance comparison (how much CPU time is spent) and maybe a comparison between operating systems (this might highlight a lot of windows-linux differences, like Windows not committing memory - better have a large swap file to handle stacks of 100k real threads - or windows having a very different scheduler)
Surely a Sufficiently Smart Compiler could optimize all of these to only take the a constant amount of memory (and less time).
I wonder if there are any correctness constraints that would actually prevent such optimizations. For example, the compiler might be required to execute the exact same system calls as an unoptimized program. It seems quite reasonable that the programs using system threads could not be optimized significantly under such a constraint.
I would expect the larger limit that prevents compilers from optimizing this is the large amount of code that defines the semantics of the lightweight threads. For example, I would expect the observable / required behaviors of tokio to not prevent a SCC from optimizing it, but the massive amount of Rust code that defines the lightweight thread semantics cannot be reasonably optimized.
Which programming language / framework is the closest to actually being optimized to like < 4 byte / task? I would make an uneducated guess of Elixir or Go.
To the author: Elixir (well, Erlang's BEAM VM actually) can be configured to allow more than the default 256k tasks (262_144). I'd do that and post the numbers for 1M tasks.
Would be interesting to hear what problems was hit trying to run 100k threads. I remember testing out a million threads on an by now older desktop and just had to do a couple of sysctl setting changes that were easy to find.
For the sake of pitching an analog to go's concurrency model. Hopac is a concurrency library for F# (partly written in C#) delivering an informal implementation of CSP (similar to Go's goroutine and Clojure Core.async). Since the job does nothing but wait for 10 seconds, the code is:
seq{1..1000000}
|> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)
|> run
I'm not sure how to calculate total memory usage but instrumenting the following way:
GC.Collect()
GC.Collect()
GC.WaitForPendingFinalizers()
GC.TryStartNoGCRegion 3000000000L
let before = GC.GetTotalMemory false
seq{1..1000000}
|> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)
|> run
GC.GetTotalMemory false - before |> printfn "%d"
GC.EndNoGCRegion()
Gives the following output, I'm reading 144MB for 1M lightweight thread (in the REPL) unless I did something dumb:
This is on .net core 7.0.5, Debian 11, Server GC (which is expected by Hopac). I removed all GC code and ran it as an executable using @compumike's `ps ax -o pid,rss,args' command. I think the only one that triggered a GC was the execution with 10M concurrent jobs.
It would be interesting to see how much the C# memory usage would go down if you allocated a single Task.Delay at the start and had all the tasks wait for it. It would probably result in it finishing in 10sec instead of climbing up to 12, while still running 1 million tasks.
I'm not sure how you would do this in other runtimes, and there is still a risk of new problems cropping up.
1 million Task.Delays means 1 million queued waits in the sleep machinery, which could run into some bad scaling and overhead. But then 1 million tasks waiting on a single awaitable probably hits overhead in other subsystems. My guess would be that 1 million waits is much cheaper than 1 million sleeps, because sleeps have to be sorted by due time while the list of tasks waiting for completion just needs to be sequential. Entries in the sleep queue are also going to be bigger (at minimum, 4-8 bytes for due time + 4-8 bytes per completion callback) while entries in a completion queue would just be 4-8 bytes for the completion callback.
Task.WhenAll also has potentially worse behavior than the other apps if you care about memory, since it will need to allocate a big array of 1 million Tasks in order to pass it in. Many of the other programs instead loop over the tasks, which eliminates that overhead.
The other thing I noticed is that they are wrapping every Task.Delay(…) inside a Task.Run(…) call. That should not be necessary. The inner task will yield back to the called when it hits the delay call anyway. There is no need to queue it with an immediate yield as if it is background or cpu-bound work.
The Run() call might be creating a more complex state machine and using more memory than necessary (2 tasks vs just 1 for each delay). The memory usage might be even lower if the Run() call is dropped.
I see C is missing. If you're going to do a language comparison, might as well include the one language we know will have the best performance and lowest memory
There is no light-weight threading concept in standard C/C++, and the cost of one OS thread is at least 1Mb (stack) + 4Kb (tls) reserved memory (at least by default, on modern Windows), so for 1m threads that will be 1Tb of RAM :)
C++ 20 has async/await (but of course named co_async and co_await, because C++ isn't the sort of language where you're allowed nice things, the beatings will continue until morale improves)
However AIUI C++ 20 doesn't actually supply an executor out of the box, so you would need to choose what to do here as for Rust, where they picked both tokio and async-std and you can see they have different performance.
C does only have threads, but you could presumably pull off the same trick in C that Rust does, to get Linux to give you the bare minimum actual resident memory for your thread and you needn't care about the notional "real" size of the thread since we're 64-bit and address space is basically free.
> (but of course named co_async and co_await, because C++ isn't the sort of language where you're allowed nice things, the beatings will continue until morale improves)
Stack memory is allocated lazily on Linux. I thought this benchmarks shows it pretty clearly. Otherwise 10k threads would be already in GBs territory, when in fact it is below 100 MB.
So how do you write programs without data structures? You copy paste some implementation and modify it slightly at every use site?
Possible optimizations depend on the semantics of the language — C++ can do just orders of magnitude more with for example a string data structure, than what is possible with a C “string”. So no, C is not the most performant language in general — de facto C++ is used for everything performance critical.
Understanding full well that these are pretty artificial results, I'm surprised at just how well Node and Python faired against Go and Java. For all the default assumptions around languages that folks have, they've certainly held their own here.
As others have mentioned, the results come from the difference between async concurrency vs "lightweight threads" (in Go and Java).
In the nodejs test, there's very little javascript code being actually executed here. Whats really being tested is nodejs's event queue (libuv), which is written in C. And its being compared to Go's coroutine system - which is also in C. But, if I recall correctly, Go's threading model includes a preemptive scheduler (which is very complex) and it allocates a stack per goroutine.
This doesn't test if Javascript or Go code is faster / more efficient. There's almost no javascript or go code being executed here at all. This is testing the execution engines of both languages. Go's execution engine has more features, and as we can see those features have a cost at runtime.
In C# program you'd generally want to call .ConfigureAwait(false) on the task you are about to await if the continuation does not require to be completed on a special thread (like UI).
Pfft doubtful. Rust, C and C++ all have very similar performance. And apparently a javascript framework is currently beating all of those languages in the composite score on techempower, though that seems a bit weird to me:
C doesn't have a built in, optimized async execution engine of any sort. C with threads would lose to any of the async runtimes here (including javascript).
C++ has async now. By all means, write up an async C++ example and compare it. I suspect a good C++ async executor would perform about as well as tokio, but it'd be nice to know for sure.
I'd also be interested to see how hard it is to write that C++ code. One nice thing about Rust and Go is that they perform very well with naively written code. (Ie, the sort of thing chatgpt, or most of your coworkers would write.) Naive C would use kernel threads, so if we're comparing average C code to average Go code at this test, the average go code will end up faster.
(This isn't always true of rust. I've seen plenty of rust code which uses Box everywhere that runs crazy slow as a result.)
> And apparently a javascript framework is currently beating all of those languages in the composite score on techempower, though that seems a bit weird to me:
Just-js was essentially built to run the techempower benchmark. It's not a real stack you can use. An exercise in "what is possible".
Are you trying to say that you can write from scratch a code just as safe or better than one of the most peer reviewed crate of Rust? Congrats, the world has been waiting for you, but nonetheless, that's why commoners prefer to stick to what's safer for them
Rust is going to do well on toy examples. This is because in a naive implementation, for each thread, Rust statically allocates enough space for the largest possible stack the thread will need. On a toy example this will be quite small. In real code, some branches will have large futures that will bloat the size of every thread, even if the thread doesn't take the bloated branch. You can mitigate this issue by heap allocating large futures on cold paths, but this requires insight and tuning.
Some runtimes can handle this for you, but they come with other performance tradeoffs.
In real code I saw Rust beating Go and Java based code by 10x-50x in memory use. But too complex and too many factors to analyze, so not a material for a blog post. Actually this tiny benchmark favors GC based runtimes because it just sleeps - so doesnt actually allocate much on the heap.
The Java tests should be optimized by making the ArrayList a fixed size. All this resizing of the huge array backing it and the potential final very high amount of empty backing array slots probably makes the Java result worse than it could be.
I wonder if the parallel stream API with a virtual thread pool would make it better or worse.
The ArrayList is just a growable array and is presumably using an exponential growth strategy so it's equivalent to what several of the others are doing. You could indeed tell Java how big the ArrayList will grow in advance, but you could for example likewise do that with the Rust Vec by asking for Vec::with_capacity(num_tasks) instead of Vec::new().
Since this is amortized exponential growth it's likely making negligible difference, although it is good style to tell data structures what you know about their size early.
The Go approach doesn't appear to pay this penalty, it's just incrementing a (presumably atomic?) counter, although I have no idea where these tasks actually "live" until they decrement the counter again - but most of the others are likewise using a growable array.
It looks like most of the languages are doing the same thing w.r.t using a 'List'-style implementation that re-sizes/copies it's internal buffer when it runs out of space.
One thing I wonder though is whether the managed languages pay a bigger penalty on the memory usage than the unmanaged languages due to this.
In the unmanaged languages, those intermediate buffers should be freed immediately during the re-size operation.
But in the managed languages, they might stick around until the next GC. So the peak memory would be higher.
In this example the runtime overhead is going to represent more of the memory usage anyway but it quickly isn't negligible. E.g. a list of size 1M is going to be ~8MB (64 bit addresses). So even if the array is re-sized from 1->2->4->8->...->~500K, the end result is not going to be worse than 2X the size.
Tokio seems oddly good compared to the others. Is that because it’s using a thread pool instead of creating something new for every task? I wonder how it would fare if the tasks were more intense, and how the time-to-completion compares between these approaches.
async Rust functions are compiled down to a finite state machine. A stackless coroutine I think it can be called. I suspect that C#/.NET does something similar though I am just speculating here. The difference between async-std and tokio would then be explained by a lower overhead runtime per task in tokio.
All the async and green thread runtimes are using thread pools by definition. Avoiding spawning as many OS threads as program's tasks is their main job.
This benchmark wasn't very rigorous, so maybe it has overlooked something, but the result seems plausible. Tokio is a good, mature runtime. Tokio-based frameworks are near top of TechEmpower benchmarks.
Rust's Futures were designed from the start with efficiency in mind. Unlike most async runtimes where chaining of multiple async calls creates new heap allocations, Rust flattens the entire call tree and all await points into a single struct representing a state machine. This allows the runtime to use exactly one heap allocation per task, even if the task is pretty complex.
It's nice to see but frankly speaking with current memory prices all I care is speed, not so much memory consumption, unless I have very specific requirements.
Using thread pools/connection pools could help a lot both in performance (less time to create threads) and memory use (less fragmentation + less memory to garbage collect after short-lived allocations).
There's a guide somewhere about using connection pools, multiprocessing and async I/O vs. pthreads, showing that the most obvious solution to the concurrency implementation does not necessarily has the best performance, but couldn't find it right now.
This is bound to problems - slighest change in your stack - language, runtime, OS would affect that so much that it's completely pointless to rely on measures taken and apply them later (that to be said, this is perfectly valid choice when you are locked in - like in the embedded or game console world, but rarely that's the case outside)
Not for a typical website. Even if you're getting a massive amount of traffic, like a million requests per second, you'd want the request to finish as quickly as possible, so they never pile up to be 1M concurrent tasks. If the tasks have real work to do and can't finish quickly, the having 1M in-flight tasks compete for orders of magnitude fewer CPU cores may not be a good idea — it will waste lots of memory for their state and it will be hard to keep latencies reasonably even. You'd probably use a simpler queue and/or load-balance to more machines.
A scenario with this many tasks spawned is more realistic if you have mostly idle connections. For example, a massive chat server with tons of websockets open.
You might not have a choice. How should a load balancer be implemented? It has to keep those connections alive while the underlying services fulfill the request.
Also I would quickly return something to say the website is busy, and apply some kind of back pressure - since a process can handle only that much. But I have less experience there to say, though my surprise while working with Java used for web serving was that if the gc took more 1% (or was it 3%) of some time sliced window, then the whole process was killed (that's how the team/company has it set up) - e.g. "better fail fast, than serve slow" - the last one can quickly affect your business if it's relying on sub 100ms (~) response time.
The go memory usage can probably be explained by the fact that go uses fairly large stacks for goroutines (according to https://go.dev/doc/go1.2#stack_size it’s 8kb per goroutine, even though that’s old)
If it was true, the results would be much worse.
The average from my measurements is a bit above 2 kB.
This website says it is 2 kB: https://tpaschalis.me/goroutines-size/
Stacks changed a lot in Go subsequently. I believe that in 1.2, segmented stacks were still being used. (They now are contiguous and copied on upsize.)
The Rust threaded solution obviously produces actual threads, which will run "actually in parallel" if that is technically possible on your system.
The Rust tokio solution will use tokio's runtime which automatically spawns an appropriate amount of worker threads to parallelize your async tasks (by default it asks the OS how many CPUs there are and creates one such thread for each CPU).
> Notably, at 1 million tasks, I observed that the overhead of launching tasks became evident, and most programs required more than 12 seconds to complete.
I didn’t quite get this part. Even if your runtime has no overhead, the fastest way you can execute 1 million 10-second tasks is 10 * 1 million / num_cores, right? Why would these million tasks only take 12 sec?
Go and Java with lightweight threads provide a full-blown system, without any limitations. You don't have to "color" your code into async and blocking functions, everything "just works". Go can even interrupt tight inner loops with async pre-emption.
The downside is that Go needs around 2k of stack minimum for each goroutine. Java is similar, but has a lower per-thread constant.
The upside is that Go can be _much_ faster due to these contiguous stacks. With async/await each yield point is basically isomorphic to a segmented stack segment. I did a quick benchmark to show this: https://blog.alex.net/benchmarking-go-and-c-async-await