It's not really fair to compare async/await systems with Go and Java.
Go and Java with lightweight threads provide a full-blown system, without any limitations. You don't have to "color" your code into async and blocking functions, everything "just works". Go can even interrupt tight inner loops with async pre-emption.
The downside is that Go needs around 2k of stack minimum for each goroutine. Java is similar, but has a lower per-thread constant.
The upside is that Go can be _much_ faster due to these contiguous stacks. With async/await each yield point is basically isomorphic to a segmented stack segment. I did a quick benchmark to show this: https://blog.alex.net/benchmarking-go-and-c-async-await
On the contrary I think it is critical to compare them. You can then decide if the tradeoff is worth it for your use case.
Even at a million tasks Go is still under 3 GiB of memory. So that is roughly 3KiB of memory overhead per task. That is likely negligible if your tasks are doing anything significant and most people don't need to worry about this number of tasks.
So this comparison shows that in many cases it is worth paying that price to avoid function colouring. But of course there are still some use cases where the price isn't worth it.
One thing I wanted to add is that in golang, you end up passing context.Context to all asynchronous functions to handle cancellations and timeouts, so you “color” them regardless. Java folks with structured concurrency have the right idea here.
The correct interpretation here is that golang chose explicit context passing where java chose implicit.
It's similar to explicit DI vs implicits.
the function coloring metaphor doesn't make sense since the calling convention is the same nor are there extra function keywords (`async` vs non-async).
This isn't true. I have use cases that don't require cancellations or timeout. The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.
This is an interesting post. My understanding: Most of the use case for async code is I/O bound operations. So you fire off a bunch of async I/O requests and wait to be notified. Logically, I/O requests normally need a timeout and/or cancel feature.
However, you raise a different point:
The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.
This sounds like CPU-bound, not I/O-bound. (Please correct me if I misunderstand.) Can you please confirm if you are using Go or a different language? If Go, I guess it still makes sense, as green threads are preferred over system threads. If not Go, I would be nice to hear more about your specific scenario. HN is a great place to learn about different use cases for a technology.
I think I just responded too hastily. I am working in Go. There is file IO going on in addition to the calculation (which because of a NAS or whatever could also be network IO). As a practical matter I had never felt the need to offer cancellation or timeout for these use cases, but I probably should, so mea culpa.
What's the point of multiplexing tasks on a particular core if the tasks don't do any I/O? It will be strictly faster to execute the tasks serially across as many cores as possible then.
It's not _quite_ the same: you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.
> you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.
You can always pass context.Background, in this metaphor creating a new tree of color.
You can always call "runtime.block_on(async_handle)", in this metaphor also creating a new tree of color.
You can always pass the async executor to the sync code and spawn async coroutines into it. And you can keep it in a global context as well to avoid parameters. E.g. there is `Handle::current()` for exactly this purpose in Tokio. Function coloring is just a theoretical disadvantage - people like to bring it up in discussions, but it almost never matters in practice, and it is even not universally considered a bad thing. I actually like to see in the signature if a function I'm calling into can suspend for an arbitrary amount of time.
It's not the same because you can have interleaving functions which don't know about context.
Say you have a foreach function that calls each function in a list. In async-await contexts you need a separate version of that function which is itself async and calls await.
With context you can pass closures that already have the context applied.
No, because if the function that includes those lines is itself async, it will now block, while the equivalent go coroutine will still preserve the stackfull-asyncness. I.e. you can't close over the yield continuation in rust, while it is implicitly done in go.
For a more concrete example: let's say you have a generic function that traverses a tree. You want to compare the leaves of two trees without flattening them, by traversing them concurrently with a coroutine [1]. AFAIK in rust you currently need two versions of traverse, one sync one async as you can't neither close over nor abstract over async. In go, where you have stackful coroutines, this works fine, even when closing over Context.
So yes, in some way Context is a color, but it is a first class value, so you can copy it, close over it and abstract over it, while async-ness (i.e. stackless coroutines) are typically second class in most languages and do not easily mesh with the rest of the language.
[1] this is known as the "same fringe problem" and it is the canonical example of turning internal iterators into external ones.
How do you figure? What’s requiring my async function to take a context in the first place? Even if it takes one, what’s stopping it from calling a function that doesn’t take one? Similarly, what’s stopping a function that doesn’t take a context from calling one which does?
Erlang / Elixir processes use about 0.5kb per process, a preemptive scheduler on the BEAM to make sure work is consistent and isolated heap space for each one to avoid stop the world garbage collection.
It’s as good as it gets for this type of thing as it was designed for it from the ground up.
This was the first thing that occurred to me when I saw the error - should have been pretty straightforward to tweak the settings to give a clearer picture.
Thanks! Looked back at the results and that's about what I'd expect. Erlang/Elixir isn't a silver bullet, but amazing in its own way for other reasons. Ultimate max performance and memory efficiency definitely isn't one of them :).
> Erlang / Elixir processes use about 0.5kb per process
Last I checked it used about 2.5K. Beam reserves around 300 words per process (depends on exact options), but a word is 8 bytes not 8.
You can get it lower (obviously at a cost as soon as you start using the process and need space) but nowhere near 512 bytes, just the process overhead is around 800.
I don't know how C#, but Rust async/await doesn't allocate anything on the heap by itself. So it is not a universal property of all async/await implementations contrary to what you imply in your blog post.
Yeah and I think NAOT still embeds the same(?) runtime GC into the native binary. So, for memory usage, I would expect it to be nearly/exactly the same.
I may remember wrongly here, but wasn't ValueTask merely the go-to option when it's expected that the task would finish synchronously most of the time? I think for the really async case with a state machine you just end up boxing the ValueTask and ending up with pretty much the same allocations as Task, just in a slightly different way.
What Rust does, it allows to remove one layer of allocations by using Future to store the contents of the state machine (basically, the stack frame of the async function).
It allows to remove all allocations within a task, except for recursive calls. This results in the entire stack of a task being in one allocation (if the task is not recursive, which is almost always the case, for exactly this reason), exactly like you describe the advantage of Go. And unlike Go, that stack space is perfectly sized, and will never need to grow, whereas go is required to reallocate the stack if you hit the limit, an expensive operation that could occur at an unknown point in your program.
Again, no magic. Future trait impl stores all the state machine that Rust can see statically. If you have complicated code that needs dynamic dispatch or if you need recursive calls, Rust will also have to allocate.
This is not at all that different from Go, except that Go preallocates stack without doing any analysis.
Actually... I can fix that! I can use Go's escape analysis machinery to statically check during the compilation if the stack size can be bounded by a lower number than the default 2kb. This way, I can get it to about ~600 bytes per goroutine. It will also help to speed up the code a bit, by eliding the "morestack" checks.
It can be crunched a bit more, by checking if the goroutine uses timers (~100 bytes) and defers (another ~100 bytes). But this will require some tricks.
I'm NOT trying to show that Go is faster than async/await or anything similar. I'm showing that nested async/await calls are incredibly expensive compared to regular nested function calls.
You need to add to go keyword to change a normal function to a goroutine.
If you would remove async/await and Task/Return from the C# code example, it would perform pretty much the same as Go.
If you want to show that async/await calls are expensive, than you should have shown two code samples of C#, one with async/await, and one without.
Or could have done the same for Go, show one example with goroutines, and one without.
But I think everyone already know that async/await and goroutines has it's costs.
The problem is more that you are comparing Go without goroutines (without it's allocation costs) to a C# example with a poor implementation of async/await.
I'm wondering that as well. The c# code seems really unrealistic though, using a lot of tasks for CPU bound work.
A fair comparison would at least need to involve some degree of IO.
Does go maybe automatically parallelize everythinh it can? That would be one potential answer to the initial question
The author doesn't have the understanding to have even gotten close to the capability for nuance you're asking for. They just copied code from ChatGPT and ran it and timed it and made graphs and it somehow got on HN.
Racket has full-blown green threads (virtual threads)[1] that serve as an alternative to the async/await paradigm. It also supports use of native (OS) threads[2].
C++ has now both async (built-in) and multiple flavors of stackful coroutines (as external libraries). You can run both on top of ASIO so that you can measure purely difference of the coroutine implementation as opposed to the runtime.
With async/await (which I think C# also uses) you actually unwind the entire stack and start a new one every time you do async callbacks. With fibers / green threads / whatever you instead store stacks in memory.
It took me a while to figure this out, thanks to articles I came across (such as “what color is your function?”.
There are memory / speed trade-offs, as per usual. If you have a lot of memory, and can keep all the stacks in memory, then go ahead and do that. It will save on all the frivolous construction / destruction of objects.
Having said that, my own experience suggests that when you have a startup, you should just build single-threaded applications that clean up after every request (such as with PHP) and spawn many of them. They will share database pool connections etc. but for the most part it will keep your app safer than if they all shared the same objects and process. The benchmarks say that PHP-FPM is only 50% slower than Swoole for instance. So why bother Starting safe beats a 2x speed boost.
And by the way, you should be building distributed systems. There is no reason why some client would have 1 trillion rows in a database, unless the client themselves is a giant centralized platform. Each client should be isolated and have their own database, etc. You can have messaging between clients.
If you think this is too hard, just use https://github.com/Qbix it does it for you out of the box. Even the AI is being run locally this way too.
I missed my opportunity to reply to your comment, but I really appreciate it, and I wanted to find a way to get this back to you. The comment in question:
"Well, I was one of the engineers that made the change :) I'm not sure how much I can tell, but the public reason was: "to make pricing more predictable".
Basically, one of the problems was customers who just set the spot price to 10x of the nominal price and leave the bids unattended. This was usually fine, when the price was 0.2x of the nominal price. But sometimes EC2 instance capacity crunches happened, and these high bids actually started competing with each other. As a result, customers could easily get 100 _times_ higher bill than they expected."
There was more to it than that, but I figure that's a good enough reference point.
Thank you for these improvements. It doesn't change anything, in terms of how much savings I can get by following the latest generations and exotic instance-types, but it does help with the reliability of my workloads.
It's been a huge benefit to me, personally, that I can provide some code that enables the potentiality of servers dying, with the benefit of 80% cost savings without using RIs.
You can also try another product of my former team: https://aws.amazon.com/savingsplans/ - it's similar to RI, but cheaper because it doesn't provide an ironclad guarantee that the instance will be available at all times. It's still a bit more expensive than spot, but not by much.
> You don't have to "color" your code into async and blocking functions
Unpopular opinion: this is a good idea. It encourages you to structure your code in a way such that computation and I/O are separated completely. The slight tedium of refactoring your functions into a different color encourages good upfront design to achieve this separation.
Good upfront design? This stinks of implementation detail leakage affecting high-level design, which should be a cardinal sin.
One should design based on constraints that best match the problem at hand, not some ossified principle turned "universal" that only really exists to mask lower-level deficiencies.
When the implementation detail involves whether or not the function will perform I/O, it is better to let that leak.
Excessive hiding of implementation detail is what leads to things like fetching a collection of user IDs from a database and then fetching each user from an ID separately (the 1+N problem). Excessive hiding of implementation detail is what leads to accidental O(N^2) algorithms. Excessive hiding of implementation detail is what leads to most performance problems.
You don't necessarily know ahead of time if an async or a sequential strategy is better. With colored functions you kind of have to pick ahead of time and hope you picked right or pay a big effort penalty to do a side by side comparison.
async/sync and sequential/parallel are orthogonal concerns. You can write sync code which does work in parallel (on different threads/cores) for something like a numerical algorithm that can be parallelized in some way. Deciding whether something is sync or async is about whether it needs to suspend (in practice, mostly whether it needs to do IO) which is much easier to understand up front. Sometimes it changes of course, in which case you have to do some refactoring. But in a decade of programming professionally in Scala and Rust (both of which have "colored" functions) I can count on one hand the number of times where I had to change something from sync to async and it took more than a few minutes of refactoring to do it.
Go and Java with lightweight threads provide a full-blown system, without any limitations. You don't have to "color" your code into async and blocking functions, everything "just works". Go can even interrupt tight inner loops with async pre-emption.
The downside is that Go needs around 2k of stack minimum for each goroutine. Java is similar, but has a lower per-thread constant.
The upside is that Go can be _much_ faster due to these contiguous stacks. With async/await each yield point is basically isomorphic to a segmented stack segment. I did a quick benchmark to show this: https://blog.alex.net/benchmarking-go-and-c-async-await