How much memory do you need to run 1M concurrent tasks?

cyberax · on May 21, 2023

It's not really fair to compare async/await systems with Go and Java.

Go and Java with lightweight threads provide a full-blown system, without any limitations. You don't have to "color" your code into async and blocking functions, everything "just works". Go can even interrupt tight inner loops with async pre-emption.

The downside is that Go needs around 2k of stack minimum for each goroutine. Java is similar, but has a lower per-thread constant.

The upside is that Go can be _much_ faster due to these contiguous stacks. With async/await each yield point is basically isomorphic to a segmented stack segment. I did a quick benchmark to show this: https://blog.alex.net/benchmarking-go-and-c-async-await

kevincox · on May 21, 2023

On the contrary I think it is critical to compare them. You can then decide if the tradeoff is worth it for your use case.

Even at a million tasks Go is still under 3 GiB of memory. So that is roughly 3KiB of memory overhead per task. That is likely negligible if your tasks are doing anything significant and most people don't need to worry about this number of tasks.

So this comparison shows that in many cases it is worth paying that price to avoid function colouring. But of course there are still some use cases where the price isn't worth it.

za3faran · on May 21, 2023

One thing I wanted to add is that in golang, you end up passing context.Context to all asynchronous functions to handle cancellations and timeouts, so you “color” them regardless. Java folks with structured concurrency have the right idea here.

throwy2342 · on May 22, 2023

The correct interpretation here is that golang chose explicit context passing where java chose implicit.

It's similar to explicit DI vs implicits.

the function coloring metaphor doesn't make sense since the calling convention is the same nor are there extra function keywords (`async` vs non-async).

cyberax · on May 22, 2023

This is a convention, and you also don't _have_ to do it.

A function without a context can still be used just fine from any code, you just won't be able to cancel it.

mftb · on May 22, 2023

This isn't true. I have use cases that don't require cancellations or timeout. The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.

throwaway2037 · on May 22, 2023

This is an interesting post. My understanding: Most of the use case for async code is I/O bound operations. So you fire off a bunch of async I/O requests and wait to be notified. Logically, I/O requests normally need a timeout and/or cancel feature.

However, you raise a different point:

    The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.

This sounds like CPU-bound, not I/O-bound. (Please correct me if I misunderstand.) Can you please confirm if you are using Go or a different language? If Go, I guess it still makes sense, as green threads are preferred over system threads. If not Go, I would be nice to hear more about your specific scenario. HN is a great place to learn about different use cases for a technology.

mftb · on May 22, 2023

I think I just responded too hastily. I am working in Go. There is file IO going on in addition to the calculation (which because of a NAS or whatever could also be network IO). As a practical matter I had never felt the need to offer cancellation or timeout for these use cases, but I probably should, so mea culpa.

eptcyka · on May 22, 2023

What's the point of multiplexing tasks on a particular core if the tasks don't do any I/O? It will be strictly faster to execute the tasks serially across as many cores as possible then.

mftb · on May 22, 2023

See response to sibling comment.

DSingularity · on May 22, 2023

How is passing in a context “coloring”? This is GC for goroutines — specifically long running ones.

metaltyphoon · on May 21, 2023

I don’t get how people that mentions “no color” don't get this: context.Context IS color.

SmooL · on May 21, 2023

It's not _quite_ the same: you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.

TheDong · on May 22, 2023

> you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.

You can always pass context.Background, in this metaphor creating a new tree of color.

You can always call "runtime.block_on(async_handle)", in this metaphor also creating a new tree of color.

throwy2342 · on May 22, 2023

Go doesn't have async/sync distinction, no keywords so the metaphor doesn't hold.

pkolaczk · on May 22, 2023

You can always pass the async executor to the sync code and spawn async coroutines into it. And you can keep it in a global context as well to avoid parameters. E.g. there is `Handle::current()` for exactly this purpose in Tokio. Function coloring is just a theoretical disadvantage - people like to bring it up in discussions, but it almost never matters in practice, and it is even not universally considered a bad thing. I actually like to see in the signature if a function I'm calling into can suspend for an arbitrary amount of time.

akvadrako · on May 22, 2023

It's not the same because you can have interleaving functions which don't know about context.

Say you have a foreach function that calls each function in a list. In async-await contexts you need a separate version of that function which is itself async and calls await.

With context you can pass closures that already have the context applied.

TheDong · on May 22, 2023

You can do that with async functions too by having the caller use, for example in rust, 'runtime.block_on(async_handle)'

You're talking about doing this:

    f1 := func() error { return nil }
    ctx_f1 := func(context.Context) error { return nil }
    fns := []func() error{f1, func() error { ctx_f1(ctx) }}

Basically, writing an anonymous conversion function.

In, say, rust, the equivalent in this analogy would be:

    let f1 = || -> Result<(), ()> { Ok(()) };
    let async_f1 = async { || -> Result<(), ()> { Ok(()) } };
    let fns = vec![f1, runtime.block_on(async_f1)];

That conversion function, 'runtime.block_on', doesn't really seem that different. The analogy still seems to hold.

gpderetta · on May 22, 2023

No, because if the function that includes those lines is itself async, it will now block, while the equivalent go coroutine will still preserve the stackfull-asyncness. I.e. you can't close over the yield continuation in rust, while it is implicitly done in go.

For a more concrete example: let's say you have a generic function that traverses a tree. You want to compare the leaves of two trees without flattening them, by traversing them concurrently with a coroutine [1]. AFAIK in rust you currently need two versions of traverse, one sync one async as you can't neither close over nor abstract over async. In go, where you have stackful coroutines, this works fine, even when closing over Context.

So yes, in some way Context is a color, but it is a first class value, so you can copy it, close over it and abstract over it, while async-ness (i.e. stackless coroutines) are typically second class in most languages and do not easily mesh with the rest of the language.

[1] this is known as the "same fringe problem" and it is the canonical example of turning internal iterators into external ones.

throwaway894345 · on May 22, 2023

How do you figure? What’s requiring my async function to take a context in the first place? Even if it takes one, what’s stopping it from calling a function that doesn’t take one? Similarly, what’s stopping a function that doesn’t take a context from calling one which does?

paulddraper · on May 22, 2023

It's absolutely fair to compare them.

Just like comparing C vs Python performance.

"B...b...but Python is easier to write."

Sure. That is a trade off. And whether the trade off makes sense depends on the situation.

brightball · on May 22, 2023

Erlang / Elixir processes use about 0.5kb per process, a preemptive scheduler on the BEAM to make sure work is consistent and isolated heap space for each one to avoid stop the world garbage collection.

It’s as good as it gets for this type of thing as it was designed for it from the ground up.

tonyarkles · on May 22, 2023

And it looks like the SystemLimitError issue they encountered is one flag away: https://elixirforum.com/t/too-many-processes-error/22224/2

anonymousDan · on May 22, 2023

This was the first thing that occurred to me when I saw the error - should have been pretty straightforward to tweak the settings to give a clearer picture.

pkolaczk · on May 22, 2023

Fixed already.

tonyarkles · on May 22, 2023

Thanks! Looked back at the results and that's about what I'd expect. Erlang/Elixir isn't a silver bullet, but amazing in its own way for other reasons. Ultimate max performance and memory efficiency definitely isn't one of them :).

masklinn · on May 22, 2023

> Erlang / Elixir processes use about 0.5kb per process

Last I checked it used about 2.5K. Beam reserves around 300 words per process (depends on exact options), but a word is 8 bytes not 8.

You can get it lower (obviously at a cost as soon as you start using the process and need space) but nowhere near 512 bytes, just the process overhead is around 800.

pkolaczk · on May 21, 2023

I don't know how C#, but Rust async/await doesn't allocate anything on the heap by itself. So it is not a universal property of all async/await implementations contrary to what you imply in your blog post.

evntdrvn · on May 21, 2023

For C#, you might want to try a different version with ValueTask instead of Task. It’s more memory-friendly.

It would also be interesting to try Native AOT compiling both versions…

noveltyaccount · on May 21, 2023

+1 to Native AOT, I'd love to see the data. Pretty easy to do, add a line of XML to the csproj and a modified `dotnet publish` command.

https://learn.microsoft.com/en-us/dotnet/core/deploying/nati...

cyberax · on May 22, 2023

Just tried it. No difference whatsoever.

neonsunset · on May 21, 2023

NativeAOT will have negligible impact on memory usage or performance here. JIT performance on average is higher than NAOT.

noveltyaccount · on May 27, 2023

Yeah and I think NAOT still embeds the same(?) runtime GC into the native binary. So, for memory usage, I would expect it to be nearly/exactly the same.

ygra · on May 22, 2023

I may remember wrongly here, but wasn't ValueTask merely the go-to option when it's expected that the task would finish synchronously most of the time? I think for the really async case with a state machine you just end up boxing the ValueTask and ending up with pretty much the same allocations as Task, just in a slightly different way.

evntdrvn · on May 22, 2023

Check out this article: https://devblogs.microsoft.com/dotnet/async-valuetask-poolin...

cyberax · on May 22, 2023

There is no magic.

Rust also has to allocate (box) when you need recursive async calls: https://rust-lang.github.io/async-book/07_workarounds/04_rec...

What Rust does, it allows to remove one layer of allocations by using Future to store the contents of the state machine (basically, the stack frame of the async function).

withoutboats3 · on May 22, 2023

It allows to remove all allocations within a task, except for recursive calls. This results in the entire stack of a task being in one allocation (if the task is not recursive, which is almost always the case, for exactly this reason), exactly like you describe the advantage of Go. And unlike Go, that stack space is perfectly sized, and will never need to grow, whereas go is required to reallocate the stack if you hit the limit, an expensive operation that could occur at an unknown point in your program.

cyberax · on May 22, 2023

Again, no magic. Future trait impl stores all the state machine that Rust can see statically. If you have complicated code that needs dynamic dispatch or if you need recursive calls, Rust will also have to allocate.

This is not at all that different from Go, except that Go preallocates stack without doing any analysis.

Actually... I can fix that! I can use Go's escape analysis machinery to statically check during the compilation if the stack size can be bounded by a lower number than the default 2kb. This way, I can get it to about ~600 bytes per goroutine. It will also help to speed up the code a bit, by eliding the "morestack" checks.

It can be crunched a bit more, by checking if the goroutine uses timers (~100 bytes) and defers (another ~100 bytes). But this will require some tricks.

djsavvy · on May 21, 2023

In your Go benchmark code, how is the calculation actually being done across threads? I don't see the `go` keyword anywhere in the source code.

cyberax · on May 22, 2023

But that's the point I'm trying to show. In Go code all calls are the same. There is no sync/async distinction.

In contrast, in C# (or any other similar system) async calls are _expensive_ compared to regular function calls.

mftb · on May 22, 2023

You've misunderstood how go routines work. You need to put the "go" keyword before the function call in order for it to be run concurrently.

cyberax · on May 22, 2023

I know perfectly well how goroutines work.

I'm NOT trying to show that Go is faster than async/await or anything similar. I'm showing that nested async/await calls are incredibly expensive compared to regular nested function calls.

foton2097 · on May 29, 2023

You need to add to go keyword to change a normal function to a goroutine. If you would remove async/await and Task/Return from the C# code example, it would perform pretty much the same as Go.

If you want to show that async/await calls are expensive, than you should have shown two code samples of C#, one with async/await, and one without.

Or could have done the same for Go, show one example with goroutines, and one without.

But I think everyone already know that async/await and goroutines has it's costs.

The problem is more that you are comparing Go without goroutines (without it's allocation costs) to a C# example with a poor implementation of async/await.

SmooL · on May 21, 2023

Yeah this isn't using goroutines at all? I don't see how this is a good comparison of goroutines vs coroutines.

h4x0rr · on May 21, 2023

I'm wondering that as well. The c# code seems really unrealistic though, using a lot of tasks for CPU bound work. A fair comparison would at least need to involve some degree of IO. Does go maybe automatically parallelize everythinh it can? That would be one potential answer to the initial question

yencabulator · on May 22, 2023

Just like there's no `spawn` or such in the async code.

It's trying to show the overhead of the async machinery, compared to "uncolored" unfunctions.

whateveracct · on May 22, 2023

The author doesn't have the understanding to have even gotten close to the capability for nuance you're asking for. They just copied code from ChatGPT and ran it and timed it and made graphs and it somehow got on HN.

mav88 · on May 21, 2023

You haven't written a concurrent Go program. Not sure what you're trying to demonstrate.

cyberax · on May 22, 2023

Yes. Because that's the point. I'm comparing overhead of async calls vs. normal calls.

schemescape · on May 22, 2023

Note that you’re changing multiple variables, namely the entire compiler and runtime.

cyberax · on May 22, 2023

If you know a system that both has async/await _and_ full-blown green threads, then I can try to test it.

michaelsbradley · on May 22, 2023

Racket has full-blown green threads (virtual threads)[1] that serve as an alternative to the async/await paradigm. It also supports use of native (OS) threads[2].

[1] https://docs.racket-lang.org/reference/eval-model.html#%28pa...

[2] https://docs.racket-lang.org/reference/places.html

[&] https://docs.racket-lang.org/reference/futures.html

gpderetta · on May 22, 2023

C++ has now both async (built-in) and multiple flavors of stackful coroutines (as external libraries). You can run both on top of ASIO so that you can measure purely difference of the coroutine implementation as opposed to the runtime.

schemescape · on May 21, 2023

Isn’t the first example completely synchronous? If that’s what’s being compared, why not use C# for both programs?

EGreg · on May 21, 2023

With async/await (which I think C# also uses) you actually unwind the entire stack and start a new one every time you do async callbacks. With fibers / green threads / whatever you instead store stacks in memory.

It took me a while to figure this out, thanks to articles I came across (such as “what color is your function?”.

There are memory / speed trade-offs, as per usual. If you have a lot of memory, and can keep all the stacks in memory, then go ahead and do that. It will save on all the frivolous construction / destruction of objects.

Having said that, my own experience suggests that when you have a startup, you should just build single-threaded applications that clean up after every request (such as with PHP) and spawn many of them. They will share database pool connections etc. but for the most part it will keep your app safer than if they all shared the same objects and process. The benchmarks say that PHP-FPM is only 50% slower than Swoole for instance. So why bother Starting safe beats a 2x speed boost.

And by the way, you should be building distributed systems. There is no reason why some client would have 1 trillion rows in a database, unless the client themselves is a giant centralized platform. Each client should be isolated and have their own database, etc. You can have messaging between clients.

If you think this is too hard, just use https://github.com/Qbix it does it for you out of the box. Even the AI is being run locally this way too.

austinshea · on May 22, 2023

Cyberax,

I missed my opportunity to reply to your comment, but I really appreciate it, and I wanted to find a way to get this back to you. The comment in question:

"Well, I was one of the engineers that made the change :) I'm not sure how much I can tell, but the public reason was: "to make pricing more predictable". Basically, one of the problems was customers who just set the spot price to 10x of the nominal price and leave the bids unattended. This was usually fine, when the price was 0.2x of the nominal price. But sometimes EC2 instance capacity crunches happened, and these high bids actually started competing with each other. As a result, customers could easily get 100 _times_ higher bill than they expected."

There was more to it than that, but I figure that's a good enough reference point.

Thank you for these improvements. It doesn't change anything, in terms of how much savings I can get by following the latest generations and exotic instance-types, but it does help with the reliability of my workloads.

It's been a huge benefit to me, personally, that I can provide some code that enables the potentiality of servers dying, with the benefit of 80% cost savings without using RIs.

cyberax · on May 22, 2023

Thank you! It means a lot!

You can also try another product of my former team: https://aws.amazon.com/savingsplans/ - it's similar to RI, but cheaper because it doesn't provide an ironclad guarantee that the instance will be available at all times. It's still a bit more expensive than spot, but not by much.

ilyt · on May 21, 2023

Your benchmark for Go runs in single goroutine

cyberax · on May 22, 2023

Yes. On purpose.

substation13 · on May 22, 2023

Function coloring is orthogonal to the underlying mechanism.

You can think of the Java / Go approach as adding the bind points automatically.

kccqzy · on May 22, 2023

> You don't have to "color" your code into async and blocking functions

Unpopular opinion: this is a good idea. It encourages you to structure your code in a way such that computation and I/O are separated completely. The slight tedium of refactoring your functions into a different color encourages good upfront design to achieve this separation.

armitron · on May 22, 2023

Good upfront design? This stinks of implementation detail leakage affecting high-level design, which should be a cardinal sin.

One should design based on constraints that best match the problem at hand, not some ossified principle turned "universal" that only really exists to mask lower-level deficiencies.

kccqzy · on May 22, 2023

When the implementation detail involves whether or not the function will perform I/O, it is better to let that leak.

Excessive hiding of implementation detail is what leads to things like fetching a collection of user IDs from a database and then fetching each user from an ID separately (the 1+N problem). Excessive hiding of implementation detail is what leads to accidental O(N^2) algorithms. Excessive hiding of implementation detail is what leads to most performance problems.

pkolaczk · on May 22, 2023

> When the implementation detail involves whether or not the function will perform I/O, it is better to let that leak.

I guess Haskell is fully based around that idea :)

gpderetta · on May 22, 2023

But async does not help with that. As shown elsethread, in most languages you can block from async functions just fine.

slt2021 · on May 22, 2023

concurrent algorithm is a design choice and is an architecture. Not an implementation detail

throwawaymaths · on May 22, 2023

You don't necessarily know ahead of time if an async or a sequential strategy is better. With colored functions you kind of have to pick ahead of time and hope you picked right or pay a big effort penalty to do a side by side comparison.

thinkharderdev · on May 27, 2023

async/sync and sequential/parallel are orthogonal concerns. You can write sync code which does work in parallel (on different threads/cores) for something like a numerical algorithm that can be parallelized in some way. Deciding whether something is sync or async is about whether it needs to suspend (in practice, mostly whether it needs to do IO) which is much easier to understand up front. Sometimes it changes of course, in which case you have to do some refactoring. But in a decade of programming professionally in Scala and Rust (both of which have "colored" functions) I can count on one hand the number of times where I had to change something from sync to async and it took more than a few minutes of refactoring to do it.

tester756 · on May 21, 2023

>num += Calculate(20).Result;

You're sure you should use .Result here instead of awaiting it?

uncheckederror · on May 21, 2023

To expand upon this thought, here is the AsyncGuidance doc[1] on why not to use .Result to get the return value of a completed Task in C#.

To make this simple they introduced async Main[2] a few years ago.

[1]: https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...

[2]: https://github.com/dotnet/csharplang/blob/main/proposals/csh...

jimsimmons · on May 21, 2023

None of what you say is a gotcha. An ignorant programmer stands only to benefit from Go

whateveracct · on May 21, 2023

The title is asking a very quantifiable question. And I was hoping the article would be a summary of expertise and deep dives the author took into various language's runtimes.

But then I read

> With a little help from ChatGPT

And sure enough the results & prose end up pretty much worthless. Because the author didn't do any actual critical thinking or technical work. (For instance, there's nothing more to "Go's memory usage ballooned at 1M tasks" beyond the observation. Why not? Just dig in there it's all documented you just have to read some stuff.)

faizshah · on May 22, 2023

It’s also really strange using python and node in an article about concurrency without mentioning the GIL (python) or node single threading.

Too · on May 22, 2023

The whole idea of async functions is to run tasks concurrently and with simple syntax on a single thread. Hiding the actual concurrency under the runtime, only when it is waiting for IO.

It doesn’t matter that node is single threaded, nor does the GIL have any impact on single thread.

It’s once you start doing cpu heavy work inside the tasks this will hit you.

yencabulator · on May 22, 2023

No, the single-thread aspect is just a limitation of Python & Node. Rust's Tokio is an example of a multithreaded async framework.

whateveracct · on May 22, 2023

That's what happens when you write a blog about a technical topic you don't know much of anything about I guess

pkolaczk · on May 22, 2023

GIL is not a problem for concurrency. It is a problem for parallelism when doing CPU bound work.

lexandstuff · on May 21, 2023

Do you mind sharing why Go ballooned to 1M?

pkolaczk · on May 21, 2023

Because goroutine stacks start at minimum 2 kB. But it is not obvious if this is the only factor that contributes to memory usage so I believe experimenting to verify theory is still quite valuable.

jasonwatkinspdx · on May 21, 2023

> "All programs were launched using the release mode if available. Other options were left default."

So it's likely these numbers simply reflect choices in the default configuration, not the ultimate limit with tuning.

I'm starting to dislike these low effort ChatGPT articles. Getting an answer from it is not the same thing as actually learning what's going on, but readers will walk away with the answers uncritically.

dceddia · on May 21, 2023

Lately I’m seeing a lot of this baked-in assumption that ChatGPT writes good/idiomatic/performant code, as if the code it generates is ~The Answer~. I guess it’s the modern version of copy/pasting from Stack Overflow. The code I’ve had it write isn’t universally terrible but it almost always has some low-hanging performance fruit (in JS for instance, it seems to love spreading arrays). I’m not sure how much I trust it to write good benchmark code!

kubb · on May 21, 2023

Does spreading arrays come with a performance penalty?

dceddia · on May 22, 2023

Adding to the others here, yeah - it has to loop over the whole array, and it allocates a new one. The allocation itself costs something and it also creates work for the garbage collector (freeing the old one). So it’s a lot of work compared to a .push()! Sometimes you do need the immutability aspect though, because a lot of things use referential equality for change detection (React is a big one).

dexwiz · on May 21, 2023

Spreading allocates a new array. Most things you can do with spread you can do with normal functions that either just directly access an entry or mutate an array directly. [first, …rest] is nice, but will always be slower than direct access.

spicebox · on May 21, 2023

Spreading an array requires looping over all the array elements which is O(n) compared to something like Array.push() which is just 0(1), although spreading has other benefits like copying rather than mutating

yen223 · on May 21, 2023

I've been seeing a lot more "ChatGPT only produces junk code" comments (case in point, this entire thread), and that hasn't really been the case either.

pkolaczk · on May 22, 2023

For majority of those benchmarks it produced the right code from the first go. It struggled more with Java VirtualThreads - probably fewer programs in the training set. It also had a tendency to overcomplicate things (adding unncecessary code). So there were a few iterations needed plus some hand edits.

fwlr · on May 21, 2023

To play Devil’s Advocate, this is a good test of “how many concurrent tasks will an uncritical developer relying on ChatGPT end up achieving with low effort”.

tialaramex · on May 21, 2023

Right, realistically your product will not primarily be used by the most talented experts, who are optimising as best as possible given intimate knowledge of internals. People will do what basically works, and it's arguably doing a disservice when "benchmarks" avoid showing us also naive performance without the tuning. The tuned performance is worth knowing, but realistically most customers will get the naive performance, ain't nobody got time to learn to tune everything.

I think this applies all the way down to hardware features and programming languages. For example the naive use of sort() in C++ will go faster than in Rust (in C++ that's an unstable sort in Rust it's stable) but may astonish naive programmers (if you don't know what an unstable sort is, or didn't realise that's what the C++ function does). Or opposite example, Rust's Vec::reserve is actively good to call in your code that adds a bunch of things to a Vec, it never destroys amortized growth, but C++ std::vector reserve does destroy amortized growth so you should avoid calling it unless you know the final size of the std::vector.

fwlr · on May 22, 2023

That is a fair point. I don’t think it saves this particular analysis, of course. Probably what’s important to know is something we might call the “Pareto naive performance”; with just a little bit of effort (the first 20% or so), you can get some significant performance improvement (the first 80% or so). Even the naive programmer just banging together something that works in Elixir is going to quickly find out how to raise Erlang’s default 260k process limit when they hit that, after all.

pkolaczk · on May 22, 2023

Trying to raise the limit, but it didn't work:

    $ /usr/bin/time -v elixir --erl "-P 10000000" main.exs 1000000

    08:42:56.594 [error] Too many processes
    ** (SystemLimitError) a system limit has been reached

Any other ideas how to raise the limit?

fwlr · on May 22, 2023

I asked ChatGPT and it said

“This limit can be adjusted with the `+P` flag when starting the BEAM… To adjust the limit, you could start the Elixir application using a command like the following:

    elixir --erl "+P 1000000"

“If you are getting this error not because of the BEAM limit, but rather because of your operating system limit (like the limit on the number of open files or the number of child processes), you will need to look into how to increase these limits on your specific operating system.”

pkolaczk · on May 22, 2023

Yes, again ChatGPT was better than just random search. If writing this blog post teaches me something, then it teaches me to use ChatGPT more not less ;)

BTW, I fixed the 1M benchmark and Elixir is included now.

fwlr · on May 22, 2023

I’m glad you saw the irony in my reply. I was a little worried it would come off acerbic, like “since you use ChatGPT, you only deserve ChatGPT”. The real intent was double-edged, of course - to demonstrate that ChatGPT isn’t just dumb-codemonkey-as-a-service, it can also quite easily solve a lot of the problems it is purported to create. You have been handling mean comments here (including my own mean comments) with grace, so I took the risk.

pkolaczk · on May 22, 2023

> Even the naive programmer just banging together something that works in Elixir is going to quickly find out how to raise Erlang’s default 260k process limit when they hit that, after all.

Sure. Will redo this part. I doubt it would allow Elixir to get a better result in the benchmark, though, as it was already losing significantly at 100k tasks. Any hints on how to decrease Elixir memory usage? Many people criticize using defaults in the comments, but don't suggest any certain settings that would improve the results. And a part of blogging experience is too learn things - also from the author perspective ;)

oshanz · on May 22, 2023

    1..num_tasks
      |> Task.async_stream(fn _ ->
          :timer.sleep(10000)
        end, max_concurrency: num_tasks)
      |> Stream.run()

https://elixirforum.com/t/how-much-memory-do-you-need-to-run...

latency-guy2 · on May 21, 2023

Time to fire a bunch of people then

vasco · on May 21, 2023

> I'm starting to dislike these low effort ChatGPT articles.

The next few decades are going to be tough for you, I think. Brace yourself!

bryancoxwell · on May 21, 2023

Curious what build configuration changes would help with this. Specifically with python/Go because that’s what I’m most familiar with, but any insight into tweaking build configs to increase performance here would be really interesting to me.

d0mine · on May 21, 2023

The results can be ignored.

You can't just copy-paste ChatGPT without even a sanity check e.g., Python code is wrong:

   await sleep(10)

sleeps 10+ seconds and returns None but the intent is to pass a coroutine to create_task() (await should be removed at the very least)

josephg · on May 22, 2023

Javascript is better, but its still sloppy:

    tasks.push(
      (async () => {
        await delay(10000);
      })()
    );

.. So tasks.push is called with an IIFE, which returns a promise, which resolves after its internally awaited another promise...

You could just do this, which avoids about 3 unnecessary allocations:

    tasks.push(delay(10000))

And I'm not convinced promisify(setTimeout) is what you want here. I'd just use this:

    const delay = timeout => new Promise(resolve => setTimeout(resolve, timeout))

... Which honestly should be part of the standard library in javascript. However, I'm less convinced that replacing the delay function will make a big difference to performance.

mhio · on May 22, 2023

The article updated to remove the cruft you noted. It dropped the 1M memory substantially

Promise timers are now packaged by node at least, they allocate about 20% less memory for the 1mil test (on v18).

    const { setTimeout } = require('node:timers/promises')

https://nodejs.org/api/timers.html#timers-promises-api

josephg · on May 22, 2023

Oh I didn't know that! Thanks!

pkolaczk · on May 22, 2023

Thanks for the suggestion. I fixed it and indeed it improved memory consumption of NodeJS.

josephg · on May 22, 2023

Heh confirming the old adage: If you want to know the right answer to a problem online, post the wrong answer. People will come out of the woodwork to correct you.

I uh, may be in this picture.

mrj · on May 22, 2023

Yeah Python got done dirty here. I tried a million tasks and I saw a peak of about 166MB. I'm still getting used to async in Python, but this seems to work fine though?

    import asyncio
    
    
    async def do_task():
        await asyncio.sleep(10)
    
    
    async def main(num_tasks: int):
        tasks = []
    
        for task_id in range(num_tasks):
            task = asyncio.create_task(do_task())
            tasks.append(task)
    
        print("spawned tasks")
        await asyncio.gather(*tasks)
    
    
    if __name__ == "__main__":
        asyncio.run(main(1_000_000))

pkolaczk · on May 22, 2023

Thank you! I fixed the code and updated the benchmarks. However, your 166 MB doesn't match my testing (I'm getting about 240M at 100k tasks) and 10x more at 1M. Are you sure you haven't misread your RSS?

mrj · on May 22, 2023

Nope, that's what it is for me but you're using an older Python. 3.11 has a lot of improvements and not sure why you wouldn't use the latest?

d0mine · on May 22, 2023

If Python 3.11 is available, then asyncio.TaskGroup could be used instead of asyncio.gather() https://docs.python.org/3/library/asyncio-task.html#task-gro...

pkolaczk · on May 22, 2023

Running Ubuntu LTS and that was the version from the official repo. Will try a newer version then.

pkolaczk · on May 21, 2023

Ok, this one is funny. Actually ChatGPT got it right and there was a create_task before, but right before committing I "inlined it" and introduced a bug. Thanks for letting me know. Will double check if it didn't messed up the results.

And BTW - I'm very far from copy pasting from chat gpt. ChatGPT helped with this code, but often required many iterations and also some manual cleanup.

withinboredom · on May 21, 2023

Haha, I had to try a PHP build with the parallel extension[1] with some rather simple code[2]:

- single job: 27.8 mb (respectable)

- 10k: 8.3 GB!

At this point, I gave up. I'm 99% sure trying to go to 1 million will crash my computer.

[1]: https://www.php.net/manual/en/intro.parallel.php

[2]: https://gist.github.com/withinboredom/8256055709c0e6a8272b3a...

withinboredom · on May 21, 2023

I should add that the good ole' fork approach[1] only uses 40.7mb for 10k, 40.7mb for 100k, and 40.7mb for 1 million.

Got to love COW memory. :)

[1]: https://gist.github.com/withinboredom/c811f3c433904235a5f196...

beders · on May 21, 2023

Another benchmark of the "I don't know what I'm actually measuring" flavor.

I don't even know where to start: from the lack of env settings, system settings, compiler settings to the fact that no concurrent work was actually done, this is one giant flawed benchmark.

The only lesson here should be: Don't take any of this seriously.

pkolaczk · on May 21, 2023

Just read the text. It lists versions and states it was using defaults in the release mode. Most of those compilers don't have any fancy options actually. E.g go is just `go build`, Python is just `python3 ...`. Rust options are in the project definition. Anyway, I'll add those command line invocations if it causes so much controversy.

tester756 · on May 21, 2023

>from the lack of env settings, system settings, compiler settings

compiler settings?? I bet he used defaults everywhere

matklad · on May 21, 2023

Benchmarks for threads might slightly underestimate memory usage. For threads, some amount of memory is spent by the kernel (eg, memory maps themselves) and won’t show as maxrss.

> a high number of concurrent tasks can consume a significant amount of memory,

So Go needs 3gigs for a million tasks. Relatively speaking, for an app that does a million concurrent somethings, 3gb is probably peanuts.

It seems that memory isn’t the limiting factor here.

compumike · on May 22, 2023

Crystal 1.8.2 https://crystal-lang.org/:

    n = ARGV.first.to_i32
    finished = Channel(Nil).new
    n.times do
      spawn do
        sleep(10.seconds)
        finished.send(nil)
      end
    end
    n.times { finished.receive }

Memory usage determined with `ps ax -o pid,rss,args`. Runtime measured with `/usr/bin/time -v`. (Had to run `sudo sysctl vm.max_map_count=2500000` to get the final two rows.) Results:

    N        Runtime(s)  RSS(kiB)  RSS(MiB)
    =======================================
          1       10.01      1792         2
       1000       10.01      6016         6
      10000       10.11     45184        44
     100000       11.00    435840       426
    1000000       20.25   4336768      4235

"System time" rather than "user time" is the majority (7.94s system time, 2.83s user time in the 20.25s wall time run). Is this pointing to memory allocations?

Crystal Fiber docs https://crystal-lang.org/api/1.8.2/Fiber.html says "A Fiber has a stack size of 8 MiB which is usually also assigned to an operating system thread. But only 4KiB are actually allocated at first so the memory footprint is very small." -- and perhaps unsurprisingly, 4KiB times 1000000 is approximately 4 GiB. Nice when the math works out like that :)

To me this is useful as a baseline for something like a websocket server, with some number of idle connections, each Websocket connection mapped to a Fiber that is waiting for I/O activity but mostly idle.

lordnacho · on May 21, 2023

I'm a big Rust fan, so I'm happy Tokio holds up so well. I find it to be one of the simplest and quickest ways to write task based programs.

Gotta wonder whether you can challenge those numbers with c++ though. I suspect you can but with a lot of work, where Rust is pretty effective written naively.

Of course the big issue with these comparisons is it's hard to be an expert in so many different languages. How do you know if what you're doing is the best way in each language?

jeroenhd · on May 22, 2023

I find the opposite to be true, Tokio is a fine library but it makes the whole program very complex.

Perhaps it's because I generally don't write code for many short lived workloads, but I find using real threads to be a lot easier to work with (plus it saves time during compilation). Tokio is great at what it does, but I have to wonder how relevant these types of threads are when "fearless concurrency" is one of the token advantages of the language.

I think this article would've been a lot better if it did some actual calculations in the spawned threads, rather than simply waiting. A smart enough compiler would see that the results of the wait calls, which usually don't come with any other side effects, are never used and optimize them out entirely.

neonsunset · on May 21, 2023

As of today (with .NET 7 and 8 preview) 120MB base footprint memory usage by .NET is indeed surprising. A few questions:

- How are you exactly launching the workload?

- How much memory does your system have?

- Could you try running it with .NET 7 instead?

The last two are especially relevant because in the recent versions each release got extra work to reduce memory footprint on Linux (mostly because it's an important metric for container deployments).

Overall, 120MB sounds about right if you are launching a .NET 6 workload with some extra instrumentation attached while the runtime picks Server GC which is much more enthusiastic about allocating large segments of memory, especially if the system has large amount of RAM.

As for Tasks - C# tasks are in many ways similar to Rust's futures (both are yielding-to-the-caller state machines) and take very little memory, which scales directly with the amount of variables that persist across "yield points" (which you have none).

DaiPlusPlus · on May 21, 2023

> Overall, 120MB sounds about right if you are launching a .NET 6 workload with some extra instrumentation attached while the runtime picks Server GC which is much more enthusiastic about allocating large segments of memory, especially if the system has large amount of RAM.

Also if you have lots of CPU cores then the CLR will (pre-)allocate up-to ~10MB/core for per-hardware-thread GC heaps. So my 16-core (32-thread) desktop will run every .NET program with a starting (OS-level) memory usage of ~200MB for a Hello, World program - though actual total GC usage will be < 1 MB.

conradfr · on May 21, 2023

I noticed he didn't use the latest version of each language. Maybe that's just what you get on Ubuntu 22.04 LTS?

neonsunset · on May 21, 2023

Using .NET 6 is perfectly fine! It is the latest LTS after all, it's just that on machines with a lot of RAM the GC can indeed pre-allocate quite a bit of memory if, based on hardware configuration, it chooses to use Server GC. This, however, is being tuned with each subsequent release hence I'm curious to see the numbers on 7 on the author's hardware.

themerone · on May 21, 2023

This is not a comparison of 1 million concurrent tasks.

For a bunch of these tests, it is measuring 1 million tasks multiplexed onto a smaller number of os threads.

In many implementations the cost of concurrency is going to be small enough that this test is really a measure of the stack size of a million functions.

creatonez · on May 21, 2023

There is sometimes a distinction made between concurrency and parallelism, where "concurrency" just means the programming interface looks parallel but is allowed to use out-of-order task-switching under the hood, and "parallelism" is real-deal usage of CPU cores. Both ways of looking at things are useful in different situations.

themerone · on May 22, 2023

That is a good point, but parallelism degrades to concurrency when you run out of CPU cores.

brunoborges · on May 24, 2023

The Java tests (Loom in particular), and possibly other langauge runtimes too in this benchmark, can't be compared without looking at the runtime settings.

JVM for example has a default heap size of 25% to 50%, depending on how much memory is available in the system. Given that the code is using ArrayList, there will be more need for bigger heap, due to array expansion. To eliminate that noise, the code should be changed to a pure Thread[] array with a fixed size based on numTasks.

Once that is done, you may as well run up to 10_000 tasks (virtual threads) within a JVM that consumes no more than 32 MB of RAM (compared to 78MB as originally reported by OP).

These were the best results I could find after playing a bit:

  # Run single task
  docker run --memory 26m --rm java-virtual-threads 1

  # Run 10 tasks
  docker run --memory 26m --rm java-virtual-threads 10

  # Run 100 tasks
  docker run --memory 26m --rm java-virtual-threads 100

  # Run 1_000 tasks
  docker run --memory 28m --rm java-virtual-threads 1000

  # Run  10_000 tasks
  docker run --memory 32m -e JAVA_TOOL_OPTIONS=-Xmx25m --rm java-virtual-threads 10000

  # Run 100_000 tasks
  docker run --memory 200m -e JAVA_TOOL_OPTIONS=-Xmx175m --rm java-virtual-threads 100000

  # Run 1_000_000 tasks
  docker run --memory 950m -e JAVA_TOOL_OPTIONS=-Xmx750m --rm java-virtual-threads 1000000

elteto · on May 21, 2023

I’m surprised at Elixir failing with 1M tasks. I’ve definitely done this sort of test with Erlang without an issue. Although I can’t really remember right now if I pushed it to 1M tasks.

sneako · on May 21, 2023

“The maximum number of simultaneously alive Erlang processes is by default 262,144. This limit can be configured at startup. For more information, see the +P command-line flag in the erl(1) manual page in ERTS.”

https://www.erlang.org/doc/efficiency_guide/advanced.html#:~....

crazydoggers · on May 21, 2023

Almost certainly a system cofiguration issue. I’m pretty sure the default process limit on BEAM is something like 200K. Additionally I don’t see mention of updating file handles or anything else on the operating system.

Maybe I’m cynical but this is what happens if you rely on chatGPT to give you programming answers without understanding the fundamentals first.

fwlr · on May 21, 2023

Unless I’m missing something in the writeup, it doesn’t mention how much RAM is on the system. A little odd to mention the CPU and the OS but not the RAM, for a benchmark that doesn’t care about execution speed but does care about memory usage. It makes figuring out that Elixir error of “system limit reached” a little harder to figure out.

edit: oh apparently it’s got nothing to do with memory, 1M just happens to be higher than the default process limit for Erlang and thus Elixir.

conradfr · on May 21, 2023

Also someone needs to tell ChatGPT that its Elixir code has a bug ;)

lbriner · on May 22, 2023

Lots of pedants like to dump on these sorts of articles but there are enough interesting things to see, even in a relatively trivial comparison like this.

It is interesting that some systems allocate more memory up-front but this doesn't grow much per-task.

It is interesting that in the form the app was written the Go program doesn't appear to scale as well. Sure there are possibly reasons for it so anyone with more knowledge of Go could attempt to write a more "correct" program and ask to rerun it.

It is interesting that with the defaults, things behave as they do, again, someone who cares more about "it's just that the defaults are different" is welcome to investigate that and ask "why" some program defaults might lead to better performance.

It is interesting that as-in most comparisons, the languages/frameworks don't neatly fit into Stereotypes.

It is interesting that most of them probably perform well enough for 99% of applications we might write.

It seems like most people still don't appreicate that YMMV and that trade-offs are always made. Just add it to your stash of brain juice that helps you become a better engineer and stop complaining.

wwarner · on May 21, 2023

Would be nice to see throughput tested at this level. There is may be a good reason for Go to allocate more per thread if each thread is really doing something. Hat tip to Rust and C# tho!

bsaul · on May 21, 2023

I suspect 1 million concurrent task very likely becomes unmanageable in any real world scenario anyway. 1KB per task giving 1GB of total memory consumption, which means 100KB would be enough to raise memory limits on most systems.

There may be scenarios such as network connexion handling where those connexion are kept idle 99% of the time, but you still need something to handle the very rare case of receiving something. However, is 1 million network connexion manageable from a kernel point of view anyway ? Wouldn't you reach system limits before that ?

toast0 · on May 21, 2023

> However, is 1 million network connexion manageable from a kernel point of view anyway ? Wouldn't you reach system limits before that ?

You can raise all those defaults as long as you have sufficient memory. Yes, you need to be working with fairly idle connections, otherwise you're likely to run out of CPU. Real world examples include Chat, IMAP, push messaging, that sort of thing where having a live connection means a significant reduction in latency vs periodically opening a connection to poll, but at the same time, there's not a whole lot going on.

I was at WhatsApp when we accidentally hit 2.8M connections on one machine [1] I can't remember if we ever got to 3M, but we ended up doing a lot more work per connection so didn't keep those connection counts for long. Kernel wise, as long as you're mostly inbound connections, everything is well oiled; I ran into problems initiating many thousands of outbound connections/second in an HAProxy use case, although I think FreeBSD 13 may have improved that bottleneck. I hear Linux also works, but I've not scaled it nearly as far.

[1] http://www.erlang-factory.com/upload/presentations/558/efsf2...

bsaul · on May 22, 2023

Thanks for sharing. What’s your take on elixir consuming an outstanding amount of memory per task in OP benchmark ?

toast0 · on May 22, 2023

If you want to do well on this benchmark, rather than calling sleep, the sleeping processes should call erlang:start_timer(10000, self(), []) to schedule a timer and then call erlang:hibernate/3 to wait for the message without having a normal heap. The efficiency guide [1] says a process will start at 326 words of memory, and a word on a 64-bit system is 8 bytes, so rough math says 2.5k per process and 2.5GB for 1M processes, so the observed 4GB isn't that far off, comments on the blog page show someone else got the example to 2.7GB by eliminating extra steps in the program, and reducing initial heap size got it to 1.1GB. Hibernation should make it smaller, but I'm not going to check.

Note also, that BEAM uses several allocation arenas and is slow to release those back to the OS. It's important to measure memory from the OS perspective, but in a real application you'd also measure memory from inside BEAM. There are knobs you can tune if you need to reduce apparent OS memory use at the expense of additional cpu use during allocation. There's tradeoffs everywhere.

Also, yes, 4GB is a lot of memory, but once you start doing real work, you'll probably use even more. Luckily memory is not that expensive and capacities are getting bigger and bigger. We ran some servers with 768GB, but it looks like the current Epycs support 6TB per CPU socket; if everything scales, you could run a billion sleeping BEAM processes on that. :p I recognize this sounds a lot like an argument to accept bloat, but it's different because I'm saying it ;) Also, I think the benefits of BEAM outweigh its memory costs in general; although I've certainly had some fights --- binary:copy/1 before storing binaries into ets/mnesia can be really helpful!

[1] https://www.erlang.org/doc/efficiency_guide/processes.html

wongarsu · on May 21, 2023

I like the general idea of the test. Things I would have liked to see (apart from basic system tweaks like allowing more file handles) are a performance comparison (how much CPU time is spent) and maybe a comparison between operating systems (this might highlight a lot of windows-linux differences, like Windows not committing memory - better have a large swap file to handle stacks of 100k real threads - or windows having a very different scheduler)

thethirdone · on May 21, 2023

Surely a Sufficiently Smart Compiler could optimize all of these to only take the a constant amount of memory (and less time).

I wonder if there are any correctness constraints that would actually prevent such optimizations. For example, the compiler might be required to execute the exact same system calls as an unoptimized program. It seems quite reasonable that the programs using system threads could not be optimized significantly under such a constraint.

I would expect the larger limit that prevents compilers from optimizing this is the large amount of code that defines the semantics of the lightweight threads. For example, I would expect the observable / required behaviors of tokio to not prevent a SCC from optimizing it, but the massive amount of Rust code that defines the lightweight thread semantics cannot be reasonably optimized.

Which programming language / framework is the closest to actually being optimized to like < 4 byte / task? I would make an uneducated guess of Elixir or Go.

sb8244 · on May 22, 2023

Each BEAM process comes with its own stack / heap space. This uses a minimum of 326 words of memory (https://www.erlang.org/doc/efficiency_guide/processes.html#c...)

pdimitar · on May 22, 2023

To the author: Elixir (well, Erlang's BEAM VM actually) can be configured to allow more than the default 256k tasks (262_144). I'd do that and post the numbers for 1M tasks.

augusto-moura · on May 22, 2023

The nodejs code doesn't actually work in parallel. The code being tested is only creating new timeouts but it is working on a single V8 thread.

It has a clock thread on the C++ part of the engine and when it timeouts it pushes the continuations on the event loop, and that is it

h1fra · on May 22, 2023

Came here to say that, not sure if it actually would improve the test but at least it would be correct. Also, Node 12 was released 4 years ago.

fulafel · on May 22, 2023

Would be interesting to hear what problems was hit trying to run 100k threads. I remember testing out a million threads on an by now older desktop and just had to do a couple of sysctl setting changes that were easy to find.

Not sure if this is what I used but the settings sound familiar: https://www.baeldung.com/linux/max-threads-per-process

Also would be good to specify how the memory as measured (eg to be able to verify they are actually measuring physical memory usage).

davidgrenier · on May 21, 2023

For the sake of pitching an analog to go's concurrency model. Hopac is a concurrency library for F# (partly written in C#) delivering an informal implementation of CSP (similar to Go's goroutine and Clojure Core.async). Since the job does nothing but wait for 10 seconds, the code is:

    seq{1..1000000}
    |> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)
    |> run

I'm not sure how to calculate total memory usage but instrumenting the following way:

    GC.Collect()
    GC.Collect()
    GC.WaitForPendingFinalizers()
    GC.TryStartNoGCRegion 3000000000L
    let before = GC.GetTotalMemory false
    seq{1..1000000}
    |> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)
    |> run
    GC.GetTotalMemory false - before |> printfn "%d"
    GC.EndNoGCRegion()

Gives the following output, I'm reading 144MB for 1M lightweight thread (in the REPL) unless I did something dumb:

    144449024                                                                                                                                                                                      
    Real: 00:00:10.902, CPU: 00:00:18.570, GC gen0: 3, gen1: 3, gen2: 3                                                                                                                            
    val before: int64 = 53902280L

davidgrenier · on May 22, 2023

This is on .net core 7.0.5, Debian 11, Server GC (which is expected by Hopac). I removed all GC code and ran it as an executable using @compumike's `ps ax -o pid,rss,args' command. I think the only one that triggered a GC was the execution with 10M concurrent jobs.

    N           Real(s)    User(s)  RSS(kiB)  RSS(MiB)
    ==================================================
          1       10.08      00.04     42660        41
       1000       10.07      00.06     40736        39
      10000       10.09      00.20     44176        43
     100000       10.15      01.85     58532        57
    1000000       10.88      18.18    182024       177
   10000000       22.28    4m04.84   1452224      1418

kg · on May 21, 2023

It would be interesting to see how much the C# memory usage would go down if you allocated a single Task.Delay at the start and had all the tasks wait for it. It would probably result in it finishing in 10sec instead of climbing up to 12, while still running 1 million tasks. I'm not sure how you would do this in other runtimes, and there is still a risk of new problems cropping up.

1 million Task.Delays means 1 million queued waits in the sleep machinery, which could run into some bad scaling and overhead. But then 1 million tasks waiting on a single awaitable probably hits overhead in other subsystems. My guess would be that 1 million waits is much cheaper than 1 million sleeps, because sleeps have to be sorted by due time while the list of tasks waiting for completion just needs to be sequential. Entries in the sleep queue are also going to be bigger (at minimum, 4-8 bytes for due time + 4-8 bytes per completion callback) while entries in a completion queue would just be 4-8 bytes for the completion callback.

Task.WhenAll also has potentially worse behavior than the other apps if you care about memory, since it will need to allocate a big array of 1 million Tasks in order to pass it in. Many of the other programs instead loop over the tasks, which eliminates that overhead.

JB_Dev · on May 21, 2023

The other thing I noticed is that they are wrapping every Task.Delay(…) inside a Task.Run(…) call. That should not be necessary. The inner task will yield back to the called when it hits the delay call anyway. There is no need to queue it with an immediate yield as if it is background or cpu-bound work.

The Run() call might be creating a more complex state machine and using more memory than necessary (2 tasks vs just 1 for each delay). The memory usage might be even lower if the Run() call is dropped.

dodyg · on May 22, 2023

Yup - this would do the job faster and better.

Task task = Task.Delay(TimeSpan.FromSeconds(1)); tasks.Add(task);

JB_Dev · on May 22, 2023

I was sufficiently curious and just went and tested this using BenchmarkDotNet. The example code is here (https://github.com/J-Bax/CSharpBenchmarkExamples).

The difference is quite significant.

(1) With the authors code (using Task.Run), I get ~428MB of allocations.

(2) Dropping the unnecessary Task.Run(...), I get ~183MB of allocations.

(3) Doing (2) and waiting N times on the same delay, I get ~39MB of allocations.

This was all using .NET 6 too. .NET 7 or the 8 preview might be even better since they are working hard on performance in recent releases.

So even looking at just (2), that puts .NET on par with the rust library.

throwawaaarrgh · on May 21, 2023

I see C is missing. If you're going to do a language comparison, might as well include the one language we know will have the best performance and lowest memory

gigel82 · on May 21, 2023

There is no light-weight threading concept in standard C/C++, and the cost of one OS thread is at least 1Mb (stack) + 4Kb (tls) reserved memory (at least by default, on modern Windows), so for 1m threads that will be 1Tb of RAM :)

tialaramex · on May 21, 2023

C++ 20 has async/await (but of course named co_async and co_await, because C++ isn't the sort of language where you're allowed nice things, the beatings will continue until morale improves)

However AIUI C++ 20 doesn't actually supply an executor out of the box, so you would need to choose what to do here as for Rust, where they picked both tokio and async-std and you can see they have different performance.

C does only have threads, but you could presumably pull off the same trick in C that Rust does, to get Linux to give you the bare minimum actual resident memory for your thread and you needn't care about the notional "real" size of the thread since we're 64-bit and address space is basically free.

foderking · on May 22, 2023

> (but of course named co_async and co_await, because C++ isn't the sort of language where you're allowed nice things, the beatings will continue until morale improves)

LMAO

gpderetta · on May 22, 2023

co_<whatever> has become a meme in the C++ community.

pkolaczk · on May 22, 2023

Stack memory is allocated lazily on Linux. I thought this benchmarks shows it pretty clearly. Otherwise 10k threads would be already in GBs territory, when in fact it is below 100 MB.

shrimp_emoji · on May 21, 2023

How is your compiler addressing bits like this?

kaba0 · on May 22, 2023

Since when is C the most performant? One can’t even write a performant generic vector with it.

C++ is the de facto language used for hight performance.

remram · on May 22, 2023

"One can't even write a performant garbage-collected function in C++"

You're describing features not performance.

kaba0 · on May 23, 2023

So how do you write programs without data structures? You copy paste some implementation and modify it slightly at every use site?

Possible optimizations depend on the semantics of the language — C++ can do just orders of magnitude more with for example a string data structure, than what is possible with a C “string”. So no, C is not the most performant language in general — de facto C++ is used for everything performance critical.

remram · on May 23, 2023

> You copy paste some implementation and modify it slightly at every use site?

Yes, or use macros. Not saying it's the most convenient, but that's an orthogonal concern.

You wouldn't say hand-written assembly isn't performant, so your objection strikes me as wrong.

asabil · on May 21, 2023

The maximum number of simultaneously alive Erlang processes is by default 262,144. This needs to be raised to avoid the `system_limit` error.

mkoubaa · on May 21, 2023

Is memory really the right resource bottleneck for this many tasks?

phamilton · on May 21, 2023

For a workload like websockets, where things are generally idle, this is a decent benchmark.

nilsmagnus · on May 26, 2023

I get very different results on the 1 mill tasks than the author here, so it would be interesting to know how the consumed memory is measured.

My results is 565Mb for the 1 mill tasks, vs the authors 2.6Gb.

My method is described in [1], feel free to point out what I am missing here.

[1] https://gist.github.com/nilsmagnus/38b1b5f63237a2a6317942849...

bastawhiz · on May 21, 2023

Understanding full well that these are pretty artificial results, I'm surprised at just how well Node and Python faired against Go and Java. For all the default assumptions around languages that folks have, they've certainly held their own here.

josephg · on May 22, 2023

As others have mentioned, the results come from the difference between async concurrency vs "lightweight threads" (in Go and Java).

In the nodejs test, there's very little javascript code being actually executed here. Whats really being tested is nodejs's event queue (libuv), which is written in C. And its being compared to Go's coroutine system - which is also in C. But, if I recall correctly, Go's threading model includes a preemptive scheduler (which is very complex) and it allocates a stack per goroutine.

This doesn't test if Javascript or Go code is faster / more efficient. There's almost no javascript or go code being executed here at all. This is testing the execution engines of both languages. Go's execution engine has more features, and as we can see those features have a cost at runtime.

Sammi · on May 22, 2023

Some of these tests are doing concurrency. Some are doing parallelism.

Concurrent meaning multiple tasks run within a given stretch of time. Parallel meaning multiple tasks run at a given point of time.

These are not the same. So the tests are not testing the same thing.

Aardwolf · on May 22, 2023

> popular languages like Rust, Go, Java, C#, Python, Node.js and Elixir.

Oof, has C/C++ become that unpopular? It would be a useful comparison as low level system language anyway.

lostmsu · on May 23, 2023

In C# program you'd generally want to call .ConfigureAwait(false) on the task you are about to await if the continuation does not require to be completed on a special thread (like UI).

P.S. the statement above is imprecise: https://devblogs.microsoft.com/dotnet/configureawait-faq/

bsenftner · on May 21, 2023

Why no C++ version?

cynicalsecurity · on May 21, 2023

Because C/C++ would kick ass, exposing the rest of the hipster technologies as lame.

josephg · on May 22, 2023

Pfft doubtful. Rust, C and C++ all have very similar performance. And apparently a javascript framework is currently beating all of those languages in the composite score on techempower, though that seems a bit weird to me:

https://www.techempower.com/benchmarks/#section=data-r21&tes...

C doesn't have a built in, optimized async execution engine of any sort. C with threads would lose to any of the async runtimes here (including javascript).

C++ has async now. By all means, write up an async C++ example and compare it. I suspect a good C++ async executor would perform about as well as tokio, but it'd be nice to know for sure.

I'd also be interested to see how hard it is to write that C++ code. One nice thing about Rust and Go is that they perform very well with naively written code. (Ie, the sort of thing chatgpt, or most of your coworkers would write.) Naive C would use kernel threads, so if we're comparing average C code to average Go code at this test, the average go code will end up faster.

(This isn't always true of rust. I've seen plenty of rust code which uses Box everywhere that runs crazy slow as a result.)

BackBlast · on May 22, 2023

> And apparently a javascript framework is currently beating all of those languages in the composite score on techempower, though that seems a bit weird to me:

Just-js was essentially built to run the techempower benchmark. It's not a real stack you can use. An exercise in "what is possible".

Cyph0n · on May 21, 2023

They would also kick memory safety in the ass, unless you’re the kind of dev who doesn’t make those kinds of silly mistakes

imran-iq · on May 22, 2023

Don't be that guy.

https://github.com/search?q=repo%3Atokio-rs%2Ftokio%20unsafe...

kouteiheika · on May 22, 2023

`unsafe` doesn't mean that it's not safe, just that the compiler can't prove that it's safe so the human has to do it manually.

WinstonSmith84 · on May 22, 2023

Are you trying to say that you can write from scratch a code just as safe or better than one of the most peer reviewed crate of Rust? Congrats, the world has been waiting for you, but nonetheless, that's why commoners prefer to stick to what's safer for them

whateveracct · on May 22, 2023

Haskell imo is the most glaring omission. Famous for its green threads long before Go was around.

Seattle3503 · on May 21, 2023

Rust is going to do well on toy examples. This is because in a naive implementation, for each thread, Rust statically allocates enough space for the largest possible stack the thread will need. On a toy example this will be quite small. In real code, some branches will have large futures that will bloat the size of every thread, even if the thread doesn't take the bloated branch. You can mitigate this issue by heap allocating large futures on cold paths, but this requires insight and tuning.

Some runtimes can handle this for you, but they come with other performance tradeoffs.

pkolaczk · on May 22, 2023

In real code I saw Rust beating Go and Java based code by 10x-50x in memory use. But too complex and too many factors to analyze, so not a material for a blog post. Actually this tiny benchmark favors GC based runtimes because it just sleeps - so doesnt actually allocate much on the heap.

Traubenfuchs · on May 21, 2023

The Java tests should be optimized by making the ArrayList a fixed size. All this resizing of the huge array backing it and the potential final very high amount of empty backing array slots probably makes the Java result worse than it could be.

I wonder if the parallel stream API with a virtual thread pool would make it better or worse.

tialaramex · on May 21, 2023

The ArrayList is just a growable array and is presumably using an exponential growth strategy so it's equivalent to what several of the others are doing. You could indeed tell Java how big the ArrayList will grow in advance, but you could for example likewise do that with the Rust Vec by asking for Vec::with_capacity(num_tasks) instead of Vec::new().

Since this is amortized exponential growth it's likely making negligible difference, although it is good style to tell data structures what you know about their size early.

The Go approach doesn't appear to pay this penalty, it's just incrementing a (presumably atomic?) counter, although I have no idea where these tasks actually "live" until they decrement the counter again - but most of the others are likewise using a growable array.

JB_Dev · on May 22, 2023

It looks like most of the languages are doing the same thing w.r.t using a 'List'-style implementation that re-sizes/copies it's internal buffer when it runs out of space. One thing I wonder though is whether the managed languages pay a bigger penalty on the memory usage than the unmanaged languages due to this.

In the unmanaged languages, those intermediate buffers should be freed immediately during the re-size operation.

But in the managed languages, they might stick around until the next GC. So the peak memory would be higher.

In this example the runtime overhead is going to represent more of the memory usage anyway but it quickly isn't negligible. E.g. a list of size 1M is going to be ~8MB (64 bit addresses). So even if the array is re-sized from 1->2->4->8->...->~500K, the end result is not going to be worse than 2X the size.

pkolaczk · on May 22, 2023

Then this should be applied to all benchmarks not just Java.

Traubenfuchs · on May 22, 2023

Ideally it's a git(hub) repository and people can add PRs to improve the outcome of their favorite languages.

tester756 · on May 21, 2023

C#'s list works the same way

dceddia · on May 21, 2023

Tokio seems oddly good compared to the others. Is that because it’s using a thread pool instead of creating something new for every task? I wonder how it would fare if the tasks were more intense, and how the time-to-completion compares between these approaches.

bombela · on May 21, 2023

async Rust functions are compiled down to a finite state machine. A stackless coroutine I think it can be called. I suspect that C#/.NET does something similar though I am just speculating here. The difference between async-std and tokio would then be explained by a lower overhead runtime per task in tokio.

pornel · on May 22, 2023

All the async and green thread runtimes are using thread pools by definition. Avoiding spawning as many OS threads as program's tasks is their main job.

This benchmark wasn't very rigorous, so maybe it has overlooked something, but the result seems plausible. Tokio is a good, mature runtime. Tokio-based frameworks are near top of TechEmpower benchmarks.

Rust's Futures were designed from the start with efficiency in mind. Unlike most async runtimes where chaining of multiple async calls creates new heap allocations, Rust flattens the entire call tree and all await points into a single struct representing a state machine. This allows the runtime to use exactly one heap allocation per task, even if the task is pretty complex.

hdjjhhvvhga · on May 22, 2023

It's nice to see but frankly speaking with current memory prices all I care is speed, not so much memory consumption, unless I have very specific requirements.

sigmonsays · on May 22, 2023

this is a terrible benchmark. You're literally doing nothing per task.

There is so much left up to the implementation details that you are not benching anything.

guenthert · on May 21, 2023

If those 'tasks' can be expressed as state machines, then I'd think log2(<number of states>) * 1e6 bits would be the lower limit.

pajko · on May 22, 2023

Using thread pools/connection pools could help a lot both in performance (less time to create threads) and memory use (less fragmentation + less memory to garbage collect after short-lived allocations). There's a guide somewhere about using connection pools, multiprocessing and async I/O vs. pthreads, showing that the most obvious solution to the concurrency implementation does not necessarily has the best performance, but couldn't find it right now.

EDIT: polls->pools typo

Patrickmi · on May 22, 2023

Let’s understand that Go’s runtime and scheduler does a lot, and there’s not enough information to justify such benchmarks

malkia · on May 21, 2023

What's the point of running 1M concurrent tasks? I don't see a good reason tbh.

pestatije · on May 21, 2023

website with that many requests

malkia · on May 22, 2023

This is bound to problems - slighest change in your stack - language, runtime, OS would affect that so much that it's completely pointless to rely on measures taken and apply them later (that to be said, this is perfectly valid choice when you are locked in - like in the embedded or game console world, but rarely that's the case outside)

For example - https://go.dev/doc/go1.3#stacks https://agis.io/post/contiguous-stacks-golang/

So your measurements in go 1.2 would probably differed in go 1.3 (and this is just one example, in only one of the possible axises that have changed).

It's not like you own the data structure there to control it.