Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How much memory do you need to run 1M concurrent tasks? (pkolaczk.github.io)
253 points by pkolaczk on May 21, 2023 | hide | past | favorite | 214 comments


It's not really fair to compare async/await systems with Go and Java.

Go and Java with lightweight threads provide a full-blown system, without any limitations. You don't have to "color" your code into async and blocking functions, everything "just works". Go can even interrupt tight inner loops with async pre-emption.

The downside is that Go needs around 2k of stack minimum for each goroutine. Java is similar, but has a lower per-thread constant.

The upside is that Go can be _much_ faster due to these contiguous stacks. With async/await each yield point is basically isomorphic to a segmented stack segment. I did a quick benchmark to show this: https://blog.alex.net/benchmarking-go-and-c-async-await


On the contrary I think it is critical to compare them. You can then decide if the tradeoff is worth it for your use case.

Even at a million tasks Go is still under 3 GiB of memory. So that is roughly 3KiB of memory overhead per task. That is likely negligible if your tasks are doing anything significant and most people don't need to worry about this number of tasks.

So this comparison shows that in many cases it is worth paying that price to avoid function colouring. But of course there are still some use cases where the price isn't worth it.


One thing I wanted to add is that in golang, you end up passing context.Context to all asynchronous functions to handle cancellations and timeouts, so you “color” them regardless. Java folks with structured concurrency have the right idea here.


The correct interpretation here is that golang chose explicit context passing where java chose implicit.

It's similar to explicit DI vs implicits.

the function coloring metaphor doesn't make sense since the calling convention is the same nor are there extra function keywords (`async` vs non-async).


This is a convention, and you also don't _have_ to do it.

A function without a context can still be used just fine from any code, you just won't be able to cancel it.


This isn't true. I have use cases that don't require cancellations or timeout. The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.


This is an interesting post. My understanding: Most of the use case for async code is I/O bound operations. So you fire off a bunch of async I/O requests and wait to be notified. Logically, I/O requests normally need a timeout and/or cancel feature.

However, you raise a different point:

    The tasks I'm running don't involve the network, they either succeed or error after an expensive calculation.
This sounds like CPU-bound, not I/O-bound. (Please correct me if I misunderstand.) Can you please confirm if you are using Go or a different language? If Go, I guess it still makes sense, as green threads are preferred over system threads. If not Go, I would be nice to hear more about your specific scenario. HN is a great place to learn about different use cases for a technology.


I think I just responded too hastily. I am working in Go. There is file IO going on in addition to the calculation (which because of a NAS or whatever could also be network IO). As a practical matter I had never felt the need to offer cancellation or timeout for these use cases, but I probably should, so mea culpa.


What's the point of multiplexing tasks on a particular core if the tasks don't do any I/O? It will be strictly faster to execute the tasks serially across as many cores as possible then.


See response to sibling comment.


How is passing in a context “coloring”? This is GC for goroutines — specifically long running ones.


I don’t get how people that mentions “no color” don't get this: context.Context IS color.


It's not _quite_ the same: you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.


> you can't call async code from a sync context (hence the color issue), but I can always pass a "context.Background()" or such as a context value if I don't already have one.

You can always pass context.Background, in this metaphor creating a new tree of color.

You can always call "runtime.block_on(async_handle)", in this metaphor also creating a new tree of color.


Go doesn't have async/sync distinction, no keywords so the metaphor doesn't hold.


You can always pass the async executor to the sync code and spawn async coroutines into it. And you can keep it in a global context as well to avoid parameters. E.g. there is `Handle::current()` for exactly this purpose in Tokio. Function coloring is just a theoretical disadvantage - people like to bring it up in discussions, but it almost never matters in practice, and it is even not universally considered a bad thing. I actually like to see in the signature if a function I'm calling into can suspend for an arbitrary amount of time.


It's not the same because you can have interleaving functions which don't know about context.

Say you have a foreach function that calls each function in a list. In async-await contexts you need a separate version of that function which is itself async and calls await.

With context you can pass closures that already have the context applied.


You can do that with async functions too by having the caller use, for example in rust, 'runtime.block_on(async_handle)'

You're talking about doing this:

    f1 := func() error { return nil }
    ctx_f1 := func(context.Context) error { return nil }
    fns := []func() error{f1, func() error { ctx_f1(ctx) }}
Basically, writing an anonymous conversion function.

In, say, rust, the equivalent in this analogy would be:

    let f1 = || -> Result<(), ()> { Ok(()) };
    let async_f1 = async { || -> Result<(), ()> { Ok(()) } };
    let fns = vec![f1, runtime.block_on(async_f1)];
That conversion function, 'runtime.block_on', doesn't really seem that different. The analogy still seems to hold.


No, because if the function that includes those lines is itself async, it will now block, while the equivalent go coroutine will still preserve the stackfull-asyncness. I.e. you can't close over the yield continuation in rust, while it is implicitly done in go.

For a more concrete example: let's say you have a generic function that traverses a tree. You want to compare the leaves of two trees without flattening them, by traversing them concurrently with a coroutine [1]. AFAIK in rust you currently need two versions of traverse, one sync one async as you can't neither close over nor abstract over async. In go, where you have stackful coroutines, this works fine, even when closing over Context.

So yes, in some way Context is a color, but it is a first class value, so you can copy it, close over it and abstract over it, while async-ness (i.e. stackless coroutines) are typically second class in most languages and do not easily mesh with the rest of the language.

[1] this is known as the "same fringe problem" and it is the canonical example of turning internal iterators into external ones.


How do you figure? What’s requiring my async function to take a context in the first place? Even if it takes one, what’s stopping it from calling a function that doesn’t take one? Similarly, what’s stopping a function that doesn’t take a context from calling one which does?


It's absolutely fair to compare them.

Just like comparing C vs Python performance.

"B...b...but Python is easier to write."

Sure. That is a trade off. And whether the trade off makes sense depends on the situation.


Erlang / Elixir processes use about 0.5kb per process, a preemptive scheduler on the BEAM to make sure work is consistent and isolated heap space for each one to avoid stop the world garbage collection.

It’s as good as it gets for this type of thing as it was designed for it from the ground up.


And it looks like the SystemLimitError issue they encountered is one flag away: https://elixirforum.com/t/too-many-processes-error/22224/2


This was the first thing that occurred to me when I saw the error - should have been pretty straightforward to tweak the settings to give a clearer picture.


Fixed already.


Thanks! Looked back at the results and that's about what I'd expect. Erlang/Elixir isn't a silver bullet, but amazing in its own way for other reasons. Ultimate max performance and memory efficiency definitely isn't one of them :).


> Erlang / Elixir processes use about 0.5kb per process

Last I checked it used about 2.5K. Beam reserves around 300 words per process (depends on exact options), but a word is 8 bytes not 8.

You can get it lower (obviously at a cost as soon as you start using the process and need space) but nowhere near 512 bytes, just the process overhead is around 800.


I don't know how C#, but Rust async/await doesn't allocate anything on the heap by itself. So it is not a universal property of all async/await implementations contrary to what you imply in your blog post.


For C#, you might want to try a different version with ValueTask instead of Task. It’s more memory-friendly.

It would also be interesting to try Native AOT compiling both versions…


+1 to Native AOT, I'd love to see the data. Pretty easy to do, add a line of XML to the csproj and a modified `dotnet publish` command.

https://learn.microsoft.com/en-us/dotnet/core/deploying/nati...


Just tried it. No difference whatsoever.


NativeAOT will have negligible impact on memory usage or performance here. JIT performance on average is higher than NAOT.


Yeah and I think NAOT still embeds the same(?) runtime GC into the native binary. So, for memory usage, I would expect it to be nearly/exactly the same.


I may remember wrongly here, but wasn't ValueTask merely the go-to option when it's expected that the task would finish synchronously most of the time? I think for the really async case with a state machine you just end up boxing the ValueTask and ending up with pretty much the same allocations as Task, just in a slightly different way.



There is no magic.

Rust also has to allocate (box) when you need recursive async calls: https://rust-lang.github.io/async-book/07_workarounds/04_rec...

What Rust does, it allows to remove one layer of allocations by using Future to store the contents of the state machine (basically, the stack frame of the async function).


It allows to remove all allocations within a task, except for recursive calls. This results in the entire stack of a task being in one allocation (if the task is not recursive, which is almost always the case, for exactly this reason), exactly like you describe the advantage of Go. And unlike Go, that stack space is perfectly sized, and will never need to grow, whereas go is required to reallocate the stack if you hit the limit, an expensive operation that could occur at an unknown point in your program.


Again, no magic. Future trait impl stores all the state machine that Rust can see statically. If you have complicated code that needs dynamic dispatch or if you need recursive calls, Rust will also have to allocate.

This is not at all that different from Go, except that Go preallocates stack without doing any analysis.

Actually... I can fix that! I can use Go's escape analysis machinery to statically check during the compilation if the stack size can be bounded by a lower number than the default 2kb. This way, I can get it to about ~600 bytes per goroutine. It will also help to speed up the code a bit, by eliding the "morestack" checks.

It can be crunched a bit more, by checking if the goroutine uses timers (~100 bytes) and defers (another ~100 bytes). But this will require some tricks.


In your Go benchmark code, how is the calculation actually being done across threads? I don't see the `go` keyword anywhere in the source code.


But that's the point I'm trying to show. In Go code all calls are the same. There is no sync/async distinction.

In contrast, in C# (or any other similar system) async calls are _expensive_ compared to regular function calls.


You've misunderstood how go routines work. You need to put the "go" keyword before the function call in order for it to be run concurrently.


I know perfectly well how goroutines work.

I'm NOT trying to show that Go is faster than async/await or anything similar. I'm showing that nested async/await calls are incredibly expensive compared to regular nested function calls.


You need to add to go keyword to change a normal function to a goroutine. If you would remove async/await and Task/Return from the C# code example, it would perform pretty much the same as Go.

If you want to show that async/await calls are expensive, than you should have shown two code samples of C#, one with async/await, and one without.

Or could have done the same for Go, show one example with goroutines, and one without.

But I think everyone already know that async/await and goroutines has it's costs.

The problem is more that you are comparing Go without goroutines (without it's allocation costs) to a C# example with a poor implementation of async/await.


Yeah this isn't using goroutines at all? I don't see how this is a good comparison of goroutines vs coroutines.


I'm wondering that as well. The c# code seems really unrealistic though, using a lot of tasks for CPU bound work. A fair comparison would at least need to involve some degree of IO. Does go maybe automatically parallelize everythinh it can? That would be one potential answer to the initial question


Just like there's no `spawn` or such in the async code.

It's trying to show the overhead of the async machinery, compared to "uncolored" unfunctions.


The author doesn't have the understanding to have even gotten close to the capability for nuance you're asking for. They just copied code from ChatGPT and ran it and timed it and made graphs and it somehow got on HN.


You haven't written a concurrent Go program. Not sure what you're trying to demonstrate.


Yes. Because that's the point. I'm comparing overhead of async calls vs. normal calls.


Note that you’re changing multiple variables, namely the entire compiler and runtime.


If you know a system that both has async/await _and_ full-blown green threads, then I can try to test it.


Racket has full-blown green threads (virtual threads)[1] that serve as an alternative to the async/await paradigm. It also supports use of native (OS) threads[2].

[1] https://docs.racket-lang.org/reference/eval-model.html#%28pa...

[2] https://docs.racket-lang.org/reference/places.html

[&] https://docs.racket-lang.org/reference/futures.html


C++ has now both async (built-in) and multiple flavors of stackful coroutines (as external libraries). You can run both on top of ASIO so that you can measure purely difference of the coroutine implementation as opposed to the runtime.


Isn’t the first example completely synchronous? If that’s what’s being compared, why not use C# for both programs?


With async/await (which I think C# also uses) you actually unwind the entire stack and start a new one every time you do async callbacks. With fibers / green threads / whatever you instead store stacks in memory.

It took me a while to figure this out, thanks to articles I came across (such as “what color is your function?”.

There are memory / speed trade-offs, as per usual. If you have a lot of memory, and can keep all the stacks in memory, then go ahead and do that. It will save on all the frivolous construction / destruction of objects.

Having said that, my own experience suggests that when you have a startup, you should just build single-threaded applications that clean up after every request (such as with PHP) and spawn many of them. They will share database pool connections etc. but for the most part it will keep your app safer than if they all shared the same objects and process. The benchmarks say that PHP-FPM is only 50% slower than Swoole for instance. So why bother Starting safe beats a 2x speed boost.

And by the way, you should be building distributed systems. There is no reason why some client would have 1 trillion rows in a database, unless the client themselves is a giant centralized platform. Each client should be isolated and have their own database, etc. You can have messaging between clients.

If you think this is too hard, just use https://github.com/Qbix it does it for you out of the box. Even the AI is being run locally this way too.


Cyberax,

I missed my opportunity to reply to your comment, but I really appreciate it, and I wanted to find a way to get this back to you. The comment in question:

"Well, I was one of the engineers that made the change :) I'm not sure how much I can tell, but the public reason was: "to make pricing more predictable". Basically, one of the problems was customers who just set the spot price to 10x of the nominal price and leave the bids unattended. This was usually fine, when the price was 0.2x of the nominal price. But sometimes EC2 instance capacity crunches happened, and these high bids actually started competing with each other. As a result, customers could easily get 100 _times_ higher bill than they expected."

There was more to it than that, but I figure that's a good enough reference point.

Thank you for these improvements. It doesn't change anything, in terms of how much savings I can get by following the latest generations and exotic instance-types, but it does help with the reliability of my workloads.

It's been a huge benefit to me, personally, that I can provide some code that enables the potentiality of servers dying, with the benefit of 80% cost savings without using RIs.


Thank you! It means a lot!

You can also try another product of my former team: https://aws.amazon.com/savingsplans/ - it's similar to RI, but cheaper because it doesn't provide an ironclad guarantee that the instance will be available at all times. It's still a bit more expensive than spot, but not by much.


Your benchmark for Go runs in single goroutine


Yes. On purpose.


Function coloring is orthogonal to the underlying mechanism.

You can think of the Java / Go approach as adding the bind points automatically.


> You don't have to "color" your code into async and blocking functions

Unpopular opinion: this is a good idea. It encourages you to structure your code in a way such that computation and I/O are separated completely. The slight tedium of refactoring your functions into a different color encourages good upfront design to achieve this separation.


Good upfront design? This stinks of implementation detail leakage affecting high-level design, which should be a cardinal sin.

One should design based on constraints that best match the problem at hand, not some ossified principle turned "universal" that only really exists to mask lower-level deficiencies.


When the implementation detail involves whether or not the function will perform I/O, it is better to let that leak.

Excessive hiding of implementation detail is what leads to things like fetching a collection of user IDs from a database and then fetching each user from an ID separately (the 1+N problem). Excessive hiding of implementation detail is what leads to accidental O(N^2) algorithms. Excessive hiding of implementation detail is what leads to most performance problems.


> When the implementation detail involves whether or not the function will perform I/O, it is better to let that leak.

I guess Haskell is fully based around that idea :)


But async does not help with that. As shown elsethread, in most languages you can block from async functions just fine.


concurrent algorithm is a design choice and is an architecture. Not an implementation detail


You don't necessarily know ahead of time if an async or a sequential strategy is better. With colored functions you kind of have to pick ahead of time and hope you picked right or pay a big effort penalty to do a side by side comparison.


async/sync and sequential/parallel are orthogonal concerns. You can write sync code which does work in parallel (on different threads/cores) for something like a numerical algorithm that can be parallelized in some way. Deciding whether something is sync or async is about whether it needs to suspend (in practice, mostly whether it needs to do IO) which is much easier to understand up front. Sometimes it changes of course, in which case you have to do some refactoring. But in a decade of programming professionally in Scala and Rust (both of which have "colored" functions) I can count on one hand the number of times where I had to change something from sync to async and it took more than a few minutes of refactoring to do it.


>num += Calculate(20).Result;

You're sure you should use .Result here instead of awaiting it?


To expand upon this thought, here is the AsyncGuidance doc[1] on why not to use .Result to get the return value of a completed Task in C#.

To make this simple they introduced async Main[2] a few years ago.

[1]: https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...

[2]: https://github.com/dotnet/csharplang/blob/main/proposals/csh...


None of what you say is a gotcha. An ignorant programmer stands only to benefit from Go


The title is asking a very quantifiable question. And I was hoping the article would be a summary of expertise and deep dives the author took into various language's runtimes.

But then I read

> With a little help from ChatGPT

And sure enough the results & prose end up pretty much worthless. Because the author didn't do any actual critical thinking or technical work. (For instance, there's nothing more to "Go's memory usage ballooned at 1M tasks" beyond the observation. Why not? Just dig in there it's all documented you just have to read some stuff.)


It’s also really strange using python and node in an article about concurrency without mentioning the GIL (python) or node single threading.


The whole idea of async functions is to run tasks concurrently and with simple syntax on a single thread. Hiding the actual concurrency under the runtime, only when it is waiting for IO.

It doesn’t matter that node is single threaded, nor does the GIL have any impact on single thread.

It’s once you start doing cpu heavy work inside the tasks this will hit you.


No, the single-thread aspect is just a limitation of Python & Node. Rust's Tokio is an example of a multithreaded async framework.


That's what happens when you write a blog about a technical topic you don't know much of anything about I guess


GIL is not a problem for concurrency. It is a problem for parallelism when doing CPU bound work.


Do you mind sharing why Go ballooned to 1M?


Because goroutine stacks start at minimum 2 kB. But it is not obvious if this is the only factor that contributes to memory usage so I believe experimenting to verify theory is still quite valuable.


> "All programs were launched using the release mode if available. Other options were left default."

So it's likely these numbers simply reflect choices in the default configuration, not the ultimate limit with tuning.

I'm starting to dislike these low effort ChatGPT articles. Getting an answer from it is not the same thing as actually learning what's going on, but readers will walk away with the answers uncritically.


Lately I’m seeing a lot of this baked-in assumption that ChatGPT writes good/idiomatic/performant code, as if the code it generates is ~The Answer~. I guess it’s the modern version of copy/pasting from Stack Overflow. The code I’ve had it write isn’t universally terrible but it almost always has some low-hanging performance fruit (in JS for instance, it seems to love spreading arrays). I’m not sure how much I trust it to write good benchmark code!


Does spreading arrays come with a performance penalty?


Adding to the others here, yeah - it has to loop over the whole array, and it allocates a new one. The allocation itself costs something and it also creates work for the garbage collector (freeing the old one). So it’s a lot of work compared to a .push()! Sometimes you do need the immutability aspect though, because a lot of things use referential equality for change detection (React is a big one).


Spreading allocates a new array. Most things you can do with spread you can do with normal functions that either just directly access an entry or mutate an array directly. [first, …rest] is nice, but will always be slower than direct access.


Spreading an array requires looping over all the array elements which is O(n) compared to something like Array.push() which is just 0(1), although spreading has other benefits like copying rather than mutating


I've been seeing a lot more "ChatGPT only produces junk code" comments (case in point, this entire thread), and that hasn't really been the case either.


For majority of those benchmarks it produced the right code from the first go. It struggled more with Java VirtualThreads - probably fewer programs in the training set. It also had a tendency to overcomplicate things (adding unncecessary code). So there were a few iterations needed plus some hand edits.


To play Devil’s Advocate, this is a good test of “how many concurrent tasks will an uncritical developer relying on ChatGPT end up achieving with low effort”.


Right, realistically your product will not primarily be used by the most talented experts, who are optimising as best as possible given intimate knowledge of internals. People will do what basically works, and it's arguably doing a disservice when "benchmarks" avoid showing us also naive performance without the tuning. The tuned performance is worth knowing, but realistically most customers will get the naive performance, ain't nobody got time to learn to tune everything.

I think this applies all the way down to hardware features and programming languages. For example the naive use of sort() in C++ will go faster than in Rust (in C++ that's an unstable sort in Rust it's stable) but may astonish naive programmers (if you don't know what an unstable sort is, or didn't realise that's what the C++ function does). Or opposite example, Rust's Vec::reserve is actively good to call in your code that adds a bunch of things to a Vec, it never destroys amortized growth, but C++ std::vector reserve does destroy amortized growth so you should avoid calling it unless you know the final size of the std::vector.


That is a fair point. I don’t think it saves this particular analysis, of course. Probably what’s important to know is something we might call the “Pareto naive performance”; with just a little bit of effort (the first 20% or so), you can get some significant performance improvement (the first 80% or so). Even the naive programmer just banging together something that works in Elixir is going to quickly find out how to raise Erlang’s default 260k process limit when they hit that, after all.


Trying to raise the limit, but it didn't work:

    $ /usr/bin/time -v elixir --erl "-P 10000000" main.exs 1000000

    08:42:56.594 [error] Too many processes
    ** (SystemLimitError) a system limit has been reached
Any other ideas how to raise the limit?


I asked ChatGPT and it said

“This limit can be adjusted with the `+P` flag when starting the BEAM… To adjust the limit, you could start the Elixir application using a command like the following:

    elixir --erl "+P 1000000"
“If you are getting this error not because of the BEAM limit, but rather because of your operating system limit (like the limit on the number of open files or the number of child processes), you will need to look into how to increase these limits on your specific operating system.”


Yes, again ChatGPT was better than just random search. If writing this blog post teaches me something, then it teaches me to use ChatGPT more not less ;)

BTW, I fixed the 1M benchmark and Elixir is included now.


I’m glad you saw the irony in my reply. I was a little worried it would come off acerbic, like “since you use ChatGPT, you only deserve ChatGPT”. The real intent was double-edged, of course - to demonstrate that ChatGPT isn’t just dumb-codemonkey-as-a-service, it can also quite easily solve a lot of the problems it is purported to create. You have been handling mean comments here (including my own mean comments) with grace, so I took the risk.


> Even the naive programmer just banging together something that works in Elixir is going to quickly find out how to raise Erlang’s default 260k process limit when they hit that, after all.

Sure. Will redo this part. I doubt it would allow Elixir to get a better result in the benchmark, though, as it was already losing significantly at 100k tasks. Any hints on how to decrease Elixir memory usage? Many people criticize using defaults in the comments, but don't suggest any certain settings that would improve the results. And a part of blogging experience is too learn things - also from the author perspective ;)


    1..num_tasks
      |> Task.async_stream(fn _ ->
          :timer.sleep(10000)
        end, max_concurrency: num_tasks)
      |> Stream.run()
https://elixirforum.com/t/how-much-memory-do-you-need-to-run...


Time to fire a bunch of people then


> I'm starting to dislike these low effort ChatGPT articles.

The next few decades are going to be tough for you, I think. Brace yourself!


Curious what build configuration changes would help with this. Specifically with python/Go because that’s what I’m most familiar with, but any insight into tweaking build configs to increase performance here would be really interesting to me.


The results can be ignored.

You can't just copy-paste ChatGPT without even a sanity check e.g., Python code is wrong:

   await sleep(10)
sleeps 10+ seconds and returns None but the intent is to pass a coroutine to create_task() (await should be removed at the very least)


Javascript is better, but its still sloppy:

    tasks.push(
      (async () => {
        await delay(10000);
      })()
    );
.. So tasks.push is called with an IIFE, which returns a promise, which resolves after its internally awaited another promise...

You could just do this, which avoids about 3 unnecessary allocations:

    tasks.push(delay(10000))
And I'm not convinced promisify(setTimeout) is what you want here. I'd just use this:

    const delay = timeout => new Promise(resolve => setTimeout(resolve, timeout))
... Which honestly should be part of the standard library in javascript. However, I'm less convinced that replacing the delay function will make a big difference to performance.


The article updated to remove the cruft you noted. It dropped the 1M memory substantially

Promise timers are now packaged by node at least, they allocate about 20% less memory for the 1mil test (on v18).

    const { setTimeout } = require('node:timers/promises')
https://nodejs.org/api/timers.html#timers-promises-api


Oh I didn't know that! Thanks!


Thanks for the suggestion. I fixed it and indeed it improved memory consumption of NodeJS.


Heh confirming the old adage: If you want to know the right answer to a problem online, post the wrong answer. People will come out of the woodwork to correct you.

I uh, may be in this picture.


Yeah Python got done dirty here. I tried a million tasks and I saw a peak of about 166MB. I'm still getting used to async in Python, but this seems to work fine though?

    import asyncio
    
    
    async def do_task():
        await asyncio.sleep(10)
    
    
    async def main(num_tasks: int):
        tasks = []
    
        for task_id in range(num_tasks):
            task = asyncio.create_task(do_task())
            tasks.append(task)
    
        print("spawned tasks")
        await asyncio.gather(*tasks)
    
    
    if __name__ == "__main__":
        asyncio.run(main(1_000_000))


Thank you! I fixed the code and updated the benchmarks. However, your 166 MB doesn't match my testing (I'm getting about 240M at 100k tasks) and 10x more at 1M. Are you sure you haven't misread your RSS?


Nope, that's what it is for me but you're using an older Python. 3.11 has a lot of improvements and not sure why you wouldn't use the latest?


If Python 3.11 is available, then asyncio.TaskGroup could be used instead of asyncio.gather() https://docs.python.org/3/library/asyncio-task.html#task-gro...


Running Ubuntu LTS and that was the version from the official repo. Will try a newer version then.


Ok, this one is funny. Actually ChatGPT got it right and there was a create_task before, but right before committing I "inlined it" and introduced a bug. Thanks for letting me know. Will double check if it didn't messed up the results.

And BTW - I'm very far from copy pasting from chat gpt. ChatGPT helped with this code, but often required many iterations and also some manual cleanup.


Haha, I had to try a PHP build with the parallel extension[1] with some rather simple code[2]:

- single job: 27.8 mb (respectable)

- 10k: 8.3 GB!

At this point, I gave up. I'm 99% sure trying to go to 1 million will crash my computer.

[1]: https://www.php.net/manual/en/intro.parallel.php

[2]: https://gist.github.com/withinboredom/8256055709c0e6a8272b3a...


I should add that the good ole' fork approach[1] only uses 40.7mb for 10k, 40.7mb for 100k, and 40.7mb for 1 million.

Got to love COW memory. :)

[1]: https://gist.github.com/withinboredom/c811f3c433904235a5f196...


Another benchmark of the "I don't know what I'm actually measuring" flavor.

I don't even know where to start: from the lack of env settings, system settings, compiler settings to the fact that no concurrent work was actually done, this is one giant flawed benchmark.

The only lesson here should be: Don't take any of this seriously.


Just read the text. It lists versions and states it was using defaults in the release mode. Most of those compilers don't have any fancy options actually. E.g go is just `go build`, Python is just `python3 ...`. Rust options are in the project definition. Anyway, I'll add those command line invocations if it causes so much controversy.


>from the lack of env settings, system settings, compiler settings

compiler settings?? I bet he used defaults everywhere


Benchmarks for threads might slightly underestimate memory usage. For threads, some amount of memory is spent by the kernel (eg, memory maps themselves) and won’t show as maxrss.

> a high number of concurrent tasks can consume a significant amount of memory,

So Go needs 3gigs for a million tasks. Relatively speaking, for an app that does a million concurrent somethings, 3gb is probably peanuts.

It seems that memory isn’t the limiting factor here.


Crystal 1.8.2 https://crystal-lang.org/:

    n = ARGV.first.to_i32
    finished = Channel(Nil).new
    n.times do
      spawn do
        sleep(10.seconds)
        finished.send(nil)
      end
    end
    n.times { finished.receive }
Memory usage determined with `ps ax -o pid,rss,args`. Runtime measured with `/usr/bin/time -v`. (Had to run `sudo sysctl vm.max_map_count=2500000` to get the final two rows.) Results:

    N        Runtime(s)  RSS(kiB)  RSS(MiB)
    =======================================
          1       10.01      1792         2
       1000       10.01      6016         6
      10000       10.11     45184        44
     100000       11.00    435840       426
    1000000       20.25   4336768      4235
"System time" rather than "user time" is the majority (7.94s system time, 2.83s user time in the 20.25s wall time run). Is this pointing to memory allocations?

Crystal Fiber docs https://crystal-lang.org/api/1.8.2/Fiber.html says "A Fiber has a stack size of 8 MiB which is usually also assigned to an operating system thread. But only 4KiB are actually allocated at first so the memory footprint is very small." -- and perhaps unsurprisingly, 4KiB times 1000000 is approximately 4 GiB. Nice when the math works out like that :)

To me this is useful as a baseline for something like a websocket server, with some number of idle connections, each Websocket connection mapped to a Fiber that is waiting for I/O activity but mostly idle.


I'm a big Rust fan, so I'm happy Tokio holds up so well. I find it to be one of the simplest and quickest ways to write task based programs.

Gotta wonder whether you can challenge those numbers with c++ though. I suspect you can but with a lot of work, where Rust is pretty effective written naively.

Of course the big issue with these comparisons is it's hard to be an expert in so many different languages. How do you know if what you're doing is the best way in each language?


I find the opposite to be true, Tokio is a fine library but it makes the whole program very complex.

Perhaps it's because I generally don't write code for many short lived workloads, but I find using real threads to be a lot easier to work with (plus it saves time during compilation). Tokio is great at what it does, but I have to wonder how relevant these types of threads are when "fearless concurrency" is one of the token advantages of the language.

I think this article would've been a lot better if it did some actual calculations in the spawned threads, rather than simply waiting. A smart enough compiler would see that the results of the wait calls, which usually don't come with any other side effects, are never used and optimize them out entirely.


As of today (with .NET 7 and 8 preview) 120MB base footprint memory usage by .NET is indeed surprising. A few questions:

- How are you exactly launching the workload?

- How much memory does your system have?

- Could you try running it with .NET 7 instead?

The last two are especially relevant because in the recent versions each release got extra work to reduce memory footprint on Linux (mostly because it's an important metric for container deployments).

Overall, 120MB sounds about right if you are launching a .NET 6 workload with some extra instrumentation attached while the runtime picks Server GC which is much more enthusiastic about allocating large segments of memory, especially if the system has large amount of RAM.

As for Tasks - C# tasks are in many ways similar to Rust's futures (both are yielding-to-the-caller state machines) and take very little memory, which scales directly with the amount of variables that persist across "yield points" (which you have none).


> Overall, 120MB sounds about right if you are launching a .NET 6 workload with some extra instrumentation attached while the runtime picks Server GC which is much more enthusiastic about allocating large segments of memory, especially if the system has large amount of RAM.

Also if you have lots of CPU cores then the CLR will (pre-)allocate up-to ~10MB/core for per-hardware-thread GC heaps. So my 16-core (32-thread) desktop will run every .NET program with a starting (OS-level) memory usage of ~200MB for a Hello, World program - though actual total GC usage will be < 1 MB.


I noticed he didn't use the latest version of each language. Maybe that's just what you get on Ubuntu 22.04 LTS?


Using .NET 6 is perfectly fine! It is the latest LTS after all, it's just that on machines with a lot of RAM the GC can indeed pre-allocate quite a bit of memory if, based on hardware configuration, it chooses to use Server GC. This, however, is being tuned with each subsequent release hence I'm curious to see the numbers on 7 on the author's hardware.


This is not a comparison of 1 million concurrent tasks.

For a bunch of these tests, it is measuring 1 million tasks multiplexed onto a smaller number of os threads.

In many implementations the cost of concurrency is going to be small enough that this test is really a measure of the stack size of a million functions.


There is sometimes a distinction made between concurrency and parallelism, where "concurrency" just means the programming interface looks parallel but is allowed to use out-of-order task-switching under the hood, and "parallelism" is real-deal usage of CPU cores. Both ways of looking at things are useful in different situations.


That is a good point, but parallelism degrades to concurrency when you run out of CPU cores.


The Java tests (Loom in particular), and possibly other langauge runtimes too in this benchmark, can't be compared without looking at the runtime settings.

JVM for example has a default heap size of 25% to 50%, depending on how much memory is available in the system. Given that the code is using ArrayList, there will be more need for bigger heap, due to array expansion. To eliminate that noise, the code should be changed to a pure Thread[] array with a fixed size based on numTasks.

Once that is done, you may as well run up to 10_000 tasks (virtual threads) within a JVM that consumes no more than 32 MB of RAM (compared to 78MB as originally reported by OP).

These were the best results I could find after playing a bit:

  # Run single task
  docker run --memory 26m --rm java-virtual-threads 1

  # Run 10 tasks
  docker run --memory 26m --rm java-virtual-threads 10

  # Run 100 tasks
  docker run --memory 26m --rm java-virtual-threads 100

  # Run 1_000 tasks
  docker run --memory 28m --rm java-virtual-threads 1000

  # Run  10_000 tasks
  docker run --memory 32m -e JAVA_TOOL_OPTIONS=-Xmx25m --rm java-virtual-threads 10000

  # Run 100_000 tasks
  docker run --memory 200m -e JAVA_TOOL_OPTIONS=-Xmx175m --rm java-virtual-threads 100000

  # Run 1_000_000 tasks
  docker run --memory 950m -e JAVA_TOOL_OPTIONS=-Xmx750m --rm java-virtual-threads 1000000


I’m surprised at Elixir failing with 1M tasks. I’ve definitely done this sort of test with Erlang without an issue. Although I can’t really remember right now if I pushed it to 1M tasks.


“The maximum number of simultaneously alive Erlang processes is by default 262,144. This limit can be configured at startup. For more information, see the +P command-line flag in the erl(1) manual page in ERTS.”

https://www.erlang.org/doc/efficiency_guide/advanced.html#:~....


Almost certainly a system cofiguration issue. I’m pretty sure the default process limit on BEAM is something like 200K. Additionally I don’t see mention of updating file handles or anything else on the operating system.

Maybe I’m cynical but this is what happens if you rely on chatGPT to give you programming answers without understanding the fundamentals first.


Unless I’m missing something in the writeup, it doesn’t mention how much RAM is on the system. A little odd to mention the CPU and the OS but not the RAM, for a benchmark that doesn’t care about execution speed but does care about memory usage. It makes figuring out that Elixir error of “system limit reached” a little harder to figure out.

edit: oh apparently it’s got nothing to do with memory, 1M just happens to be higher than the default process limit for Erlang and thus Elixir.


Also someone needs to tell ChatGPT that its Elixir code has a bug ;)


Lots of pedants like to dump on these sorts of articles but there are enough interesting things to see, even in a relatively trivial comparison like this.

It is interesting that some systems allocate more memory up-front but this doesn't grow much per-task.

It is interesting that in the form the app was written the Go program doesn't appear to scale as well. Sure there are possibly reasons for it so anyone with more knowledge of Go could attempt to write a more "correct" program and ask to rerun it.

It is interesting that with the defaults, things behave as they do, again, someone who cares more about "it's just that the defaults are different" is welcome to investigate that and ask "why" some program defaults might lead to better performance.

It is interesting that as-in most comparisons, the languages/frameworks don't neatly fit into Stereotypes.

It is interesting that most of them probably perform well enough for 99% of applications we might write.

It seems like most people still don't appreicate that YMMV and that trade-offs are always made. Just add it to your stash of brain juice that helps you become a better engineer and stop complaining.


Would be nice to see throughput tested at this level. There is may be a good reason for Go to allocate more per thread if each thread is really doing something. Hat tip to Rust and C# tho!


I suspect 1 million concurrent task very likely becomes unmanageable in any real world scenario anyway. 1KB per task giving 1GB of total memory consumption, which means 100KB would be enough to raise memory limits on most systems.

There may be scenarios such as network connexion handling where those connexion are kept idle 99% of the time, but you still need something to handle the very rare case of receiving something. However, is 1 million network connexion manageable from a kernel point of view anyway ? Wouldn't you reach system limits before that ?


> However, is 1 million network connexion manageable from a kernel point of view anyway ? Wouldn't you reach system limits before that ?

You can raise all those defaults as long as you have sufficient memory. Yes, you need to be working with fairly idle connections, otherwise you're likely to run out of CPU. Real world examples include Chat, IMAP, push messaging, that sort of thing where having a live connection means a significant reduction in latency vs periodically opening a connection to poll, but at the same time, there's not a whole lot going on.

I was at WhatsApp when we accidentally hit 2.8M connections on one machine [1] I can't remember if we ever got to 3M, but we ended up doing a lot more work per connection so didn't keep those connection counts for long. Kernel wise, as long as you're mostly inbound connections, everything is well oiled; I ran into problems initiating many thousands of outbound connections/second in an HAProxy use case, although I think FreeBSD 13 may have improved that bottleneck. I hear Linux also works, but I've not scaled it nearly as far.

[1] http://www.erlang-factory.com/upload/presentations/558/efsf2...


Thanks for sharing. What’s your take on elixir consuming an outstanding amount of memory per task in OP benchmark ?


If you want to do well on this benchmark, rather than calling sleep, the sleeping processes should call erlang:start_timer(10000, self(), []) to schedule a timer and then call erlang:hibernate/3 to wait for the message without having a normal heap. The efficiency guide [1] says a process will start at 326 words of memory, and a word on a 64-bit system is 8 bytes, so rough math says 2.5k per process and 2.5GB for 1M processes, so the observed 4GB isn't that far off, comments on the blog page show someone else got the example to 2.7GB by eliminating extra steps in the program, and reducing initial heap size got it to 1.1GB. Hibernation should make it smaller, but I'm not going to check.

Note also, that BEAM uses several allocation arenas and is slow to release those back to the OS. It's important to measure memory from the OS perspective, but in a real application you'd also measure memory from inside BEAM. There are knobs you can tune if you need to reduce apparent OS memory use at the expense of additional cpu use during allocation. There's tradeoffs everywhere.

Also, yes, 4GB is a lot of memory, but once you start doing real work, you'll probably use even more. Luckily memory is not that expensive and capacities are getting bigger and bigger. We ran some servers with 768GB, but it looks like the current Epycs support 6TB per CPU socket; if everything scales, you could run a billion sleeping BEAM processes on that. :p I recognize this sounds a lot like an argument to accept bloat, but it's different because I'm saying it ;) Also, I think the benefits of BEAM outweigh its memory costs in general; although I've certainly had some fights --- binary:copy/1 before storing binaries into ets/mnesia can be really helpful!

[1] https://www.erlang.org/doc/efficiency_guide/processes.html


I like the general idea of the test. Things I would have liked to see (apart from basic system tweaks like allowing more file handles) are a performance comparison (how much CPU time is spent) and maybe a comparison between operating systems (this might highlight a lot of windows-linux differences, like Windows not committing memory - better have a large swap file to handle stacks of 100k real threads - or windows having a very different scheduler)


Surely a Sufficiently Smart Compiler could optimize all of these to only take the a constant amount of memory (and less time).

I wonder if there are any correctness constraints that would actually prevent such optimizations. For example, the compiler might be required to execute the exact same system calls as an unoptimized program. It seems quite reasonable that the programs using system threads could not be optimized significantly under such a constraint.

I would expect the larger limit that prevents compilers from optimizing this is the large amount of code that defines the semantics of the lightweight threads. For example, I would expect the observable / required behaviors of tokio to not prevent a SCC from optimizing it, but the massive amount of Rust code that defines the lightweight thread semantics cannot be reasonably optimized.

Which programming language / framework is the closest to actually being optimized to like < 4 byte / task? I would make an uneducated guess of Elixir or Go.


Each BEAM process comes with its own stack / heap space. This uses a minimum of 326 words of memory (https://www.erlang.org/doc/efficiency_guide/processes.html#c...)


To the author: Elixir (well, Erlang's BEAM VM actually) can be configured to allow more than the default 256k tasks (262_144). I'd do that and post the numbers for 1M tasks.


The nodejs code doesn't actually work in parallel. The code being tested is only creating new timeouts but it is working on a single V8 thread.

It has a clock thread on the C++ part of the engine and when it timeouts it pushes the continuations on the event loop, and that is it


Came here to say that, not sure if it actually would improve the test but at least it would be correct. Also, Node 12 was released 4 years ago.


Would be interesting to hear what problems was hit trying to run 100k threads. I remember testing out a million threads on an by now older desktop and just had to do a couple of sysctl setting changes that were easy to find.

Not sure if this is what I used but the settings sound familiar: https://www.baeldung.com/linux/max-threads-per-process

Also would be good to specify how the memory as measured (eg to be able to verify they are actually measuring physical memory usage).


For the sake of pitching an analog to go's concurrency model. Hopac is a concurrency library for F# (partly written in C#) delivering an informal implementation of CSP (similar to Go's goroutine and Clojure Core.async). Since the job does nothing but wait for 10 seconds, the code is:

    seq{1..1000000}
    |> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)
    |> run
I'm not sure how to calculate total memory usage but instrumenting the following way:

    GC.Collect()
    GC.Collect()
    GC.WaitForPendingFinalizers()
    GC.TryStartNoGCRegion 3000000000L
    let before = GC.GetTotalMemory false
    seq{1..1000000}
    |> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)
    |> run
    GC.GetTotalMemory false - before |> printfn "%d"
    GC.EndNoGCRegion()
Gives the following output, I'm reading 144MB for 1M lightweight thread (in the REPL) unless I did something dumb:

    144449024                                                                                                                                                                                      
    Real: 00:00:10.902, CPU: 00:00:18.570, GC gen0: 3, gen1: 3, gen2: 3                                                                                                                            
    val before: int64 = 53902280L


This is on .net core 7.0.5, Debian 11, Server GC (which is expected by Hopac). I removed all GC code and ran it as an executable using @compumike's `ps ax -o pid,rss,args' command. I think the only one that triggered a GC was the execution with 10M concurrent jobs.

    N           Real(s)    User(s)  RSS(kiB)  RSS(MiB)
    ==================================================
          1       10.08      00.04     42660        41
       1000       10.07      00.06     40736        39
      10000       10.09      00.20     44176        43
     100000       10.15      01.85     58532        57
    1000000       10.88      18.18    182024       177
   10000000       22.28    4m04.84   1452224      1418


It would be interesting to see how much the C# memory usage would go down if you allocated a single Task.Delay at the start and had all the tasks wait for it. It would probably result in it finishing in 10sec instead of climbing up to 12, while still running 1 million tasks. I'm not sure how you would do this in other runtimes, and there is still a risk of new problems cropping up.

1 million Task.Delays means 1 million queued waits in the sleep machinery, which could run into some bad scaling and overhead. But then 1 million tasks waiting on a single awaitable probably hits overhead in other subsystems. My guess would be that 1 million waits is much cheaper than 1 million sleeps, because sleeps have to be sorted by due time while the list of tasks waiting for completion just needs to be sequential. Entries in the sleep queue are also going to be bigger (at minimum, 4-8 bytes for due time + 4-8 bytes per completion callback) while entries in a completion queue would just be 4-8 bytes for the completion callback.

Task.WhenAll also has potentially worse behavior than the other apps if you care about memory, since it will need to allocate a big array of 1 million Tasks in order to pass it in. Many of the other programs instead loop over the tasks, which eliminates that overhead.


The other thing I noticed is that they are wrapping every Task.Delay(…) inside a Task.Run(…) call. That should not be necessary. The inner task will yield back to the called when it hits the delay call anyway. There is no need to queue it with an immediate yield as if it is background or cpu-bound work.

The Run() call might be creating a more complex state machine and using more memory than necessary (2 tasks vs just 1 for each delay). The memory usage might be even lower if the Run() call is dropped.


Yup - this would do the job faster and better.

Task task = Task.Delay(TimeSpan.FromSeconds(1)); tasks.Add(task);


I was sufficiently curious and just went and tested this using BenchmarkDotNet. The example code is here (https://github.com/J-Bax/CSharpBenchmarkExamples).

The difference is quite significant.

(1) With the authors code (using Task.Run), I get ~428MB of allocations.

(2) Dropping the unnecessary Task.Run(...), I get ~183MB of allocations.

(3) Doing (2) and waiting N times on the same delay, I get ~39MB of allocations.

This was all using .NET 6 too. .NET 7 or the 8 preview might be even better since they are working hard on performance in recent releases.

So even looking at just (2), that puts .NET on par with the rust library.


I see C is missing. If you're going to do a language comparison, might as well include the one language we know will have the best performance and lowest memory


There is no light-weight threading concept in standard C/C++, and the cost of one OS thread is at least 1Mb (stack) + 4Kb (tls) reserved memory (at least by default, on modern Windows), so for 1m threads that will be 1Tb of RAM :)


C++ 20 has async/await (but of course named co_async and co_await, because C++ isn't the sort of language where you're allowed nice things, the beatings will continue until morale improves)

However AIUI C++ 20 doesn't actually supply an executor out of the box, so you would need to choose what to do here as for Rust, where they picked both tokio and async-std and you can see they have different performance.

C does only have threads, but you could presumably pull off the same trick in C that Rust does, to get Linux to give you the bare minimum actual resident memory for your thread and you needn't care about the notional "real" size of the thread since we're 64-bit and address space is basically free.


> (but of course named co_async and co_await, because C++ isn't the sort of language where you're allowed nice things, the beatings will continue until morale improves)

LMAO


co_<whatever> has become a meme in the C++ community.


Stack memory is allocated lazily on Linux. I thought this benchmarks shows it pretty clearly. Otherwise 10k threads would be already in GBs territory, when in fact it is below 100 MB.


How is your compiler addressing bits like this?


Since when is C the most performant? One can’t even write a performant generic vector with it.

C++ is the de facto language used for hight performance.


"One can't even write a performant garbage-collected function in C++"

You're describing features not performance.


So how do you write programs without data structures? You copy paste some implementation and modify it slightly at every use site?

Possible optimizations depend on the semantics of the language — C++ can do just orders of magnitude more with for example a string data structure, than what is possible with a C “string”. So no, C is not the most performant language in general — de facto C++ is used for everything performance critical.


> You copy paste some implementation and modify it slightly at every use site?

Yes, or use macros. Not saying it's the most convenient, but that's an orthogonal concern.

You wouldn't say hand-written assembly isn't performant, so your objection strikes me as wrong.


The maximum number of simultaneously alive Erlang processes is by default 262,144. This needs to be raised to avoid the `system_limit` error.


Is memory really the right resource bottleneck for this many tasks?


For a workload like websockets, where things are generally idle, this is a decent benchmark.


I get very different results on the 1 mill tasks than the author here, so it would be interesting to know how the consumed memory is measured.

My results is 565Mb for the 1 mill tasks, vs the authors 2.6Gb.

My method is described in [1], feel free to point out what I am missing here.

[1] https://gist.github.com/nilsmagnus/38b1b5f63237a2a6317942849...


Understanding full well that these are pretty artificial results, I'm surprised at just how well Node and Python faired against Go and Java. For all the default assumptions around languages that folks have, they've certainly held their own here.


As others have mentioned, the results come from the difference between async concurrency vs "lightweight threads" (in Go and Java).

In the nodejs test, there's very little javascript code being actually executed here. Whats really being tested is nodejs's event queue (libuv), which is written in C. And its being compared to Go's coroutine system - which is also in C. But, if I recall correctly, Go's threading model includes a preemptive scheduler (which is very complex) and it allocates a stack per goroutine.

This doesn't test if Javascript or Go code is faster / more efficient. There's almost no javascript or go code being executed here at all. This is testing the execution engines of both languages. Go's execution engine has more features, and as we can see those features have a cost at runtime.


Some of these tests are doing concurrency. Some are doing parallelism.

Concurrent meaning multiple tasks run within a given stretch of time. Parallel meaning multiple tasks run at a given point of time.

These are not the same. So the tests are not testing the same thing.


> popular languages like Rust, Go, Java, C#, Python, Node.js and Elixir.

Oof, has C/C++ become that unpopular? It would be a useful comparison as low level system language anyway.


In C# program you'd generally want to call .ConfigureAwait(false) on the task you are about to await if the continuation does not require to be completed on a special thread (like UI).

P.S. the statement above is imprecise: https://devblogs.microsoft.com/dotnet/configureawait-faq/


Why no C++ version?


Because C/C++ would kick ass, exposing the rest of the hipster technologies as lame.


Pfft doubtful. Rust, C and C++ all have very similar performance. And apparently a javascript framework is currently beating all of those languages in the composite score on techempower, though that seems a bit weird to me:

https://www.techempower.com/benchmarks/#section=data-r21&tes...

C doesn't have a built in, optimized async execution engine of any sort. C with threads would lose to any of the async runtimes here (including javascript).

C++ has async now. By all means, write up an async C++ example and compare it. I suspect a good C++ async executor would perform about as well as tokio, but it'd be nice to know for sure.

I'd also be interested to see how hard it is to write that C++ code. One nice thing about Rust and Go is that they perform very well with naively written code. (Ie, the sort of thing chatgpt, or most of your coworkers would write.) Naive C would use kernel threads, so if we're comparing average C code to average Go code at this test, the average go code will end up faster.

(This isn't always true of rust. I've seen plenty of rust code which uses Box everywhere that runs crazy slow as a result.)


> And apparently a javascript framework is currently beating all of those languages in the composite score on techempower, though that seems a bit weird to me:

Just-js was essentially built to run the techempower benchmark. It's not a real stack you can use. An exercise in "what is possible".


They would also kick memory safety in the ass, unless you’re the kind of dev who doesn’t make those kinds of silly mistakes



`unsafe` doesn't mean that it's not safe, just that the compiler can't prove that it's safe so the human has to do it manually.


Are you trying to say that you can write from scratch a code just as safe or better than one of the most peer reviewed crate of Rust? Congrats, the world has been waiting for you, but nonetheless, that's why commoners prefer to stick to what's safer for them


Haskell imo is the most glaring omission. Famous for its green threads long before Go was around.


Rust is going to do well on toy examples. This is because in a naive implementation, for each thread, Rust statically allocates enough space for the largest possible stack the thread will need. On a toy example this will be quite small. In real code, some branches will have large futures that will bloat the size of every thread, even if the thread doesn't take the bloated branch. You can mitigate this issue by heap allocating large futures on cold paths, but this requires insight and tuning.

Some runtimes can handle this for you, but they come with other performance tradeoffs.


In real code I saw Rust beating Go and Java based code by 10x-50x in memory use. But too complex and too many factors to analyze, so not a material for a blog post. Actually this tiny benchmark favors GC based runtimes because it just sleeps - so doesnt actually allocate much on the heap.


The Java tests should be optimized by making the ArrayList a fixed size. All this resizing of the huge array backing it and the potential final very high amount of empty backing array slots probably makes the Java result worse than it could be.

I wonder if the parallel stream API with a virtual thread pool would make it better or worse.


The ArrayList is just a growable array and is presumably using an exponential growth strategy so it's equivalent to what several of the others are doing. You could indeed tell Java how big the ArrayList will grow in advance, but you could for example likewise do that with the Rust Vec by asking for Vec::with_capacity(num_tasks) instead of Vec::new().

Since this is amortized exponential growth it's likely making negligible difference, although it is good style to tell data structures what you know about their size early.

The Go approach doesn't appear to pay this penalty, it's just incrementing a (presumably atomic?) counter, although I have no idea where these tasks actually "live" until they decrement the counter again - but most of the others are likewise using a growable array.


It looks like most of the languages are doing the same thing w.r.t using a 'List'-style implementation that re-sizes/copies it's internal buffer when it runs out of space. One thing I wonder though is whether the managed languages pay a bigger penalty on the memory usage than the unmanaged languages due to this.

In the unmanaged languages, those intermediate buffers should be freed immediately during the re-size operation.

But in the managed languages, they might stick around until the next GC. So the peak memory would be higher.

In this example the runtime overhead is going to represent more of the memory usage anyway but it quickly isn't negligible. E.g. a list of size 1M is going to be ~8MB (64 bit addresses). So even if the array is re-sized from 1->2->4->8->...->~500K, the end result is not going to be worse than 2X the size.


Then this should be applied to all benchmarks not just Java.


Ideally it's a git(hub) repository and people can add PRs to improve the outcome of their favorite languages.


C#'s list works the same way


Tokio seems oddly good compared to the others. Is that because it’s using a thread pool instead of creating something new for every task? I wonder how it would fare if the tasks were more intense, and how the time-to-completion compares between these approaches.


async Rust functions are compiled down to a finite state machine. A stackless coroutine I think it can be called. I suspect that C#/.NET does something similar though I am just speculating here. The difference between async-std and tokio would then be explained by a lower overhead runtime per task in tokio.


All the async and green thread runtimes are using thread pools by definition. Avoiding spawning as many OS threads as program's tasks is their main job.

This benchmark wasn't very rigorous, so maybe it has overlooked something, but the result seems plausible. Tokio is a good, mature runtime. Tokio-based frameworks are near top of TechEmpower benchmarks.

Rust's Futures were designed from the start with efficiency in mind. Unlike most async runtimes where chaining of multiple async calls creates new heap allocations, Rust flattens the entire call tree and all await points into a single struct representing a state machine. This allows the runtime to use exactly one heap allocation per task, even if the task is pretty complex.


It's nice to see but frankly speaking with current memory prices all I care is speed, not so much memory consumption, unless I have very specific requirements.


this is a terrible benchmark. You're literally doing nothing per task.

There is so much left up to the implementation details that you are not benching anything.


If those 'tasks' can be expressed as state machines, then I'd think log2(<number of states>) * 1e6 bits would be the lower limit.


Using thread pools/connection pools could help a lot both in performance (less time to create threads) and memory use (less fragmentation + less memory to garbage collect after short-lived allocations). There's a guide somewhere about using connection pools, multiprocessing and async I/O vs. pthreads, showing that the most obvious solution to the concurrency implementation does not necessarily has the best performance, but couldn't find it right now.

EDIT: polls->pools typo


Let’s understand that Go’s runtime and scheduler does a lot, and there’s not enough information to justify such benchmarks


What's the point of running 1M concurrent tasks? I don't see a good reason tbh.


website with that many requests


This is bound to problems - slighest change in your stack - language, runtime, OS would affect that so much that it's completely pointless to rely on measures taken and apply them later (that to be said, this is perfectly valid choice when you are locked in - like in the embedded or game console world, but rarely that's the case outside)

For example - https://go.dev/doc/go1.3#stacks https://agis.io/post/contiguous-stacks-golang/

So your measurements in go 1.2 would probably differed in go 1.3 (and this is just one example, in only one of the possible axises that have changed).

It's not like you own the data structure there to control it.


Not for a typical website. Even if you're getting a massive amount of traffic, like a million requests per second, you'd want the request to finish as quickly as possible, so they never pile up to be 1M concurrent tasks. If the tasks have real work to do and can't finish quickly, the having 1M in-flight tasks compete for orders of magnitude fewer CPU cores may not be a good idea — it will waste lots of memory for their state and it will be hard to keep latencies reasonably even. You'd probably use a simpler queue and/or load-balance to more machines.

A scenario with this many tasks spawned is more realistic if you have mostly idle connections. For example, a massive chat server with tons of websockets open.


You might not have a choice. How should a load balancer be implemented? It has to keep those connections alive while the underlying services fulfill the request.


Also I would quickly return something to say the website is busy, and apply some kind of back pressure - since a process can handle only that much. But I have less experience there to say, though my surprise while working with Java used for web serving was that if the gc took more 1% (or was it 3%) of some time sliced window, then the whole process was killed (that's how the team/company has it set up) - e.g. "better fail fast, than serve slow" - the last one can quickly affect your business if it's relying on sub 100ms (~) response time.


The go memory usage can probably be explained by the fact that go uses fairly large stacks for goroutines (according to https://go.dev/doc/go1.2#stack_size it’s 8kb per goroutine, even though that’s old)


If it was true, the results would be much worse. The average from my measurements is a bit above 2 kB. This website says it is 2 kB: https://tpaschalis.me/goroutines-size/


Stacks changed a lot in Go subsequently. I believe that in 1.2, segmented stacks were still being used. (They now are contiguous and copied on upsize.)


It also runs actually in parallel, unlike the Rust and JS ones


The Rust threaded solution obviously produces actual threads, which will run "actually in parallel" if that is technically possible on your system.

The Rust tokio solution will use tokio's runtime which automatically spawns an appropriate amount of worker threads to parallelize your async tasks (by default it asks the OS how many CPUs there are and creates one such thread for each CPU).

The async-std implementation is roughly similar.


The Rust version is running in parallel. Tokio's runtime is parallel by default.


I wonder how Ruby performs in this benchmark


World's largest compact car. Yawn.


> Notably, at 1 million tasks, I observed that the overhead of launching tasks became evident, and most programs required more than 12 seconds to complete.

I didn’t quite get this part. Even if your runtime has no overhead, the fastest way you can execute 1 million 10-second tasks is 10 * 1 million / num_cores, right? Why would these million tasks only take 12 sec?


You can “run” more threads than you have cores when they are all inside sleep()


Because the tasks are all running concurrently. (They're all running at the same time).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: