Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Costs of multi-process architectures (aseigo.blogspot.com)
59 points by cronaldo on July 4, 2014 | hide | past | favorite | 48 comments


My TL;DR: Multi-process architectures may suck on machines where you don't have enough CPU/Cores for your workload; or the design of your software requires a lot more processes than common sense would dictate.


My constant reluctance of things like goroutines/coroutines is that if they truely get executed in parallel there is little to know way to manage their machine load as one would in a distributed pipeline.


The golang runtime allows you to specify how many processes it will use. It schedules goroutines one after the other inside that number of processes. If you specify max 1 process, you have no parallelism, just concurrency.


Does it have a priority system? Or a way to setup pools? I often find I want to run some part of a program (some of its tasks) on a set number of threads at a certain priority, not throw tasks into some global pool.


Oh come on! How can you say that? Of course there's a cost on multiprocess arquitecture, but it doesn't suck at all! Ok, you shouldn't just spawn new processes whenever you need more tasks running concurrently, but how could a modern OS exist without multiprocessing? What would you have us do, go back to MS-DOS? :-)


If you read his article - mainly message passing and consequent context switching.

Most programs with protected memory - thats anything after windows ME in the windows world, incl. win 98, 95 for example (ie not just ms dos) - don't need too much message passing.

Stuff like webservers do (hence the perf advantage of nginx over apache, until they brought mod_event). Heck even web browsers do quite a bit despite his point - they do what they can to send as few messages as possible to compensate.


Yeah, actually I agree with everything on the article but the title. The article does too :-)


I think the core conflict is synchronization overhead vs latency, no? If communication is too granular there's too much synchronization overhead, which can be fixed by buffering, which in turn increases latency. Game engines have been swinging back and forth over the past 10 years on this a lot. First the trend was to have a few "fat threads" (maybe input+gamelogic->physics->visibility->rendering) which ran in parallel but which were coupled like a pipeline working on the previous frame's data. Each pipeline stage meant one frame more latency. Add the rendering API/driver latency, plus whatever the display device adds, and suddenly games had something like 100ms latency or even more which is very noticeable. Then people started to make the game loop a simple sequence of subsystem stages again, but subsystems split their work on the current frame internally into small parallel tasks. It will be interesting what the perfect game engine architecture will look like for VR with its ultra-low latency requirements from sensory input to display update.


I could have sworn that Aaron had mentioned that this story was a counterpart to his recent "Multi-process architectures rock! :)" post that all of you defending multi-process architectures apparently didn't find. It's worth a read as well since many of the arguments you've made, Aaron already made.


Erlang does quite nicely as a multi-process architecture. Low context switch overhead, low per-process overhead, built for redundancy and fault-tolerance (implying multiple execution units), tool support. There's more than one way to do it - all the world's not Unix.


These are all problems that have easy solutions - chuck more CPUs at it.

Back in the ideal world though, I would say the harder problem is still not these, it would be the increased operational complexity. Support and maintenance costs go up.

I think the problems are somewhat over played here too. We've had zero copy network stacks for a while. Stack sizes are not a concern because scheduling becomes a bigger problem faster than using small chunks of memory does as you scale.

Edit: forgot to mention isolcpus or pinning as one method to solve the context switching - of course frequently we can make greater impact by careful consideration of the syscalls we make from our code.


> These are all problems that have easy solutions - chuck more CPUs at it.

So basically you tell us that if your application that uses bubblesort everywhere is too slow we just have to "chuck more CPUs at it"? No Thanks.

That line of reasoning makes me angry twice: as an embedded systems engineer (for which that kind of "solution" is simply not an option) and as a user (when I see applications take more and more megabytes and CPU cycles without proportionally increasing functionality).


This, is the way of the world.

Don't get angry about it though. Instead make it easier for stakeholders / customers to choose to do "the right thing". Be transparent and accurate with time-to-fix estimates. That's a hard problem, i don't even think it's perfectly solvable but i try hard anyway.

The uncomfortable truth - sometimes "the right thing" is to throw more CPU at it. Being late to the consumer electronics market for example, even if your product is bug free and perfectly designed, means you lost a whole chunk of available cash to the guy who got there first... With that money, he may be enabled to make a v2 that completely trumps yours.


Cant agree to the points. It all depends on use case. In some cases it is even beneficial to remove the OS itself. I am not talking about high lveel OS like Linux, QNX or VXWorks, even MicroC OS II is a big overhead for some systems.

Consider how your car decides to deploy the airbags. Do you want a message queue? No, as soon as hardware inputs meets the condition the airbags needs to be deployed. On the other hand on the same car will have a infotainment system that has VxWorks/QNX/WinCE with multi process architecture. Most of them even have separate processor to interface vehicle CAN bus and power management. Inside the application processor, graphics and HIM will be distributed in some processes. The low level drivers codecs in another set of processes. The whole thing will just give the user a mediaplayer, a map and a phone interface. Some OEMs (e.g. Daimler, Ford) even distribute this whole functionality across different H/W modules.

Dividing gives you maintainability, re-usability, drop-in replacement alternatives. Most of the cases it shortens the engineering time, improves quality and reduces product recall.

The above automotive embedded example is just one use case. There are many areas where you want to distribute your application in many ways.

Finally, I want to ask one small question, when you turn off the reading light in a passenger aircraft, how many processes do you want it (Switch OFF signal) to go through before the light turns off? and why so?


I'm guessing the study from 2007 is a bit stale now. Intel/AMD/... have almost certainly been trying to decrease the penalty for context switches. I'm curious how much they've changed over time.


I'm not an OS guy, but most aspects of context switching is implemented in software. It seems to me the only places in hardware that could help, would be faster memory access (which is more of a general optimization!) and the TLB. Awkwardly though, x86 doesn't seem to have a TLB insert command, so you just have to miss...?

Oh, and you could shorten the pipeline to make flushes faster and the penalty smaller. But pipeline length has stayed fairly static.


It's interesting that x86 only seems to have TLB tagging for VM guest/host switching usage. Regular context switches require TLB flushing, which (reading the Linux kernel) either flush the entire TLB or a specific set of entries.

I'm surprised no (more) tagging is used here for regular switches. Just not worth it?

Edit: A colleague pointed me to this: http://www.google.com/patents/US6510508 which is used by recent AMD cpus. Anyone know any resource that has more info like this centralized?


Most x86 CPUs from 386 and forward have hardware support for context switching. Generally a hardware solution is superiour to a software solution speedwise, but they're not complete and therefor mainstream operating systems are not using them.


My understanding is the old P4 architecture had an exceptionally long pipeline (75 instructions?); would those have been the primary desktop CPU in 2007?


So they just discovered how multicore programming was back in the day OS could only juggle processes.

Worse, I see no mention of modern micro-kernel OS, many of them used commercially, where the performance is actually quite good.


I wonder if anyone has recently measured the time it takes to do a context switch under Linux on a modern x86-64 processor. Something like the numbers in the table here:

http://blog.codinghorror.com/the-infinite-space-between-word...


why not just pin processes to cores e.g. set-scheduler-affinity and friends on linux ? that works right ?


Apparently doesn't work on Android. At GDC this year I was in a hallway huddle of high-end mobile game devs complaining about how wide parallelism does work on Android because the scheduler stops listening to your affinity requests and just schedules everything in serial on one core so that it can save power on the other three.


I've had huge speedups from poorly-threaded software (thousands of threads, all sharing locks and work) just by running multiple processes and pinning them to specific cores. It was also far more stable that way.


I don't know why the author called this "Multi-process architectures suck :(" when he really meant "I suck at multi-process architectures :(". Look at what he lists:

- "Context Switching"

You can find 4 cores in a trivial computer these days (8 cores with hyperthreading). This means you can have 8 processes without context switching at all, and also suggests that if you don't use multiple processes, you can only reach about 10-20% capacity on a multi-core machine.

- "Per-process overhead"

It's true it has overhead, but you don't have to create thousands of processes, we have light concurrency patterns to use within the actual thread/process.

But even then, you don't even have to run more than one primary process per core. You have multiple cores. They can't together run one process.

- "We built for a single-core world" and "We lack the tools"

Those are outright PEBKAC errors, and not architectural problems. We're way past the stage when multi-process architectures were a hardly understood, confusing problem. We have the right tools, for those who look for them.

Cache misses and overly eager context switching are bad, and you can read a lot about this by Martin Thompson on his Mechanical Sympathy blog: http://mechanical-sympathy.blogspot.com/ There are ways to design multi-process systems to take best advantage of all cores, and your machine caches.

But even Thompson's Disruptor architecture is based around message passing and multiple processes. Because, again, in a multi-core world, suggesting anything else is laughable.

Plus, forget computers, multitasking is a fact of life. Animals do it, humans do it, so do computers. We have to check email and talk on the phone at the same time sometimes. We need to walk and chew gum. We constantly argue and write about how much we should multitask and how much we should focus on a single task, because people have the same context switching overhead and so on. Well for both people and computers, there's a balance, and either extreme is counter-productive. It's as simple as that.


> [...] if you don't use multiple processes, you can only reach about 10-20% capacity on a multi-core machine.

No, you can use threads.

> Plus, forget computers, multitasking is a fact of life. Animals do it, humans do it, so do computers. [...]

That's a really bad comparison, and isn't applicable here at all. Computers are very different from humans or animals, both in how they function and in how they're "programmed". Well, at least it wasn't a car analogy.


> No, you can use threads.

I thought that was assumed in the sentence I wrote, as threads are even harder to deal with properly due to their shared memory space. But they still do context switch, have cache misses, require complex (way more complex) synchronization and so on.

If you can get away with separate processes, you're way better off. Threads are a last resort and come with a "use with caution" label.

> That's a really bad comparison, and isn't applicable here at all. Computers are very different from humans or animals, both in how they function and in how they're "programmed". Well, at least it wasn't a car analogy.

No, I didn't just use a random analogy. How much do you know about our brain works anyway?

Did you know our memory types are stacked like CPU caches, RAM and disk, going from fast short-term to slow long-term memory? Interesting, right? We are literally having "cache misses" when context switching, just like our computer brothers.

And as we keep evolving computer designs, they keep creeping closer and closer to how our brain works and the tradeoffs are startlingly similar. I was very deliberate in my comparison. Think about it like dolphins adopting a quite similar body design and locomotion as fish, despite their completely different origins.

Next step is the always-around-the-corner clockless CPU design. It'll come, eventually (either that, or more likely we'll have hundreds of tiny, simple, specialized, locally clocked cores with some memory on-board).


We have things like caches, true. But I'm not aware of neurology having identified anything remotely resembling multiple cores or hyperthreading in the human or animal brain.

And saying that "there's a balance to be struck" in human multitasking manages to be both trite and wrong; the evidence is not that some people multitask too much and some people multitask too little. Multitasking makes you more stressed and less effective; the overwhelming majority of people should be trying to reduce or eliminate it. Which says very little about how much multitasking our CPUs should be doing.


But I'm not aware of neurology having identified anything remotely resembling multiple cores or hyperthreading in the human or animal brain.

Um, you do realize that the human brain has ten to a hundred billion neurons that all run in parallel, right?


You mean like all the transistors in a single-core processor run in parallel?


Transistors are much less complex than neurons; it takes multiple transistors even to implement a single logic gate, and the input-output function for a single neuron is much more complex than a single logic gate. (Some neuroscientists might say that processor cores are less complex than neurons, though I'm not sure I would go that far. I would only say that it would take a small number of neurons--much smaller than the number in the brain--to be equivalent to a processor core.)


"But I'm not aware of neurology having identified anything remotely resembling multiple cores or hyperthreading in the human or animal brain."

And yet here you sit, typing and breathing and seeing and thinking about what you're having for dinner tonight while your digestive system continues to work on lunch and your heart continues to beat.


Sounds more like a bunch of specialized hardware controllers than several general-purpose cores. If you want to do something as simple as counting while thinking about something else, the only way is to (ab)use your visual or auditory processors.


> I thought that was assumed in the sentence I wrote

You might have assumed it, but your post didn't mention or even imply it.

> Computers keep creeping closer and closer to how our brain works and the tradeoffs are startlingly similar.

This definitely needs an explanation. In what way are computers and brains "creeping closer and closer"?

> Think about it like dolphins adopting a quite similar body design and locomotion as fish, despite their completely different origins.

This is another bad analogy. CPUs and human brains do not live in the same environment, and CPUs are not subjected to evolution (they could be considered instances of Intelligent Design, though ;-) ).


Seriously, you're going to nitpick the way I used the word "evolution" now? I'm done.


You can find 4 cores in a trivial computer these days (8 cores with hyperthreading). This means you can have 8 processes without context switching at all

The story changes if those processes have to communicate.

Cache misses and overly eager context switching are bad,...There are ways to design multi-process systems to take best advantage of all cores, and your machine caches.

Yeah, but frankly, the situation sucks! It's still quite difficult to detect such problems with certainly, and the solutions often look like hacks.

Plus, forget computers, multitasking is a fact of life...We have to check email and talk on the phone at the same time sometimes

This kind of misses the point. We can do efficient multitasking of largely independent processes, and have been doing that for a long while. It's using multiple cores on heavily interrelated and interconnected processes where things get quite hairy. This is where current architectures suck.


You can find 4 cores in a trivial computer these days (8 cores with hyperthreading). This means you can have 8 processes without context switching at all, and also suggests that if you don't use multiple processes, you can only reach about 10-20% capacity on a multi-core machine.

In a typical user machine there can be useful independent single-threaded processes running, all serving the user's needs (and thus -- using machine's computational capacity). For servers that have to do only one thing, on the other hand -- yes, it's a problem that requires an actual engineering.


I agree with all your arguments. I even have a project to back you up (it's Node.js multi-process based): https://github.com/topcloud/socketcluster - It scales linearly. I just ran a benchmark on a 16-core machine and was able to reach 126k concurrent 'active' virtual users sending messages ever 6 seconds. To put it in perspective, I was only able to reach 55k concurrent users on an equivalent 8-core machine using the same benchmark test.


"You can find 4 cores in a trivial computer these days (8 cores with hyperthreading)."

There are still plenty of tablets and phones out there which don't. (And the author is targetting those devices with the software he's writing)

"It's true it has overhead, but you don't have to create thousands of processes,"

Ah, but that might be a natural consequence of your design. I think part of the problem is that we're currently in the middle of a really awkward transition period in computer architectures.

As you point out, almost no computers are single core nowadays. So having a single-process application is obviously not taking advantage of all the computer power you have at your disposal. It's blatantly sub-optimal.

However, if you're going to design a multi-process application framework, you probably want to design it to kick off new processes whenever you're about to do something non-trivial that could introduce significant latency into the mix. But depending on what the user/application user is actually doing, that might potentially end up starting dozens, or hundreds, or even thousands of processes.

And our computer architecture is not yet at the point where we have "enough" cores in our CPUs that we can just do that and have it work. We're getting there, but it's likely a decade or maybe two away.[0]

So, we want to create a multi-process architecture to stop wasting the computing power that exists; but we have to be careful and write extra code to managing creating these processes, because we can't yet afford to create them "on a whim".

It seems similar in some ways to segmented 16-bit memory models. In an early flat 16-bit memory model, you were very constrained (64k) but at least the environment you were working in was simple. The move to a segmented 16-bit model was in some ways a lot less constrained, but taking advantage of it meant dealing with a bunch of extra complexity. (boo!) It was the next step to flat 32-bit systems which made working with memory painless and simple again, while further lessening the constraints.

When low-end phones and tablets have >=256 cores, then we'll be able to take advantage of multi-process frameworks properly.

It occurs to me, having seen Alan Kay's talk a few weeks ago and being fascinated by the idea of "spending money to get ahead of Moore's law" to put together a computer that would allow you to write the sort of software to take advantage of computers a decade from now, but couldn't figure out what that meant if you wanted to do it now, I imagine that putting together a >=256 core system might be a good start.


I doubt tablets will ever need 256 cores. Desktop computers don't need that much either if you notice that many of them just run a browser and Excel. It's really when you do specific stuff like for instance gaming that you really need more horsepower. 2-4 cores are enough probably till the end of the decade at least: it allows many people to run 2 or 3 programs smoothly. A corollary is that for a program to try to use every core may not be a so good idea in the big picture.

Furthermore, if one observes the hardware evolution of the PC, one notices that it took the direction of heterogeneous multicore architecture (CPU, GPU, etc.) rather than the direction of an homogeneous multicore architecture: there are more "cores" of different types in your PC outside your i7 than inside. The same goes for tablets and phones that typically feature an ARM design with CPU, GPU, DSP integrated into one chip for footprint and power consumption reasons. Architectures seem to evolve into modular hardware, featuring a base CPU backbone to which specialized chips are added. There are a gazillion ARM-based designs, depending which set of functions you need.

This makes sense for the software and consumer electronics industry. Switching to completely different solutions like many-core chips (Mill CPU, GreenArray), would have a huge cost. Picking a common, general purpose "few-cores" CPU and adding in specialized chips as needed is much more affordable.


"I doubt tablets will ever need 256 cores. Desktop computers don't need that much either"

And 640k will be enough for everyone.

Hey, you might even be right about not needing them, but that doesn't mean those devices won't get them. You don't really need a 32-bit CPU to run your microwave or washing machine, but very few of the ones you buy today are running 8-bit microcontrollers with hand-coded assembler.

Though clock speeds stopped increasing some time ago, Moore's law marches on, and we're still getting more and more transistors per buck. Sure, some of them will go to more L1/L2 cache, but at some point the bandwidth to flush them to main memory becomes a bottleneck, so I think we're going to see more and more cores/chip. As that marches on, I think it's going to trickle down to even the low end of CPUs. For instance, you might even be able to buy a bunch of 64-core chips which have a bunch of cores disabled really cheap, because the disabled cores failed factory testing and were switched off. It can't be sold for full price any more, but that doesn't mean it's useless.

And once your $5 CPUs have 32 cores, well, providing the tooling is there, you might as well use a framework that makes use of them.

"many of them just run a browser and Excel."

Bad choice of examples there, I think. Browsers can be pretty multi-process heavy these days, with one main process, plus one per tab, plus extra processes for e.g. decoding streaming video in the tab, or running your (spit) EME plugins (spit) in a sandbox.

Similarly, with Excel, sure most of the time it doesn't do much. But, if you've got a spreadsheet with a bunch of dependent cells/formulas, then if someone updates the right cell, being massively parallel could really speed up value recalculations and propagation throughout the sheet. Some spreadsheets translate really well to map/reduce, and using all the cores could really help there.


"Ah, but that might be a natural consequence of your design."

If a natural consequence of your design is a need to do something that fits the target environment so poorly, then it's a bad design & should be rethought. Software isn't a Platonic ideal, it needs to account for these things if it's to run well.


> However, if you're going to design a multi-process application framework, you probably want to design it to kick off new processes whenever you're about to do something non-trivial that could introduce significant latency into the mix. But depending on what the user/application user is actually doing, that might potentially end up starting dozens, or hundreds, or even thousands of processes.

This is exactly the problem process/thread pools solve. Most if not all modern programming languages include such thread pools either in their stdlib or as battle-tested external libraries. Do you have a specific complaint here?

The really hard part of multi-(thread/machine)process architectures is synchronization, not resource allocation.


Enlisting more cores in order to get something done faster, you know, by splitting the work is indeed a serious pain.

I recently wanted to get lpsolve to split an integer branch-and-bound programming problem across multiple cores and then get a large on-demand AWS instance to deal with.

The branch-and-bound algorithm is eminently parallellizable. So, it should have been possible.

I came to the conclusion, however, that I would have to rewrite lpsolve for that. That program sticks to one process and there is no way to get it to fork other processes and read back the results.


The tendency is for things like lpsolve to be written single-process, because typically when you need to do it once, you need to do it one thousand times, and then your distribution is using each core available to you for a single lpsolve instance.

1000 iterations of an lpsolve invocation running on a single core, is going to run faster than the same number of lpsolve invocations each running on 10 cores.


Using multiple processes for this is a big leap since you have to plan your data sharing scheme. Assuming the program is currently only single threaded, dropping some OpenMP on it in the right places would be an easier path to using all of your cores.


I get nothing when viewing the site, just a blank page with a list of URLs generated by NoScript. Not a shred of content to be seen without allowing Javascript. It has the trappings of a malicious page.


Sorry about the unconstructive comment, folks. Does anyone have an URL that just serves up HTML with the text of the article? Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: