Precisely, if we are talking about low-latency, why would we let the communicati...

Precisely, if we are talking about low-latency, why would we let the communications go through the application (in user-land), to the kernel, then to the network stack, then be received by a kernel from another node, and then finally received again by the application in user-land. I would imagine as a first guess that bypassing several linux kernels and directly accessing remote hardwares would be mandatory for best low-latency.

If they are some info on internet about the software stack/architecture of the entire system, I would document myself on that. I didn't explore all the links I posted above yet.

I'm nowhere an expert, and HPC is really specific use case, but there is surely interesting bits to learn from it