Tidalscale, a hypervisor that allows a single OS to run across multiple servers

nl · on Feb 14, 2016

So.

It's easy to dismiss this: the fallacies of network programming, etc etc.

But something happened in the last 10 years many people haven't really thought through: the network became faster than (most spinning) disks. Reading 1 MB from a spinning disk is roughly slow as reading 1 MB from the network in the same data center.

That makes you consider exactly what a "computer" is. A "supercomputer" is little more than multiple traditional computers with very heavily optimized network interconnects. But now everyone's internet connects are just as fast, why should we consider a cluster of computers any different?

Yes, SSDs change this (especially M2!). But view the SSD -> Network relationship as we used to view the RAM -> Disk relationship.

I think it's worth rethinking some of the traditional boundaries of what "a computer" is.

danpalmer · on Feb 14, 2016

True, however this is competing with Supercomputers, which are quite a special case.

It's difficult to define a supercomputer, but essentially a defining factor is being able to read data from the memory (not spinning/SSD disk) across the network at the same speed as if it were local. This is typically done with specialised networking (like Infiniband) that can reach nearly 100Gb/s, something that we are still quite far away from on 'commodity hardware'.

I think this company must be doing something very different, or have made some sort of breakthrough, that allows them to do what they're claiming, I don't think the network becoming faster and having SSDs is what has enabled this, at least not for the most part.

withjive · on Feb 14, 2016

Not a fan of "magic" scaling across multiple servers. I'm not sure I would get the amount of control to squeeze the most performance out of this "OS" (edit: in terms of data locality for "big computation").

I am using Terracotta for a high performance JVM memory heap across a cluster. I need control and performance.

I wonder how this compares...

Personally, I'd rather stitch my cluster together than myself, rather than a monolithic "OS" hypervisor.

2trill2spill · on Feb 14, 2016

This video from the Bay Area FreeBSD User Group, June 2015, does a decent job of explaining how TidalScale implements their hypervisor, spoiler, it's based off FreeBSD and Bhyve.

https://www.youtube.com/watch?v=f-ug6B6ycng

_qc3o · on Feb 14, 2016

About page is pretty impressive. I understand how memory and storage can be clustered this way, I guess it's just like virtual memory and you have another layer of indirection via their hypervisor. What I don't understand is how you can partition computation the same way and not have performance tank.

lotyrin · on Feb 14, 2016

Answer: you can't.

Something like MOSIX (os-level ability to ship processes, their IO, etc. across the network), being less indirect, should theoretically outperform this sort of thing, but people don't even deploy those very often.

It's almost always harder, but still not harder enough to write apps that decompose explicitly to being distributed.

Implicitly, you take on risks where either your whole VM dies if any physical member dies, or you eat the performance penalty of some kind of application-naive consensus, or yet some other hideous compromise. There's no good answers and you get what you pay for.

dh997 · on Feb 14, 2016

Computing's RAID0 with worse performance and worse reliability.

With traditional enterprise hypervisors, boxes are sized for least TCO and to be large enough to comfortably hold the largest VM... and then the farm is usually sized to have at least N+k (k >= 1) capacity to avoid killing off VMs in order to perform maintenance or be trapped without some spare capacity for new VMs. Critical VMs use lock-stepped mirroring on multiple physical boxes to avoid downtime from hardware failure. Furthermore, most enterprise hypervisors allow migrating both storage (disk images) and computing (the running VM's state) resources to different hw (hw failure notwithstanding).

If folks are trying to vertically scale one app or one system to become a giant box without using explicit parallelism, it's going to be slow and painful.

(As a teenager, I once ran an actual Fortran nuclear reactor simulator on Windows beige box and later played with making code using OFED performant over Infiniband... don't ever inherit two brands of gear and expect it to work)

winestock · on Feb 14, 2016

How does this compare to the clustering capabilities of VMS? If I remember correctly, a VMS cluster can make several physical computer behave like one logical computer with multiple CPUs and disk packs.

80x86 · on Feb 14, 2016

VMS did not share memories throughout the cluster.

ris · on Feb 14, 2016

Hmm. Proprietary "magic".

"Come, base your entire infrastructure on our hypervisor..."

rdl · on Feb 14, 2016

I don't think I'd use this for my core application (where it's worth building application-specific distribution), but having an option in the toolbox for other/internal/non-core apps which have grown beyond a single machine.

(I met the founder at a conference last year where we were both speaking; I was suspicious of the whole model (because it sounds like magic), but talked with them a bit more and it seems legit for a lot of workloads.)

dh997 · on Feb 14, 2016

Having been an HPC sys admin pre and post cloud, this unicorn comes around about every ten years... many previous startups failed because they tried to push expensive custom hardware with fancy interconnects, but most importantly, there isn't a need because scaling apps horizontally instead of vertically is usually more powerful, cheaper and more reliable as evidenced by Google's dominance. Plus, Rocks Cluster, Hadoop, Kafka, Kinesis, iPython, Erlang/OTP, Akka and more already make these sort of things moot because deploying a resilient cluster has not been as impressive as summiting Mt. Everest for a decade now. It's feels like trying to sell a magical product (cluster resource management) to get a major advance for free (single process system image). Maybe they can make the file system, ram, network devices, packets, firewall rule and everything else magically work, but fault handling and latency are big concerns that rapidly increase in complexity, orthogonal to anticipating understanding of how to most efficiently schedule what a process wants done when it would be simpler to code for parallelism such as MapReduce / actors from the beginning.

sz4kerto · on Feb 14, 2016

> there isn't a need because scaling apps horizontally instead of vertically is usually more powerful

No, it's isn't more powerful per se. There are many problems that simply cannot be scaled horizontally at all because the data exchange pattern between nodes. However, the infrastructure/hardware behind horizontal scaling is simple, and it's basically unlimited.

dh997 · on Feb 14, 2016

There's only so many TB of ram / disk and cores that can fit in one box for sensible cost, whereas it's usually far cheaper and less limited to just add more boxes, plus the SPoF of one giant box.

It usually indicates developer/provider laziness / cheapness / lack of imagination when people say something "can't" be done. Have they even tried? Most real problems are diagonalizable. Plus any shop using RDMA and fast infiniband instead of Ethernet dont have to worry as much about node-to-node latency.

There is no magic solution for all use cases, but scaling veritically tends to encounter budget or technical limits where a little forethought beforehand could have scaled using modern tools instead of shifting burden down the stack.

If you're LMAX, scale up... looking for ET in radio signals, scale out.

taneq · on Feb 14, 2016

If these problems can't explicitly be scaled horizontally by the developer, why would this system be able to scale them horizontally by doing it behind the scenes? Any limitations of data exchange between nodes would still apply.

dh997 · on Feb 14, 2016

Most of these single system image clusters have used custom, fast interconnects akin to Inifinband or hypertransport to ameliorate locality issues (latency and bandwidth)... Without such, it's neigh impossible without pinning all resources for a "process" to one physical box.

Good luck ever getting more than one box worth of performance or resources out of a single application instance.

Hence MapReduce, actors, promises, message-passing, etc.

poelzi · on Feb 17, 2016

So, it is basically for people who don't want or can't distribute their work using proper data modeling and multiple distributed processes.

Don't get me wrong, there may be use-cases where this is actually the best approach. But having 1 OS with many, many NUMA nodes (that's how I guess the OS sees the cluster) is also very hard to get right. You need to program NUMA aware and use local memory where possible, all kind of locks will become very, very slow so you need to program lock-less where possible or have some sort of transactional memory. Anyway, I would say that the gain you get by using Threads will most likely be overcome by the complexity in memory management and data structures to reach good performance.

Message passing is much easier to program and allows easy encapsulation of modules and test abilities.

Cool project nonetheless :)

ctstover · on Feb 14, 2016

Why does this make me think of RAM doublers from the win 3.1 era?

slang800 · on Feb 14, 2016

link for those who aren't familiar: http://www.ambrosiasw.com/Ambrosia_Times/January_96/3.1HowTo...

cdvonstinkpot · on Feb 14, 2016

ScaleMP does this, too. Not for the faint of wallet, even for the free edition considering you have to use approved hardware...

http://www.scalemp.com/products/vsmp-foundation-free/

alilee · on Feb 14, 2016

ScaleMP has been around for a long time. It would be good to get an update or understand the distinction between the two products.

gaze · on Feb 14, 2016

Mosix is still a terrible idea.

jjoe · on Feb 14, 2016

Is this NUMA orchestration?

seabrookmx · on Feb 14, 2016

That was my thought as well. Like NUMA but over the network.

jupp0r · on Feb 14, 2016

I don't see how this is relevant when you can buy servers with 3TB of RAM off the shelf.