> Since everyone is treating containers as cattle CRIU doesn't seem to get much attention
Nah, it's more like "I don't trust that thing to not cause weird behavior in production".
VM-level snapshots are standard practice[1] because the abstraction there is right-sized for being able to do that reliably. CRIU isn't, because it's trying to solve a much harder problem.
[1]: And even there, beware cloning running memory state, you can get weird interactions from two identical parties trying to talk to the same 3rd service, separated by time. Cloning disk snapshots is much safer, and even there you can screw up because of duplicate machine IDs, crypto keys, nonces, etc.
The thing with VMs is that there is much more overhead to booting a Linux VM which makes checkpointing much more attractive. For a container running with Linux namespaces/cgroups the container can be started in a few milliseconds.
Im sure there are some niche applications for container checkpointing, but I don’t really see the complexity being worth it.
Maybe checkpointing some long running batch jobs could save you some money, but you should just make your jobs checkpoint their state to an external store such a ceph or s3 and make the jobs smart enough to load any state from those stores if they are preempted.
Firecracker starts running the application in as low as 125ms. Most of the overhead in a "cloud cold start" comes from the cloud infrastructure, not from the virtualization mechanism.
Yeah, I’ve always held a soft spot for CRIU, but I don’t see it battle-tested enough to trust big, gnarly closed-source third-part vendor enterprise Java products to run under it. And if I’ve reduced an open source piece of kit’s execution footprint enough to trust it, I’d probably reach for Unikraft with checkpointing to a Persistent Volume before CRIU, which feels like early days of VMWare.
Hopefully though, my trepidation is wrong. What is the most complex piece of software others have run under CRIU in production, and for how long?
Nah, it's more like "I don't trust that thing to not cause weird behavior in production".
VM-level snapshots are standard practice[1] because the abstraction there is right-sized for being able to do that reliably. CRIU isn't, because it's trying to solve a much harder problem.
[1]: And even there, beware cloning running memory state, you can get weird interactions from two identical parties trying to talk to the same 3rd service, separated by time. Cloning disk snapshots is much safer, and even there you can screw up because of duplicate machine IDs, crypto keys, nonces, etc.