Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Disclosure: I work on Google Cloud (and launched Preemptible VMs).

Thanks for the write-up, Max! I want to clarify something though: how do you handle and account for preemption? As we document online we've oscillated between 5 and 15% preemption rates (on average, varying from zone to zone and day to day) but those are also going to be higher for the largest instances (like highcpu-64). But if you need training longer than our 24-hour limit, or you're getting preempted too much, that's a real drawback (Note: I'm all for using preemptible for development and/or all batch-ey things but only if you're ready for the trade-off).

While we don't support preemptible with GPUs yet, it's mostly because the team wanted to see some usage history. We didn't launch Preemptible until about 18 months after GCE itself went GA, and even then it involved a lot of handwringing over cannibalization and economics. We've looked at it on and off, but the first priority for the team is to get K80s to General Availability.

Again, Disclosure: I work on Google Cloud (and love when people love preemptible).



> how do you handle and account for preemption?

I do most of my experiments with Jupyter Notebooks and Keras on top of TensorFlow. Keras has a ModelCheckpoint callback (https://keras.io/callbacks/#modelcheckpoint) which saves a model to disk after each epoch and is super easy to implement (1 LOC), and a good idea even if I wasn't training on a preemptable instance. In the event of an unexpected preemption, I can just retransform the data (easy with a Jupyter-organized workflow), load the last-saved model (1 LOC) and resume training.

The drawback there is if the epochs are long, which could risk in losing more-than-wanted progress due to a preemption.


That's really odd that that Keras API's interval is measured in epochs (which is a different wallclock interval for every different model/dataset/hardware configuration). It's much more common to checkpoint based on a time interval.


Oh interesting, I've never seen checkpointing on a time interval. Most Torch examples just dump the model to disk after the epoch finishes.

One reason to use epoch checkpointing is because that ensures that all samples of the training data have been seen the same number of times. If your data is large and diverse, with heavy enough augmentation it might not matter very much




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: