Disclosure: I work on Google Cloud (and launched Preemptible VMs). Thanks for th...

minimaxir · on July 9, 2017

> how do you handle and account for preemption?

I do most of my experiments with Jupyter Notebooks and Keras on top of TensorFlow. Keras has a ModelCheckpoint callback (https://keras.io/callbacks/#modelcheckpoint) which saves a model to disk after each epoch and is super easy to implement (1 LOC), and a good idea even if I wasn't training on a preemptable instance. In the event of an unexpected preemption, I can just retransform the data (easy with a Jupyter-organized workflow), load the last-saved model (1 LOC) and resume training.

The drawback there is if the epochs are long, which could risk in losing more-than-wanted progress due to a preemption.

rryan · on July 9, 2017

That's really odd that that Keras API's interval is measured in epochs (which is a different wallclock interval for every different model/dataset/hardware configuration). It's much more common to checkpoint based on a time interval.

gcr · on July 9, 2017

Oh interesting, I've never seen checkpointing on a time interval. Most Torch examples just dump the model to disk after the epoch finishes.

One reason to use epoch checkpointing is because that ensures that all samples of the training data have been seen the same number of times. If your data is large and diverse, with heavy enough augmentation it might not matter very much