This looks cool. How much SpotML-specific code is needed to target SpotML? I'm a...

vishnukool · on Oct 3, 2021

We tried to keep it minimal. All you need to do is specify the format to resume last checkpoint in the spotml.yml file. So let's say your checkpoint files are saved as ckpt00.pt, ckpt01.pt, ckpt03.pt and so on. You can configure checkpoint regex file format ^ckpt[0-9]{2}$ and spotml resumes by picking the latest of it.

For detecting if the training process is still running or errored out it registers the training command pid when launching the task and then monitors for the Pid for completion. It also registers and monitors the instance state itself to check for interruptions and resuming.

ShamelessC · on Oct 3, 2021

Oh wow that's rather simple and works with more configurations than I expected. Thanks.