Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks cool. How much SpotML-specific code is needed to target SpotML? I'm assuming it doesn't just magically detect the training loop in an existing pytorch codebase and needs hooks implemented for resume, inference, error handling, etc.


We tried to keep it minimal. All you need to do is specify the format to resume last checkpoint in the spotml.yml file. So let's say your checkpoint files are saved as ckpt00.pt, ckpt01.pt, ckpt03.pt and so on. You can configure checkpoint regex file format ^ckpt[0-9]{2}$ and spotml resumes by picking the latest of it.

For detecting if the training process is still running or errored out it registers the training command pid when launching the task and then monitors for the Pid for completion. It also registers and monitors the instance state itself to check for interruptions and resuming.


Oh wow that's rather simple and works with more configurations than I expected. Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: