Thank you!
Interesting, we actually tried aws batch ourselves. 1) How were you able to handle spot interruptions and resuming from the latest checkpoint ? 2) Not to mention fallback to OnDemand on spot interruptions 3) then switching back to spot from onDemand would also need additional process to be setup.
Also i'm not sure how straightforward it is to detach/attach persistent volumes to retain data across different spot interruptions ? The latter can be done but it's just the same rote each time you wanna train something new.
Also thanks for the suggestions !
We're a team of 2 right now, I used to be in the bay area but in Mexico temporarily.
1. Spot interruptions didn't matter much as AWS Batch looks for spots with low interruption probability. Auto retries kicked-in whenever those did get interrupted.
2. Checkpointing was a pain (we relied mostly on AWS Batch's JobState and S3, not ideal), but the current capability to mount EFS (Elastic Filesystem) looks like it would solve this?
3. No hot swapping on-demand with spot and vice versa. Interestingly, ALB (Application Load Balancer) supports such mixed EC2 configurations (AWS Batch doesn't).
Also i'm not sure how straightforward it is to detach/attach persistent volumes to retain data across different spot interruptions ? The latter can be done but it's just the same rote each time you wanna train something new.
Also thanks for the suggestions ! We're a team of 2 right now, I used to be in the bay area but in Mexico temporarily.