Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: SpotML – Managed ML Training on Cheap AWS/GCP Spot Instances (spotml.io)
157 points by vishnukool on Oct 3, 2021 | hide | past | favorite | 68 comments


Disclosure: I used to work for GCP and launched Preemptible VMs.

Congrats! Can I suggest charging more?

IIUC, your business plan is $10/month for the company, regardless of number of users?

You probably save a company that $10 in a day or less for one GPU: One A100 is ~$3/hr on demand and ~$.90/hr as Preemptible, saving over $2/hr.

Said another way, your pitch is to recover a lot of the 70% discount that they aren't going to do themselves. If you were a managed training service, you could pitch yourself as "half the price of AWS or GCP" and keep the 20%+ margin with both parties being happy. (The problem is that pass through billing makes that obvious, you need to support lots of bucket security and IAM controls, etc.).

Fwiw, I would also branch out into inference! Preemptible and Spot T4s are commonly used for heavy image models, but many people pay full price. Inference that takes X ms can easily be handled "without errors" in the shutdown time. The risk is handling all the capacity swings.


Thanks for the feedback, yeah at this stage we really just wanted to find out if we can build something useful for the community. Agreed on the pricing suggestion.

Also, interesting point about inference. I'm not sure though how common it is for companies to need GPUs for inference. Because if you can have a CPU based inference model, which I thought was most common, it's probably not a big usecase?


I've got some comment somewhere on HN that says exactly that "try CPU inference first, it's pretty good".

The need to reach for a T4 comes when someone is doing a big model on images or video and wants sub-second response time. (Think some of the stuff on Snapchat, etc.)


i've worked with people who have needed GPU powered inference on AWS. Training had to happen on p3's but inference happened on g4's. The pricing is lower (and depending on the use case, often the overall cost) and spot savings are usually less dramatic as these instances can be very in demand.


We built a tool called spotML to make training on AWS/GCP cheaper.

Spot Instances are 70% cheaper than On-Demand instances but are prone to interruptions. We mitigate the downside of these interruptions through the use of persistence features, including optional fallback to On-Demand instances. So you can optimize workflows according to your budget and time constraints.

History: We were working on a neural rendering startup that needed a lot of GAN training which was getting very expensive. We were blowing roughly $1000, to train a single category class. Training on Spot instances was cheaper, but still a mess. It needed lot of hand holding/devops stuff to make it usable. So we built SpotML to automate a lot of things.

Posting it here to see if the community finds this helpful, so that we can open it up to the larger community.


Neat. Congratulations on the launch, Vishnu!

Apart from the fact that it could deploy to both GCP and AWS, what does it do differently than AWS Batch [0]?

When we had a similar problem, we ran jobs on spots with AWS Batch and it worked nicely enough.

Some suggestions (for a later date):

1a. Add built-in support for Ray [1] (you'd essentially be then competing with Anyscale, which is a VC funded startup, just to contrast it with another comment on this thread) and dbt [2].

1b. Or: Support deploying coin miners (might help widen the product's reach; and stand it up against the likes of consensys).

3. Get in front of the very many cost optimisation consultants out there, like the Duckbill Group.

If I may, where are you building this product from? And how many are on the team? Thanks.

[0] https://aws.amazon.com/batch/use-cases/

[1] https://ray.io/

[2] https://getdbt.com/


Thank you! Interesting, we actually tried aws batch ourselves. 1) How were you able to handle spot interruptions and resuming from the latest checkpoint ? 2) Not to mention fallback to OnDemand on spot interruptions 3) then switching back to spot from onDemand would also need additional process to be setup.

Also i'm not sure how straightforward it is to detach/attach persistent volumes to retain data across different spot interruptions ? The latter can be done but it's just the same rote each time you wanna train something new.

Also thanks for the suggestions ! We're a team of 2 right now, I used to be in the bay area but in Mexico temporarily.


1. Spot interruptions didn't matter much as AWS Batch looks for spots with low interruption probability. Auto retries kicked-in whenever those did get interrupted.

2. Checkpointing was a pain (we relied mostly on AWS Batch's JobState and S3, not ideal), but the current capability to mount EFS (Elastic Filesystem) looks like it would solve this?

3. No hot swapping on-demand with spot and vice versa. Interestingly, ALB (Application Load Balancer) supports such mixed EC2 configurations (AWS Batch doesn't).


This looks useful, I like the pricing of free/$9.99 a month.

No one asked, but it's HN so I'll say it anyway. I think it's a questionable "vc business" but a great business for 1-2 people. The road from this to an enterprise sales motion, or even a 10K/year contract is hard for me to imagine. At some point, it becomes cost effective for my org to build this functionality in house.

However, as a hobbyist /single dev / small team, $120/Year is a no brainer after the first 2-3K I spend on GPU by mistake. As you know, setting up spots when I just want to get shit done is a pain and I'll gladly pay you (a little) to make that go away.

Still no one asked but... One thing that plays to your advantage is that the current price point is something anyone in your user group can buy on their own, and there are a lot of us / enough to make a nice business out of.

Good luck!


Thanks, yeah. At this stage we really just want to validate if this was a real problem in the ML community. Down the line I suppose as we scale to handle multi instance training and other use cases, we could probably charge more, say a % of the cost savings in training.


You should really mention / give attribution / emphasize more that this is a fork of https://spotty.cloud and you took a lot from https://github.com/nimbo-sh/nimbo as well.


Kudos; this is likely a very useful tool!

So useful (and potentially lucrative) that AWS/GCP would likely take over this functionality eventually -- either by building on top of spot instances like you have, or underneath regular instances or re-use that capacity for some other managed service (thereby reducing the price differential). How do you plan to protect yourselves against that?


Ha! That's interesting. We use AWS instances for model development and costs are definitely an issue. Sending this to my team. Good luck!


This looks like an obvious approach (in the hindsight of course) to a general problem. I love your pivot.

Congrats on the idea and godspeed. You’ll probably have a lot of interest if you execute well.


Thank you!


Really excited to try it out. I've had a heck of a time setting up spot instance training on Sagemaker in the past so simplification efforts are much appreciated.


Curious, did you eventually start using Sagemaker with spot instances or did you give up on it ? Also what would you say were the biggest pain points with Sagemaker?


I'd like to point out that this seems extremely similar to Nimbo (https://nimbo.sh), to the extent that even some of the terminal messages are exactly the same, and even parts of the docs are copy pasted. E.g:

Nimbo docs: "In order to run this job on Nimbo, all you need is one tiny config file and a Conda environment file (to set the remote environment), and Nimbo does the following for you:"

SpotML docs: "In order to run this job on SpotML, all you need is one tiny config file and a Docker file (to set the remote environment), and SpotML does the following for you:"

Make of that what you will :).


Yes we liked the elegance of both the tool and the docs. So it's very much inspired from it. I must also give credit to another great tool https://spotty.cloud/ from which this project was adopted.


Optics on direct copying without attribution aren't great for trust in open source software. Count me out and thanks for pointing me to the place where people are _actually_ working in the open/cooperatively.


Docs are also still copyrighted works. Copying them verbatim might be violating the rights of the original authors.


Thanks for pointing it out, We realize our mistake here. We also should've done proper attribution. Will be correcting this.


Seems like Nimbo (https://nimbo.sh) has a Business Source License (https://github.com/nimbo-sh/nimbo/blob/master/LICENSE), so you might want to check with them regarding licensing terms for a startup that is using their code and/or docs in "production"?

Otherwise, this idea is interesting and probably generalizable to other applications. Maybe it's not crystal clear to me, but what are the advantages of your service over existing solutions such as Nimbo and Spotty? FWIW it might be worthwhile adding this to your website.

Good luck!


Thanks, Makes sense. It doesn't use any "code" from nimbo. The documentation and the design simplicity of the tool were the things that was appealing to us and adopted. The project itself was forked from spotty which has an MIT license.

The biggest advantage which was missing in the Open source options was monitoring on the training job and auto recovery from spot interruptions which spotML does.


A couple years ago I would have had a bigger interest in this. Pytorch added elastic training as part of version 1.9 when they incorporated torch elastic into it. Tensorflow added an easier way to do fault tolerant training in 2.4 with backup and restore callback for keras models. For non-keras models it'll take some extra code but they do include some examples for this too.

Also there's at least two open source free solutions for elastic training I know of. RaySGD, https://docs.ray.io/en/master/raysgd/raysgd.html and elastic horovod, https://horovod.readthedocs.io/en/stable/elastic_include.htm... For me to consider this I'd need a comparison table with native framework solutions and these solutions along with what does it add for me.


Read through the homepage, but not entirely sure --

Why not just train on Spot Instances with a retry implemented?

I see that SpotML has a configurable fall back to On-Demand instances, and perhaps their value prop is that it saves the state of your run up to the interruption + resumes it on the On-Demand instance, but why not just set a retry on the Spot Instance if its interrupted?

I'm failing to see what is different about SpotML vs Metaflow's @retry decorator and using AWS Batch: https://docs.metaflow.org/metaflow/failures#retrying-tasks-w...

If you're in the comment still, Vishnu, would love to hear your thoughts


Interesting, thanks, we weren't aware of Metaflow.

I've read through the docs, the one difference that comes to my mind is the automatic fallback to on-Demand and resume back to spot when available. I can't readily see a way to do this yet in Metaflow, but it's possible I've missed something.


There are a few different ways to deal with spot interruptions. First, it is a good idea to specify multiple instance types in your compute environment so even if some instances types become unavailable in spot, Batch can use another type automatically.

Second, you can rely on Spot Fleets which handle both spot and on-demand instances seamlessly https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fle...


This is really cool!

I used to have these same pains. My trick with spot instances has been to set my maximum price to the price of a regular instance of that class or higher and to sync weights to S3 in the background on every save. The former is a parameter when starting the instance in the console or terraform, the latter is basically a do...while loop. I've noticed that often one gets booted from a spot instance causing interruption because the market price increases a few cents above the "70% savings price." Increasing the maximum to on demand is basically free money because you don't get booted often, and your max price is the regular on demand price.

This has seemed to mitigate about all of the spot downsides (like interruptions or losing data) because you don't easily get kicked out of the spot unless there's a run on machines and the prices rarely fluctuate that much (at least for the higher end p3 instances). This has seemed to prevent data loss and protected the downside risk by setting the instance to a knowable max price. There are instances where the spot price goes higher than the on demand price, so you still get booted once in a while, but it's very infrequent as you can imagine most people don't want to spend more for spot than on demand.

Anecdotally I still average out to getting the vast majority of the spot savings with this method with very few interruptions. Looking at the SpotML it seems to be a lot of tooling that achieves these same goals (assuming one would be interrupted when a spot dies and moves to full-freight on-demand with SpotML), which makes SpotML's solution feel very over-engineered to me given that the majority of what SpotML provides can be had with a simple maximum cost parameter change when spinning up a spot instance.

I would be very interested in using anything that doesn't have great overhead and saves money. Our bill seems "big" to me (but I realize it may be small to many others), so even these small savings add up. Would you compare the potential benefits of SpotML to the method I described above?


Thanks for the feedback. The biggest upside is, if you have a long running training(say hours, days) and the spot training is interrupted. You probably don't want to manually monitor to check for the next available spot instance and kick start the training. SpotML takes care of that part. Also optionally, you can configure it to resume with an onDemand instance on interruption until the next available spot instance. In essence we try to do it make i) creating buckets/EBS ii) code to save to S3 in loop iii) Monitor for interruptions and resume from checkpoint parts easy.


FWIW your "trick" has been the default for years. Unless you specify otherwise, the default max price is the on-demand price. They changed this back in 2017. Maximum price isn't really used by anyone any more. https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pri...

We spend about $10k/month on spot instances and I don't specify any max price. The way to avoid terminations is just to make sure you spread your workload over a large number of instance types and availability zones.


i always envisioned that if i ever did this, i would set the spot price to the equivalent of infinity/max and then monitor it myself, terminating myself if i determined that continuing to run at the spot price was more expensive.

why? sure, you don't want to pay more than the on demand price, but iirc spot prices often spike very momentarily. so the question becomes whether the sum over spike time cost at the higher spot price exceeds the time to boot/shutdown/migration overhead at the on demand price.

but i've never actually tried it so...


This is interesting, from what I've read AWS no longer recommends setting spot instance prices so that they manage it themselves. I wonder, if there's an actual advantage of avoiding spot interruptions by setting the spot price even higher than onDemand.


yeah true. it's an interesting question that involves not only startup/shutdown of instance/os overheads (i suppose they've probably thought about this) but also overhead for checkpoint/restart of your application.

i also suppose this way of thinking about it comes from thinking around how to minimize cost from a purely mathematical standpoint. when you think about how and why the spot market is operated, and what those short term spikes may actually be, it may run counter to the intended purpose. (cheap capacity that they may recall at any time, because capacity is actually fixed)

funny how market based approaches can gamify things sufficiently that sometimes they obscure the underlying intention or purpose of having a market in the first place.


I'm interested in how they 'hibernate'/save the state of the instances within the shutdown time limit. I was also looking into this for myself, there are ways of using docker to save the in-process memory a-la hibernate, which would work well with this. But, especially for GCP where you only get ~60 seconds between the shutdown signal and hard stop, I was worried that it wouldn't save it fast enough. I often work on pretty high-ram instances and thought even saving from ram to disk would take too long for 150-300GB ram uses.

I hadn't heard of nimbo, maybe I can read how they're doing it since it's open sources. Does anyone have any idea how they're saving state so fast (NVME SSD disk?)


It uses a mounted EBS(Elastic Block Store) so all the checkpoints, data etc. is already in the persistence storage. This is simply be re-attached to the next spot/onDemand instance after interruption.


Cool yeah that makes sense, makes total sense for ML where you just need to run over epochs, less clear for other workloads.

After looking around I thinking more about CRIU/docker suspend. The google stars aligned and I found this https://github.com/checkpoint-restore/criu-image-streamer + https://linuxplumbersconf.org/event/7/contributions/641/atta... which actually seems perfect. I wonder how fast it is

Edit: Also no GPU support AFAIK but https://github.com/twosigma/fastfreeze looks really nice, turnkey. I wonder if I write to a fast persistent disk if I can get higher maximum ram than over the NW

(or, hacking on a checkpoint idea, have a daemon periodically 'checkpoint' other programs so even if it's too slow over 60 seconds, revert to the last checkpoint. Even an rsync like application where only send the changes)


Oh I didn't see much in Nimbo with a quick glance but reading more closely > We immediately resume training after interruptions, using the last model checkpoint via persistent EBS volume.

Makes sense, just save checkpoints to disk. What I'm doing is more CPU bound and not straight ML so less easily check-pointed, sadly. Cool though, it's worth jumping through hoops for 70% reduction


How does this work? From the documentation it's clear that AWS credentials are required. Which permissions are necessary? This leads me to the assumption that the SpotML cli creates with boto3 the necessary resource (EBS Volume, spot instances, S3 bucket) on my AWS account. If this is the case, how does billing work if this is "just" a cli?


That's right it will need aws credentials with access to create EBS volumes, S3 Bucket, spawn instances, etc. In addition to the "cli" a cloud service constantly monitors the progress of the Jobs(by registering the pid when launching it), and the instance states. So the billing will be based on the hours of training run and $ saved.


Looks very useful. One suggestion...the before/after swipe image is great for showing how it works, but not why you would use it. Might be helpful to overlay the ascending cost line, which would be steep in the "before", but gradual in the "after".


Thanks for the suggestion, yes makes sense.


Cheapeast spot GPUs these days actually come from Azure.

Source: https://cloudoptimizer.io


Not related to SpotML - but I literally just want to rent a server with decent GPU or TPU, enough memory to handle image training. Hassle free: Just SSH into it, transfer data to it, and run the training script - and won't break my wallet (as in, hundreds of dollars a month).

Seems like the services I've tried, have focused heavily on supporting notebooks (colab, paperspace).

Any ideas?


Hey! There are several sites which provide this service. A while ago I put together a basic comparison of their features and pricing for different GPUs: https://mlcontests.com/cloud-gpu (best viewed on desktop).

Some offer both notebook and ssh, some just one of the two. The cheapest are often the p2p ones, where you essentially rent someone else's consumer gpu


This looks awesome!

Could you give an approximate cost of fine-tuning a model like Bert or even GAN given this system?

Just want to get a sense of cost of using the system.


Sure, in our own startup, we used to spend roughly $1000 for training a StyleGAN model for a class and then additional latent space manipulation models. In the early days we recklessly wasted a lot of our AWS credits during experimentation. But later on with spot instances were able to bring it down to $250 to $300 per category class training which was more bearable as cost.


This looks cool. How much SpotML-specific code is needed to target SpotML? I'm assuming it doesn't just magically detect the training loop in an existing pytorch codebase and needs hooks implemented for resume, inference, error handling, etc.


We tried to keep it minimal. All you need to do is specify the format to resume last checkpoint in the spotml.yml file. So let's say your checkpoint files are saved as ckpt00.pt, ckpt01.pt, ckpt03.pt and so on. You can configure checkpoint regex file format ^ckpt[0-9]{2}$ and spotml resumes by picking the latest of it.

For detecting if the training process is still running or errored out it registers the training command pid when launching the task and then monitors for the Pid for completion. It also registers and monitors the instance state itself to check for interruptions and resuming.


Oh wow that's rather simple and works with more configurations than I expected. Thanks.


I'm not sure what this provides over what GCP already offers. 4 years ago I switched my co's ML training to use GKE (Google Kubernetes Engine) using a cluster of preemptible nodes.

All you need to do is schedule your jobs (just call the Kubernetes API and schedule your container to run). In the rare case the node gets preempted, your job will be restarted and restored by Kubernetes. Let your node pool scale to near zero when not in use and get billed by the _second_ of compute used.


Reminder: this is Show HN, so please try to be constructive.

However, there's at least a couple of things that matter here that aren't covered by "just use a preemptible node pool":

* SpotML configures checkpoints (yes this is easy, but next point)

* SpotML sends those checkpoints to a persistent volume (by default in GKE, you would not use a cluster-wide persistent volume claim, and instead only have a local ephemeral one, losing your checkpoint)

* SpotML seems to have logic around "retry on preemptible, and then switch to on-demand if needed" (you could do this on GKE by just having two pools, but it won't be as "directed")


Looks like SpotML is a fork of https://github.com/nimbo-sh/nimbo and https://spotty.cloud/

This is a hustle to gauge interest (and collect emails) in a service that is a clone of nimbo.


This is cool project, will try it out. how does it compare to cortex.dev? they use AWS spot instances too.


Haven't used cortex.dev, but looking at the docs, I'd say primarily simplicity and ease of getting started quickly (<3 mins) to get it up and running. Also with cortex, It's not clear to me yet if spot instance fails, cortex can wait for next spot/replace it with onDemand, to keep training going automatically. If not, that would be the second difference.


do you have an example dockerfile for a ML job ? say xgboost with a CSV data file ... or tensorflow with image data ?

unable to figure out how to construct the job itself and how to submit the dockerfile to be executed.

also - do u support distributed training ?


We'll be updating the documentation with examples in a few weeks, once we release it. No, it doesn't support distributed training right now.


This looks promising! Will definitely try it out for my next passion project.


Does it apply to jobs running on multiple instances, e.g using dask?


Good question, I've heard other engineers ask for this. The MVP version doesn't yet handle multiple instances, but it will be on our roadmap.


When you do get there, consider the Bulk Insert API on GCP and "Spot Fleets" on AWS (both make it easier for the provider to satisfy the entire request in one go).


Hmm.. makes sense, yes. thanks!


This is likely a very useful tool, wish you luck!


How does it compare with grid.ai?


Thanks for this, I wasn't aware of it. From reading the docs, I don't see anywhere if Grid automatically handles spot interruptions to resume from last checkpoint which was the main focus of our internal tool.

Have you used it btw ? and what has your experience been with Grid ?


Ah good point. They do have persistent storage so I’m guessing you can add code to do the same. I haven’t personally tried it since we developed our own solution in house. It basically does everything grid.ai does but with checkpoints. Though our solution does require adding a few lines of sdk code to handle the check pointing


I should add that where our solution falls short is in multi node training which grid.ai seems to have out of the box


Cool!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: