Disclosure: I used to work for GCP and launched Preemptible VMs.
Congrats! Can I suggest charging more?
IIUC, your business plan is $10/month for the company, regardless of number of users?
You probably save a company that $10 in a day or less for one GPU: One A100 is ~$3/hr on demand and ~$.90/hr as Preemptible, saving over $2/hr.
Said another way, your pitch is to recover a lot of the 70% discount that they aren't going to do themselves. If you were a managed training service, you could pitch yourself as "half the price of AWS or GCP" and keep the 20%+ margin with both parties being happy. (The problem is that pass through billing makes that obvious, you need to support lots of bucket security and IAM controls, etc.).
Fwiw, I would also branch out into inference! Preemptible and Spot T4s are commonly used for heavy image models, but many people pay full price. Inference that takes X ms can easily be handled "without errors" in the shutdown time. The risk is handling all the capacity swings.
Thanks for the feedback, yeah at this stage we really just wanted to find out if we can build something useful for the community. Agreed on the pricing suggestion.
Also, interesting point about inference. I'm not sure though how common it is for companies to need GPUs for inference. Because if you can have a CPU based inference model, which I thought was most common, it's probably not a big usecase?
I've got some comment somewhere on HN that says exactly that "try CPU inference first, it's pretty good".
The need to reach for a T4 comes when someone is doing a big model on images or video and wants sub-second response time. (Think some of the stuff on Snapchat, etc.)
i've worked with people who have needed GPU powered inference on AWS. Training had to happen on p3's but inference happened on g4's. The pricing is lower (and depending on the use case, often the overall cost) and spot savings are usually less dramatic as these instances can be very in demand.
We built a tool called spotML to make training on AWS/GCP cheaper.
Spot Instances are 70% cheaper than On-Demand instances but are prone to interruptions. We mitigate the downside of these interruptions through the use of persistence features, including optional fallback to On-Demand instances. So you can optimize workflows according to your budget and time constraints.
History: We were working on a neural rendering startup that needed a lot of GAN training which was getting very expensive. We were blowing roughly $1000, to train a single category class. Training on Spot instances was cheaper, but still a mess. It needed lot of hand holding/devops stuff to make it usable. So we built SpotML to automate a lot of things.
Posting it here to see if the community finds this helpful, so that we can open it up to the larger community.
Apart from the fact that it could deploy to both GCP and AWS, what does it do differently than AWS Batch [0]?
When we had a similar problem, we ran jobs on spots with AWS Batch and it worked nicely enough.
Some suggestions (for a later date):
1a. Add built-in support for Ray [1] (you'd essentially be then competing with Anyscale, which is a VC funded startup, just to contrast it with another comment on this thread) and dbt [2].
1b. Or: Support deploying coin miners (might help widen the product's reach; and stand it up against the likes of consensys).
3. Get in front of the very many cost optimisation consultants out there, like the Duckbill Group.
If I may, where are you building this product from? And how many are on the team? Thanks.
Thank you!
Interesting, we actually tried aws batch ourselves. 1) How were you able to handle spot interruptions and resuming from the latest checkpoint ? 2) Not to mention fallback to OnDemand on spot interruptions 3) then switching back to spot from onDemand would also need additional process to be setup.
Also i'm not sure how straightforward it is to detach/attach persistent volumes to retain data across different spot interruptions ? The latter can be done but it's just the same rote each time you wanna train something new.
Also thanks for the suggestions !
We're a team of 2 right now, I used to be in the bay area but in Mexico temporarily.
1. Spot interruptions didn't matter much as AWS Batch looks for spots with low interruption probability. Auto retries kicked-in whenever those did get interrupted.
2. Checkpointing was a pain (we relied mostly on AWS Batch's JobState and S3, not ideal), but the current capability to mount EFS (Elastic Filesystem) looks like it would solve this?
3. No hot swapping on-demand with spot and vice versa. Interestingly, ALB (Application Load Balancer) supports such mixed EC2 configurations (AWS Batch doesn't).
This looks useful, I like the pricing of free/$9.99 a month.
No one asked, but it's HN so I'll say it anyway. I think it's a questionable "vc business" but a great business for 1-2 people. The road from this to an enterprise sales motion, or even a 10K/year contract is hard for me to imagine. At some point, it becomes cost effective for my org to build this functionality in house.
However, as a hobbyist /single dev / small team, $120/Year is a no brainer after the first 2-3K I spend on GPU by mistake. As you know, setting up spots when I just want to get shit done is a pain and I'll gladly pay you (a little) to make that go away.
Still no one asked but... One thing that plays to your advantage is that the current price point is something anyone in your user group can buy on their own, and there are a lot of us / enough to make a nice business out of.
Thanks, yeah.
At this stage we really just want to validate if this was a real problem in the ML community.
Down the line I suppose as we scale to handle multi instance training and other use cases, we could probably charge more, say a % of the cost savings in training.
So useful (and potentially lucrative) that AWS/GCP would likely take over this functionality eventually -- either by building on top of spot instances like you have, or underneath regular instances or re-use that capacity for some other managed service (thereby reducing the price differential). How do you plan to protect yourselves against that?
Really excited to try it out. I've had a heck of a time setting up spot instance training on Sagemaker in the past so simplification efforts are much appreciated.
Curious, did you eventually start using Sagemaker with spot instances or did you give up on it ?
Also what would you say were the biggest pain points with Sagemaker?
I'd like to point out that this seems extremely similar to Nimbo (https://nimbo.sh), to the extent that even some of the terminal messages are exactly the same, and even parts of the docs are copy pasted. E.g:
Nimbo docs: "In order to run this job on Nimbo, all you need is one tiny config file and a Conda environment file (to set the remote environment), and Nimbo does the following for you:"
SpotML docs: "In order to run this job on SpotML, all you need is one tiny config file and a Docker file (to set the remote environment), and SpotML does the following for you:"
Yes we liked the elegance of both the tool and the docs. So it's very much inspired from it.
I must also give credit to another great tool https://spotty.cloud/ from which this project was adopted.
Optics on direct copying without attribution aren't great for trust in open source software. Count me out and thanks for pointing me to the place where people are _actually_ working in the open/cooperatively.
Otherwise, this idea is interesting and probably generalizable to other applications. Maybe it's not crystal clear to me, but what are the advantages of your service over existing solutions such as Nimbo and Spotty? FWIW it might be worthwhile adding this to your website.
Thanks, Makes sense.
It doesn't use any "code" from nimbo. The documentation and the design simplicity of the tool were the things that was appealing to us and adopted.
The project itself was forked from spotty which has an MIT license.
The biggest advantage which was missing in the Open source options was monitoring on the training job and auto recovery from spot interruptions which spotML does.
A couple years ago I would have had a bigger interest in this. Pytorch added elastic training as part of version 1.9 when they incorporated torch elastic into it. Tensorflow added an easier way to do fault tolerant training in 2.4 with backup and restore callback for keras models. For non-keras models it'll take some extra code but they do include some examples for this too.
Read through the homepage, but not entirely sure --
Why not just train on Spot Instances with a retry implemented?
I see that SpotML has a configurable fall back to On-Demand instances, and perhaps their value prop is that it saves the state of your run up to the interruption + resumes it on the On-Demand instance, but why not just set a retry on the Spot Instance if its interrupted?
Interesting, thanks, we weren't aware of Metaflow.
I've read through the docs, the one difference that comes to my mind is the automatic fallback to on-Demand and resume back to spot when available. I can't readily see a way to do this yet in Metaflow, but it's possible I've missed something.
There are a few different ways to deal with spot interruptions. First, it is a good idea to specify multiple instance types in your compute environment so even if some instances types become unavailable in spot, Batch can use another type automatically.
I used to have these same pains. My trick with spot instances has been to set my maximum price to the price of a regular instance of that class or higher and to sync weights to S3 in the background on every save. The former is a parameter when starting the instance in the console or terraform, the latter is basically a do...while loop. I've noticed that often one gets booted from a spot instance causing interruption because the market price increases a few cents above the "70% savings price." Increasing the maximum to on demand is basically free money because you don't get booted often, and your max price is the regular on demand price.
This has seemed to mitigate about all of the spot downsides (like interruptions or losing data) because you don't easily get kicked out of the spot unless there's a run on machines and the prices rarely fluctuate that much (at least for the higher end p3 instances). This has seemed to prevent data loss and protected the downside risk by setting the instance to a knowable max price. There are instances where the spot price goes higher than the on demand price, so you still get booted once in a while, but it's very infrequent as you can imagine most people don't want to spend more for spot than on demand.
Anecdotally I still average out to getting the vast majority of the spot savings with this method with very few interruptions. Looking at the SpotML it seems to be a lot of tooling that achieves these same goals (assuming one would be interrupted when a spot dies and moves to full-freight on-demand with SpotML), which makes SpotML's solution feel very over-engineered to me given that the majority of what SpotML provides can be had with a simple maximum cost parameter change when spinning up a spot instance.
I would be very interested in using anything that doesn't have great overhead and saves money. Our bill seems "big" to me (but I realize it may be small to many others), so even these small savings add up. Would you compare the potential benefits of SpotML to the method I described above?
Thanks for the feedback.
The biggest upside is, if you have a long running training(say hours, days) and the spot training is interrupted.
You probably don't want to manually monitor to check for the next available spot instance and kick start the training. SpotML takes care of that part.
Also optionally, you can configure it to resume with an onDemand instance on interruption until the next available spot instance.
In essence we try to do it make
i) creating buckets/EBS
ii) code to save to S3 in loop
iii) Monitor for interruptions and resume from checkpoint
parts easy.
FWIW your "trick" has been the default for years. Unless you specify otherwise, the default max price is the on-demand price. They changed this back in 2017. Maximum price isn't really used by anyone any more. https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pri...
We spend about $10k/month on spot instances and I don't specify any max price. The way to avoid terminations is just to make sure you spread your workload over a large number of instance types and availability zones.
i always envisioned that if i ever did this, i would set the spot price to the equivalent of infinity/max and then monitor it myself, terminating myself if i determined that continuing to run at the spot price was more expensive.
why? sure, you don't want to pay more than the on demand price, but iirc spot prices often spike very momentarily. so the question becomes whether the sum over spike time cost at the higher spot price exceeds the time to boot/shutdown/migration overhead at the on demand price.
This is interesting, from what I've read AWS no longer recommends setting spot instance prices so that they manage it themselves.
I wonder, if there's an actual advantage of avoiding spot interruptions by setting the spot price even higher than onDemand.
yeah true. it's an interesting question that involves not only startup/shutdown of instance/os overheads (i suppose they've probably thought about this) but also overhead for checkpoint/restart of your application.
i also suppose this way of thinking about it comes from thinking around how to minimize cost from a purely mathematical standpoint. when you think about how and why the spot market is operated, and what those short term spikes may actually be, it may run counter to the intended purpose. (cheap capacity that they may recall at any time, because capacity is actually fixed)
funny how market based approaches can gamify things sufficiently that sometimes they obscure the underlying intention or purpose of having a market in the first place.
I'm interested in how they 'hibernate'/save the state of the instances within the shutdown time limit. I was also looking into this for myself, there are ways of using docker to save the in-process memory a-la hibernate, which would work well with this. But, especially for GCP where you only get ~60 seconds between the shutdown signal and hard stop, I was worried that it wouldn't save it fast enough. I often work on pretty high-ram instances and thought even saving from ram to disk would take too long for 150-300GB ram uses.
I hadn't heard of nimbo, maybe I can read how they're doing it since it's open sources. Does anyone have any idea how they're saving state so fast (NVME SSD disk?)
It uses a mounted EBS(Elastic Block Store) so all the checkpoints, data etc. is already in the persistence storage. This is simply be re-attached to the next spot/onDemand instance after interruption.
Edit: Also no GPU support AFAIK but https://github.com/twosigma/fastfreeze looks really nice, turnkey. I wonder if I write to a fast persistent disk if I can get higher maximum ram than over the NW
(or, hacking on a checkpoint idea, have a daemon periodically 'checkpoint' other programs so even if it's too slow over 60 seconds, revert to the last checkpoint. Even an rsync like application where only send the changes)
Oh I didn't see much in Nimbo with a quick glance but reading more closely
> We immediately resume training after interruptions, using the last model checkpoint via persistent EBS volume.
Makes sense, just save checkpoints to disk. What I'm doing is more CPU bound and not straight ML so less easily check-pointed, sadly. Cool though, it's worth jumping through hoops for 70% reduction
How does this work? From the documentation it's clear that AWS credentials are required. Which permissions are necessary? This leads me to the assumption that the SpotML cli creates with boto3 the necessary resource (EBS Volume, spot instances, S3 bucket) on my AWS account. If this is the case, how does billing work if this is "just" a cli?
That's right it will need aws credentials with access to create EBS volumes, S3 Bucket, spawn instances, etc.
In addition to the "cli" a cloud service constantly monitors the progress of the Jobs(by registering the pid when launching it), and the instance states. So the billing will be based on the hours of training run and $ saved.
Looks very useful. One suggestion...the before/after swipe image is great for showing how it works, but not why you would use it. Might be helpful to overlay the ascending cost line, which would be steep in the "before", but gradual in the "after".
Not related to SpotML - but I literally just want to rent a server with decent GPU or TPU, enough memory to handle image training. Hassle free: Just SSH into it, transfer data to it, and run the training script - and won't break my wallet (as in, hundreds of dollars a month).
Seems like the services I've tried, have focused heavily on supporting notebooks (colab, paperspace).
Hey! There are several sites which provide this service. A while ago I put together a basic comparison of their features and pricing for different GPUs: https://mlcontests.com/cloud-gpu (best viewed on desktop).
Some offer both notebook and ssh, some just one of the two. The cheapest are often the p2p ones, where you essentially rent someone else's consumer gpu
Sure, in our own startup, we used to spend roughly $1000 for training a StyleGAN model for a class and then additional latent space manipulation models.
In the early days we recklessly wasted a lot of our AWS credits during experimentation.
But later on with spot instances were able to bring it down to $250 to $300 per category class training which was more bearable as cost.
This looks cool. How much SpotML-specific code is needed to target SpotML? I'm assuming it doesn't just magically detect the training loop in an existing pytorch codebase and needs hooks implemented for resume, inference, error handling, etc.
We tried to keep it minimal. All you need to do is specify the format to resume last checkpoint in the spotml.yml file.
So let's say your checkpoint files are saved as ckpt00.pt, ckpt01.pt, ckpt03.pt and so on.
You can configure checkpoint regex file format ^ckpt[0-9]{2}$
and spotml resumes by picking the latest of it.
For detecting if the training process is still running or errored out it registers the training command pid when launching the task and then monitors for the Pid for completion. It also registers and monitors the instance state itself to check for interruptions and resuming.
I'm not sure what this provides over what GCP already offers. 4 years ago I switched my co's ML training to use GKE (Google Kubernetes Engine) using a cluster of preemptible nodes.
All you need to do is schedule your jobs (just call the Kubernetes API and schedule your container to run). In the rare case the node gets preempted, your job will be restarted and restored by Kubernetes. Let your node pool scale to near zero when not in use and get billed by the _second_ of compute used.
Reminder: this is Show HN, so please try to be constructive.
However, there's at least a couple of things that matter here that aren't covered by "just use a preemptible node pool":
* SpotML configures checkpoints (yes this is easy, but next point)
* SpotML sends those checkpoints to a persistent volume (by default in GKE, you would not use a cluster-wide persistent volume claim, and instead only have a local ephemeral one, losing your checkpoint)
* SpotML seems to have logic around "retry on preemptible, and then switch to on-demand if needed" (you could do this on GKE by just having two pools, but it won't be as "directed")
Haven't used cortex.dev, but looking at the docs, I'd say primarily simplicity and ease of getting started quickly (<3 mins) to get it up and running.
Also with cortex, It's not clear to me yet if spot instance fails, cortex can wait for next spot/replace it with onDemand, to keep training going automatically. If not, that would be the second difference.
When you do get there, consider the Bulk Insert API on GCP and "Spot Fleets" on AWS (both make it easier for the provider to satisfy the entire request in one go).
Thanks for this, I wasn't aware of it.
From reading the docs, I don't see anywhere if Grid automatically handles spot interruptions to resume from last checkpoint which was the main focus of our internal tool.
Have you used it btw ? and what has your experience been with Grid ?
Ah good point. They do have persistent storage so I’m guessing you can add code to do the same. I haven’t personally tried it since we developed our own solution in house. It basically does everything grid.ai does but with checkpoints. Though our solution does require adding a few lines of sdk code to handle the check pointing
Congrats! Can I suggest charging more?
IIUC, your business plan is $10/month for the company, regardless of number of users?
You probably save a company that $10 in a day or less for one GPU: One A100 is ~$3/hr on demand and ~$.90/hr as Preemptible, saving over $2/hr.
Said another way, your pitch is to recover a lot of the 70% discount that they aren't going to do themselves. If you were a managed training service, you could pitch yourself as "half the price of AWS or GCP" and keep the 20%+ margin with both parties being happy. (The problem is that pass through billing makes that obvious, you need to support lots of bucket security and IAM controls, etc.).
Fwiw, I would also branch out into inference! Preemptible and Spot T4s are commonly used for heavy image models, but many people pay full price. Inference that takes X ms can easily be handled "without errors" in the shutdown time. The risk is handling all the capacity swings.