The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)
(I assume) it will fail if the eTag doesn't match -- the instance it got the header.
The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.
Practically, they could do both: Do an early reject of a given POST in case the ETag does not match, but re-validate this just before swapping out the objects (and committing to considering the given request as the successful one globally).
That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.
Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.
I can imagine it might be useful to make this a choice for databases with high frequency small swaps and occasional large ones.
1) default, load-compare-&-swap for small fast load/swaps.
2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.
3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.
To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3
Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)
How? This is a technical forum. Unless you’re saying any consumer of S3 can now spam links to their product on this thread with impunity. (Hey maybe they’re using cas).
Pretty much all other S3 implementations (including open source ones) support this or equivalent primitives, so this is great for interoperability with existing implementations.
Yes! I’m actively working on it, in fact. We’re waiting on the next release of the Rust `object_store` crate, which will bring support for S3’s native conditional puts.
tpuf’s ANN index uses a variant of SPFresh, yup. These are the only two production implementations I am aware of. I don’t think it is in production at MSFT yet
Yeah, thinking about this more I now understand Clickhouse to be more of an operational warehouse similar to Materialize, Pinot, Druid, etc. if I understand correctly? So bunching with BigQuery/Snowflake/Trino/Databricks... wasn't the right category (although operational warehouses certainly can have a ton of overlap)
I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)
Most production storage systems/databases built on top of S3 spend a significant amount of effort building an SSD/memory caching tier to make them performant enough for production (e.g. on top of RocksDB). But it's not easy to keep it in sync with blob...
Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.
Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal
We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]
We built HopsFS-S3 [0] for exactly this problem, and have running it as part of Hopsworks now for a number of years. It's a network-aware, write-through cache for S3 with a HDFS API. Metadata operations are performed on HopsFS, so you don't have the other problems list max listing operations return 1000 files/dirs.
NVMe is what is changing the equation, not SSD. NVMe disks now have up to 8 GB/s, although the crap in the cloud providers barely goes to 2 GB/s - and only for expensive instances. So, instead of 40X better throughput than S3, we can get like 10X. Right now, these workloads are much better on-premises on the cheapest m.2 NVMe disks ($200 for 4TB with 4 GB/s read/write) backed by a S3 object store like Scality.
the numbers you're giving are throughput (byte/sec) not latency.
The comment you reply to is talking mostly about latency - reporting that S3 object get latencies (time to open the object and return its head) in the single-digits ms, where S3 was 50ms before.
BTW EBS can do 4GB/sec per volume. But you will pay for it.
Very excited about being able to build scalable vector databases on DiskANN like turbopuffer or lancedb. These changes in latency are game changing. The best server is no server. The capability a low latency vector database application that runs in lambda and S3 and is dirt cheap is pretty amazing.
The clear use case is serverless—without the complications of DynamoDB (expensive, 0.03/GB read), DynamoDB+DAX (VPC complications), or Redis (again, VPC requirements).
This instantly makes a number of applications able to run directly on S3, sans any caching system.
Similar to vector databases this could be really useful for hosting cloud optimized geotiffs for mapping purposes. At a previous job we were able to do on the fly tiling in about 100ms but with this new storage class you could probably make something that could tile just as fast or even faster than arcgis with all of its proprietary optimizations and goop.
Take it a step further, gdal supports s3 raster data sources out of the box for a while now. Any gdal powered system may be able to operate on s3 files as if they are local.
While I never owned a VanMoof, I read they painted pictures of flat screen TVs on the shipping boxes after months of issues with bikes being damaged by the time they got to customers. Clever!