More

Sirupsen · 2025-11-17T19:59:16 1763409556

Our query planner has that built in! We've spent a lot of time making high recall with any selectivity in the fitler work.

Sirupsen · on Nov 26, 2024

The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.

CubsFan1060 · on Nov 26, 2024

Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)

timmg · on Nov 26, 2024

(I assume) it will fail if the eTag doesn't match -- the instance it got the header.

The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.

poincaredisk · on Nov 26, 2024

I imagine, for it to make sense, that the comparison is done at the last possible moment, before atomically swapping the file contents.

lxgr · on Nov 26, 2024

Practically, they could do both: Do an early reject of a given POST in case the ETag does not match, but re-validate this just before swapping out the objects (and committing to considering the given request as the successful one globally).

That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.

Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.

Nevermark · on Nov 26, 2024

I can imagine it might be useful to make this a choice for databases with high frequency small swaps and occasional large ones.

1) default, load-compare-&-swap for small fast load/swaps.

2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.

3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.

Sirupsen · on Nov 26, 2024

To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!

https://turbopuffer.com/blog/turbopuffer

amazingamazing · on Nov 26, 2024

Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3

viraptor · on Nov 26, 2024

Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)

amazingamazing · on Nov 26, 2024

Worked out how? There’s no implementation. It’s just conjecture.

viraptor · on Nov 26, 2024

It's right there:

> Our bet that S3 would get it in a reasonable time-frame worked out!

amazingamazing · on Nov 26, 2024

How? This is a technical forum. Unless you’re saying any consumer of S3 can now spam links to their product on this thread with impunity. (Hey maybe they’re using cas).

richardlblair · on Nov 26, 2024

Oh look, someone is mad on the internet about something silly.

hedora · on Nov 26, 2024

Pretty much all other S3 implementations (including open source ones) support this or equivalent primitives, so this is great for interoperability with existing implementations.

ramraj07 · on Nov 26, 2024

No one owes anyone open source. If they can make the business case work or if it works in their favor, sure.

jrochkind1 · on Nov 26, 2024

I don't mind hearing another developer's use case for this feature, even if it's commercial proprietary software.

It's no longer top comment, which is fine.

jauntywundrkind · on Nov 26, 2024

https://github.com/slatedb/slatedb will, I expect, use this at some point. Object backed DB, which is open source.

benesch · on Nov 26, 2024

Yes! I’m actively working on it, in fact. We’re waiting on the next release of the Rust `object_store` crate, which will bring support for S3’s native conditional puts.

If you want to follow along: https://github.com/slatedb/slatedb/issues/164

deanCommie · on Nov 27, 2024

I mean isn't the news story itself essentially an ad?

CobrastanJorji · on Nov 26, 2024

I'm glad that bet worked out for you, but what made you think one year ago that S3 would introduce it soon that was untrue for the previous 15 years?

Sirupsen · on Nov 3, 2024

It works great. We’ve had SPANN in production since October of 2023 at https://turbopuffer.com/

Sirupsen · on Oct 29, 2024

tpuf’s ANN index uses a variant of SPFresh, yup. These are the only two production implementations I am aware of. I don’t think it is in production at MSFT yet

Sirupsen · on Oct 10, 2024

Ya, the world needed S3 to become fully consistent. This didn't happen until end of 2020!

Sirupsen · on July 10, 2024

Yeah, thinking about this more I now understand Clickhouse to be more of an operational warehouse similar to Materialize, Pinot, Druid, etc. if I understand correctly? So bunching with BigQuery/Snowflake/Trino/Databricks... wasn't the right category (although operational warehouses certainly can have a ton of overlap)

I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)

Sirupsen · on Nov 28, 2023

Most production storage systems/databases built on top of S3 spend a significant amount of effort building an SSD/memory caching tier to make them performant enough for production (e.g. on top of RocksDB). But it's not easy to keep it in sync with blob...

Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.

Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal

We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]

[0]: https://github.com/sirupsen/napkin-math [1]: https://turbopuffer.com/

jamesblonde · on Nov 28, 2023

We built HopsFS-S3 [0] for exactly this problem, and have running it as part of Hopsworks now for a number of years. It's a network-aware, write-through cache for S3 with a HDFS API. Metadata operations are performed on HopsFS, so you don't have the other problems list max listing operations return 1000 files/dirs.

NVMe is what is changing the equation, not SSD. NVMe disks now have up to 8 GB/s, although the crap in the cloud providers barely goes to 2 GB/s - and only for expensive instances. So, instead of 40X better throughput than S3, we can get like 10X. Right now, these workloads are much better on-premises on the cheapest m.2 NVMe disks ($200 for 4TB with 4 GB/s read/write) backed by a S3 object store like Scality.

[0] https://www.hopsworks.ai/post/faster-than-aws-s3

dekhn · on Nov 28, 2023

the numbers you're giving are throughput (byte/sec) not latency.

The comment you reply to is talking mostly about latency - reporting that S3 object get latencies (time to open the object and return its head) in the single-digits ms, where S3 was 50ms before.

BTW EBS can do 4GB/sec per volume. But you will pay for it.

kernelsanderz · on Nov 29, 2023

Very excited about being able to build scalable vector databases on DiskANN like turbopuffer or lancedb. These changes in latency are game changing. The best server is no server. The capability a low latency vector database application that runs in lambda and S3 and is dirt cheap is pretty amazing.

seasily · on Nov 29, 2023

The clear use case is serverless—without the complications of DynamoDB (expensive, 0.03/GB read), DynamoDB+DAX (VPC complications), or Redis (again, VPC requirements).

This instantly makes a number of applications able to run directly on S3, sans any caching system.

__turbobrew__ · on Nov 29, 2023

Similar to vector databases this could be really useful for hosting cloud optimized geotiffs for mapping purposes. At a previous job we were able to do on the fly tiling in about 100ms but with this new storage class you could probably make something that could tile just as fast or even faster than arcgis with all of its proprietary optimizations and goop.

Take it a step further, gdal supports s3 raster data sources out of the box for a while now. Any gdal powered system may be able to operate on s3 files as if they are local.

Sirupsen · on Oct 4, 2023

Emil if you email me at info@turbopuffer.com I can let you into the alpha :)

Sirupsen · on Aug 24, 2023

While I never owned a VanMoof, I read they painted pictures of flat screen TVs on the shipping boxes after months of issues with bikes being damaged by the time they got to customers. Clever!