Hacker Newsnew | past | comments | ask | show | jobs | submit | Sirupsen's commentslogin

Our query planner has that built in! We've spent a lot of time making high recall with any selectivity in the fitler work.


The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.


Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)


(I assume) it will fail if the eTag doesn't match -- the instance it got the header.

The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.


I imagine, for it to make sense, that the comparison is done at the last possible moment, before atomically swapping the file contents.


Practically, they could do both: Do an early reject of a given POST in case the ETag does not match, but re-validate this just before swapping out the objects (and committing to considering the given request as the successful one globally).

That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.

Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.


I can imagine it might be useful to make this a choice for databases with high frequency small swaps and occasional large ones.

1) default, load-compare-&-swap for small fast load/swaps.

2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.

3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.


To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!

https://turbopuffer.com/blog/turbopuffer


Interesting that what’s basically an ad is the top comment - it’s not like this is open source or anything - can’t even use it immediately (you have to apply for access). Totally proprietary. At least elasticsearch is APGL, saying nothing of open search which also supports use of S3


Someone made an informed technical bet that worked out. Sounds like HN material to me. (Also, is it really a useful ad if you can't easily use the product?)


Worked out how? There’s no implementation. It’s just conjecture.


It's right there:

> Our bet that S3 would get it in a reasonable time-frame worked out!


How? This is a technical forum. Unless you’re saying any consumer of S3 can now spam links to their product on this thread with impunity. (Hey maybe they’re using cas).


Oh look, someone is mad on the internet about something silly.


Pretty much all other S3 implementations (including open source ones) support this or equivalent primitives, so this is great for interoperability with existing implementations.


No one owes anyone open source. If they can make the business case work or if it works in their favor, sure.


I don't mind hearing another developer's use case for this feature, even if it's commercial proprietary software.

It's no longer top comment, which is fine.


https://github.com/slatedb/slatedb will, I expect, use this at some point. Object backed DB, which is open source.


Yes! I’m actively working on it, in fact. We’re waiting on the next release of the Rust `object_store` crate, which will bring support for S3’s native conditional puts.

If you want to follow along: https://github.com/slatedb/slatedb/issues/164


I mean isn't the news story itself essentially an ad?


I'm glad that bet worked out for you, but what made you think one year ago that S3 would introduce it soon that was untrue for the previous 15 years?


It works great. We’ve had SPANN in production since October of 2023 at https://turbopuffer.com/


tpuf’s ANN index uses a variant of SPFresh, yup. These are the only two production implementations I am aware of. I don’t think it is in production at MSFT yet


Ya, the world needed S3 to become fully consistent. This didn't happen until end of 2020!


Yeah, thinking about this more I now understand Clickhouse to be more of an operational warehouse similar to Materialize, Pinot, Druid, etc. if I understand correctly? So bunching with BigQuery/Snowflake/Trino/Databricks... wasn't the right category (although operational warehouses certainly can have a ton of overlap)

I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)


Most production storage systems/databases built on top of S3 spend a significant amount of effort building an SSD/memory caching tier to make them performant enough for production (e.g. on top of RocksDB). But it's not easy to keep it in sync with blob...

Even with the cache, the cold query latency lower-bound to S3 is subject to ~50ms roundtrips [0]. To build a performant system, you have to tightly control roundtrips. S3 Express changes that equation dramatically, as S3 Express approaches HDD random read speeds (single-digit ms), so we can build production systems that don't need an SSD cache—just the zero-copy, deserialized in-memory cache.

Many systems will probably continue to have an SSD cache (~100 us random reads), but now MVPs can be built without it, and cold query latency goes down dramatically. That's a big deal

We're currently building a vector database on top of object storage, so this is extremely timely for us... I hope GCS ships this ASAP. [1]

[0]: https://github.com/sirupsen/napkin-math [1]: https://turbopuffer.com/


We built HopsFS-S3 [0] for exactly this problem, and have running it as part of Hopsworks now for a number of years. It's a network-aware, write-through cache for S3 with a HDFS API. Metadata operations are performed on HopsFS, so you don't have the other problems list max listing operations return 1000 files/dirs.

NVMe is what is changing the equation, not SSD. NVMe disks now have up to 8 GB/s, although the crap in the cloud providers barely goes to 2 GB/s - and only for expensive instances. So, instead of 40X better throughput than S3, we can get like 10X. Right now, these workloads are much better on-premises on the cheapest m.2 NVMe disks ($200 for 4TB with 4 GB/s read/write) backed by a S3 object store like Scality.

[0] https://www.hopsworks.ai/post/faster-than-aws-s3


the numbers you're giving are throughput (byte/sec) not latency.

The comment you reply to is talking mostly about latency - reporting that S3 object get latencies (time to open the object and return its head) in the single-digits ms, where S3 was 50ms before.

BTW EBS can do 4GB/sec per volume. But you will pay for it.


Very excited about being able to build scalable vector databases on DiskANN like turbopuffer or lancedb. These changes in latency are game changing. The best server is no server. The capability a low latency vector database application that runs in lambda and S3 and is dirt cheap is pretty amazing.


The clear use case is serverless—without the complications of DynamoDB (expensive, 0.03/GB read), DynamoDB+DAX (VPC complications), or Redis (again, VPC requirements).

This instantly makes a number of applications able to run directly on S3, sans any caching system.


Similar to vector databases this could be really useful for hosting cloud optimized geotiffs for mapping purposes. At a previous job we were able to do on the fly tiling in about 100ms but with this new storage class you could probably make something that could tile just as fast or even faster than arcgis with all of its proprietary optimizations and goop.

Take it a step further, gdal supports s3 raster data sources out of the box for a while now. Any gdal powered system may be able to operate on s3 files as if they are local.


Emil if you email me at info@turbopuffer.com I can let you into the alpha :)


While I never owned a VanMoof, I read they painted pictures of flat screen TVs on the shipping boxes after months of issues with bikes being damaged by the time they got to customers. Clever!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: