Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While I value the depth in which the blog authors went into, some of these are not problems with pgvector but problems with similarity search itself and would be problems with other vector databases as well. There are also solutions to the problems describe that the authors do not mention that I think any reader should take into account.

> 1. Required and Negated Words

This is not a bug of pgvector but a bug in embedding models and simple similarity search itself. You'd run into this issue on when doing RAG with any vectordb or ANN search library. You could probably solve this with query expansion, running multiple similarity searches in parallel, and doing filtering of results containing the required theme vs trying to get this all in a single search.

> 2. Explainability With Highlights

This is a misunderstanding of the purpose of embeddings based semantic search. The point of semantic search is not to match on exact keywords but to match on the /meaning/. If you want exact keyword matching, use full text search. This is something that can also be solved by hybrid search, combining full text search and semantic search and using a re-ranker. PostgreSQL has built in FTS with tsvector.

> 3. Performant Filters and Order By’s

The authors do not disclose any details about what index they use here or what kind of filtering they are trying to do. As other commenters point out, the StreamingDiskANN in the pgvectorscale extension [0] (complement to pgvector, you can use them together) improves on performance and accuracy of filtering vs pgvector HNSW (see details in [1]).

>4. Support for sparse vectors, BM25, and other inverse document frequency search modes

This is probably the most fair point in the post. But there exists projects like pg_search from ParadeDB which bring bm_25 to Postgres and help solve this [2]

Lastly, I respect companies trying to provide real-world examples of the trade-offs of different systems, and so thank the authors for sharing their experience and spurring discussion.

Disclaimer: I work at Timescale, where we offer pgvector, and also made other extensions for AI/ vector workloads on PostgreSQL, namely pgvectorscale, and pgai. I've tried to be as even in my analysis as possible but as with everything on the internet, you can make up your own mind and decide for yourself.

[0]: https://github.com/timescale/pgvectorscale/ [1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas... [2] https://github.com/paradedb/paradedb



I tend to think "semantic" is a better term than "similarity" for describing dense vector search. Question to answer pairs are present in the training data which should discourage "similarity" thinking by itself. Personally, it feels like there was a shift to using "similarity" recently and I don't know why.

IDF keyword matching oriented techniques are a lot closer to "similarity" search than dense vector ones.

Dense vector search excels uniquely well for queries containing a semantic concept like a comparison with a query such as "A vs. B" [1]. Results will shift towards comparisons in general over results containing tokens A or B. Bag-of-words model where all the tokens collapse into a single vector is ideal for anything where you're trying to get an idea more than a set of tokens.

> 1. solve this with query expansion, running multiple similarity searches..

It would be a lot easier to solve by just having the required/negated word feature. That's really my point in the blog. Having to experiment and do newfangled things with high latency penalties for this common query pattern is something to seriously consider when choosing pgvector.

> Explainability with highlights

It's rare that a semantic search does not match tokens to some extent. Example imaged in the blog is actually from a dense vector query. Dense vector search isn't that magical and retains some token-level precision. Also, you can use word-level embeddings for the highlights to make them more semantically accurate. However, in our system, we use jaro-winkler distance.

In a semantic search context, you do want something matching on meaning, but when it doesn't work well, the information from highlights helps you refactor and get it closer to ideal.

> Sparse vectors, BM25, and other IDF modes

I'm a fan of pg_search (and tantivy), but SPLADE is a significant improvement and I think it's a big loss to not have easy access to it.

Appreciate that you enjoyed the post and thank you for starting a discussion on it!

[1]: https://hn.trieve.ai/about




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: