Right, makes sense. But then what do you actually *do* with a database? Starting...

ndriscoll · on May 5, 2023

I haven't worked in this space, but from what I gather, the idea would be something along the lines of the following:

An autoencoder is a model that takes a high dimensional input, distills it down to a low dimensional middle layer, and then tries to rebuild the high dimensional input again. You train the model to minimize reconstruction error, and the point is then that you can run an input on just the first half to get a low-dimensional representation that captures the "essence" of the thing (in the "latent space"). In this representation, images that are similar should have similar "essences", so their latent vectors should be near to each other.

The low dimensional representation must do a good job capturing the "essence" of your things, otherwise your reconstruction error would be large. The lower the dimension you manage to use while still managing to reconstruct your things, the better of a job it must do at making those parameters really encode the salient features of your thing without wasting any information. So similar things should be encoded similarly.

So imagine you've got a database of images, and you have a table of all of the low dimensional encoded vectors. You want to do a reverse image search. The user sends you an image, you run the encoder on it to get the latent representation, and then you want to essentially run "SELECT ei.image_id FROM encoded_images ei ORDER BY distance(encode(input_image), ei.encoding) LIMIT 10".

So you want a database that supports indexes that let you efficiently run vector similarity queries/nearest neighbor search, i.e. that support an efficient "ORDER BY distance(_, indexed_column)". Since the whole process was fuzzy anyway, you may actually want to support an approximate "ORDER BY distance" for speed.

In practice apparently the encoding might be taking the output of the first or nth layer in a deep network or something rather than specifically using an autoencoder. Or you may have some other way to hash/encode things to produce a latent representation that you want to do distance searches on. And of course images could instead be documents or whatever you want to run similarity searches on.

qorrect · on May 5, 2023

What a great explanation thank you.

browsewhilepoop · on May 5, 2023

Often the use case is search. Ex. You have a basic text search engine to find musicians on your site which does some string matching and basic tokenization and so on. But you want to be able to surface similar types of musicians on search too.

In that case you might store vectors representing a user based on some features youve selected, or a word embedding of their common genres/tags.

To actually search this thing, you need something to compare against. You could directly use the word embeddings of the search query. You could also do a search against your existing method, and then use the top results from that as a seed to search your vectors.

Since everything's a vector, you can also ask questions like "what musician is similar to Tom AND Sally" by looking for vectors near T+S. T-S could represent like Tom but not like Sally, etc.

So the answer to what do you store is, what will be your seed to search against?

cubefox · on May 6, 2023

Wow, that's interesting. So we can do vector arithmetic and the results make sense as a form of embedding/concept logic. Addition seems to work like "and" / set intersection. Subtraction works like set difference ("T\S", also literally written "T-S", i.e. "without"), which logically says "T, but not S", or in terms of predicate calculus, "T(x) & not S(x)".

Perhaps there is also some unitary vector operation which directly corresponds to negation (not Sally)? Perhaps multiplying the vector by -1? Or would (-1)S rather pick out "the opposite of Sally in conceptual space" instead of "not / anyone but, Sally"? And what about logical disjunction (union)? One could go further here, and ask whether there is an analog to logical quantifiers. Then there is of course the question whether there is anything in vector logic which would correspond to relations, binary predicates, like R(x, y), not just unitary ones, etc.

(Sorry for rambling, I'm thinking out loud here.)

frabcus · on May 5, 2023

The vectors are usually (if you use OpenAI API anyway) unit in length, and so you can imagine them on the surface of a hypersphere.

You measure the cosine distance between documents, or between search queries and documents. (Cosine is fast, there are other distance metrics).

The vector database queries will do things like given one embedding (document or query) find the nearest embeddings (documents). Or given two embeddings (e.g. a query and a context) with a weight for each one, find the ones that triangulate to being near both.

morgango · on May 5, 2023

Simple answer - you normally store text in it, but with the state of neural networks these days most things can be vectorized and searched.

So, coming myself from a database background but working in search, the SELECT statement (and joins) probably aren't the best way to get your head wrapped around things. I would think of the vector as a unique key for a record, and only using a LIKE statement for all my queries, but one that will return a probability of a match instead of an actual match.

A great use case is to think about similarity, where we want the things that are closest to what we want to see, but there isn't an exact match.

For example; a user gives me a sentence that says, "How long do I have to be with the company before I get a 401K match?". My vector store has a bunch of vectors including "A new employee will be eligible for 401K after 6 months." ,and, "The 401K program is run by <MEGACORP X>."

I would like to be able to see that the first vector is a closer match to the user sentence than the second, and by how much. I would also like to do this without having to change my code much based on the structure of the text. Luckily, there is a very simple algorithm for doing this (cosine similarity) that doesn't change regardless of the sentence structure or the question answered. Also, it doesn't matter what kind of question/answer you do as long as it can be vectorized, so you could even give me a vector representing an image and I can give you an image that is most similar.

Here is the most interesting thing about vectors -- with very little effort they turn the english language into a programming language.

Instead of typing "SELECT document_id, document_name, document_body FROM documents WHERE (document_body LIKE '%401K%' AND document_body LIKE '%match%' AND document_body LIKE '%existing employee%') FROM documents" I can just ask, "How long do I have to be with the company before I get a 401K match?" and I will get back a result and a match probability. How I change my text will change the matches, and can do so in ways that are profound and unexpected. Note that the SQL query I gave would not return any values because I didn't have any documents that had the term "existing" in them. Building the correct SQL query could be quite complex it comparison to just using the text.

This is pretty great for long-tailed search, q&a, image search, recommendations, classification, etc.

BTW, I am biased, I work for Elastic (makers of Elasticsearch) and we have been doing traditional search forever, and vector/hybrid search for the last few years.

esafak · on May 5, 2023

The use case is a specific type of search:

* https://en.wikipedia.org/wiki/Semantic_search

* https://en.wikipedia.org/wiki/Similarity_search