I'm *assuming*, at its core, a vector database is a collection of vectors all of...

seanhunter · on May 5, 2023

Not really. For starters not all the vectors in the database are of the same dimension (at least in ones I've used). You can for example in pgvector have multiple tables with a column of vector type and those tables can have different cardinality.

In any case your description is strangely reductionist. The important thing about a vector db is that is typically designed to store embeddings used in various ML applications. So say you are doing NLP you can tokenize some input and then store the token and positional embeddings in a vector db and then use it for similarity search, training etc.

defrost · on May 5, 2023

Mixed tables each with reduced cardinalities makes sense.

> your description is strangely reductionist.

Sure - pure | applied math background, old enough to have used Postgres when it was known as Ingres, to have patched in Spatial relations before it had the GIS functions it has now, and to have written libraries for GIS linked { 256 | 1024 | 2048 } D vector databases for signal aquisition | processing.

I'm late to the 'modern' discussions & just checking my read - I can think of applications for mixed dimensions and discrete space vectors and there are analogs to trad R^N ops for those cases.

tudorw · on May 5, 2023

i think it might be a n-dimensional topological manifold with a nth dimensional geodesic supplying the shortest path, anyone actually done topology?

tudorw · on May 5, 2023

if you take a 2d piece of paper, draw the numbers 1 to 5 horizontally, then look at the distance between 1 and 5, or 1 and 3, they vary, now roll it into as cylinder, it's a manifold, and now the distance between 1 and 5 and 1 and 1 and 3 are the same, no matter how many numbers you had written on the paper, they would be connected by the shortest path, something like a helix I believe, the helix being the geodesic, now, increase the dimension, but keep rolling it up into that 'cylinder' and using a new geodesic to connect the shortest path, this is my layman's take on it, if you are #math please jump in and put me right!

modernpink · on May 5, 2023

Not sure what this comment is driving at. "At it's core" a database is a collection of data, sure.

defrost · on May 5, 2023

and a "vector database" is a collection of vectors, sure.

The question comes from is it the CompSci | HN | AI domain nomenclature to assume a vector database is made up of vectors that are all of the same dimension over a continuum (eg. N real numbers for fixed N) or are vector databases made up of mixed vectors (no fixed dimension) and discrete values, etc.

I ask as the linked article doesn't specify but does appear to imply.

modernpink · on May 5, 2023

An embedding model will map a string of text (of variable length) to R^D. Each model has its own fixed D, yes. (Typically the vector is a unit vector for performance reasons.) The main function of vector database is similarity lookup so you would calculate approximate nearest neighbours of a vector with a scalar vector distance metric (e.g cosine similarity). These similarity metrics operate on two vectors of the same dimension.

You would not mix and match embedding models (e.g. with differing dimensions) at look-up time. The target vector table assumes you will look it up with a vector created from the exact same embedding model and version that was used to backfill it.

The API documentation for a look up operation may be more illuminating here:

>vector (array of floats)

>The query vector. This should be the same length as the dimension of the index being queried. Each query() request can contain only one of the parameters id or vector.

https://docs.pinecone.io/reference/query

seanhunter · on May 5, 2023

D is fixed for the model but not for the database. You don't need a seperate database for each model.

modernpink · on May 5, 2023

Yes that's imprecision on my part. For a table D is fixed, but not necessarily across tables in the vector database index

seanhunter · on May 5, 2023

As per my comment earlier, the dimension isn't fixed. The usual use case (storing embeddings) is instructive as to the range of values. For token embeddings, often the embedding is generated via a lookup in a fixed vocabulary of tokens token to a token ID. So say your vocab is words, the value is a word id which would obviously be an integer not a real. Here's an intro to word embeddings https://wiki.pathmind.com/word2vec and here's one for positional embeddings (the new hotness given how zeitgeisty GPTs are) https://theaisummer.com/positional-embeddings/