Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unfortunately this piece is nebulous on what an embedding is. Apparently it is saved as an array of floats, and it has some string of text it is associated with, and the float arrays are compared by "similarity".

None of these explains what an embedding really is. My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.



> My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.

Yeah, you've got it. A mapping from words to vectors such that semantic similarity between words is reflected in mathematical similarity between vectors.

An idea of how you might train this thing: lets say the words "king" and "queen" are being embedded. In your training data there are lots of examples where "king" and "queen" are interchangeable, for example in the sentence "The ___ is dead, long live the ____", either word is appropriate in either slot, so each time we see an example like this we nudge "king" and "queen" a little closer together in some sense. However you also find phrases where they are not interchangeable, such as "The first born male will one day be ____". So when you see those examples you nudge "king" a little closer in some sense to other words which appropriately complete the sentence (which does not include "queen" in this case).

In this way, repeated over a giant training set with thousands of words, concepts like "male/female" and "royalty", "person/object" and tons of others end up getting reflected in the relationships between the vectors.

These vectors are then useful representations of words to ML models.


Right, makes sense. But then what do you actually do with a database?

Starting with: what do you store in it?

Maybe sentence/vector pairs. But what does that give you? What do you do with that data algorithmically? What's the equivalent of a SELECT statement? What's the application that benefits an end user? That part still seems rather hazy.


I haven't worked in this space, but from what I gather, the idea would be something along the lines of the following:

An autoencoder is a model that takes a high dimensional input, distills it down to a low dimensional middle layer, and then tries to rebuild the high dimensional input again. You train the model to minimize reconstruction error, and the point is then that you can run an input on just the first half to get a low-dimensional representation that captures the "essence" of the thing (in the "latent space"). In this representation, images that are similar should have similar "essences", so their latent vectors should be near to each other.

The low dimensional representation must do a good job capturing the "essence" of your things, otherwise your reconstruction error would be large. The lower the dimension you manage to use while still managing to reconstruct your things, the better of a job it must do at making those parameters really encode the salient features of your thing without wasting any information. So similar things should be encoded similarly.

So imagine you've got a database of images, and you have a table of all of the low dimensional encoded vectors. You want to do a reverse image search. The user sends you an image, you run the encoder on it to get the latent representation, and then you want to essentially run "SELECT ei.image_id FROM encoded_images ei ORDER BY distance(encode(input_image), ei.encoding) LIMIT 10".

So you want a database that supports indexes that let you efficiently run vector similarity queries/nearest neighbor search, i.e. that support an efficient "ORDER BY distance(_, indexed_column)". Since the whole process was fuzzy anyway, you may actually want to support an approximate "ORDER BY distance" for speed.

In practice apparently the encoding might be taking the output of the first or nth layer in a deep network or something rather than specifically using an autoencoder. Or you may have some other way to hash/encode things to produce a latent representation that you want to do distance searches on. And of course images could instead be documents or whatever you want to run similarity searches on.


What a great explanation thank you.


Often the use case is search. Ex. You have a basic text search engine to find musicians on your site which does some string matching and basic tokenization and so on. But you want to be able to surface similar types of musicians on search too.

In that case you might store vectors representing a user based on some features youve selected, or a word embedding of their common genres/tags.

To actually search this thing, you need something to compare against. You could directly use the word embeddings of the search query. You could also do a search against your existing method, and then use the top results from that as a seed to search your vectors.

Since everything's a vector, you can also ask questions like "what musician is similar to Tom AND Sally" by looking for vectors near T+S. T-S could represent like Tom but not like Sally, etc.

So the answer to what do you store is, what will be your seed to search against?


Wow, that's interesting. So we can do vector arithmetic and the results make sense as a form of embedding/concept logic. Addition seems to work like "and" / set intersection. Subtraction works like set difference ("T\S", also literally written "T-S", i.e. "without"), which logically says "T, but not S", or in terms of predicate calculus, "T(x) & not S(x)".

Perhaps there is also some unitary vector operation which directly corresponds to negation (not Sally)? Perhaps multiplying the vector by -1? Or would (-1)S rather pick out "the opposite of Sally in conceptual space" instead of "not / anyone but, Sally"? And what about logical disjunction (union)? One could go further here, and ask whether there is an analog to logical quantifiers. Then there is of course the question whether there is anything in vector logic which would correspond to relations, binary predicates, like R(x, y), not just unitary ones, etc.

(Sorry for rambling, I'm thinking out loud here.)


The vectors are usually (if you use OpenAI API anyway) unit in length, and so you can imagine them on the surface of a hypersphere.

You measure the cosine distance between documents, or between search queries and documents. (Cosine is fast, there are other distance metrics).

The vector database queries will do things like given one embedding (document or query) find the nearest embeddings (documents). Or given two embeddings (e.g. a query and a context) with a weight for each one, find the ones that triangulate to being near both.


Simple answer - you normally store text in it, but with the state of neural networks these days most things can be vectorized and searched.

So, coming myself from a database background but working in search, the SELECT statement (and joins) probably aren't the best way to get your head wrapped around things. I would think of the vector as a unique key for a record, and only using a LIKE statement for all my queries, but one that will return a probability of a match instead of an actual match.

A great use case is to think about similarity, where we want the things that are closest to what we want to see, but there isn't an exact match.

For example; a user gives me a sentence that says, "How long do I have to be with the company before I get a 401K match?". My vector store has a bunch of vectors including "A new employee will be eligible for 401K after 6 months." ,and, "The 401K program is run by <MEGACORP X>."

I would like to be able to see that the first vector is a closer match to the user sentence than the second, and by how much. I would also like to do this without having to change my code much based on the structure of the text. Luckily, there is a very simple algorithm for doing this (cosine similarity) that doesn't change regardless of the sentence structure or the question answered. Also, it doesn't matter what kind of question/answer you do as long as it can be vectorized, so you could even give me a vector representing an image and I can give you an image that is most similar.

Here is the most interesting thing about vectors -- with very little effort they turn the english language into a programming language.

Instead of typing "SELECT document_id, document_name, document_body FROM documents WHERE (document_body LIKE '%401K%' AND document_body LIKE '%match%' AND document_body LIKE '%existing employee%') FROM documents" I can just ask, "How long do I have to be with the company before I get a 401K match?" and I will get back a result and a match probability. How I change my text will change the matches, and can do so in ways that are profound and unexpected. Note that the SQL query I gave would not return any values because I didn't have any documents that had the term "existing" in them. Building the correct SQL query could be quite complex it comparison to just using the text.

This is pretty great for long-tailed search, q&a, image search, recommendations, classification, etc.

BTW, I am biased, I work for Elastic (makers of Elasticsearch) and we have been doing traditional search forever, and vector/hybrid search for the last few years.



Great explanation, thank you!


How is each dimension maintained to have a sticky meaning among scenarios?


Because the model used to compute the embeddings is the same across scenarios. You can infer meaning for each dimension by checking which inputs get embeddings that have large values for the dimension.

If the inputs are images, you may find that some dimension scores e.g. how much blue there is in the image. Though often it's not that simple (there could be multiple dimensions that relate to how blue the image is, especially if the embedding dimensionality is large, which it does tend to be these days. Though you could reduce the embedding dimensionality first using PCA, and see what input images correspond to high/low values of the first principal component, etc.).


Dimensions itself do not carry any meaning, what matters are the neighbors to maintain a sense of similarity. Think if it like a very complex point cloud. Applying an n-dimensional rotation leads to the same point cloud content wise.

As for the number of dimensions, in a sense they are a training variable just as the content itself. The more dimensions you utilize for your embeddings the more complex your relations can be during clustering. Too many dimensions can easily lead to over fitting however and too little dimensions can usually not accurately represent the training corpus.


All the embeddings (vectors) are usually generated at the same time, and regenerated periodically. Does this answer your question?


There are good sibling explanations by @ta20211004_1 and @HarHarVeryFunny, but if I can try in an additional way:

Imagine you wanted to go from words to numbers (which are easier to work with mathematically), like you wanted to assign a number to some words.

How could you do it? Well you could do it randomly: cat could be 2, dog could be 10, sweater could be 4.534 and frog could be 8.

Not super useful, but hey - words are now numbers! How can we make this "better"?

What if we decided on a way to put words on a line - let's say we ordered words by how much they had to do with animals. Let's say 10 meant it's a very animal-related word, and 0 is very not-animal related. So cat and dog would be 10, and maybe zoo would be 9, and fur could be 8. But something like sweater would be 1 (depending if the sweater was made from animal wool...?)

What now? Well what's cool is that if you assign words on that "animal-ness" line, you can find the words that are "similar" by looking at the numbers that are close. So, words whose value is around 6 are probably similar in meaning. At least, in terms of how much they relate to animals.

That's the core idea. Ordering words by animal-ness is not that useful in the real world, so maybe we can place words on a 2d grid instead of a line. Horizontally, it would go from 0 to 10 (not animal at all - very animal) and vertically, it could be ordered by brightness - 0 for dark, and 10 for bright.

So now, bright animals will congregate together in one part of the grid, and dark non animals will also live close together. For example, a dark frog might be in the bottom right at position (10, 0) - very animal (right end of the x axis) but not bright (bottom of the y axis). Any other word whose position is close to (10, 0) would presumably also be animal-y and dark.

That's really it. The magic is that... this works in thousands of dimensions. Each dimension being some way that "AIs" see words / our world. It's harder to think about what each dimension "is" or represents. But embeddings are really just that - the position in a space with a huge number of dimensions. Just like dark frogs were (10, 0) in our simple example, the word "frog" might be (0.124, 0.51251, 0.61, 0.2362, 0.236236, ..............) as an embedding.

That's it!


Wow. Great explanation.

The example you used going from 1 to 2 to n dimensions really made sense


An embedding is a collection of learned vectors.

Each vector is an array of n floats that represent a location of a thing in an n-dimensional space. The idea of learning an embedding is that you have some learning process that will put items that are similar into similar parts of that vector space.

The vectors don’t necessarily need to represent words and the model that produces them doesn’t necessarily to be a language model.

For example, embeddings are widely used to generate recommendations. Say you have a dataset of users clicking on products on a website. You could assume that products that get clicked in the same session are probably similar and use that dataset to learn an embedding for products. This would give you vector representing each product. When you want to generate recommendations for a product, you take the vector for that product and then search through the set of all product vectors to find those that are closest to it in the vector space.


An embedding is a a way to map words into a high-dimensional "concept space", so they can be processed by ML algorithms. The most popular one is word2vec

https://jalammar.github.io/illustrated-word2vec/


Sorry, that's even less helpful in the context of a database... but thanks for trying.


A vector database is used for things where you're trying query "I have this image, give me a list of the 10 closest images and metrics of how similar they are."

You use a machine learning model (like word2vec, OpenAI, etc.) to produce an "embedding" that describes the image, text, video, etc., which is your "vector".

For all of the other images in your database, you also run them through the same model, and store their embedding vectors in the vector database.

Then, you ask the database "I have this vector, what are the most similar vectors, and what are their primary keys, so I can see what content they refer to".

Think: you want to implement google "search by image". This is the basics of how you'd do that.


Isn't this just locality-sensitive hashing?

Why use the word "embedding" if there are already much more familiar words for it (isn't this the same as feature vector)?

I want to convince myself that this isn't similar to blockchain. In the sense that blockchain renamed an old and simple idea and advertised it as something complex and groundbreaking...

Also, relational databases or graph databases have a reach theory that results in many interesting sub-problems, each interesting in its own right, to contrast this with "document databases", which have no theory, and nothing interesting behind it. So, if I were to invest my time learning about one w/o a financial incentive to do so, I'd not want to concentrate on some accidental concept that just happened to solve an immediate problem, but isn't applicable / transferable to other problems.

For example, graph databases and relational databases create interesting storage problems wrt' optimal layout for various database components. If hash-table is all there is to the vector database, then it's not an interesting storage problem.

Similarly, with querying the database: if key lookup from a hash-table is all there is, then it's not an interesting problem.


Okay, "mapping into concept space" is at least compatible with my meaning theory, but by itself it doesn't say much, since in principle anything can be mapped to anything.


Embeddings are a mapping of some type of thing (pictures, words, sentences, etc) to points in a high-dimensional space (e.g. few hundred dimensions) such that items that are close together in this space have some similarity.

The general idea is that the items you are embedding may vary in very many different ways, so trying to map them into a low dimensional space based on similarity isn't going to be able to capture all of that (e.g. if you wanted to represent faces in a 2-D space, you could only use 2 similarity measures such as eye and skin color). However a high enough dimensional space is able to represent many more axis of similarity.

Embeddings are learnt from examples, with the learning algorithm trying to map items that are similar to be close together in the embedding space, and items that are dissimilar to be distant from each other. For example, one could generate an embedding of face photos based on visual similarity by training it with many photos of each of a large number of people, and have the embedding learn to group all photos of the same person to be close together, and further away from those of other individuals. If you now had a new photo and wanted to know who it is (or who it most looks like), you'd generate the embedding for the new photo and determine what other photos it is close to in the embedding space.

Another example would be to create an embedding of words, trying to capture the meanings of words. The common way to to this is to take advantage of the fact that words are largely defined by use/context, so you can take a lot of texts and embed the constituent words such that words that are physically close together in the text are close together in the embedding space. This works surprisingly well, and words that end up close together in the embedding space can be seen to be related in terms of meaning.

Word embeddings are useful as an input to machine learning models/algorithms where you want the model to "understand" the words, and so it is useful if words with similar meaning have similar representations (i.e. their embeddings are close together), and vice versa.



As opwieurposiu said, embeddings are high-dimensional vectors. Often, they're created by classic math techniques (e.g. principal component analysis), or they are extracted from a model that proved useful for something else.

For example, a neural net model accepts a massive number of input values that directly map to the input. So those initial values don't add any info. But a layer further inside the model, with fewer values and probably close to the end, is smaller and should reflect what the model's learned. Like a lot of deep learning, three values work but don't give much insight.

If I'm wrong, I hope somebody more knowledge corrects me. I got my understanding from basic into tutorials and Wolfram's essay on ChatGPT: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...


A word or sentence embedding is a long array of numbers that represents the semantic "position" in a high dimensional space, which allows you to find the distance between any two sentences in this semantic space. My understanding of paragraph and document embeddings is that they are an average of all the sentence vectors combined as one point, which lets you find the distance between any two sentences in this semantic space.


Yeah... for a while I wanted to understand what a vector database is, but this article reads like a thinly-veiled advertorial: too many buzzwords, and the content feels like the author doesn't really have a good knowledge of the subject and is just trying to advertise the tech their company is selling.


An embedding is a series of numbers that have been gradually shifted to better fit some purpose. The gradients tell me that if I increase the first number of embedding X a little, the model will perform better, so I do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: