In 2018, I tried to visualize what good writing looks like. My goal was to write a program that could take in a piece of writing, and return some dimensions of that writing that could be used to discern its quality.
My first attempts were really basic. I was dealing with a lot of lengths: lengths of sentences, lengths of words, etc, which I plotted as bar charts and histograms. I didn't find these plots to be particularly interesting, and I lost interest in the project.
But embeddings have me excited about this idea again!
An embedding is a numerical representation of text. Instead of representing text as a sequence of characters (a string), embeddings represent text as a sequence of numbers. These numbers enable us to do some interesting things, but before we get into what those are, I want to first show why numerical representations are important by looking at how computers represent color.
The RGB color model is a way of describing colors as 24-bit numbers (for a total of 2^24 or ~16.7 million unique colors). That 24-bit number is broken down into 3 separate 8-bit dimensions. Each dimension corresponds to an intensity of red (R), green (G), and blue (B) light known as channels that range from 0 to 255.
If we have a RGB color and want to make it "more red", we have an easy way of doing so in terms of its underlying representation - we increase the red channel while keeping the other channels constant.
function moreRed(color, amount) {
const [r, g, b] = color;
return [r + amount, g, b];
}
We can express more complex transformations just as easily. For example:
To brighten1 a color, we can increase the value of each channel by the same relative amount.
function brighten(color, amount) {
const [r, g, b] = color;
return [r * amount, g * amount, b * amount];
}
We can see the effects of this transformation when it is applied to a single color.
These effects are even more apparent when we apply the transformation to an entire image (hover over the image to see how the transformation affects individual pixels):
In the RGB color model, colors with an equal intensity across all three channels (i.e. r = g = b
) are all different shades of gray:
This means we can turn an input color "gray" by calculating the average of its three channels, and returning a new color where all channels have that average.
function gray(color) {
const [r, g, b] = color;
const avg = (r + g + b) / 3;
return [avg, avg, avg];
}
When applied to each pixel of an image:
The RGB color model works because it is rooted in reality: it's based on the Trichromatic Theory of Color Vision. In other words, it's a numerical representation that reflects (to some degree of accuracy) how humans actually perceive color. This makes the RGB color model an enabler - it enhances what we can do with colors by enabling us to define functions that transform color like the ones shown above.
You're familiar with these color transformation functions already - they're the basis of Instagram filters.
Note: This idea of "representations as enablers" comes from Synthesizers for Thought by Linus Lee, which heavily influenced this post.
The RGB Color Model is a useful frame for understanding embeddings. Just like how the RGB color model is a numerical representation of color based on how humans actually perceive color, an embedding is a numerical representation of text based on how humans actually use text in writing. Similarly, embeddings enhance what we can do with text.
To create an embedding, you take a piece of text and pass it through an embedding model. You can create embeddings of all types of text - individual words, phrases, paragraphs, even entire documents - but we'll stick to sentences in this post. The embedding model takes in the sentence as input, and returns a list of numbers, which is the embedding of that sentence.
You can take any sentence you can think of and run it through the embedding model - the output is always a list of numbers with length N
. We refer to N
as the dimension of the embedding model. The embedding model we use in this post (sentence-transformers/all-MiniLM-L6-v2) has a dimension of 384.
The fact that our embedding model maps any sentence to a list of numbers of length 384 means we can compare two arbitrary sentences by comparing their embeddings. A bit of intuition from RGB color models helps here. If two colors look similar, their underlying RGB representation will be similar. The same is true for embeddings: if two sentences have similar meanings, their embeddings will be similar.
We can get a sense of this by comparing the embeddings of 3 different sentences. Since making sense of 384 numbers is impossible, let's instead visualize an embedding by coloring each number in the embedding according to its value (hover over each rectangle to see the value).
We can compare the embeddings of "Let's generate an embedding of this sentence." (e1)
and "I have a black cat." (e2)
by pointwise subtracting each value in e2
from the corresponding value in e1
:
Not surprisingly, these differences are much more pronounced than the differences between the embeddings of "Let's generate an embedding of this sentence." and "Let's get an embedding for this sentence."
It's useful to think of embeddings as vectors, which means each embedding occupies a point in an N-dimensional space. An embedding model thus maps all possible sentences onto points in the same N-dimensional space. And as we just saw: they are mapped in such a way that sentences with similar meanings sit closer together in this space than sentences with different meanings.
And because embeddings are vectors, we can use cosine similarity to measure the distance between any two embeddings. Cosine similarity is a function that measures how similar two vectors are to each other by measuring the angle between them. When used in the context of embeddings, cosine similarity can be thought of as a way of "measuring meaning": if two sentences are related in meaning, their cosine similarity will be closer to 1. If they are unrelated, their cosine similarity will be closer to 0.
Cosine similarity brings us back to the main idea of this post: that embeddings enhance what we can do with text.
For example, here's the start of a Paul Graham essay, When to Do What You Love, with certain words highlighted with the help of embeddings and cosine similarity:
There's some debate about whether it's a good idea to "follow your passion". In fact the question is impossible to answer with a simple yes or no. Sometimes you should and sometimes you shouldn't, but the border between should and shouldn't is very complicated. The only way to give a general answer is to trace it.
When people talk about this question, there's always an implicit "instead of". All other things being equal, why wouldn't you work on what interests you the most? So even raising the question implies that all other things aren't equal, and that you have to choose between working on what interests you the most and something else, like what pays the best.
And indeed if your main goal is to make money, you can't usually afford to work on what interests you the most. People pay you for doing what they want, not what you want. But there's an obvious exception: when you both want the same thing. For example, if you love football, and you're good enough at it, you can get paid a lot to play it.
Of course the odds are against you in a case like football, because so many other people like playing it too. This is not to say you shouldn't try though. It depends how much ability you have and how hard you're willing to work.
To create these highlights, I first generate embeddings for each sentence in the essay (e1)
. I then loop through every word in each sentence, remove that word from the sentence, and generate an embedding for that shortened sentence (e2)
.
I then calculate the cosine similarity between e1
and e2
, and use that value to drive the highlighting. Words, that when removed, result in larger differences in cosine similarity, are highlighted darker. The result is a rough measure of how “important” a word is to the meaning of the sentence in which that word appears.
>>> cosine_similarity(e1, e2)
0.9933603 # high similarity (close to 1)
This technique has its shortcomings. For one, long sentences tend to dilute the importance of any single word. And it'd probably be more informative to remove entire phrases such as "follow your passion" in some places rather than individual words. Even so, I find it's a helpful indicator of where to direct my attention. I like the idea of turning these highlights on after reading something for the first time, as I find it prompts me to read each sentence more closely.
But it isn't the virtues of any particular technique that excite me the most - it's the fact that such techniques are even possible. Here are two other embeddings-based techniques for working with text that others have imagined, which go beyond using embeddings to measure meaning.
Below, Amelia Wattenberger demonstrates how embeddings can be used to measure sentences on a scale of “concrete” to “abstract”. (I especially love the view on the right-hand side which plots the entire essay along that scale).
Just as the RGB color model facilitates meaningful transforms of color, embeddings also facilitate meaningful transformations of text.
We can already do this with ChatGPT when we ask it to make a sentence more "concrete":
And this is because under the hood, the Large Language Models (LLMs) that power ChatGPT are just manipulating embeddings!
Right now, we use natural language to interface with the LLM, which manipulate the embeddings in response to our prompts. But more natural interfaces - ones that manipulate embeddings more precisely - like the text editing interface Linus Lee imagines below are possible:
And what's even more exciting is that embeddings can be generalized to all different types of data. We can create embeddings for images, songs, videos, and even abstract concepts like one's movie taste (these different types of data are called modalities). All embeddings, regardless of the modality, follow the same principle: similar data sit close together in the embedding space. This means that images with similar content, songs with similar sounds, or movies with similar themes will occupy nearby points in their respective spaces.
We can even create multi-modal embeddings, where different modalities, such as images and text, are mapped (or aligned) to the same embedding space. This is the underlying idea behind Generative AI applications, which enable us to generate images based on a text descriptions ("a red apple on a wooden table"), or the reverse, in which we provide an image, and generate a text caption based on its contents.
So while I'm personally most excited about text embeddings, embeddings are truly at the heart of the recent advancements in machine learning. And in order to get a more complete understanding of how embeddings work and the transforms they enable, we have to take a closer look at embedding models, which we'll do in the next post.
Thank you to Ian Johnson for his valuable feedback on this post.