How LLMs understand words

It is fascinating to see LLMs in action with understanding what we say, generating new content, storing the whole world’s information, and performing various other complex tasks. But aren’t these just statistical models? And weren’t we taught that these models or computers cannot understand what humans say? So, how do these LLMs understand us and do their job so efficiently? Let us break it down in this article.

What happens to the input we provide to these models?

The very first thing an LLM does after getting the input is the process of tokenization. The tokenization process can be done using various techniques, such as word-level, character-level, and subword tokenization. LLMs usually prefer subword tokenization, especially byte-level tokenization, for its adaptability and efficiency. This means that the entire user input is broken into a bunch of little pieces, called tokens, which the model can then process.

This is how the entire text is converted into tokens

What’s next for these tokens?

Now, after tokenizing the user input, these tokens are converted into respective embedding vectors. Wait… hold on a second—what exactly are “embedding vectors”?? Well, these embedding vectors are nothing but column vectors consisting of a whole lot of numerical values. These vectors serve as the numerical representations of the tokens, allowing the model to work with them. But now the question arises: how are these vectors assigned to the tokens?

Screenshot of vector examples.

Image source: OpenAI

Embedding Matrix and Embedding vectors

They might sound daunting but are very easy to understand. Before we jump onto embedding vectors, let us understand what an embedding matrix is.

An embedding matrix is just a large two-dimensional matrix filled with random numbers, where columns represent tokens and rows correspond to one dimension of the vector representation of the token.
Simplifying this:

Embedding here refers to the learned mapping from a discrete token (like a word) to a numerical representation.
Matrix here just refers to a two-dimensional array of numbers (rows and columns).

Image source: 3Blue1Brown

Now, as we can see, each column corresponds to a token or word and is followed by a series of numerical values, where each row represents a value against some particular dimension. So, this entire row is called an embedding vector, which represents the given token in its numerical form.

Embedding vector

Image source: 3Blue1Brown

The size of this huge matrix is determined by the vocabulary size and the embedding dimension of the model, where:

Vocabulary Size: It is the number of unique tokens (words, subwords, or characters) that the model knows. The vocabulary size determines how many rows the matrix will have.
Embedding Dimension: It is the length of the vector that represents each token. It defines how many values (features) each token embedding will have. The embedding dimension helps the model evaluate and understand the token's meaning and relationships in a multi-dimensional space.

So,
Embedding matrix size = Vocabulary size * Embedding dimension

For example, GPT-3 has a vocabulary size of 50,257 tokens and an embedding dimension of 12,288, so the embedding matrix size for GPT-3 is 617,558,016.

The values in the embedding matrix are initially randomized, but they are gradually adjusted during training on a large dataset. Over time, as the model learns, these values are fine-tuned to capture meaningful relationships between tokens based on the patterns and context in the data.

Let us capture everything again with an example:
User Input: "What is the embedding matrix?"
Tokens upon tokenization: ["What", "is", "the", "embedding", "matrix", "?"]
Embedding Matrix

`Token/Dimensions`	`Dim1`	`Dim2`	`Dim3`	`Dim4`
`"What"`	`0.12`	`0.34`	`-0.23`	`0.45`
`"is"`	`-0.22`	`0.44`	`0.19`	`-0.11`
`"the"`	`0.11`	`0.13`	`-0.34`	`0.12`
…	…	…	…	…
`"embedding"`	`-0.34`	`1.23`	`0.75`	`-0.98`
`"matrix"`	`0.45`	`-0.12`	`0.67`	`0.21`
`"?"`	`0.09`	`-0.11`	`0.31`	`-0.33`

Embedding vectors for the required tokens from the embedding matrix:

"What": [0.12, 0.34, -0.23, 0.45]
"is": [-0.22, 0.44, 0.19, -0.11]
"the": [0.11, 0.13, -0.34, 0.12]
"embedding": [-0.34, 1.23, 0.75, -0.98]
"matrix": [0.45, -0.12, 0.67, 0.21]
"?": [0.09, -0.11, 0.31, -0.33]

And this is how a model understands what the user is saying and wants from the machine.

That’s it?

Well, yes! After obtaining these embedding vectors from the embedding matrix, they are passed through the Transformer architecture. In this phase, the model learns the relationships between words and context, changing the initial values of the embedding vectors, and is adjusted as they move through the transformer architecture, enabling it to predict the next word or generate meaningful responses based on the input.
We will explore the workings of the transformer architecture in the upcoming blogs.

💡

A huge shoutout to OpenAI and 3Blue1Brown for their insightful videos and articles, which have been instrumental in helping me gain a deeper understanding of the under-the-hood workings of large language models.

Conclusion

Thank you for reading this blog! I hope you found it informative and gained some valuable insights into LLMs. Please feel free to share your thoughts in the comments section…