Explore the Interpretability of Embeddings

Community Article Published August 23, 2025

This is probably an interesting and somewhat profound discovery.

Word embedding is relatively abstract, but behind these abstractions lie some “concrete” meanings. This article uses a few simple computations (transformations) to make certain dimensions/attributes of embeddings more tangible. Specifically, it demonstrates how to find a direction in the embedding space that represents the “animal” attribute. If you’re interested, you can use this simple method to find the attribute directions you care about.

TL;DR

Typically, in a specific word embedding implementation, you can first provide a set of words that share a common attribute, and then compute the average direction of their embedding vectors. This average direction can represent that common attribute.

For example, if you take a set of “animals” and compute the average direction of their embedding vectors, that direction would represent the “animal” attribute.

Overview

If you’ve ever tried to understand the meaning of each dimension in an embedding, you’ve probably heard this idea: each dimension of an embedding can be interpreted as representing some attribute of the word, such as “gender” or “royalty-relatedness.” This is the kind of explanation often given in the classic example man – woman = king – queen.

However, when you actually get a word’s embedding — say, with 768 dimensions — it often seems that no single dimension has such a clear attribute meaning. In reality, these attribute meanings do exist, but the corresponding attribute directions do not align with the axes of the “standard basis.”

So, if these directions exist, how can we find them? This article presents and verifies a very simple method to quickly identify the direction corresponding to a specific attribute and to perform some validation. This can greatly deepen your understanding of embeddings.

Finding a Direction of Interest

This section uses the example of finding the “animal” attribute direction to illustrate how to identify and validate such a direction.

List the Most Representative Words

Let’s think about the problem this way: if there were a direction that represents the “animal” attribute of a word, what direction would that be? Here, using the sentence embeddings provided by the all-MiniLM-L6-v2 model as an example, we will explore how to determine which direction in the embedding’s vector space is most likely to represent the “animal” attribute. The specific method is described as follows:

First, identify $n$ words that are considered most representative of the “animal” attribute — here we take $n$ = 50.

Then, compute the average direction avg_vector of these $n$ words; this direction is taken to be the one we are looking for.

Here are the 50 animals used in this example:

animals = [
    "tiger", "lion", "elephant", "giraffe", "zebra",
    "rhinoceros", "hippopotamus","crocodile", "monkey",
    "panda", "koala", "kangaroo","whale", "dolphin",
    "seal", "penguin", "shark", "snake", "lizard",
    "turtle", "frog", "butterfly", "bee", "ant", "eagle",
    "sparrow", "pigeon", "parrot", "owl", "duck", "chicken",
    "dog", "cat", "pig", "cow", "sheep", "horse", "donkey",
    "rabbit", "squirrel", "fox", "wolf", "bear", "deer",
    "hedgehog", "bat", "mouse", "chameleon", "snail", "jellyfish"
]

Compute the average direction of the embeddings

This average direction(vector) is the “animal” attribute direction we are looking for.

animals_embeddings = model.encode(animals)
avg_animals_embeddings = np.mean(animals_embeddings, axis=0)

Validate this “direction”

Next, select two sets of words — one set that is considered highly related to “animals,” and another set that is unrelated to animals. Then, compute the projection values of these two sets of words onto the avg_vector direction obtained above. Observe the projection values to see if they match expectations.

The two sets of words chosen here are:

Highly related to animals: “Camel”, “Gorilla”, “Cheetah”
Unrelated to animals: “Dream”, “Chair”, “Mathematics”

Compute the projections and visualize them

The specific program is as follows:

animals_words    = ["Camel", "Gorilla", "Cheetah"]
un_animals_words = ["Dream", "Chair", "Mathematics"]

for word_list in (animals_words,un_animals_words):
    projection_scores = np.dot(model.encode(word_list),
                              avg_animals_embeddings)
    results.update({word: score for word,
                    score in zip(word_list, projection_scores)})

for word, score in results.items():
    print(f"'{word}': {score:.4f}")
print(np.round(avg_animals_embeddings[:10], 4))

The projection results are:

'Camel': 0.3887
'Gorilla': 0.4186
'Cheetah': 0.3797
'Dream': 0.2450
'Chair': 0.2823
'Mathematics': 0.1972

Plot the projections of the above two sets of words on the real number axis:

It is quite clear that the avg_vector direction, to some extent, represents a word’s “animal” attribute: words related to animals have large projections in this direction, while unrelated words have small projections.

Overview of the Principle

summary

In fact, the “average vector” (centroid of word embeddings) of a set of words can, to some extent, represent the “semantic center” of that set. If the words in the set share certain commonalities, then this average vector is likely to represent that commonality.

In the example above, the given set of words deliberately consists entirely of animal names. Therefore, this “average vector” is quite likely to represent the “animal” attribute in the vector space.

Mathematical Derivation

Consider the problem this way: given $n$ vectors $\alpha_1, \dots , \alpha_n$ , find a unit vector $ξ \xi$ such that the sum of the projection values of these $n$ vectors in the direction of $ξ \xi$ is maximized.

Let $\bar{\alpha} = \frac{\sum\limits_{i=1}^{n}\alpha_i}{n}$

The objective function is $S = \sum\limits_{i=1}^{n}(\alpha_i \cdot \xi ) = \sum\limits_{i=1}^{n}(\alpha_i) \cdot \xi = n \bar{\alpha} \cdot \xi = n | \bar{\alpha}| | \xi | \cos\theta$

Here, $n$ and $\bar{\alpha}$ are given, and $\| \xi \| = 1$ , so when $\cos\theta$ takes its maximum value, the above objective function (S) reaches its maximum.

That is, when $\theta = 0$ , $S$ reaches its maximum. In other words, when $ξ \xi$ has the same direction as $\bar{\alpha}$ , i.e., $\xi = \frac{\bar{\alpha}}{|\bar{\alpha}|}$ , the sum of the projection values of all vectors is maximized.

Projection Calculation

The formula for calculating the projection length of a vector $\alpha$ onto the direction of $\beta$ is as follows:

$proj = \frac{\alpha \cdot \beta}{\|\beta\|}$

The proof is fairly straightforward and will not be elaborated here.

A Deeper Understanding

Modern embedding algorithms are trained on real-world corpora and reflect the correspondence between “language” and the real world in human cognition. In human cognition, the world has “dimensions.” A straightforward example is that we categorize words as “positive” or “negative.” Other possible dimensions might include animal-relatedness, emotional intensity, part of speech, and so on. But how many such “cognitive” dimensions actually exist in human understanding? This is unknown, and various embedding algorithms are essentially attempts to describe these dimensions in a quantitative way.

However, in actual trained embedding implementations—say, a 768-dimensional embedding—the unit vector directions are unlikely to align exactly with the above human “cognitive” dimensions. If we denote the unit vectors of the trained embedding as $\alpha_1, \dots, \alpha_n$ , and the dimensions of human cognition as $\beta_1, \dots, \beta_n$ , then there exists a transformation matrix $T$ that can perform the change of basis between these vector spaces.

Reality, however, is not so ideal. The embedding space indeed provides an orthogonal basis, but human cognition rarely organizes concepts into such orthogonal bases. For example, words with the “animal” attribute may also carry an “emotional” attribute—such as in the phrase “bear in the morning”, which conveys a certain emotional tone.

Although it is difficult to identify an orthogonal “basis” in cognition, it is still possible to find a specific attribute direction using the method described in this article. This is precisely both the limitation and the value of the method presented here.

Additional Notes

In this article, the term Word Embedding generally refers to the Token Embedding within a Sentence Embedding. Here, there is no need to distinguish between the two.

In reality, the situation is more complex. For example, the “animal” attribute discussed in this article refers only to the “animal” attribute represented by these specific words. There is no precise concept of what constitutes the true “animal” attribute. In human language, “animal” is an abstract notion without a digitized or mathematically exact definition.

For the complete implementation code, see: embedding_research_01.py

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote