File size: 4,951 Bytes
327fd9c
 
 
 
 
 
 
 
 
 
3e7409c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- embedding
- text embedding
---

# flan-ul2-text-encoder

THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model is 17.44 GB in `bfloat16` precision.


## basic usage

> note: this is 'a way' of using the encoder, and not 'the only way'. suggestions and ideas welcome

This guide provides a set of functions to calculate the cosine similarity between the embeddings of different texts. The embeddings are calculated using a pre-trained model.

## Functions

### load_model_and_tokenizer

<details>
<summary><b>Details</b></summary>

This function loads the model and tokenizer based on the given model name. It returns a tuple containing the loaded model and tokenizer.

```python
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
    """
    Load the model and tokenizer based on the given model name.

    Args:
        model_name (str): The name of the model to be loaded.

    Returns:
        Tuple[AutoModel, AutoTokenizer]: The loaded model and tokenizer.
    """
    model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model.eval()  # Deactivate Dropout
    return model, tokenizer
```

</details>

### get_embeddings

This function gets the embeddings for the given texts using the provided model and tokenizer. It returns the calculated embeddings.

<details>
<summary><b>Details</b></summary>


```python
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
    """
    Get the embeddings for the given texts using the provided model and tokenizer.

    Args:
        model (AutoModel): The model to be used for getting embeddings.
        tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts.
        texts (List[str]): The texts for which embeddings are to be calculated.

    Returns:
        torch.Tensor: The calculated embeddings.
    """
    # Tokenize input texts
    batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    # Get the embeddings
    with torch.no_grad():
        last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state

    # Get weights
    weights = (
        torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
        .unsqueeze(0)
        .unsqueeze(-1)
        .expand(last_hidden_state.size())
        .float().to(last_hidden_state.device)
    )

    # Get attn mask
    input_mask_expanded = (
        batch_tokens["attention_mask"]
        .unsqueeze(-1)
        .expand(last_hidden_state.size())
        .float()
    )

    # Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
    sum_mask = torch.sum(input_mask_expanded * weights, dim=1)

    embeddings = sum_embeddings / sum_mask

    return embeddings
```

</details>

### calculate_cosine_similarity

This function calculates and prints the cosine similarity between the first text and all other texts. It does not return anything.

<details>
<summary><b>click to expand</b></summary>


```python
def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None:
    """
    Calculate and print the cosine similarity between the first text and all other texts.

    Args:
        embeddings (torch.Tensor): The embeddings for the texts.
        texts (List[str]): The texts for which cosine similarity is to be calculated.
    """
    # Calculate cosine similarities
    for i in range(1, len(embeddings)):
        cosine_sim = 1 - cosine(embeddings[0], embeddings[i])
        print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[i], cosine_sim))
```

</details>

## Usage

To use these functions, you need to have the `transformers` and `scipy` libraries installed. You can install these with pip:

```bash
pip install transformers scipy
```

Then, you can use the functions in your Python code as needed. For example:

```python
model_name = "pszemraj/flan-ul2-text-encoder"
model, tokenizer = load_model_and_tokenizer(model_name)

texts = [
    "deep learning",
    "artificial intelligence",
    "deep diving",
    "artificial snow",
]

embeddings = get_embeddings(model, tokenizer, texts)
calculate_cosine_similarity(embeddings, texts)
```

This will print the cosine similarity between the first text and all other texts in the `texts` list.

<details>
<summary><b>Customization</b></summary>

You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable.

</details>

## References

This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).