File size: 4,951 Bytes
327fd9c 3e7409c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- embedding
- text embedding
---
# flan-ul2-text-encoder
THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model is 17.44 GB in `bfloat16` precision.
## basic usage
> note: this is 'a way' of using the encoder, and not 'the only way'. suggestions and ideas welcome
This guide provides a set of functions to calculate the cosine similarity between the embeddings of different texts. The embeddings are calculated using a pre-trained model.
## Functions
### load_model_and_tokenizer
<details>
<summary><b>Details</b></summary>
This function loads the model and tokenizer based on the given model name. It returns a tuple containing the loaded model and tokenizer.
```python
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
"""
Load the model and tokenizer based on the given model name.
Args:
model_name (str): The name of the model to be loaded.
Returns:
Tuple[AutoModel, AutoTokenizer]: The loaded model and tokenizer.
"""
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval() # Deactivate Dropout
return model, tokenizer
```
</details>
### get_embeddings
This function gets the embeddings for the given texts using the provided model and tokenizer. It returns the calculated embeddings.
<details>
<summary><b>Details</b></summary>
```python
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
"""
Get the embeddings for the given texts using the provided model and tokenizer.
Args:
model (AutoModel): The model to be used for getting embeddings.
tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts.
texts (List[str]): The texts for which embeddings are to be calculated.
Returns:
torch.Tensor: The calculated embeddings.
"""
# Tokenize input texts
batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Get the embeddings
with torch.no_grad():
last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state
# Get weights
weights = (
torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
.unsqueeze(0)
.unsqueeze(-1)
.expand(last_hidden_state.size())
.float().to(last_hidden_state.device)
)
# Get attn mask
input_mask_expanded = (
batch_tokens["attention_mask"]
.unsqueeze(-1)
.expand(last_hidden_state.size())
.float()
)
# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
sum_mask = torch.sum(input_mask_expanded * weights, dim=1)
embeddings = sum_embeddings / sum_mask
return embeddings
```
</details>
### calculate_cosine_similarity
This function calculates and prints the cosine similarity between the first text and all other texts. It does not return anything.
<details>
<summary><b>click to expand</b></summary>
```python
def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None:
"""
Calculate and print the cosine similarity between the first text and all other texts.
Args:
embeddings (torch.Tensor): The embeddings for the texts.
texts (List[str]): The texts for which cosine similarity is to be calculated.
"""
# Calculate cosine similarities
for i in range(1, len(embeddings)):
cosine_sim = 1 - cosine(embeddings[0], embeddings[i])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[i], cosine_sim))
```
</details>
## Usage
To use these functions, you need to have the `transformers` and `scipy` libraries installed. You can install these with pip:
```bash
pip install transformers scipy
```
Then, you can use the functions in your Python code as needed. For example:
```python
model_name = "pszemraj/flan-ul2-text-encoder"
model, tokenizer = load_model_and_tokenizer(model_name)
texts = [
"deep learning",
"artificial intelligence",
"deep diving",
"artificial snow",
]
embeddings = get_embeddings(model, tokenizer, texts)
calculate_cosine_similarity(embeddings, texts)
```
This will print the cosine similarity between the first text and all other texts in the `texts` list.
<details>
<summary><b>Customization</b></summary>
You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable.
</details>
## References
This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be). |