|
--- |
|
license: apache-2.0 |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- embedding |
|
- text embedding |
|
--- |
|
|
|
# flan-ul2-text-encoder |
|
|
|
THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model is 17.44 GB in `bfloat16` precision. |
|
|
|
|
|
## basic usage |
|
|
|
> note: this is 'a way' of using the encoder, and not 'the only way'. suggestions and ideas welcome |
|
|
|
This guide provides a set of functions to calculate the cosine similarity between the embeddings of different texts. The embeddings are calculated using a pre-trained model. |
|
|
|
## Functions |
|
|
|
### load_model_and_tokenizer |
|
|
|
<details> |
|
<summary><b>Details</b></summary> |
|
|
|
This function loads the model and tokenizer based on the given model name. It returns a tuple containing the loaded model and tokenizer. |
|
|
|
```python |
|
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]: |
|
""" |
|
Load the model and tokenizer based on the given model name. |
|
|
|
Args: |
|
model_name (str): The name of the model to be loaded. |
|
|
|
Returns: |
|
Tuple[AutoModel, AutoTokenizer]: The loaded model and tokenizer. |
|
""" |
|
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model.eval() # Deactivate Dropout |
|
return model, tokenizer |
|
``` |
|
|
|
</details> |
|
|
|
### get_embeddings |
|
|
|
This function gets the embeddings for the given texts using the provided model and tokenizer. It returns the calculated embeddings. |
|
|
|
<details> |
|
<summary><b>Details</b></summary> |
|
|
|
|
|
```python |
|
def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor: |
|
""" |
|
Get the embeddings for the given texts using the provided model and tokenizer. |
|
|
|
Args: |
|
model (AutoModel): The model to be used for getting embeddings. |
|
tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts. |
|
texts (List[str]): The texts for which embeddings are to be calculated. |
|
|
|
Returns: |
|
torch.Tensor: The calculated embeddings. |
|
""" |
|
# Tokenize input texts |
|
batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") |
|
|
|
# Get the embeddings |
|
with torch.no_grad(): |
|
last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state |
|
|
|
# Get weights |
|
weights = ( |
|
torch.arange(start=1, end=last_hidden_state.shape[1] + 1) |
|
.unsqueeze(0) |
|
.unsqueeze(-1) |
|
.expand(last_hidden_state.size()) |
|
.float().to(last_hidden_state.device) |
|
) |
|
|
|
# Get attn mask |
|
input_mask_expanded = ( |
|
batch_tokens["attention_mask"] |
|
.unsqueeze(-1) |
|
.expand(last_hidden_state.size()) |
|
.float() |
|
) |
|
|
|
# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim |
|
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1) |
|
sum_mask = torch.sum(input_mask_expanded * weights, dim=1) |
|
|
|
embeddings = sum_embeddings / sum_mask |
|
|
|
return embeddings |
|
``` |
|
|
|
</details> |
|
|
|
### calculate_cosine_similarity |
|
|
|
This function calculates and prints the cosine similarity between the first text and all other texts. It does not return anything. |
|
|
|
<details> |
|
<summary><b>click to expand</b></summary> |
|
|
|
|
|
```python |
|
def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None: |
|
""" |
|
Calculate and print the cosine similarity between the first text and all other texts. |
|
|
|
Args: |
|
embeddings (torch.Tensor): The embeddings for the texts. |
|
texts (List[str]): The texts for which cosine similarity is to be calculated. |
|
""" |
|
# Calculate cosine similarities |
|
for i in range(1, len(embeddings)): |
|
cosine_sim = 1 - cosine(embeddings[0], embeddings[i]) |
|
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[i], cosine_sim)) |
|
``` |
|
|
|
</details> |
|
|
|
## Usage |
|
|
|
To use these functions, you need to have the `transformers` and `scipy` libraries installed. You can install these with pip: |
|
|
|
```bash |
|
pip install transformers scipy |
|
``` |
|
|
|
Then, you can use the functions in your Python code as needed. For example: |
|
|
|
```python |
|
model_name = "pszemraj/flan-ul2-text-encoder" |
|
model, tokenizer = load_model_and_tokenizer(model_name) |
|
|
|
texts = [ |
|
"deep learning", |
|
"artificial intelligence", |
|
"deep diving", |
|
"artificial snow", |
|
] |
|
|
|
embeddings = get_embeddings(model, tokenizer, texts) |
|
calculate_cosine_similarity(embeddings, texts) |
|
``` |
|
|
|
This will print the cosine similarity between the first text and all other texts in the `texts` list. |
|
|
|
<details> |
|
<summary><b>Customization</b></summary> |
|
|
|
You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable. |
|
|
|
</details> |
|
|
|
## References |
|
|
|
This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be). |