Instructions to use togethercomputer/Llama-2-7B-32K-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/Llama-2-7B-32K-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/Llama-2-7B-32K-Instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use togethercomputer/Llama-2-7B-32K-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/Llama-2-7B-32K-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/Llama-2-7B-32K-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/Llama-2-7B-32K-Instruct

SGLang

How to use togethercomputer/Llama-2-7B-32K-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/Llama-2-7B-32K-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/Llama-2-7B-32K-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/Llama-2-7B-32K-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/Llama-2-7B-32K-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/Llama-2-7B-32K-Instruct with Docker Model Runner:
```
docker model run hf.co/togethercomputer/Llama-2-7B-32K-Instruct
```

llama2 forward pass seemingly not working with padded inputs, unless one element in batch is not padded

#13

by joehakim - opened Oct 5, 2023

Discussion

joehakim

Oct 5, 2023

From this discussion thread [https://github.com/huggingface/transformers/issues/26601], moved to here. Basically this seems to be an issue with padding, only when trust_remote_code=True, so maybe related to FlashAttention?

Here's a script to reproduce,

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizerFast


tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct")

tokenizer = LlamaTokenizerFast.from_pretrained(
    "togethercomputer/Llama-2-7B-32K-Instruct"
)

model = AutoModelForCausalLM.from_pretrained(
    "togethercomputer/Llama-2-7B-32K-Instruct",
    trust_remote_code=True, # this works when this is False
    torch_dtype=torch.float16,
).cuda()

""" THIS works in both cases
model = MT5ForConditionalGeneration.from_pretrained(
    'google/mt5-xl'
"""

encoded = tokenizer(
    [
        "[INST]\nWrite a poem about cats\n[/INST]\n\n",
        "[INST]\nWrite " + "a poem about" * 400 + " cats\n[/INST]\n\n",
    ],
    return_tensors="pt",
    padding="longest",
).to(model.device)

encoded_firstelem = {
    "input_ids": encoded["input_ids"][:1, :],
    "attention_mask": encoded["attention_mask"][:1, :],
}
breakpoint()

print(encoded_firstelem)
# {'input_ids': tensor([[    0,     0,     0,  ..., 29962,    13,    13]], device='cuda:0'), 'attention_mask': tensor([[0, 0, 0,  ..., 1, 1, 1]], device='cuda:0')}

# works
print(model(**encoded))

# breaks
print(model(**encoded_firstelem))

mauriceweber

Oct 9, 2023

Hi @joehakim and thanks for reporting this!

I think the error you see when feeding only the first element comes from a mismatch between q_len and max_seqlen_q, because of the unnecessary padding of the first element.

For your specific example, this is caused by the following steps in `modelling_flash_llama.py:

bsz, q_len, h_size = hidden_states.size() (L311) -- this reads the sequence length from the padded input which is 1215.
unpadded_q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, attention_mask[:, -q.size(1):]) (L371) -- here the padding gets removed and your max_seqlen_q becomes 18.
attn_output = pad_input(attn_output, indices_q, bsz, max_seqlen_q).reshape(bsz, q_len, h_size) (L380-382) -- this is were the error happens due to the mismatch between q_len and max_seqlen_q

So that means that you can't process a batch where the actual (unpadded) sequence length is smaller than the longest (padded) sequence in your batch.

magicaltommy

Mar 21, 2024

I am encountering the same error, ie a mismatch between q_len and max_seqlen_q gives
RuntimeError: shape '[4, 6400, 4096]' is invalid for input of size 14811136

Is there a solution to this issue?

ckv

Mar 21, 2024

Hi @mauriceweber - Is there support for batches containing different lengths of unpadded sequences?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment