add" to the GPU in BF16 precision" (#7)

daa7e40 verified 2 months ago

7.18 kB

	---
	license: other
	license_name: jamba-open-model-license
	license_link: https://www.ai21.com/jamba-open-model-license/
	---
	# Model Information
	Jamba Mini 1.7 offers new improvements to our Jamba open model family. This new version builds on the novel SSM-Transformer hybrid architecture, 256K context window, and efficiency gains of previous versions, while introducing improvements in grounding and instruction-following.

	## Key Improvements
	* Grounding: Jamba Mini 1.7 provides more complete and accurate answers, grounded fully in the given context.
	* Instruction following: Jamba Mini 1.7 improves on steerability.

	## Use Cases
	Jamba’s long context efficiency, contextual faithfulness, and steerability make it ideal for a variety of business applications and industries, such as:

	* Finance: Investment research, digital banking support chatbot, M&A due diligence.
	* Healthcare: Procurement (RFP creation & response review), medical publication and reports generation.
	* Retail: Brand-aligned product description generation, conversational AI.
	* Education & Research: Personalized chatbot tutor, grants applications.

	The models are released under the [Jamba Open Model License](https://www.ai21.com/jamba-open-model-license/), a permissive license allowing full research use and commercial use under the license terms. If you need to license the model for your needs, [talk to us](https://www.ai21.com/contact-sales/).


	## Model Details

	- Developed by: [AI21](https://www.ai21.com)
	- Model type: Joint Attention and Mamba (Jamba)
	- License: [Jamba Open Model License](https://www.ai21.com/licenses/jamba-open-model-license)
	- Context length: 256K
	- Knowledge cutoff date: August 22nd, 2024
	- Supported languages: English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew


	## Grounding and instruction-following improvements
	\| Category \| Benchmark \| Jamba Mini 1.6 \| Jamba Mini 1.7 \|
	\|---------------\|:----------:\|:---------------:\|:--------------:\|
	\| Grounding \| FACTS \| 0.727 \| 0.790 \|
	\| Steerability \| IFEcal \| 0.68 \| 0.76 \|

	## Usage

	Find step-by-step instructions on how to privately deploy Jamba:

	<details>
	<summary><strong>Run the model with vLLM</strong></summary>

	The recommended way to perform efficient inference with Jamba Mini 1.7 is using [vLLM](https://docs.vllm.ai/en/latest/). First, make sure to install vLLM (version 0.5.4 or higher is required):

	```bash
	pip install vllm>=0.5.4
	```

	In the example below, `number_gpus` should match the number of GPUs you want to deploy Jamba Mini 1.7 on. A minimum of 2×80GB GPUs is required.

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	model = "ai21labs/AI21-Jamba-1.7-Mini"
	number_gpus = 2

	llm = LLM(model=model,
	max_model_len=200*1024,
	tensor_parallel_size=number_gpus)

	tokenizer = AutoTokenizer.from_pretrained(model)

	messages = [
	{"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
	{"role": "user", "content": "Hello!"},
	]

	prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100)
	outputs = llm.generate(prompts, sampling_params)

	generated_text = outputs[0].outputs[0].text
	print(generated_text)
	```

	Output:
	Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

	With the default BF16 precision on 2×80GB A100 GPUs and default vLLM configuration, you'll be able to perform inference on prompts up to 200K tokens long. On more than 2×80GB GPUs, you can easily fit the full 256K context.

	> Note: vLLM's main branch has some memory utilization improvements specific to the Jamba architecture that allow using the full 256K context length on 2×80GB GPUs. You can build vLLM from source if you wish to make use of them.

	</details>

	<details>
	<summary><strong>Run the model with Transformers</strong></summary>

	The following example loads Jamba Mini 1.7 to the GPU in BF16 precision, uses optimized [FlashAttention2](https://github.com/Dao-AILab/flash-attention) and Mamba kernels, and parallelizes the model across multiple GPUs using [`accelerate`](https://huggingface.co/docs/accelerate/index).

	> Note: In half precision (FP16/BF16), Jamba Mini 1.7 is too large to fit on a single 80GB GPU, so you'll need at least 2 such GPUs.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.7-Mini",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto")

	tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.7-Mini")

	messages = [
	{"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
	{"role": "user", "content": "Hello!"},
	]

	prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100)
	outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.device),
	**sampling_params.to_dict())

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	> Note: Versions `4.44.0` and `4.44.1` of `transformers` have a bug that restricts the ability to run the Jamba architecture. Make sure you're not using these versions.

	> Note: If you're having trouble installing `mamba-ssm` and `causal-conv1d` for the optimized Mamba kernels, you can run Jamba Mini 1.7 without them at the cost of extra latency. To do that, add the kwarg `use_mamba_kernels=False` when loading the model:

	```python
	model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.7-Mini",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	use_mamba_kernels=False)
	```

	</details>

	You can also find all instructions in our [private AI (vLLM) deployment guide](https://docs.ai21.com/docs/vllm).
	And to get started with our SDK:
	[AI21 Python SDK guide](https://docs.ai21.com/docs/sdk)

	## Further Documentation

	For comprehensive guides and advanced usage:
	- [Tokenization Guide](https://docs.ai21.com/docs/tokenization) – Using `ai21-tokenizer`
	- [Quantization Guide](https://docs.ai21.com/docs/quantization) – ExpertsInt8, bitsandbytes
	- [Fine-tuning Guide](https://docs.ai21.com/docs/fine-tuning) – LoRA, qLoRA and full fine-tuning

	For more resources to start building, visit our [official documentation](https://docs.ai21.com/docs).