|
--- |
|
license: other |
|
license_name: jamba-open-model-license |
|
license_link: https://www.ai21.com/jamba-open-model-license/ |
|
--- |
|
## Model Information |
|
|
|
Jamba Large 1.7-FP8 offers new improvements to our Jamba open model family. This new version builds on the novel SSM-Transformer hybrid architecture, 256K context window, and efficiency gains of previous versions, while introducing improvements in grounding, instruction-following, and speed. |
|
|
|
## Key improvements: |
|
|
|
* **Grounding**: Jamba Large 1.7-FP8 provides more complete and accurate answers, grounded fully in the given context. |
|
* **Instruction following**: Jamba Large 1.7-FP8 improves on steerability. |
|
* **Speed**: Jamba Large 1.7-FP8 is faster due to FP8 quantizations. |
|
|
|
## Use cases |
|
Jamba’s long context efficiency, contextual faithfulness, and steerability make it ideal for a variety of business applications and industries, such as: |
|
|
|
|
|
* **Finance**: Investment research, digital banking support chatbot, M&A due diligence. |
|
* **Healthcare**: Procurement (RFP creation & response review), medical publication and reports generation. |
|
* **Retail**: Brand-aligned product description generation, conversational AI. |
|
* **Education & Research**: Personalized chatbot tutor, grants applications. |
|
|
|
The models are released under the [Jamba Open Model License](https://www.ai21.com/jamba-open-model-license/), a permissive license allowing full research use and commercial use under the license terms. If you need to license the model for your needs, [talk to us](https://www.ai21.com/contact-sales/). |
|
|
|
## Model Details |
|
Developed by: AI21 |
|
Model type: Joint Attention and Mamba (Jamba) |
|
Model size: 94B active/398B total parameters |
|
License: Jamba Open Model License |
|
Context length: 256K |
|
Knowledge cutoff date: August 22, 2024 |
|
Supported languages: English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew |
|
|
|
## Grounding and instruction-following improvements |
|
| Category | Benchmark | Jamba Large 1.6 | Jamba Large 1.7 | |
|
|-------------|:----------:|:---------------:|:--------------:| |
|
| Grounding | FACTS | 0.758 | 0.832 | |
|
| Steerability| IFEcal | 0.782 | 0.84 | |
|
|
|
## FP8 Quantization |
|
Jamba Large 1.7-FP8 weights are available in this pre-quantized FP8 format, which is optimal for NVIDIA Hopper architecture machines. As a result: |
|
|
|
* The initial GPU memory footprint is lower on inference launch. |
|
* FP8 model weights require almost 50% less disk space. |
|
|
|
## Usage |
|
|
|
Find step-by-step instructions on how to privately deploy Jamba: |
|
|
|
<details> |
|
<summary><strong>Run the model with vLLM</strong></summary> |
|
|
|
The recommended way to perform efficient inference with Jamba Large 1.7-FP8 is using [vLLM](https://docs.vllm.ai/en/latest/). First, make sure to install vLLM (version 0.6.5 or higher is required): |
|
|
|
```bash |
|
pip install vllm>=0.6.5 |
|
``` |
|
|
|
Jamba Large 1.7-FP8 is too large to be loaded in full (FP32) or half (FP16/BF16) precision on a single node of 8 80GB GPUs. Using Jamba Large 1.7-FP8, you'll be able to deploy the model on a single node of 8 80GB GPUs. |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = "ai21labs/AI21-Jamba-Large-1.7-FP8" |
|
|
|
|
|
llm = LLM(model=model, |
|
tensor_parallel_size=8, |
|
max_model_len=220*1024, |
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model) |
|
|
|
|
|
messages = [ |
|
{"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."}, |
|
{"role": "user", "content": "Hello!"}, |
|
] |
|
|
|
|
|
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
|
|
|
|
|
sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100) |
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
|
|
generated_text = outputs[0].outputs[0].text |
|
print(generated_text) |
|
``` |
|
|
|
|
|
**Note:** Versions 4.44.0 and 4.44.1 of transformers have a bug that restricts the ability to run the Jamba architecture. Make sure you're not using these versions. |
|
|
|
**Note:** If you're having trouble installing mamba-ssm and causal-conv1d for the optimized Mamba kernels, you can run Jamba Large 1.7-FP8 without them, at the cost of extra latency. In order to do that, add the kwarg use_mamba_kernels=False when loading the model via AutoModelForCausalLM.from_pretained(). |
|
|
|
You can also find all instructions in our [private AI (vLLM) deployment guide](https://docs.ai21.com/docs/vllm). |
|
|
|
</details> |
|
|
|
</details> |
|
|
|
<details> |
|
<summary><strong>Run the model with Transformers</strong></summary> |
|
|
|
To load Jamba Large 1.7 in transformers on a single node of 8 80GB GPUs, we recommend to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index): |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# a device map to distribute the model evenly across 8 GPUs |
|
device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.layers.32': 3, 'model.layers.33': 3, 'model.layers.34': 3, 'model.layers.35': 3, 'model.layers.36': 4, 'model.layers.37': 4, 'model.layers.38': 4, 'model.layers.39': 4, 'model.layers.40': 4, 'model.layers.41': 4, 'model.layers.42': 4, 'model.layers.43': 4, 'model.layers.44': 4, 'model.layers.45': 5, 'model.layers.46': 5, 'model.layers.47': 5, 'model.layers.48': 5, 'model.layers.49': 5, 'model.layers.50': 5, 'model.layers.51': 5, 'model.layers.52': 5, 'model.layers.53': 5, 'model.layers.54': 6, 'model.layers.55': 6, 'model.layers.56': 6, 'model.layers.57': 6, 'model.layers.58': 6, 'model.layers.59': 6, 'model.layers.60': 6, 'model.layers.61': 6, 'model.layers.62': 6, 'model.layers.63': 7, 'model.layers.64': 7, 'model.layers.65': 7, 'model.layers.66': 7, 'model.layers.67': 7, 'model.layers.68': 7, 'model.layers.69': 7, 'model.layers.70': 7, 'model.layers.71': 7, 'model.final_layernorm': 7, 'lm_head': 7} |
|
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Large-1.7-FP8", |
|
attn_implementation="flash_attention_2", |
|
device_map=device_map) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Large-1.7-FP8") |
|
|
|
|
|
messages = [ |
|
{"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."}, |
|
{"role": "user", "content": "Hello!"}, |
|
] |
|
|
|
|
|
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device) |
|
|
|
|
|
outputs = model.generate(input_ids, max_new_tokens=216) |
|
|
|
|
|
# Decode the output |
|
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
# Split the conversation to get only the assistant's response |
|
assistant_response = conversation.split(messages[-1]['content'])[1].strip() |
|
print(assistant_response) |
|
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes? |
|
|
|
``` |
|
**Note:** Versions 4.44.0 and 4.44.1 of transformers have a bug that restricts the ability to run the Jamba architecture. Make sure you're not using these versions. |
|
|
|
**Note:** If you're having trouble installing mamba-ssm and causal-conv1d for the optimized Mamba kernels, you can run Jamba Large 1.7 without them, at the cost of extra latency. In order to do that, add the kwarg use_mamba_kernels=False when loading the model via AutoModelForCausalLM.from_pretained(). |
|
|
|
|
|
</details> |
|
|
|
And to get started with our SDK: |
|
[AI21 Python SDK guide](https://docs.ai21.com/docs/sdk) |
|
|
|
## Further documentation |
|
For more comprehensive guides and advanced usage: |
|
* [Tokenization guide](https://docs.ai21.com/docs/tokenization) - Using ai21-tokenizer |
|
* [Quantization guide](https://docs.ai21.com/docs/quantization) - ExpertsInt8, bitsandbytes |
|
* [Fine-tuning guide](https://docs.ai21.com/docs/fine-tuning) - LoRA, qLoRA, and full fine-tuning |
|
* [Function-calling guide](https://docs.ai21.com/docs/function-calling) |
|
|
|
For more resources to start building, [visit our official documentation](https://docs.ai21.com/home). |
|
|
|
|