File size: 6,869 Bytes
ceea8dc 8eb0207 ceea8dc 8eb0207 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
quantization_method: bitsandbytes
quantization_config:
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: torch.bfloat16
bnb_4bit_use_double_quant: true
tags:
- phi
- phi-4
- phi-4-multimodal
- multimodal
- quantized
- 4bit
- bitsandbytes
- bubblspace
- Automatic Speech Recognition
language:
- ar
- en
- pl
- zh
- fr
- de
- hu
- sv
- es
- ko
- 'no'
---
# Bubbl-P4-multimodal-instruct (4-bit Quantized)
This repository contains a 4-bit quantized version of the `microsoft/Phi-4-multimodal-instruct` model.
Quantization was performed using the `bitsandbytes` library integrated with `transformers`.
## Model Description
* **Original Model:** [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
* **Quantization Method:** `bitsandbytes` Post-Training Quantization (PTQ)
* **Precision:** 4-bit
* **Quantization Config:**
* `load_in_4bit=True`
* `bnb_4bit_quant_type="nf4"` (NormalFloat 4-bit)
* `bnb_4bit_compute_dtype=torch.bfloat16` (Computation performed in BF16 for compatible GPUs like A100)
* `bnb_4bit_use_double_quant=True` (Enables nested quantization for potentially more memory savings)
This version was created to provide the capabilities of Phi-4-multimodal with a significantly reduced memory footprint, making it suitable for deployment on GPUs with lower VRAM.
## Intended Use
This quantized model is primarily intended for scenarios where VRAM resources are constrained, but the advanced multimodal reasoning, language understanding, and instruction-following capabilities of `Phi-4-multimodal-instruct` are desired.
Refer to the [original model card](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) for the full range of intended uses and capabilities of the base model.
## How to Use
You can load this 4-bit quantized model directly using the `transformers` library. Ensure you have `bitsandbytes` and `accelerate` installed (`pip install transformers bitsandbytes accelerate torch torchvision pillow soundfile scipy sentencepiece protobuf`).
```python
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model_id = "bubblspace/Bubbl-P4-multimodal-instruct"
# Load the processor (requires trust_remote_code)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load the model with 4-bit quantization enabled
# The quantization config is loaded automatically from the model's config file
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True, # Essential for Phi-4 models
load_in_4bit=True, # Explicitly activate 4-bit loading (though config should handle it)
device_map="auto" # Automatically map model layers to available GPU(s)
# torch_dtype=torch.bfloat16 # Often not needed here as bnb_4bit_compute_dtype is handled
)
print("4-bit quantized model loaded successfully!")
# --- Example: Text Inference ---
prompt = "<|user|>\nExplain the benefits of model quantization.<|end|>\n<|assistant|>"
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150)
response_text = processor.batch_decode(outputs)[0]
print(response_text)
# --- Example: Image Inference Placeholder ---
# from PIL import Image
# import requests
# url = "your_image_url.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
# image_prompt = "<|user|>\n<|image_1|>\nDescribe this image.<|end|>\n<|assistant|>"
# inputs = processor(text=image_prompt, images=image, return_tensors="pt").to(model.device)
# outputs = model.generate(**inputs, max_new_tokens=100)
# response_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
# print(response_text)
# --- Example: Audio Inference Placeholder ---
# import soundfile as sf
# audio_path = "your_audio.wav"
# audio_array, sampling_rate = sf.read(audio_path)
# audio_prompt = "<|user|>\n<|audio_1|>\nTranscribe this audio.<|end|>\n<|assistant|>"
# inputs = processor(text=audio_prompt, audios=[(audio_array, sampling_rate)], return_tensors="pt").to(model.device)
# # ... generate and decode ...
```
**Important:** Remember to always pass `trust_remote_code=True` when loading both the processor and the model for Phi-4 architectures.
## Hardware Requirements
* Requires a CUDA-enabled GPU.
* The 4-bit quantization significantly reduces VRAM requirements compared to the original BF16 model (approx. 11-12GB). This version should fit comfortably on GPUs with ~10GB VRAM, and potentially less depending on context length and batch size (evaluation recommended).
* Performance gains (inference speed) compared to the original are most noticeable on GPUs that efficiently handle lower-precision operations (e.g., NVIDIA Ampere, Ada Lovelace series like A100, L4, RTX 30/40xx).
## Limitations and Considerations
* **Potential Accuracy Impact:** While 4-bit quantization aims to preserve performance, there might be a slight degradation in accuracy compared to the original BF16 model. Users should evaluate the model's performance on their specific tasks to ensure the trade-off is acceptable.
* **Inference Speed:** Memory usage is significantly reduced. Inference speed may or may not be faster than the original BF16 model; it depends heavily on the hardware, batch size, sequence length, and specific implementation details. Test on your target hardware.
* **Multimodal Evaluation:** Quantization primarily affects the model weights. Thorough evaluation on specific vision and audio tasks is recommended to confirm performance characteristics for multimodal use cases.
* **Inherited Limitations:** This model inherits the limitations, biases, and safety considerations of the original `microsoft/Phi-4-multimodal-instruct` model. Please refer to its model card for detailed information on responsible AI practices.
## License
The model is licensed under the [MIT License](LICENSE), consistent with the original `microsoft/Phi-4-multimodal-instruct` model.
## Citation
Please cite the original work if you use this model:
```bibtex
@misc{phi4multimodal2025,
title={Phi-4-multimodal: A Compact Multimodal Model for Recommendation, Recognition, and Reasoning},
author={Microsoft},
year={2025},
eprint={2503.01743},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
```
Additionally, if you use this specific 4-bit quantized version, please acknowledge **Bubblspace** ([bubblspace.com](https://bubblspace.com)) and **AIEDX** ([aiedx.com](https://aiedx.com)) for providing this quantized model. You could add a note such as:
> *"We used the 4-bit quantized version of Phi-4-multimodal-instruct provided by Bubblspace/AIEDX, available at huggingface.co/bubblspace/Bubbl-P4-multimodal-instruct."* |