RedHatAI
/

gemma-3-12b-it-quantized.w8a8

Image-Text-to-Text

text-generation-inference

8-bit precision

compressed-tensors

Model card Files Files and versions

nm-research commited on Jun 5

Commit

0ee5310

·

verified ·

1 Parent(s): ac379eb

Update README.md

Files changed (1) hide show

README.md +24 -23

README.md CHANGED Viewed

@@ -34,32 +34,33 @@ This model was obtained by quantizing the weights of [google/gemma-3-12b-it](htt
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
-from vllm.assets.image import ImageAsset
 from vllm import LLM, SamplingParams
-# prepare model
-llm = LLM(
-    model="nm-testing/gemma-3-12b-it-quantized.w8a8",
-    trust_remote_code=True,
-    max_model_len=4096,
-    max_num_seqs=2,
-)
-# prepare inputs
-question = "What is the content of this image?"
-inputs = {
-    "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
-    "multi_modal_data": {
-        "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
-    },
-}
-# generate response
-print("========== SAMPLE GENERATION ==============")
 outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
-print(f"PROMPT  : {outputs[0].prompt}")
-print(f"RESPONSE: {outputs[0].outputs[0].text}")
-print("==========================================")
 ```
 vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
@@ -183,7 +184,7 @@ lm_eval \
       <th>Category</th>
       <th>Metric</th>
       <th>google/gemma-3-12b-it</th>
-      <th>nm-testing/gemma-3-12b-it-quantized.w8a8</th>
       <th>Recovery (%)</th>
     </tr>
   </thead>

 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
 from vllm import LLM, SamplingParams
+from vllm.assets.image import ImageAsset
+from transformers import AutoProcessor
+# Define model name once
+model_name = "RedHatAI/gemma-3-12b-it-quantized.w8a8"
+# Load image and processor
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
+# Build multimodal prompt
+chat = [
+    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is the content of this image?"}]},
+    {"role": "assistant", "content": []}
+]
+prompt = processor.apply_chat_template(chat, add_generation_prompt=True)
+# Initialize model
+llm = LLM(model=model_name, trust_remote_code=True)
+# Run inference
+inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}}
 outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
+# Display result
+print("RESPONSE:", outputs[0].outputs[0].text)
 ```
 vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
       <th>Category</th>
       <th>Metric</th>
       <th>google/gemma-3-12b-it</th>
+      <th>RedHatAI/gemma-3-12b-it-quantized.w8a8</th>
       <th>Recovery (%)</th>
     </tr>
   </thead>