Intel
/

Qwen3-Coder-480B-A35B-Instruct-int4-AutoRound

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-Coder-480B-A35B-Instruct
+datasets:
+- codeparrot/github-code-clean
+---
+## Model Details
+This model is a mixed int4 model with group_size 128 and symmetric quantization of [Qwen/Qwen3-Coder-480B-A35B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct) generated by [intel/auto-round](https://github.com/intel/auto-round) via **auot-round-light**
+Please follow the license of the original model.
+## How To Use
+**INT4 Inference on CPU/Intel GPU/CUDA**
+~~~python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Intel/Qwen3-Coder-480B-A35B-Instruct-int4-AutoRound"
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+prompts = [
+    "Write a quick sort algorithm.",
+    "Write a flappy bird.",
+    "Write a llm quantization algorithm.",
+]
+texts = []
+for prompt in prompts:
+    messages = [
+        {"role": "user", "content": prompt}
+    ]
+    text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    texts.append(text)
+inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, padding_side="left").to(model.device)
+# conduct text completion
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=65536,
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
+]
+decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+for i, prompt in enumerate(prompts):
+    input_id = inputs
+    print(f"Prompt: {prompt}")
+    print(f"Generated: {decoded_outputs[i]}")
+    print("-" * 50)
+"""
+Prompt: Write a quick sort algorithm.
+Generated: Here's a Quick Sort implementation in Python:
+```python
+def quicksort(arr):
+    """
+    Quick Sort algorithm implementation
+    Args:
+        arr: List of comparable elements
+    Returns:
+        Sorted list
+    """
+    # Base case: arrays with 0 or 1 element are already sorted
+    if len(arr) <= 1:
+        return arr
+    # Choose pivot (using middle element)
+    pivot = arr[len(arr) // 2]
+    # Partition array into three parts
+    left = [x for x in arr if x < pivot]      # Elements less than pivot
+    middle = [x for x in arr if x == pivot]   # Elements equal to pivot
+    right = [x for x in arr if x > pivot]     # Elements greater than pivot
+    # Recursively sort left and right partitions, then combine
+    return quicksort(left) + middle + quicksort(right)
+# Alternative in-place version (more memory efficient)
+def quicksort_inplace(arr, low=0, high=None):
+    """
+    In-place Quick Sort implementation
+    Args:
+        arr: List to be sorted in-place
+        low: Starting index
+        high: Ending index
+    """
+    if high is None:
+        high = len(arr) - 1
+    if low < high:
+        # Partition the array and get pivot index
+        pivot_index = partition(arr, low, high)
+        # Recursively sort elements before and after partition
+        quicksort_inplace(arr, low, pivot_index - 1)
+        quicksort_inplace(arr, pivot_index + 1, high)
+def partition(arr, low, high):
+    """
+    Partition function for in-place quicksort
+    """
+    # Choose rightmost element as pivot
+    pivot = arr[high]
+    # Index of smaller element (indicates right position of pivot)
+    i = low - 1
+    for j in range(low, high):
+        # If current element is smaller than or equal to pivot
+        if arr[j] <= pivot:
+            i += 1
+            arr[i], arr[j] = arr[j], arr[i]  # Swap elements
+    # Place pivot in correct position
+    arr[i + 1], arr[high] = arr[high], arr[i + 1]
+    return i + 1
+# Example usage
+if __name__ == "__main__":
+    # Test the simple version
+    test_array
+--------------------------------------------------
+Prompt: Write a flappy bird.
+Generated: # Flappy Bird in PyGame
+Here's a complete implementation of Flappy Bird using PyGame:
+```python
+import pygame
+import sys
+import random
+# Initialize pygame
+pygame.init()
+# Game constants
+WIDTH, HEIGHT = 400, 600
+FPS = 60
+GRAVITY = 0.25
+FLAP_STRENGTH = -5
+PIPE_SPEED = 3
+PIPE_GAP = 150
+PIPE_FREQUENCY = 1800  # milliseconds
+GROUND_HEIGHT = 100
+BIRD_SIZE = 30
+# Colors
+SKY_BLUE = (113, 197, 207)
+GREEN = (111, 196, 69)
+DARK_GREEN = (76, 145, 65)
+BROWN = (160, 120, 40)
+YELLOW = (255, 221, 45)
+RED = (231, 76, 60)
+WHITE = (255, 255, 255)
+BLACK = (0, 0, 0)
+# Set up the display
+screen = pygame.display.set_mode((WIDTH, HEIGHT))
+pygame.display.set_caption("Flappy Bird")
+clock = pygame.time.Clock()
+# Font setup
+font = pygame.font.SysFont(None, 36)
+small_font = pygame.font.SysFont(None, 24)
+class Bird:
+    def __init__(self):
+        self.x = WIDTH // 3
+        self.y = HEIGHT // 2
+        self.velocity = 0
+        self.alive = True
+        self.rotation = 0
+    def flap(self):
+        self.velocity = FLAP_STRENGTH
+    def update(self):
+        # Apply gravity
+        self.velocity += GRAVITY
+        self.y += self.velocity
+        # Rotate bird based on velocity
+        self.rotation = max(-30, min(self.velocity * 3, 90))
+        # Check if bird hits the ground or ceiling
+        if self.y >= HEIGHT - GROUND_HEIGHT - BIRD_SIZE//2:
+            self.y = HEIGHT - GROUND_HEIGHT - BIRD_SIZE//2
+            self.alive = False
+        if self.y <= 0:
+            self.y = 0
+            self.velocity = 0
+    def draw(self
+--------------------------------------------------
+Prompt: Write a llm quantization algorithm.
+Generated: Here's a comprehensive implementation of LLM quantization algorithms, including post-training quantization and QLoRA-style quantization:
+```python
+import torch
+import torch.nn as nn
+from typing import Dict, Tuple, Optional
+import math
+class Quantizer:
+    """Base class for quantization operations"""
+    @staticmethod
+    def symmetric_quantize(tensor: torch.Tensor, bits: int = 8) -> Tuple[torch.Tensor, float]:
+        """
+        Symmetric quantization for weights
+        Returns quantized tensor and scale factor
+        """
+        max_val = tensor.abs().max()
+        scale = max_val / (2 ** (bits - 1) - 1)
+        # Quantize to integer values
+        quantized = torch.round(tensor / scale).clamp(-2**(bits-1), 2**(bits-1)-1)
+        return quantized.to(torch.int8), scale
+    @staticmethod
+    def asymmetric_quantize(tensor: torch.Tensor, bits: int = 8) -> Tuple[torch.Tensor, float, float]:
+        """
+        Asymmetric quantization for activations
+        Returns quantized tensor, scale, and zero point
+        """
+        min_val, max_val = tensor.min(), tensor.max()
+       scale = (max_val - min_val) / (2**bits - 1)
+        zero_point = torch.round(-min_val / scale).clamp(0, 2**bits-1)
+        # Quantize with zero point
+        quantized = torch.round(tensor / scale + zero_point).clamp(0, 2**bits-1)
+        return quantized.to(torch.uint8), scale, zero_point
+    @staticmethod
+    def dequantize(quantized: torch.Tensor, scale: float, zero_point: Optional[float] = None) -> torch.Tensor:
+        """Dequantize tensor back to floating point"""
+        if zero_point is not None:
+            return (quantized.float() - zero_point) * scale
+        else:
+            return quantized.float() * scale
+class NF4Quantizer:
+    """4-bit NormalFloat quantization (NF4)"""
+    def __init__(self):
+        # Pre-defined NF4 values normalized to [-1, 1]
+        self.norm_floats = torch.tensor([
+            -1.0, -0.6962, -0.5251, -0.3949, -0.2844,
+--------------------------------------------------
+"""
+~~~
+### Generate the model
+Here is the sample command to reproduce the model. 3*80G
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import transformers
+from auto_round import AutoRound
+model_name = "Qwen3/Qwen3-Coder-480B-A35B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", torch_dtype="auto", trust_remote_code=True)
+block = model.model.layers
+device_map = {}
+for n, m in block.named_modules():
+    if isinstance(m, (torch.nn.Linear, transformers.modeling_utils.Conv1D)):
+        if "experts" in n and ("shared_experts" not in n):
+            if int(n.split('.')[-2]) < 30:
+                device = "cuda:0"
+            elif int(n.split('.')[-2]) >= 30 and int(n.split('.')[-2]) < 95:
+                device = "cuda:1"
+            elif int(n.split('.')[-2]) >= 95:
+                device = "cuda:2"
+        else:
+            device = "cuda:0"
+        n = n[2:]
+        device_map.update({n: device})
+autoround = AutoRound(
+    model=model, tokenizer=tokenizer, device_map=device_map, nsamples=512,dataset="github-code-clean")
+autoround.quantize_and_save(format="auto_round", output_dir="./Qwen3-Coder-480B-A35B-Instruct-int4")
+```
+## Ethical Considerations and Limitations
+The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
+Therefore, before deploying any applications of the model, developers should perform safety testing.
+## Caveats and Recommendations
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
+Here are a couple of useful links to learn more about Intel's AI software:
+- Intel Neural Compressor [link](https://github.com/intel/neural-compressor)
+## Disclaimer
+The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
+## Cite
+@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
+[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)