Qwen3-0.6B-T5-xxl-GGUF

Model Description

This repository provides GGUF quantized versions of the Qwen3-0.6B-T5-xxl model body. These models are designed for fast, low-resource inference on CPUs.

The goal of this project is to replicate the embedding outputs of google/t5-v1_1-xxl using a highly optimized pipeline.

To make this repository fully functional out-of-the-box, the fine-tuned projection head is also included. This allows you to combine the GGUF model with the PyTorch-based head to get the final 4096-dimension embeddings.

Repository Contents

qwen3-0.6B-Q4_K_M.gguf: The model body quantized using the Q4_K_M method. (And potentially other quantizations).
/projection_head/projection_head.pth: The PyTorch state dictionary for the final projection layer.

How to Use: Hybrid GGUF + PyTorch Pipeline

This tutorial shows how to use the GGUF model for fast base embedding generation and the PyTorch head for the final projection.

Step 1: Prerequisites

First, install the necessary libraries. llama-cpp-python is required to run GGUF models.

pip install llama-cpp-python torch numpy

Step 2: Inference Script

The following script encapsulates the entire hybrid pipeline into a convenient class. You can save it as a .py file and import it into your projects.

import torch
from torch import nn
from llama_cpp import Llama
import numpy as np

class HybridEmbedder:
    """
    A class that encapsulates the hybrid embedding pipeline.
    It loads the models once at initialization for optimal performance.
    """
    def __init__(self, gguf_path: str, head_path: str, n_ctx: int = 512):
        print("Initializing HybridEmbedder...")
        
        # 1. Load the GGUF body
        print(f"Loading GGUF body from: {gguf_path}")
        self.body_model = Llama(
            model_path=gguf_path,
            embedding=True,
            n_ctx=n_ctx,
            verbose=False
        )
        print(" -> GGUF body loaded.")

        # 2. Load the PyTorch projection head
        print(f"Loading projection head from: {head_path}")
        input_dim = self.body_model.n_embd()
        hidden_dim = 2048
        output_dim = 4096
        
        self.head_model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), 
            nn.GELU(),
            nn.Dropout(0.1), 
            nn.Linear(hidden_dim, output_dim)
        )
        self.head_model.load_state_dict(torch.load(head_path))
        self.head_model.eval()
        print(" -> Projection head loaded.")
        print("\n✅ Embedder is ready to use.")

    def get_embedding(self, text: str) -> torch.Tensor:
        # a) Get the sequence of token embeddings from the GGUF model
        token_embeddings = self.body_model.embed(text)
        
        # b) Apply Mean Pooling to get a single sentence vector
        sentence_embedding = np.mean(token_embeddings, axis=0)
        
        # c) Convert to a PyTorch tensor and add a batch dimension
        sentence_tensor = torch.tensor(sentence_embedding).unsqueeze(0)
        
        # d) Pass through the projection head
        with torch.no_grad():
            final_embedding = self.head_model(sentence_tensor.float())
            
        return final_embedding

# --- Example Usage ---
if __name__ == "__main__":
    # Define the paths to your local model files
    GGUF_FILE = "qwen3-0.6B-Q4_K_M.gguf"
    HEAD_FILE = "./projection_head/projection_head.pth"

    # Create an instance of our embedder
    embedder = HybridEmbedder(gguf_path=GGUF_FILE, head_path=HEAD_FILE)

    # Use the embedder to get vectors
    prompt = "A sprawling fantasy city built into a giant tree."
    embedding = embedder.get_embedding(prompt)
    
    print("\n--- Inference Test ---")
    print(f"Prompt: '{prompt}'")
    print(f"Output dimension: {embedding.shape}")
    print(f"Vector excerpt: {embedding[0, :5]}...")

License

This repository is licensed under the Apache license 2.0.

JusteLeo
/

Qwen3-0.6B-T5-xxl-GGUF