Qwen3-0.6B-T5-xxl-GGUF
Model Description
This repository provides GGUF quantized versions of the Qwen3-0.6B-T5-xxl
model body. These models are designed for fast, low-resource inference on CPUs.
The goal of this project is to replicate the embedding outputs of google/t5-v1_1-xxl
using a highly optimized pipeline.
To make this repository fully functional out-of-the-box, the fine-tuned projection head is also included. This allows you to combine the GGUF model with the PyTorch-based head to get the final 4096-dimension embeddings.
Repository Contents
qwen3-0.6B-Q4_K_M.gguf
: The model body quantized using the Q4_K_M method. (And potentially other quantizations).- /projection_head/projection_head.pth: The PyTorch state dictionary for the final projection layer.
How to Use: Hybrid GGUF + PyTorch Pipeline
This tutorial shows how to use the GGUF model for fast base embedding generation and the PyTorch head for the final projection.
Step 1: Prerequisites
First, install the necessary libraries. llama-cpp-python
is required to run GGUF models.
pip install llama-cpp-python torch numpy
Step 2: Inference Script
The following script encapsulates the entire hybrid pipeline into a convenient class. You can save it as a .py
file and import it into your projects.
import torch
from torch import nn
from llama_cpp import Llama
import numpy as np
class HybridEmbedder:
"""
A class that encapsulates the hybrid embedding pipeline.
It loads the models once at initialization for optimal performance.
"""
def __init__(self, gguf_path: str, head_path: str, n_ctx: int = 512):
print("Initializing HybridEmbedder...")
# 1. Load the GGUF body
print(f"Loading GGUF body from: {gguf_path}")
self.body_model = Llama(
model_path=gguf_path,
embedding=True,
n_ctx=n_ctx,
verbose=False
)
print(" -> GGUF body loaded.")
# 2. Load the PyTorch projection head
print(f"Loading projection head from: {head_path}")
input_dim = self.body_model.n_embd()
hidden_dim = 2048
output_dim = 4096
self.head_model = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, output_dim)
)
self.head_model.load_state_dict(torch.load(head_path))
self.head_model.eval()
print(" -> Projection head loaded.")
print("\n✅ Embedder is ready to use.")
def get_embedding(self, text: str) -> torch.Tensor:
# a) Get the sequence of token embeddings from the GGUF model
token_embeddings = self.body_model.embed(text)
# b) Apply Mean Pooling to get a single sentence vector
sentence_embedding = np.mean(token_embeddings, axis=0)
# c) Convert to a PyTorch tensor and add a batch dimension
sentence_tensor = torch.tensor(sentence_embedding).unsqueeze(0)
# d) Pass through the projection head
with torch.no_grad():
final_embedding = self.head_model(sentence_tensor.float())
return final_embedding
# --- Example Usage ---
if __name__ == "__main__":
# Define the paths to your local model files
GGUF_FILE = "qwen3-0.6B-Q4_K_M.gguf"
HEAD_FILE = "./projection_head/projection_head.pth"
# Create an instance of our embedder
embedder = HybridEmbedder(gguf_path=GGUF_FILE, head_path=HEAD_FILE)
# Use the embedder to get vectors
prompt = "A sprawling fantasy city built into a giant tree."
embedding = embedder.get_embedding(prompt)
print("\n--- Inference Test ---")
print(f"Prompt: '{prompt}'")
print(f"Output dimension: {embedding.shape}")
print(f"Vector excerpt: {embedding[0, :5]}...")
License
This repository is licensed under the Apache license 2.0.
- Downloads last month
- 1,557
4-bit
5-bit
8-bit
16-bit
32-bit