Instructions to use xthor/Qwen3-Embedding-0.6B-GraphQL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("xthor/Qwen3-Embedding-0.6B-GraphQL") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - llama-cpp-python
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="xthor/Qwen3-Embedding-0.6B-GraphQL", filename="model-f16.gguf", )
llm.create_chat_completion( messages = "{\n \"source_sentence\": \"That is a happy person\",\n \"sentences\": [\n \"That is a happy dog\",\n \"That is a very happy person\",\n \"Today is a sunny day\"\n ]\n}" ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M # Run inference directly in the terminal: llama cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M # Run inference directly in the terminal: llama cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Use Docker
docker model run hf.co/xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Ollama:
ollama run hf.co/xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
- Unsloth Studio
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for xthor/Qwen3-Embedding-0.6B-GraphQL to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for xthor/Qwen3-Embedding-0.6B-GraphQL to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for xthor/Qwen3-Embedding-0.6B-GraphQL to start chatting
- Pi
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Docker Model Runner:
docker model run hf.co/xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
- Lemonade
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-Embedding-0.6B-GraphQL-Q4_K_M
List all available models
lemonade list
Qwen3-Embedding-0.6B-GraphQL
An embedding model that maps a question in plain English to the GraphQL Type.field that answers it. It's made for schema retrieval in LLM agent pipelines, and appears to be the first open-source embedding model trained for the job. General-purpose embedders, the usual choice, can't reliably tell apart the near-identical field names that fill a real schema, so retrieval suffers.
When an LLM agent has to query a GraphQL API, the hard part isn't writing the query. It's grounding the query in a schema that's often thousands of fields wide and won't fit in a context window. The usual fix is RAG over the schema: embed every Type.field, retrieve the handful relevant to the question, and feed only those to the agent. General-purpose embedders struggle here because real schemas reuse field names everywhere. Dozens of types carry a description, an author, a createdAt, a state. Knowing the field name isn't enough; you have to know whose field it is.
This is a fine-tune of Qwen/Qwen3-Embedding-0.6B trained for that one task: owner-type disambiguation when field names collide. The agent gets the right coordinate in context instead of a same-named field on the wrong type. At 0.6B it runs on CPU or alongside the agent's own model.
The payoff, on held-out queries against schemas never seen in training:
| metric | base | tuned | lift |
|---|---|---|---|
| exact_match@1 | 0.090 | 0.229 | +155% |
| recall@10 | 0.215 | 0.435 | +102% |
| mrr@10 | 0.121 | 0.285 | +135% |
On an external benchmark against the full GitHub GraphQL schema (6,342 coordinates, 52 queries, never seen in training), using sdl formatting:
| metric | base | tuned | lift |
|---|---|---|---|
| MRR | 0.511 | 0.723 | +41% |
| R@1 | 0.385 | 0.615 | +60% |
| R@5 | 0.654 | 0.865 | +32% |
| P95 rank | 53 | 40 | -25% |
Drop-in for any GraphQL-aware RAG, query builder, or schema search. Ships as SentenceTransformer weights and GGUF builds for llama.cpp / Ollama.
Important: how you format the corpus matters as much as the model. Use SDL snippets or
dot_plus_glossformatting for best results. See Embedding style comparison for details.
Inference
SentenceTransformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("xthor/Qwen3-Embedding-0.6B-GraphQL")
query = "What's the nightly rate for this room?"
# coordinates of Type.field pairs
coords = [
"Room.priceCents",
"RoomUpgradeOffer.priceCents",
"Ticket.priceCents",
]
q = model.encode(query, prompt_name="query")
c = model.encode(coords, prompt_name="document")
scores = (q @ c.T).tolist()
for coord, score in sorted(zip(coords, scores), key=lambda x: -x[1]):
print(f"{score:.3f} {coord}")
Two prompts are wired into the model and must be used for best results:
prompt_name="query"for natural-language questionsprompt_name="document"for GraphQL coordinate descriptions in the corpus
Ollama
# pull one quantization (Q8_0 is a good default: near-lossless, ~650 MB)
hf download xthor/Qwen3-Embedding-0.6B-GraphQL model-q8_0.gguf --local-dir .
cat > Modelfile <<'EOF'
FROM ./model-q8_0.gguf
EOF
ollama create qwen3-graphql-embedder -f Modelfile
# OpenAI-compatible embeddings endpoint
curl -s http://localhost:11434/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3-graphql-embedder","input":"What is the nightly rate for this room?"}' \
| jq '.data[0].embedding'
llama.cpp
hf download xthor/Qwen3-Embedding-0.6B-GraphQL model-q8_0.gguf --local-dir .
./llama-server -m model-q8_0.gguf --embedding --port 8080
# POST http://localhost:8080/embedding { "content": "..." }
Available GGUF quantizations
| file | size | use case |
|---|---|---|
model-f16.gguf |
~1.2 GB | reference quality, parity with safetensors |
model-q8_0.gguf |
~650 MB | near-lossless; recommended default |
model-q4_k_m.gguf |
~400 MB | small footprint; accepts a minor quality trade-off |
Results
223 held-out test queries · 28,893-coordinate corpus · 30% real SDLs (GitHub GHES, Saleor, Shopify, AniList) never seen in training.
| metric | baseline | tuned (3 epochs) | lift |
|---|---|---|---|
| exact_match@1 | 0.090 | 0.229 | +0.139 (+155%) |
| recall@3 | 0.130 | 0.318 | +0.188 |
| recall@5 | 0.161 | 0.345 | +0.184 (+114%) |
| recall@10 | 0.215 | 0.435 | +0.220 (+102%) |
| mrr@10 | 0.121 | 0.285 | +0.164 |
| ndcg@10 | 0.143 | 0.320 | +0.177 |
Where the lift comes from
Direct questions ("has my package shipped?", "what's my total?") are already handled well by the base model. The gains come from indirect questions where the user names a concept rather than a field. Those require owner-type reasoning, and that's where the base model falls behind.
Example: rank 101 → 1
"I need to understand what commitments we have regarding support response times. Where can I find that info?"
Correct target: SlaPolicy.description. The schema has 262 .description fields (on Incident, Issue, Resolution, SatisfactionSurvey, …). The task is picking the right owner, not the right field name.
| base | tuned | |
|---|---|---|
| rank in full corpus (18,396 coordinates) | 101 | 1 |
rank among 262 .description siblings |
12 | 1 |
| cosine(query, target) | 0.428 | 0.383 |
| cosine(query, base top-1 distractor) | 0.484 | 0.303 |
The base model ranks SatisfactionSurvey.description and Incident.description above the target. The fine-tune demotes them: every wrong owner drops to 0.15–0.22 while the target becomes the top hit.
Example: rank 5 → 1
"What's the nightly rate for this room?"
Correct target: Room.priceCents. Six other .priceCents fields exist (upgrade offers, extensions, tickets).
| base | tuned | |
|---|---|---|
| rank in full corpus | 5 | 1 |
rank among 7 .priceCents siblings |
3 | 1 |
| cosine(query, target) | 0.51 | 0.61 |
| cosine(query, base top-1 distractor) | 0.55 (RoomUpgradeOffer) |
0.43 |
| margin to runner-up | –0.04 (target loses) | +0.12 |
Even on a natural, direct question the base model picks the wrong owner (it ranks RoomUpgradeOffer.priceCents first). The fine-tune reverses the ordering and opens a clear margin.
Known limitations
Formatting sensitivity. With raw dot notation (
Type.field), the fine-tune's R@1 is only 0.308 on the GitHub schema. Always usesdl,dot_plus_gloss, ornaturalformatting for the corpus.Same-owner wrong-field rate.
same_owner_wrong_field_rate@1rose from 0.063 to 0.103. The model picks the right owner type more often but occasionally lands on the wrong field within that type. The training signal rewards owner disambiguation; within-owner field disambiguation isn't targeted. The next iteration will add competition sets that share owner and differ by field.Tail regression with raw dot notation. When using raw dot notation, the fine-tune's P95 rank (404) is worse than the base model's (123). The model becomes more confident: it either ranks the correct answer first or misses much harder. This is fully mitigated by using
sdl(P95 40) ordot_plus_gloss(P95 41) formatting.Indirect queries. Queries that don't name or allude to the owner type (e.g., "get the README" →
Repository.object) remain hard for both models. The fine-tune does not improve on these.
How you format the corpus matters
How you turn each Type.field coordinate into text before embedding it affects retrieval more than the fine-tune does. The benchmark below compares twelve formats on the GitHub GraphQL schema (52 held-out queries):
Use one of these two. They tie at the top:
# sdl: if you parse the schema (MRR 0.723)
type PullRequest { baseRefName: String! }
# dot_plus_gloss: string-only, no parsing needed (MRR 0.715)
PullRequest.baseRefName — the base ref name of a pull request
The cheap string-only gloss costs almost nothing versus full schema parsing, so reach for dot_plus_gloss unless you already have parsed types on hand. Whatever you do, don't embed raw Type.field identifiers. With dot formatting, MRR drops to 0.393 and the worst-case rank blows out 10x. The owner type is what carries the signal: drop it entirely and retrieval collapses to MRR ~0.05.
Full results
Each format is one way of rendering PullRequest.baseRefName into text before embedding (the example column shows exactly what). P95 is the 95th-percentile rank, i.e. how badly the worst queries rank. Lower is better.
| format | example (PullRequest.baseRefName →) |
base MRR | tuned MRR | P95 |
|---|---|---|---|---|
sdl |
type PullRequest { baseRefName: String! } |
0.511 | 0.723 | 40 |
dot_plus_gloss |
PullRequest.baseRefName — the base ref name of a pull request |
0.551 | 0.715 | 41 |
semantic |
GraphQL field PullRequest.baseRefName. Owner type… Returns: String!… |
0.368 | 0.659 | 39 |
field_first |
base ref name (PullRequest) |
0.571 | 0.652 | 70 |
natural |
the base ref name field on PullRequest |
0.420 | 0.578 | 119 |
arrow |
PullRequest > base ref name |
0.419 | 0.548 | 159 |
colon |
PullRequest: base ref name |
0.400 | 0.488 | 199 |
split_space |
pull request base ref name |
0.391 | 0.447 | 448 |
signature |
PullRequest.baseRefName: String! |
0.334 | 0.408 | 298 |
dot |
PullRequest.baseRefName (raw, no change) |
0.334 | 0.393 | 404 |
type_only |
pull request (field dropped, ablation) |
0.248 | 0.242 | 261 |
field_only |
base ref name (type dropped, ablation) |
0.063 | 0.045 | 3377 |
Training
| run | epochs | batch | lr | loss |
|---|---|---|---|---|
qwen3 |
2 | 64 | 5e-5 | cached_mnrl |
qwen3-e3 |
3 | 64 | 5e-5 | cached_mnrl |
Both: --max-seq-length 256, 4 hard negatives per anchor, bf16, full fine-tune (no LoRA), single H100. Published checkpoint: qwen3-e3.
Dataset
| split | rows |
|---|---|
| train | 4,788 |
| val | 94 |
| test | 223 |
| corpus | 28,893 |
Built from 7,626 raw seed pairs via world-leakage, per-row strict-leakage, and family-level semantic-dedup filters. The strict-leakage filter is aggressive on real-SDL queries, which is why val/test shrink to ~20% of raw.
Citation
- Base model: Qwen3-Embedding-0.6B
- GitHub Training GitHub-Repo-train-data
- License: Apache 2.0 (inherited from the base)
- Downloads last month
- 1,287







