Qwen3-Embedding-0.6B-GraphQL

An embedding model that maps a question in plain English to the GraphQL Type.field that answers it. It's made for schema retrieval in LLM agent pipelines, and appears to be the first open-source embedding model trained for the job. General-purpose embedders, the usual choice, can't reliably tell apart the near-identical field names that fill a real schema, so retrieval suffers.

When an LLM agent has to query a GraphQL API, the hard part isn't writing the query. It's grounding the query in a schema that's often thousands of fields wide and won't fit in a context window. The usual fix is RAG over the schema: embed every Type.field, retrieve the handful relevant to the question, and feed only those to the agent. General-purpose embedders struggle here because real schemas reuse field names everywhere. Dozens of types carry a description, an author, a createdAt, a state. Knowing the field name isn't enough; you have to know whose field it is.

This is a fine-tune of Qwen/Qwen3-Embedding-0.6B trained for that one task: owner-type disambiguation when field names collide. The agent gets the right coordinate in context instead of a same-named field on the wrong type. At 0.6B it runs on CPU or alongside the agent's own model.

The payoff, on held-out queries against schemas never seen in training:

metric base tuned lift
exact_match@1 0.090 0.229 +155%
recall@10 0.215 0.435 +102%
mrr@10 0.121 0.285 +135%

On an external benchmark against the full GitHub GraphQL schema (6,342 coordinates, 52 queries, never seen in training), using sdl formatting:

metric base tuned lift
MRR 0.511 0.723 +41%
R@1 0.385 0.615 +60%
R@5 0.654 0.865 +32%
P95 rank 53 40 -25%

Drop-in for any GraphQL-aware RAG, query builder, or schema search. Ships as SentenceTransformer weights and GGUF builds for llama.cpp / Ollama.

Important: how you format the corpus matters as much as the model. Use SDL snippets or dot_plus_gloss formatting for best results. See Embedding style comparison for details.


Inference

SentenceTransformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("xthor/Qwen3-Embedding-0.6B-GraphQL")

query = "What's the nightly rate for this room?"
# coordinates of Type.field pairs
coords = [
    "Room.priceCents",
    "RoomUpgradeOffer.priceCents",
    "Ticket.priceCents",
]

q = model.encode(query, prompt_name="query")
c = model.encode(coords, prompt_name="document")
scores = (q @ c.T).tolist()

for coord, score in sorted(zip(coords, scores), key=lambda x: -x[1]):
    print(f"{score:.3f}  {coord}")

Two prompts are wired into the model and must be used for best results:

  • prompt_name="query" for natural-language questions
  • prompt_name="document" for GraphQL coordinate descriptions in the corpus

Ollama

# pull one quantization (Q8_0 is a good default: near-lossless, ~650 MB)
hf download xthor/Qwen3-Embedding-0.6B-GraphQL model-q8_0.gguf --local-dir .

cat > Modelfile <<'EOF'
FROM ./model-q8_0.gguf
EOF
ollama create qwen3-graphql-embedder -f Modelfile

# OpenAI-compatible embeddings endpoint
curl -s http://localhost:11434/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3-graphql-embedder","input":"What is the nightly rate for this room?"}' \
  | jq '.data[0].embedding'

llama.cpp

hf download xthor/Qwen3-Embedding-0.6B-GraphQL model-q8_0.gguf --local-dir .

./llama-server -m model-q8_0.gguf --embedding --port 8080
# POST http://localhost:8080/embedding   { "content": "..." }

Available GGUF quantizations

file size use case
model-f16.gguf ~1.2 GB reference quality, parity with safetensors
model-q8_0.gguf ~650 MB near-lossless; recommended default
model-q4_k_m.gguf ~400 MB small footprint; accepts a minor quality trade-off

Results

223 held-out test queries · 28,893-coordinate corpus · 30% real SDLs (GitHub GHES, Saleor, Shopify, AniList) never seen in training.

metric baseline tuned (3 epochs) lift
exact_match@1 0.090 0.229 +0.139 (+155%)
recall@3 0.130 0.318 +0.188
recall@5 0.161 0.345 +0.184 (+114%)
recall@10 0.215 0.435 +0.220 (+102%)
mrr@10 0.121 0.285 +0.164
ndcg@10 0.143 0.320 +0.177

baseline vs tuned — headline metrics

recall@k across the sweep

Where the lift comes from

Direct questions ("has my package shipped?", "what's my total?") are already handled well by the base model. The gains come from indirect questions where the user names a concept rather than a field. Those require owner-type reasoning, and that's where the base model falls behind.

Example: rank 101 → 1

"I need to understand what commitments we have regarding support response times. Where can I find that info?"

Correct target: SlaPolicy.description. The schema has 262 .description fields (on Incident, Issue, Resolution, SatisfactionSurvey, …). The task is picking the right owner, not the right field name.

base tuned
rank in full corpus (18,396 coordinates) 101 1
rank among 262 .description siblings 12 1
cosine(query, target) 0.428 0.383
cosine(query, base top-1 distractor) 0.484 0.303

SlaPolicy sibling cosines

The base model ranks SatisfactionSurvey.description and Incident.description above the target. The fine-tune demotes them: every wrong owner drops to 0.15–0.22 while the target becomes the top hit.

SlaPolicy ranking ladder

Example: rank 5 → 1

"What's the nightly rate for this room?"

Correct target: Room.priceCents. Six other .priceCents fields exist (upgrade offers, extensions, tickets).

base tuned
rank in full corpus 5 1
rank among 7 .priceCents siblings 3 1
cosine(query, target) 0.51 0.61
cosine(query, base top-1 distractor) 0.55 (RoomUpgradeOffer) 0.43
margin to runner-up –0.04 (target loses) +0.12

Room sibling cosines

Even on a natural, direct question the base model picks the wrong owner (it ranks RoomUpgradeOffer.priceCents first). The fine-tune reverses the ordering and opens a clear margin.

Room ranking ladder

Known limitations

  1. Formatting sensitivity. With raw dot notation (Type.field), the fine-tune's R@1 is only 0.308 on the GitHub schema. Always use sdl, dot_plus_gloss, or natural formatting for the corpus.

  2. Same-owner wrong-field rate. same_owner_wrong_field_rate@1 rose from 0.063 to 0.103. The model picks the right owner type more often but occasionally lands on the wrong field within that type. The training signal rewards owner disambiguation; within-owner field disambiguation isn't targeted. The next iteration will add competition sets that share owner and differ by field.

  3. Tail regression with raw dot notation. When using raw dot notation, the fine-tune's P95 rank (404) is worse than the base model's (123). The model becomes more confident: it either ranks the correct answer first or misses much harder. This is fully mitigated by using sdl (P95 40) or dot_plus_gloss (P95 41) formatting.

  4. Indirect queries. Queries that don't name or allude to the owner type (e.g., "get the README"Repository.object) remain hard for both models. The fine-tune does not improve on these.

metric deltas

How you format the corpus matters

How you turn each Type.field coordinate into text before embedding it affects retrieval more than the fine-tune does. The benchmark below compares twelve formats on the GitHub GraphQL schema (52 held-out queries):

embedding style comparison

Use one of these two. They tie at the top:

# sdl: if you parse the schema (MRR 0.723)
type PullRequest { baseRefName: String! }

# dot_plus_gloss: string-only, no parsing needed (MRR 0.715)
PullRequest.baseRefName — the base ref name of a pull request

The cheap string-only gloss costs almost nothing versus full schema parsing, so reach for dot_plus_gloss unless you already have parsed types on hand. Whatever you do, don't embed raw Type.field identifiers. With dot formatting, MRR drops to 0.393 and the worst-case rank blows out 10x. The owner type is what carries the signal: drop it entirely and retrieval collapses to MRR ~0.05.

Full results

Each format is one way of rendering PullRequest.baseRefName into text before embedding (the example column shows exactly what). P95 is the 95th-percentile rank, i.e. how badly the worst queries rank. Lower is better.

format example (PullRequest.baseRefName →) base MRR tuned MRR P95
sdl type PullRequest { baseRefName: String! } 0.511 0.723 40
dot_plus_gloss PullRequest.baseRefName — the base ref name of a pull request 0.551 0.715 41
semantic GraphQL field PullRequest.baseRefName. Owner type… Returns: String!… 0.368 0.659 39
field_first base ref name (PullRequest) 0.571 0.652 70
natural the base ref name field on PullRequest 0.420 0.578 119
arrow PullRequest > base ref name 0.419 0.548 159
colon PullRequest: base ref name 0.400 0.488 199
split_space pull request base ref name 0.391 0.447 448
signature PullRequest.baseRefName: String! 0.334 0.408 298
dot PullRequest.baseRefName (raw, no change) 0.334 0.393 404
type_only pull request (field dropped, ablation) 0.248 0.242 261
field_only base ref name (type dropped, ablation) 0.063 0.045 3377

Training

run epochs batch lr loss
qwen3 2 64 5e-5 cached_mnrl
qwen3-e3 3 64 5e-5 cached_mnrl

Both: --max-seq-length 256, 4 hard negatives per anchor, bf16, full fine-tune (no LoRA), single H100. Published checkpoint: qwen3-e3.

Dataset

split rows
train 4,788
val 94
test 223
corpus 28,893

Built from 7,626 raw seed pairs via world-leakage, per-row strict-leakage, and family-level semantic-dedup filters. The strict-leakage filter is aggressive on real-SDL queries, which is why val/test shrink to ~20% of raw.


Citation

Downloads last month
1,287
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xthor/Qwen3-Embedding-0.6B-GraphQL

Quantized
(236)
this model

Dataset used to train xthor/Qwen3-Embedding-0.6B-GraphQL

Free AI Image Generator No sign-up. Instant results. Open Now