Instructions to use xthor/Qwen3-Embedding-0.6B-GraphQL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("xthor/Qwen3-Embedding-0.6B-GraphQL")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

llama-cpp-python

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="xthor/Qwen3-Embedding-0.6B-GraphQL",
	filename="model-f16.gguf",
)

llm.create_chat_completion(
	messages = "{\n    \"source_sentence\": \"That is a happy person\",\n    \"sentences\": [\n        \"That is a happy dog\",\n        \"That is a very happy person\",\n        \"Today is a sunny day\"\n    ]\n}"
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Use Docker

docker model run hf.co/xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

LM Studio
Jan
Ollama
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Ollama:
```
ollama run hf.co/xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
```

Unsloth Studio

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for xthor/Qwen3-Embedding-0.6B-GraphQL to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for xthor/Qwen3-Embedding-0.6B-GraphQL to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for xthor/Qwen3-Embedding-0.6B-GraphQL to start chatting

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Docker Model Runner:
```
docker model run hf.co/xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M
```

Lemonade

How to use xthor/Qwen3-Embedding-0.6B-GraphQL with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull xthor/Qwen3-Embedding-0.6B-GraphQL:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-Embedding-0.6B-GraphQL-Q4_K_M

List all available models

lemonade list

Qwen3-Embedding-0.6B-GraphQL

An embedding model that maps a question in plain English to the GraphQL Type.field that answers it. It's made for schema retrieval in LLM agent pipelines, and appears to be the first open-source embedding model trained for the job. General-purpose embedders, the usual choice, can't reliably tell apart the near-identical field names that fill a real schema, so retrieval suffers.

When an LLM agent has to query a GraphQL API, the hard part isn't writing the query. It's grounding the query in a schema that's often thousands of fields wide and won't fit in a context window. The usual fix is RAG over the schema: embed every Type.field, retrieve the handful relevant to the question, and feed only those to the agent. General-purpose embedders struggle here because real schemas reuse field names everywhere. Dozens of types carry a description, an author, a createdAt, a state. Knowing the field name isn't enough; you have to know whose field it is.

This is a fine-tune of Qwen/Qwen3-Embedding-0.6B trained for that one task: owner-type disambiguation when field names collide. The agent gets the right coordinate in context instead of a same-named field on the wrong type. At 0.6B it runs on CPU or alongside the agent's own model.

The payoff, on held-out queries against schemas never seen in training:

metric	base	tuned	lift
exact_match@1	0.090	0.229	+155%
recall@10	0.215	0.435	+102%
mrr@10	0.121	0.285	+135%

On an external benchmark against the full GitHub GraphQL schema (6,342 coordinates, 52 queries, never seen in training), using sdl formatting:

metric	base	tuned	lift
MRR	0.511	0.723	+41%
R@1	0.385	0.615	+60%
R@5	0.654	0.865	+32%
P95 rank	53	40	-25%

Drop-in for any GraphQL-aware RAG, query builder, or schema search. Ships as SentenceTransformer weights and GGUF builds for llama.cpp / Ollama.

Important: how you format the corpus matters as much as the model. Use SDL snippets or dot_plus_gloss formatting for best results. See Embedding style comparison for details.

Inference

SentenceTransformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("xthor/Qwen3-Embedding-0.6B-GraphQL")

query = "What's the nightly rate for this room?"
# coordinates of Type.field pairs
coords = [
    "Room.priceCents",
    "RoomUpgradeOffer.priceCents",
    "Ticket.priceCents",
]

q = model.encode(query, prompt_name="query")
c = model.encode(coords, prompt_name="document")
scores = (q @ c.T).tolist()

for coord, score in sorted(zip(coords, scores), key=lambda x: -x[1]):
    print(f"{score:.3f}  {coord}")

Two prompts are wired into the model and must be used for best results:

prompt_name="query" for natural-language questions
prompt_name="document" for GraphQL coordinate descriptions in the corpus

Ollama

# pull one quantization (Q8_0 is a good default: near-lossless, ~650 MB)
hf download xthor/Qwen3-Embedding-0.6B-GraphQL model-q8_0.gguf --local-dir .

cat > Modelfile <<'EOF'
FROM ./model-q8_0.gguf
EOF
ollama create qwen3-graphql-embedder -f Modelfile

# OpenAI-compatible embeddings endpoint
curl -s http://localhost:11434/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen3-graphql-embedder","input":"What is the nightly rate for this room?"}' \
  | jq '.data[0].embedding'

llama.cpp

hf download xthor/Qwen3-Embedding-0.6B-GraphQL model-q8_0.gguf --local-dir .

./llama-server -m model-q8_0.gguf --embedding --port 8080
# POST http://localhost:8080/embedding   { "content": "..." }

Available GGUF quantizations

file	size	use case
`model-f16.gguf`	~1.2 GB	reference quality, parity with safetensors
`model-q8_0.gguf`	~650 MB	near-lossless; recommended default
`model-q4_k_m.gguf`	~400 MB	small footprint; accepts a minor quality trade-off

Results

223 held-out test queries · 28,893-coordinate corpus · 30% real SDLs (GitHub GHES, Saleor, Shopify, AniList) never seen in training.

metric	baseline	tuned (3 epochs)	lift
exact_match@1	0.090	0.229	+0.139 (+155%)
recall@3	0.130	0.318	+0.188
recall@5	0.161	0.345	+0.184 (+114%)
recall@10	0.215	0.435	+0.220 (+102%)
mrr@10	0.121	0.285	+0.164
ndcg@10	0.143	0.320	+0.177

Where the lift comes from

Direct questions ("has my package shipped?", "what's my total?") are already handled well by the base model. The gains come from indirect questions where the user names a concept rather than a field. Those require owner-type reasoning, and that's where the base model falls behind.

Example: rank 101 → 1

"I need to understand what commitments we have regarding support response times. Where can I find that info?"

Correct target: SlaPolicy.description. The schema has 262 .description fields (on Incident, Issue, Resolution, SatisfactionSurvey, …). The task is picking the right owner, not the right field name.

	base	tuned
rank in full corpus (18,396 coordinates)	101	1
rank among 262 `.description` siblings	12	1
cosine(query, target)	0.428	0.383
cosine(query, base top-1 distractor)	0.484	0.303

The base model ranks SatisfactionSurvey.description and Incident.description above the target. The fine-tune demotes them: every wrong owner drops to 0.15–0.22 while the target becomes the top hit.

Example: rank 5 → 1

"What's the nightly rate for this room?"

Correct target: Room.priceCents. Six other .priceCents fields exist (upgrade offers, extensions, tickets).

	base	tuned
rank in full corpus	5	1
rank among 7 `.priceCents` siblings	3	1
cosine(query, target)	0.51	0.61
cosine(query, base top-1 distractor)	0.55 (`RoomUpgradeOffer`)	0.43
margin to runner-up	–0.04 (target loses)	+0.12

Even on a natural, direct question the base model picks the wrong owner (it ranks RoomUpgradeOffer.priceCents first). The fine-tune reverses the ordering and opens a clear margin.

Known limitations

Formatting sensitivity. With raw dot notation (Type.field), the fine-tune's R@1 is only 0.308 on the GitHub schema. Always use sdl, dot_plus_gloss, or natural formatting for the corpus.
Same-owner wrong-field rate. same_owner_wrong_field_rate@1 rose from 0.063 to 0.103. The model picks the right owner type more often but occasionally lands on the wrong field within that type. The training signal rewards owner disambiguation; within-owner field disambiguation isn't targeted. The next iteration will add competition sets that share owner and differ by field.
Tail regression with raw dot notation. When using raw dot notation, the fine-tune's P95 rank (404) is worse than the base model's (123). The model becomes more confident: it either ranks the correct answer first or misses much harder. This is fully mitigated by using sdl (P95 40) or dot_plus_gloss (P95 41) formatting.
Indirect queries. Queries that don't name or allude to the owner type (e.g., "get the README" → Repository.object) remain hard for both models. The fine-tune does not improve on these.

How you format the corpus matters

How you turn each Type.field coordinate into text before embedding it affects retrieval more than the fine-tune does. The benchmark below compares twelve formats on the GitHub GraphQL schema (52 held-out queries):

Use one of these two. They tie at the top:

# sdl: if you parse the schema (MRR 0.723)
type PullRequest { baseRefName: String! }

# dot_plus_gloss: string-only, no parsing needed (MRR 0.715)
PullRequest.baseRefName — the base ref name of a pull request

The cheap string-only gloss costs almost nothing versus full schema parsing, so reach for dot_plus_gloss unless you already have parsed types on hand. Whatever you do, don't embed raw Type.field identifiers. With dot formatting, MRR drops to 0.393 and the worst-case rank blows out 10x. The owner type is what carries the signal: drop it entirely and retrieval collapses to MRR ~0.05.

Full results

Each format is one way of rendering PullRequest.baseRefName into text before embedding (the example column shows exactly what). P95 is the 95th-percentile rank, i.e. how badly the worst queries rank. Lower is better.

format	example (`PullRequest.baseRefName` →)	base MRR	tuned MRR	P95
`sdl`	`type PullRequest { baseRefName: String! }`	0.511	0.723	40
`dot_plus_gloss`	`PullRequest.baseRefName — the base ref name of a pull request`	0.551	0.715	41
`semantic`	`GraphQL field PullRequest.baseRefName. Owner type… Returns: String!…`	0.368	0.659	39
`field_first`	`base ref name (PullRequest)`	0.571	0.652	70
`natural`	`the base ref name field on PullRequest`	0.420	0.578	119
`arrow`	`PullRequest > base ref name`	0.419	0.548	159
`colon`	`PullRequest: base ref name`	0.400	0.488	199
`split_space`	`pull request base ref name`	0.391	0.447	448
`signature`	`PullRequest.baseRefName: String!`	0.334	0.408	298
`dot`	`PullRequest.baseRefName` (raw, no change)	0.334	0.393	404
`type_only`	`pull request` (field dropped, ablation)	0.248	0.242	261
`field_only`	`base ref name` (type dropped, ablation)	0.063	0.045	3377

Training

run	epochs	batch	lr	loss
`qwen3`	2	64	5e-5	cached_mnrl
`qwen3-e3`	3	64	5e-5	cached_mnrl

Both: --max-seq-length 256, 4 hard negatives per anchor, bf16, full fine-tune (no LoRA), single H100. Published checkpoint: qwen3-e3.

Dataset

split	rows
train	4,788
val	94
test	223
corpus	28,893

Built from 7,626 raw seed pairs via world-leakage, per-row strict-leakage, and family-level semantic-dedup filters. The strict-leakage filter is aggressive on real-SDL queries, which is why val/test shrink to ~20% of raw.

Citation

Base model: Qwen3-Embedding-0.6B
GitHub Training GitHub-Repo-train-data
License: Apache 2.0 (inherited from the base)

Downloads last month: 1,287

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for xthor/Qwen3-Embedding-0.6B-GraphQL

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Quantized

(236)

this model

xthor
/

Qwen3-Embedding-0.6B-GraphQL

Qwen3-Embedding-0.6B-GraphQL

Inference

SentenceTransformers

Ollama

llama.cpp

Available GGUF quantizations

Results

Where the lift comes from

Example: rank 101 → 1

Example: rank 5 → 1

Known limitations

How you format the corpus matters

Full results

Training

Dataset

Citation

Model tree for xthor/Qwen3-Embedding-0.6B-GraphQL

Dataset used to train xthor/Qwen3-Embedding-0.6B-GraphQL