On GPU vm, TEI is trying to load pytorch_model.bin instead of onnx

#2
by weiywang - opened

ubuntu@ip-10-255-6-200:~/src$ sudo docker run --gpus all -p 8080:80 -v $PWD:/data
ghcr.io/huggingface/text-embeddings-inference:latest
--model-id janni-t/qwen3-embedding-0.6b-tei-onnx
--pooling mean
2025-07-21T18:57:23.815909Z INFO text_embeddings_router: router/src/main.rs:189: Args { model_id: "jan**-/--.-***-*nnx", revision: None, tokenization_workers: None, dtype: None, pooling: Some(Mean), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "77c150589237", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-07-21T18:57:23.912096Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:20: Starting download
2025-07-21T18:57:24.021564Z INFO download_artifacts:download_new_st_config: text_embeddings_core::download: core/src/download.rs:77: Downloading config_sentence_transformers.json
2025-07-21T18:57:24.021618Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:40: Downloading config.json
2025-07-21T18:57:24.021639Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:43: Downloading tokenizer.json
2025-07-21T18:57:24.021665Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:47: Model artifacts downloaded in 109.570159ms
2025-07-21T18:57:24.406347Z WARN text_embeddings_router: router/src/lib.rs:189: Could not find a Sentence Transformers config
2025-07-21T18:57:24.406369Z INFO text_embeddings_router: router/src/lib.rs:193: Maximum number of tokens per request: 32768
2025-07-21T18:57:24.406622Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-07-21T18:57:24.731375Z INFO text_embeddings_router: router/src/lib.rs:235: Starting model backend
2025-07-21T18:57:24.735574Z INFO text_embeddings_backend: backends/src/lib.rs:507: Downloading model.safetensors
2025-07-21T18:57:24.800868Z WARN text_embeddings_backend: backends/src/lib.rs:510: Could not download model.safetensors: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/janni-t/qwen3-embedding-0.6b-tei-onnx/resolve/main/model.safetensors)
2025-07-21T18:57:24.800884Z INFO text_embeddings_backend: backends/src/lib.rs:515: Downloading model.safetensors.index.json
2025-07-21T18:57:24.821535Z WARN text_embeddings_backend: backends/src/lib.rs:383: safetensors weights not found. Using pytorch_model.bin instead. Model loading will be significantly slower.
2025-07-21T18:57:24.821547Z INFO text_embeddings_backend: backends/src/lib.rs:384: Downloading pytorch_model.bin
Error: Could not create backend

Caused by:
Weights not found: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/janni-t/qwen3-embedding-0.6b-tei-onnx/resolve/main/pytorch_model.bin)

Did you find any solution?

Sign up or log in to comment