Transformers documentation

Serve CLI

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.5.4).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Serve CLI

The transformers serve CLI is a lightweight option for local or self-hosted servers. It avoids the extra runtime and operational overhead of dedicated inference engines like vLLM. Use it for evaluation, experimentation, and moderate load deployments. Features like continuous batching increase throughput and lower latency.

For large scale production deployments, use vLLM or SGLang with a Transformer model as the backend. Learn more in the Inference backends guide.

The transformers serve command spawns a local server compatible with the OpenAI SDK. The server works with many third-party applications and supports the REST APIs below.

  • /v1/chat/completions for text, image, audio, and video requests
  • /v1/responses supports the Responses API
  • /v1/audio/transcriptions for audio transcriptions
  • /v1/models lists available models for third-party integrations
  • /load_model streams model loading progress via SSE

Install the serving dependencies.

pip install transformers[serving]

Run transformers serve to launch a server. The default server address is http://localhost:8000.

transformers serve

v1/chat/completions

The v1/chat/completions API is based on the Chat Completions API. It supports text, image, audio, and video requests for LLMs, VLMs, and multimodal models. Use it with curl, the InferenceClient, or the OpenAI client.

Text-based completions

huggingface_hub
huggingface_hub (stream)
openai
openai (stream)
curl
curl (stream)
from huggingface_hub import InferenceClient

messages = [{"role": "user", "content": "What is the Transformers library known for?"}]
client = InferenceClient("http://localhost:8000")

result = client.chat_completion(messages, model="Qwen/Qwen2.5-0.5B-Instruct", max_tokens=256)
print(result.choices[0].message.content)

The InferenceClient returns a printed string.

The Transformers library is primarily known for its ability to create and manipulate large-scale language models [...]

Image-based completions

huggingface_hub
huggingface_hub (stream)
openai
openai (stream)
curl
curl (stream)
from huggingface_hub import InferenceClient

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg",
                }
            },
        ],
    }
]
client = InferenceClient("http://localhost:8000")

result = client.chat_completion(messages, model="Qwen/Qwen2.5-VL-7B-Instruct", max_tokens=256)
print(result.choices[0].message.content)

The InferenceClient returns a printed string.

The image depicts an astronaut in a space suit standing on what appears to be the surface of the moon, given the barren, rocky landscape and the dark sky in the background. The astronaut is holding a large egg that has cracked open, revealing a small creature inside. The scene is imaginative and playful, combining elements of space exploration with a whimsical twist involving the egg and the creature.

Audio completions

Multimodal models like Gemma 4 and Qwen2.5-Omni accept audio input using the OpenAI input_audio content type. The audio must be base64-encoded and the format (mp3 or wav) must be specified.

huggingface_hub
huggingface_hub (stream)
openai
openai (stream)
curl
curl (stream)
import base64
import httpx
from huggingface_hub import InferenceClient

audio_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"
audio_b64 = base64.b64encode(httpx.get(audio_url, follow_redirects=True).content).decode()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe this audio."},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "mp3"}},
        ],
    }
]
client = InferenceClient("http://localhost:8000")

result = client.chat_completion(messages, model="google/gemma-4-E2B-it", max_tokens=256)
print(result.choices[0].message.content)

The InferenceClient returns a printed string.

This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners, and on distant military outposts, all these conversations are what have kept me honest.

The audio_url content type is an extension not part of the OpenAI standard and may change in future versions.

As a convenience, audio can also be passed by URL using the audio_url content type, avoiding the need for base64 encoding.

completion = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "audio_url", "audio_url": {"url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"}},
            ],
        }
    ],
)

Video completions

The video_url content type is an extension not part of the OpenAI standard and may change in future versions.

Video input is supported using the video_url content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.

Video processing requires torchcodec. Install it with pip install torchcodec.

huggingface_hub
huggingface_hub (stream)
openai
openai (stream)
curl
curl (stream)
from huggingface_hub import InferenceClient

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"}},
            {"type": "text", "text": "What is happening in the video and what is the song about?"},
        ],
    }
]
client = InferenceClient("http://localhost:8000")

result = client.chat_completion(messages, model="google/gemma-4-E2B-it", max_tokens=256)
print(result.choices[0].message.content)

The InferenceClient returns a printed string.

The video captures a live music performance at a music festival or a large concert. There are several musicians on stage, including a central figure playing an acoustic guitar and singing. The foreground is filled with the backs of the audience, indicating a large crowd watching the show. The stage is dramatically lit with bright spotlights and blue and white stage lighting, with haze and smoke creating an immersive atmosphere.

The lyrics of the song are: "I don't care 'bout street, from that fresh street, 'cause there's no problem, another one I want to be, in the storm..."

v1/responses

The Responses API is OpenAI’s latest API endpoint for generation. It supports stateful interactions and integrates built-in tools to extend a model’s capabilities. OpenAI recommends using the Responses API over the Chat Completions API for new projects.

The v1/responses API supports text, image, audio, and video requests through the curl command and OpenAI client.

Text-based responses

openai
openai (stream)
curl
curl (stream)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<random_string>")

response = client.responses.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    input="Tell me a three sentence bedtime story about a unicorn.",
    max_output_tokens=256,
    stream=False,
)
print(response.output[0].content[0].text)

The OpenAI client returns a printed string.

Once upon a time, in a faraway land, there lived a beautiful unicorn named Luna [...]

Image-based responses

The Responses API also supports image, audio, and video inputs. Pass them as a list of messages using the same content types as v1/chat/completions.

openai
openai (stream)
curl
curl (stream)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<random_string>")

response = client.responses.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg",
                    }
                },
            ],
        }
    ],
    max_output_tokens=256,
    stream=False,
)
print(response.output[0].content[0].text)

The OpenAI client returns a printed string.

The image depicts an astronaut in a space suit standing on what appears to be the surface of the moon, given the barren, rocky landscape and the dark sky in the background. The astronaut is holding a large egg that has cracked open, revealing a small creature inside. The scene is imaginative and playful, combining elements of space exploration with a whimsical twist involving the egg and the creature.

Audio-based responses

openai
openai (stream)
curl
curl (stream)
import base64
import httpx
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<random_string>")

audio_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"
audio_b64 = base64.b64encode(httpx.get(audio_url, follow_redirects=True).content).decode()

response = client.responses.create(
    model="google/gemma-4-E2B-it",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "mp3"}},
            ],
        }
    ],
    max_output_tokens=256,
    stream=False,
)
print(response.output[0].content[0].text)

The OpenAI client returns a printed string.

This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye to eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners, and on distant military outposts, all these conversations are what have kept me honest.

The audio_url content type is an extension not part of the OpenAI standard and may change in future versions.

As a convenience, audio can also be passed by URL using the audio_url content type, avoiding the need for base64 encoding.

response = client.responses.create(
    model="google/gemma-4-E2B-it",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "audio_url", "audio_url": {"url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"}},
            ],
        }
    ],
    max_output_tokens=256,
    stream=False,
)

Video-based responses

The video_url content type is an extension not part of the OpenAI standard and may change in future versions.

Video processing requires torchcodec. Install it with pip install torchcodec.

openai
openai (stream)
curl
curl (stream)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<random_string>")

response = client.responses.create(
    model="google/gemma-4-E2B-it",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"}},
                {"type": "text", "text": "What is happening in the video and what is the song about?"},
            ],
        }
    ],
    max_output_tokens=256,
    stream=False,
)
print(response.output[0].content[0].text)

The OpenAI client returns a printed string.

The video captures a live music performance at a music festival or a large concert. There are several musicians on stage, including a central figure playing an acoustic guitar and singing. The foreground is filled with the backs of the audience, indicating a large crowd watching the show. The stage is dramatically lit with bright spotlights and blue and white stage lighting, with haze and smoke creating an immersive atmosphere.

The lyrics of the song are: "I don't care 'bout street, from that fresh street, 'cause there's no problem, another one I want to be, in the storm..."

v1/audio/transcriptions

The v1/audio/transcriptions endpoint transcribes audio using speech-to-text models. It follows the Audio transcription API format.

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/audio.wav" \
  -F "model=openai/whisper-large-v3"

The command returns the following response.

{
  "text": "Transcribed text from the audio file",
}

v1/models

The v1/models endpoint scans your local Hugging Face cache and returns a list of downloaded models in the OpenAI-compatible format. Third-party tools use this endpoint to discover available models.

Download a model before running transformers serve.

transformers download Qwen/Qwen2.5-0.5B-Instruct

Once downloaded, the model appears in /v1/models responses.

curl http://localhost:8000/v1/models

The endpoint returns a JSON object with available models.

Loading models

The /load_model endpoint pre-loads a model and streams progress via Server-Sent Events (SSE). The transformers chat CLI uses it automatically so users see download and loading progress instead of a hanging prompt. Use it to warm up a model before sending inference requests.

Request

curl -N -X POST http://localhost:8000/load_model \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-0.5B-Instruct"}'

The model field is a Hugging Face model identifier, optionally with an @revision suffix (Qwen/Qwen2.5-0.5B-Instruct@main). Omitting the revision defaults to main.

Response

The response is an SSE stream (Content-Type: text/event-stream). Each frame is a JSON object on a data: line.

data: {"status": "loading", "model": "Qwen/Qwen2.5-0.5B-Instruct@main", "stage": "processor"}

Every event contains at minimum a status and model field. Additional fields depend on the status.

Field Present when Description
status Always loading, ready, or error
model Always Canonical model_id@revision
stage status == "loading" One of processor, config, download, weights (see stages below)
progress download and weights stages Object with current and total (integer or null)
cached status == "ready" true if the model was already in memory
message status == "error" Error description

Stages

Loading progresses through these stages in order. Some may be skipped (download is skipped when files are already cached locally).

Stage Has progress? Description
processor No Loading the tokenizer/processor
config No Loading model configuration
download Yes (bytes) Downloading model files
weights Yes (items) Loading weight tensors into memory

The stream ends with exactly one terminal event, ready (success) or error (failure).

Timeout

transformers serve supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set --model-timeout to a different value in seconds, or -1 to disable unloading entirely.

transformers serve --model-timeout 400

Loading examples

See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.

fresh load
cached files
in memory
data: {"status": "loading", "model": "org/model@main", "stage": "processor"}
data: {"status": "loading", "model": "org/model@main", "stage": "config"}
data: {"status": "loading", "model": "org/model@main", "stage": "download", "progress": {"current": 0, "total": 269100000}}
data: {"status": "loading", "model": "org/model@main", "stage": "download", "progress": {"current": 134600000, "total": 269100000}}
data: {"status": "loading", "model": "org/model@main", "stage": "download", "progress": {"current": 269100000, "total": 269100000}}
data: {"status": "loading", "model": "org/model@main", "stage": "weights", "progress": {"current": 1, "total": 272}}
data: {"status": "loading", "model": "org/model@main", "stage": "weights", "progress": {"current": 272, "total": 272}}
data: {"status": "ready", "model": "org/model@main", "cached": false}

Tool calling

The transformers serve server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.

Tool calling is currently limited to the Qwen model family.

Define tools as a list of function specifications following the OpenAI format.

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather in a location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city name, e.g. San Francisco"
          },
          "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "temperature unit"
          }
        },
        "required": ["location"]
      }
    }
  }
]

Customize generation by passing GenerationConfig parameters to the extra_body argument in create.

generation_config = {
  "max_new_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 50,
  "do_sample": True,
  "repetition_penalty": 1.1,
  "no_repeat_ngram_size": 3,
}

response = client.responses.create(
  model="Qwen/Qwen2.5-7B-Instruct",
  instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
  input="What's the weather like in San Francisco?",
  tools=tools,
  stream=True,
  extra_body={"generation_config": json.dumps(generation_config)}
)

for event in response:
  print(event)

Port forwarding

Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.

ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server

Reproducibility

Use --force-model <repo_id> to avoid per-request model hints and produce stable, repeatable runs.

transformers serve \
  --force-model Qwen/Qwen2.5-0.5B-Instruct \
  --continuous-batching \
  --dtype "bfloat16"
Update on GitHub

Free AI Image Generator No sign-up. Instant results. Open Now