Self-Serve
Pay as you go when using Inference Endpoints
Easily deploy your AI models to production on our fully managed platform. Instead of spending weeks configuring infrastructure, focus on building you AI application.
No Hugging Face account ? Sign up
These teams are running AI models on Inference Endpoints
Import your favorite model from Hugging Face or browse our catalog of hand-picked, ready-to-deploy models!
ggml-org
gemma-4-26B-A4B-it-GGUF
gemma-4-31B-it
gemma-4-26B-A4B-it
unsloth
Qwen3.5-9B-GGUF
unsloth
Qwen3.5-35B-A3B-GGUF
mistralai
Mistral-Small-4-119B-2603
Fully managed infrastructure, autoscaling, and built-in observability — so you can focus on your model, not the ops.
Don't worry about Kubernetes, CUDA versions, or configuring VPNs. Focus on deploying your model and serving customers.
Automatically scales up as traffic increases and down as it decreases to save on compute costs.
Understand and debug your model through comprehensive logs & metrics.
Deploy with vLLM, TGI, SGLang, TEI, or custom containers.
Download model weights fast and securely with seamless Hugging Face Hub integration.
Stay current with the latest frameworks and optimizations without managing complex upgrades.
Deploy with TEI, vLLM, SGLang, llama.cpp or bring your own custom container — all with zero infrastructure overhead.
Start with pay-as-you-go pricing, or scale up with a tailored enterprise contract — only pay for the compute you actually use.
Pay as you go when using Inference Endpoints
Get a custom quote and premium support
“The coolest thing was how easy it was to define a complete custom interface from the model to the inference process. It just took us a couple of hours to adapt our code, and have a functioning and totally custom endpoint.”
Andrea Boscarino, Data Scientist at Musixmatch
“It took off a week's worth of developer time. Thanks to Inference Endpoints, we now basically spend all of our time on R&D, not fiddling with AWS. If you haven't already built a robust, performant, fault tolerant system for inference, then it's pretty much a no brainer.”
Bryce Harlan, Senior Software Engineer at Phamily
“We were able to choose an off the shelf model that's very common for our customers and set it to to handle over 100 requests per second just with a few button clicks. A new standard for easily building your first vector embedding based solution, whether it be semantic search or question answering system.”
Gareth Jones, Senior Product Manager at Pinecone
“You're bringing the potential time delta between testing and production down to potentially less than a day. I've never seen anything that could do this before. I could have it on infrastructure ready to support an existing product”
Nathan Labenz, Founder at Waymark
Join thousands of developers and teams using Inference Endpoints to deploy their AI models at scale. Start building today with our simple, secure, and scalable infrastructure.
Create images in seconds. No sign-up, no paywall, no setup.