Home / Providers / Hugging Face Inference

Hugging Face Inference

300k+ open-source models, free serverless inference

TL;DR

Hugging Face Inference — 300k+ open-source models, free serverless inference. Free tier: Free serverless inference — strict per-account quotas, model can cold-start. Best for prototyping, not production. API is OpenAI-compatible — point your SDK at https://api-inference.huggingface.co.

—

Latency now

—

Uptime 24h

—

Free RPM

1000

Free RPD

Get free API key → Read docs ↗ Pricing ↗

Free tier limits

1,000 requests/day

No credit card required.

Models on free tier

meta-llama/Llama-3.3-70B-Instruct
mistralai/Mistral-Nemo-Instruct-2407
Qwen/Qwen2.5-72B-Instruct
deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Upgrade path

Pro plan ($9/mo) → 20× rate limit. HF Inference Endpoints (dedicated) start at ~$0.06/hour for CPU, $0.60/hr for small GPU.

HF Enterprise Hub, private model hosting, SOC 2.

Endpoint

https://api-inference.huggingface.co

OpenAI-compatible — works with the OpenAI SDK by overriding base_url.

Quick start

curl https://api-inference.huggingface.co/models/meta-llama/Llama-3.3-70B-Instruct/v1/chat/completions \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello in 5 words"}]
  }'

from huggingface_hub import InferenceClient

client = InferenceClient(api_key="YOUR_HF_TOKEN")
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello in 5 words"}],
)
print(resp.choices[0].message.content)

import { HfInference } from "@huggingface/inference";

const hf = new HfInference(process.env.HF_TOKEN);
const resp = await hf.chatCompletion({
  model: "meta-llama/Llama-3.3-70B-Instruct",
  messages: [{ role: "user", content: "Hello in 5 words" }],
});
console.log(resp.choices[0].message.content);

When Hugging Face Inference is the right pick

Stay on free tier when

Trying any of the 300k+ HF models without setup. Niche or fine-tuned models that other providers do not host.

Pick something else when

Production traffic on free tier — cold starts and quotas hurt UX.

FAQ

Is Hugging Face Inference's API really free?

Free serverless inference — strict per-account quotas, model can cold-start. Best for prototyping, not production. No credit card is required to sign up.

What models can I call on Hugging Face Inference's free tier?

Most commonly used: meta-llama/Llama-3.3-70B-Instruct, mistralai/Mistral-Nemo-Instruct-2407, Qwen/Qwen2.5-72B-Instruct, deepseek-ai/DeepSeek-R1-Distill-Llama-70B. The full current list is on Hugging Face Inference's docs page.

Is Hugging Face Inference OpenAI-compatible?

Yes — point the OpenAI SDK's base URL at `https://api-inference.huggingface.co` and pass your Hugging Face Inference API key.

When should I upgrade from Hugging Face Inference's free tier?

Pro plan ($9/mo) → 20× rate limit. HF Inference Endpoints (dedicated) start at ~$0.06/hour for CPU, $0.60/hr for small GPU. If your traffic is bursty or seasonal, the free tier may be enough; if you need a guaranteed SLA, upgrade.