Home / Providers / Cerebras Inference

Cerebras Inference

Wafer-scale chips → fastest open-model inference (often >2,000 tok/s)

TL;DR

Cerebras Inference — Wafer-scale chips → fastest open-model inference (often >2,000 tok/s). Free tier: Free tier with 30 RPM, ~1M tokens/day across Llama 3.1/3.3/4 models. Very fast — Cerebras claims highest tok/s on Llama 70B. API is OpenAI-compatible — point your SDK at https://api.cerebras.ai/v1.

—

Latency now

—

Uptime 24h

Free RPM

60,000 TPM

Free TPM

Get free API key → Read docs ↗ Pricing ↗

Free tier limits

30 requests/min
60,000 tokens/min
1,000,000 tokens/day

No credit card required.

Models on free tier

llama-3.3-70b
llama-3.1-8b
llama-4-scout-17b-16e-instruct

Upgrade path

Pay-as-you-go; Llama 3.3 70B at $0.85/1M input tokens, $1.20/1M output. Enterprise tier available.

Reserved capacity + dedicated wafer-scale instances.

Endpoint

https://api.cerebras.ai/v1

OpenAI-compatible — works with the OpenAI SDK by overriding base_url.

Quick start

curl https://api.cerebras.ai/v1/chat/completions \
  -H "Authorization: Bearer $CEREBRAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b",
    "messages": [{"role": "user", "content": "Hello in 5 words"}]
  }'

from cerebras.cloud.sdk import Cerebras

client = Cerebras(api_key="YOUR_KEY")

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello in 5 words"}],
)
print(resp.choices[0].message.content)

import Cerebras from "@cerebras/cerebras_cloud_sdk";

const client = new Cerebras({ apiKey: process.env.CEREBRAS_API_KEY });
const resp = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [{ role: "user", content: "Hello in 5 words" }],
});
console.log(resp.choices[0].message.content);

When Cerebras Inference is the right pick

Stay on free tier when

Real-time UX where latency dominates — coding copilots, voice agents, live transcription summarization.

Pick something else when

Vision or multimodal tasks; closed-model needs.

FAQ

Is Cerebras Inference's API really free?

Free tier with 30 RPM, ~1M tokens/day across Llama 3.1/3.3/4 models. Very fast — Cerebras claims highest tok/s on Llama 70B. No credit card is required to sign up.

What models can I call on Cerebras Inference's free tier?

Most commonly used: llama-3.3-70b, llama-3.1-8b, llama-4-scout-17b-16e-instruct. The full current list is on Cerebras Inference's docs page.

Is Cerebras Inference OpenAI-compatible?

Yes — point the OpenAI SDK's base URL at `https://api.cerebras.ai/v1` and pass your Cerebras Inference API key.

When should I upgrade from Cerebras Inference's free tier?

Pay-as-you-go; Llama 3.3 70B at $0.85/1M input tokens, $1.20/1M output. Enterprise tier available. If your traffic is bursty or seasonal, the free tier may be enough; if you need a guaranteed SLA, upgrade.