Wafer-scale chips → fastest open-model inference (often >2,000 tok/s)
Cerebras Inference — Wafer-scale chips → fastest open-model inference (often >2,000 tok/s).
Free tier: Free tier with 30 RPM, ~1M tokens/day across Llama 3.1/3.3/4 models. Very fast — Cerebras claims highest tok/s on Llama 70B.
API is OpenAI-compatible — point your SDK at https://api.cerebras.ai/v1.
llama-3.3-70bllama-3.1-8bllama-4-scout-17b-16e-instructPay-as-you-go; Llama 3.3 70B at $0.85/1M input tokens, $1.20/1M output. Enterprise tier available.
Reserved capacity + dedicated wafer-scale instances.
https://api.cerebras.ai/v1
OpenAI-compatible — works with the OpenAI SDK by overriding base_url.
curl https://api.cerebras.ai/v1/chat/completions \
-H "Authorization: Bearer $CEREBRAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello in 5 words"}]
}'
Real-time UX where latency dominates — coding copilots, voice agents, live transcription summarization.
Vision or multimodal tasks; closed-model needs.
Free tier with 30 RPM, ~1M tokens/day across Llama 3.1/3.3/4 models. Very fast — Cerebras claims highest tok/s on Llama 70B. No credit card is required to sign up.
Most commonly used: llama-3.3-70b, llama-3.1-8b, llama-4-scout-17b-16e-instruct. The full current list is on Cerebras Inference's docs page.
Yes — point the OpenAI SDK's base URL at `https://api.cerebras.ai/v1` and pass your Cerebras Inference API key.
Pay-as-you-go; Llama 3.3 70B at $0.85/1M input tokens, $1.20/1M output. Enterprise tier available. If your traffic is bursty or seasonal, the free tier may be enough; if you need a guaranteed SLA, upgrade.