Solutions
AI Inference API — one endpoint, every model
The InferAll AI Inference API runs prompts, images, and video generation against 190+ models across OpenAI, Anthropic, Google, NVIDIA, Replicate, and Runway. One API key, one base URL, and 110+ free open-source models included. It is a unified inference API and AI gateway in one — no GPUs, no per-provider accounts, no API aggregation glue code.
Get free API accessWhat is an AI inference API?
An AI inference API is an HTTP endpoint that runs a trained model on your input and returns the result. You send a prompt; the API returns a completion, an image, an embedding, or a video. The provider handles GPUs, model weights, batching, and scaling.
Traditionally, each AI inference provider — OpenAI, Anthropic, Google, NVIDIA, and others — ships its own API format, SDK, authentication, and bill. As soon as you want to compare models or route traffic by cost, you end up writing the same translation and aggregation layer that other teams have already built.
InferAll is that layer. The InferAll AI inference API is a single endpoint that fronts every major inference provider, with format compatibility for the OpenAI and Anthropic SDKs you already use.
Why unified inference matters
The list of credible AI inference providers grew from one to six in under two years. Each ships new models monthly. A unified inference API lets you evaluate, switch, and combine providers without rewriting integration code.
AI API aggregation does three things that direct integrations cannot: it normalizes wire formats so your code is provider-independent; it adds a routing layer that picks the cheapest model that meets your quality bar; and it gives you a single bill, a single key, and a single set of observability traces across every model you use.
If your AI gateway is just a thin proxy, you still own the routing and fallback logic. InferAll bundles the gateway, the inference API, and the aggregation layer into one product — and prices the premium providers at their wholesale token cost.
Supported AI inference providers
Llama 3.1 70B, Mixtral, Nemotron, CodeLlama
GPT-4o, o1, GPT-4 Turbo, DALL-E 3
Claude Sonnet 4, Opus, Haiku
Gemini 2.5 Pro, Gemini 2.5 Flash, Veo 3, Imagen 4
Flux Pro, Stable Diffusion XL
Gen-4.5, Kling 3.0
Inference capabilities
Pricing and billing
The free tier includes 100,000 tokens per month against 110+ open-source models on NVIDIA NIM. No card is required to start — you're charged $0 within that allowance, and a card is needed only for paid providers or to continue past the free trial. Llama 3.1 70B, Mixtral, Nemotron, CodeLlama, and the rest of the open-source roster are entirely free up to that limit.
The Pro plan is $29/month and includes 2M tokens plus access to every premium provider — OpenAI, Anthropic, Google, Replicate, Runway — billed at the provider's standard token price with zero markup. Pay-as-you-go beyond the included tokens. Team and Enterprise plans add more tokens, multiple keys, priority routing, and custom model hosting.
Every request you make across every provider lands on a single bill. See full pricing.
Code examples
The inference API is wire-compatible with the OpenAI Chat Completions and Anthropic Messages formats. Drop your existing SDK at api.inferall.ai and it just works.
curl
# Inference via the unified AI inference API
curl https://api.inferall.ai/ai/v1/generate \
-H "Authorization: Bearer ifu_..." \
-H "Content-Type: application/json" \
-d '{
"provider": "nvidia",
"model": "meta/llama-3.1-70b-instruct",
"messages": [{"role":"user","content":"What is an AI inference API?"}],
"stream": true
}'Python (OpenAI SDK)
# Python — works with the OpenAI SDK, no rewrite required
from openai import OpenAI
client = OpenAI(
base_url="https://api.inferall.ai/v1",
api_key="ifu_...",
)
# Run inference on a free NVIDIA-hosted model
resp = client.chat.completions.create(
model="meta/llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Summarize unified inference APIs."}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="")TypeScript (Anthropic SDK)
// TypeScript / Node — Anthropic SDK pointed at InferAll
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
baseURL: "https://api.inferall.ai",
apiKey: process.env.INFERALL_API_KEY,
});
const message = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Compare AI inference providers." }],
});Full reference for every endpoint, model, and parameter is in the documentation.
Latency and reliability
The InferAll gateway is deployed on Fly.io across global edge regions. Routing overhead is 10-30 ms on top of the provider's own inference latency. Time-to-first-token on free NVIDIA-hosted Llama 3.1 70B is typically 200-400 ms; premium providers behave the same through InferAll as they do directly.
Reliability comes from the aggregation layer. If a provider rate-limits, errors, or times out (30s default), the gateway automatically retries the same request against the next cheapest provider that can serve the model class. A single 5xx no longer takes down your application.
Context-size-aware routing handles requests from 128K to 1M+ tokens by selecting a provider whose model can actually fit the prompt — so you don't have to maintain that logic yourself.
Frequently asked questions
What is an AI inference API?
An AI inference API is an HTTP endpoint that runs a trained model on your input and returns the result. Instead of provisioning GPUs and loading model weights yourself, you POST a prompt (or image, or audio) to the API and get a response. InferAll provides a unified AI inference API that routes requests to OpenAI, Anthropic, Google, NVIDIA, Replicate, or Runway — whichever provider hosts the model you asked for.
What is the difference between an AI inference API and an AI gateway?
An AI inference API exposes individual models for inference. An AI gateway sits in front of one or more inference APIs and adds routing, fallback, authentication, and billing. InferAll is both: it is an AI gateway that aggregates 6 inference providers behind a single inference API.
Which AI inference providers does InferAll support?
InferAll aggregates 6 AI inference providers: OpenAI (GPT-4o, o1, DALL-E 3), Anthropic (Claude Sonnet 4, Opus, Haiku), Google (Gemini 2.5, Veo 3, Imagen 4), NVIDIA NIM (110+ free open-source models including Llama 3.1 70B and Mixtral), Replicate (Flux, Stable Diffusion), and Runway (Gen-4.5, Kling 3.0). Switching providers is a parameter change, not a code change.
How much does the inference API cost?
110+ open-source models on NVIDIA NIM are free up to 100,000 tokens per month. Premium providers (OpenAI, Anthropic, Google, Runway) are billed at the provider's published token price with zero markup. The $29/month Pro plan includes 2M tokens and access to every model including image and video generation.
Does the inference API support streaming and tool use?
Yes. The inference API supports Server-Sent Events (SSE) streaming for real-time token output, function calling / tool use translated across providers, and vision inputs. Format compatibility with the OpenAI and Anthropic SDKs means existing streaming and tool-use code works without modification.
What is the latency of the AI inference API?
InferAll's gateway runs on Fly.io with global edge regions, adding 10-30 ms of routing overhead on top of provider latency. Time-to-first-token on free NVIDIA-hosted models is typically 200-400 ms; premium providers depend on the underlying API. Automatic fallback retries with the next provider if the primary times out (30s default).
Can I use this inference API with OpenAI or Anthropic SDKs?
Yes. The inference API is wire-compatible with the OpenAI Chat Completions format and the Anthropic Messages format. Set OPENAI_BASE_URL=https://api.inferall.ai/v1 or ANTHROPIC_BASE_URL=https://api.inferall.ai with your InferAll API key, and existing applications, Claude Code, Cursor, Continue, Cline, and Aider all work unchanged.
Related reading
Unified AI API — one key for OpenAI, Claude, Gemini, and Llama.
LLM API aggregator — 190+ models across 6 providers, one endpoint.
AI model gateway — intelligent routing with automatic provider fallback.