Solutions

AI Inference API — one endpoint, every model

The InferAll AI Inference API runs prompts, images, and video generation against 190+ models across OpenAI, Anthropic, Google, NVIDIA, Replicate, and Runway. One API key, one base URL, and 110+ free open-source models included. It is a unified inference API and AI gateway in one — no GPUs, no per-provider accounts, no API aggregation glue code.

Get free API access

What is an AI inference API?

An AI inference API is an HTTP endpoint that runs a trained model on your input and returns the result. You send a prompt; the API returns a completion, an image, an embedding, or a video. The provider handles GPUs, model weights, batching, and scaling.

Traditionally, each AI inference provider — OpenAI, Anthropic, Google, NVIDIA, and others — ships its own API format, SDK, authentication, and bill. As soon as you want to compare models or route traffic by cost, you end up writing the same translation and aggregation layer that other teams have already built.

InferAll is that layer. The InferAll AI inference API is a single endpoint that fronts every major inference provider, with format compatibility for the OpenAI and Anthropic SDKs you already use.

Why unified inference matters

The list of credible AI inference providers grew from one to six in under two years. Each ships new models monthly. A unified inference API lets you evaluate, switch, and combine providers without rewriting integration code.

AI API aggregation does three things that direct integrations cannot: it normalizes wire formats so your code is provider-independent; it adds a routing layer that picks the cheapest model that meets your quality bar; and it gives you a single bill, a single key, and a single set of observability traces across every model you use.

If your AI gateway is just a thin proxy, you still own the routing and fallback logic. InferAll bundles the gateway, the inference API, and the aggregation layer into one product — and prices the premium providers at their wholesale token cost.

Supported AI inference providers

NVIDIA NIM110+ models — free

Llama 3.1 70B, Mixtral, Nemotron, CodeLlama

OpenAI15+ models

GPT-4o, o1, GPT-4 Turbo, DALL-E 3

Anthropic10+ models

Claude Sonnet 4, Opus, Haiku

Google38 models

Gemini 2.5 Pro, Gemini 2.5 Flash, Veo 3, Imagen 4

ReplicateImage models

Flux Pro, Stable Diffusion XL

RunwayVideo models

Gen-4.5, Kling 3.0

Inference capabilities

Text and chatCompletions and multi-turn conversations with any LLM
StreamingServer-sent events with real-time token output
Tool useFunction calling translated across providers automatically
VisionImage understanding for screenshots, documents, photos
Image generationDALL-E 3, Flux, Stable Diffusion via simple API
Video generationRunway Gen-4.5, Kling 3.0, Google Veo 3
Code generationSpecialized code models including CodeLlama and GPT-4o

Pricing and billing

The free tier includes 100,000 tokens per month against 110+ open-source models on NVIDIA NIM. No card is required to start — you're charged $0 within that allowance, and a card is needed only for paid providers or to continue past the free trial. Llama 3.1 70B, Mixtral, Nemotron, CodeLlama, and the rest of the open-source roster are entirely free up to that limit.

The Pro plan is $29/month and includes 2M tokens plus access to every premium provider — OpenAI, Anthropic, Google, Replicate, Runway — billed at the provider's standard token price with zero markup. Pay-as-you-go beyond the included tokens. Team and Enterprise plans add more tokens, multiple keys, priority routing, and custom model hosting.

Every request you make across every provider lands on a single bill. See full pricing.

Code examples

The inference API is wire-compatible with the OpenAI Chat Completions and Anthropic Messages formats. Drop your existing SDK at api.inferall.ai and it just works.

curl

# Inference via the unified AI inference API
curl https://api.inferall.ai/ai/v1/generate \
  -H "Authorization: Bearer ifu_..." \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "nvidia",
    "model": "meta/llama-3.1-70b-instruct",
    "messages": [{"role":"user","content":"What is an AI inference API?"}],
    "stream": true
  }'

Python (OpenAI SDK)

# Python — works with the OpenAI SDK, no rewrite required
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferall.ai/v1",
    api_key="ifu_...",
)

# Run inference on a free NVIDIA-hosted model
resp = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Summarize unified inference APIs."}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")

TypeScript (Anthropic SDK)

// TypeScript / Node — Anthropic SDK pointed at InferAll
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  baseURL: "https://api.inferall.ai",
  apiKey: process.env.INFERALL_API_KEY,
});

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Compare AI inference providers." }],
});

Full reference for every endpoint, model, and parameter is in the documentation.

Latency and reliability

The InferAll gateway is deployed on Fly.io across global edge regions. Routing overhead is 10-30 ms on top of the provider's own inference latency. Time-to-first-token on free NVIDIA-hosted Llama 3.1 70B is typically 200-400 ms; premium providers behave the same through InferAll as they do directly.

Reliability comes from the aggregation layer. If a provider rate-limits, errors, or times out (30s default), the gateway automatically retries the same request against the next cheapest provider that can serve the model class. A single 5xx no longer takes down your application.

Context-size-aware routing handles requests from 128K to 1M+ tokens by selecting a provider whose model can actually fit the prompt — so you don't have to maintain that logic yourself.

Frequently asked questions

What is an AI inference API?

An AI inference API is an HTTP endpoint that runs a trained model on your input and returns the result. Instead of provisioning GPUs and loading model weights yourself, you POST a prompt (or image, or audio) to the API and get a response. InferAll provides a unified AI inference API that routes requests to OpenAI, Anthropic, Google, NVIDIA, Replicate, or Runway — whichever provider hosts the model you asked for.

What is the difference between an AI inference API and an AI gateway?

An AI inference API exposes individual models for inference. An AI gateway sits in front of one or more inference APIs and adds routing, fallback, authentication, and billing. InferAll is both: it is an AI gateway that aggregates 6 inference providers behind a single inference API.

Which AI inference providers does InferAll support?

InferAll aggregates 6 AI inference providers: OpenAI (GPT-4o, o1, DALL-E 3), Anthropic (Claude Sonnet 4, Opus, Haiku), Google (Gemini 2.5, Veo 3, Imagen 4), NVIDIA NIM (110+ free open-source models including Llama 3.1 70B and Mixtral), Replicate (Flux, Stable Diffusion), and Runway (Gen-4.5, Kling 3.0). Switching providers is a parameter change, not a code change.

How much does the inference API cost?

110+ open-source models on NVIDIA NIM are free up to 100,000 tokens per month. Premium providers (OpenAI, Anthropic, Google, Runway) are billed at the provider's published token price with zero markup. The $29/month Pro plan includes 2M tokens and access to every model including image and video generation.

Does the inference API support streaming and tool use?

Yes. The inference API supports Server-Sent Events (SSE) streaming for real-time token output, function calling / tool use translated across providers, and vision inputs. Format compatibility with the OpenAI and Anthropic SDKs means existing streaming and tool-use code works without modification.

What is the latency of the AI inference API?

InferAll's gateway runs on Fly.io with global edge regions, adding 10-30 ms of routing overhead on top of provider latency. Time-to-first-token on free NVIDIA-hosted models is typically 200-400 ms; premium providers depend on the underlying API. Automatic fallback retries with the next provider if the primary times out (30s default).

Can I use this inference API with OpenAI or Anthropic SDKs?

Yes. The inference API is wire-compatible with the OpenAI Chat Completions format and the Anthropic Messages format. Set OPENAI_BASE_URL=https://api.inferall.ai/v1 or ANTHROPIC_BASE_URL=https://api.inferall.ai with your InferAll API key, and existing applications, Claude Code, Cursor, Continue, Cline, and Aider all work unchanged.

Related reading

Unified AI API — one key for OpenAI, Claude, Gemini, and Llama.

LLM API aggregator — 190+ models across 6 providers, one endpoint.

AI model gateway — intelligent routing with automatic provider fallback.

Back to InferAll home.

Start running inference

Related solutions

Unified AI APIOne key for OpenAI, Claude, Gemini, and Llama
LLM API aggregator190+ models across 6 providers, one endpoint
AI model gatewayIntelligent routing with automatic provider fallback
Compare AI modelsTest GPT-4o, Claude, Gemini, and Llama side-by-side