Question 1

What is an AI inference API?

Accepted Answer

An AI inference API is an HTTP endpoint that runs a trained model on your input and returns the result. Instead of provisioning GPUs and loading model weights yourself, you POST a prompt (or image, or audio) to the API and get a response. InferAll provides a unified AI inference API that routes requests to OpenAI, Anthropic, Google, NVIDIA, Replicate, or Runway — whichever provider hosts the model you asked for.

Question 2

What is the difference between an AI inference API and an AI gateway?

Accepted Answer

An AI inference API exposes individual models for inference. An AI gateway sits in front of one or more inference APIs and adds routing, fallback, authentication, and billing. InferAll is both: it is an AI gateway that aggregates 6 inference providers behind a single inference API.

Question 3

Which AI inference providers does InferAll support?

Accepted Answer

InferAll aggregates 6 AI inference providers: OpenAI (GPT-4o, o1, DALL-E 3), Anthropic (Claude Sonnet 4, Opus, Haiku), Google (Gemini 2.5, Veo 3, Imagen 4), NVIDIA NIM (110+ free open-source models including Llama 3.1 70B and Mixtral), Replicate (Flux, Stable Diffusion), and Runway (Gen-4.5, Kling 3.0). Switching providers is a parameter change, not a code change.

Question 4

How much does the inference API cost?

Accepted Answer

110+ open-source models on NVIDIA NIM are free up to 100,000 tokens per month. Premium providers (OpenAI, Anthropic, Google, Runway) are billed at the provider's published token price with zero markup. The $29/month Pro plan includes 2M tokens and access to every model including image and video generation.

Question 5

Does the inference API support streaming and tool use?

Accepted Answer

Yes. The inference API supports Server-Sent Events (SSE) streaming for real-time token output, function calling / tool use translated across providers, and vision inputs. Format compatibility with the OpenAI and Anthropic SDKs means existing streaming and tool-use code works without modification.

Question 6

What is the latency of the AI inference API?

Accepted Answer

InferAll's gateway runs on Fly.io with global edge regions, adding 10-30 ms of routing overhead on top of provider latency. Time-to-first-token on free NVIDIA-hosted models is typically 200-400 ms; premium providers depend on the underlying API. Automatic fallback retries with the next provider if the primary times out (30s default).

Question 7

Can I use this inference API with OpenAI or Anthropic SDKs?

Accepted Answer

Yes. The inference API is wire-compatible with the OpenAI Chat Completions format and the Anthropic Messages format. Set OPENAI_BASE_URL=https://api.inferall.ai/v1 or ANTHROPIC_BASE_URL=https://api.inferall.ai with your InferAll API key, and existing applications, Claude Code, Cursor, Continue, Cline, and Aider all work unchanged.

AI Inference API — one endpoint, every model

What is an AI inference API?

Why unified inference matters

Supported AI inference providers

Inference capabilities

Pricing and billing

Code examples

Latency and reliability

Frequently asked questions

Related reading

Related solutions