Open models at $0 rate
118+ open-source models hosted on NVIDIA NIM — Llama 3.1 70B, Mixtral, Nemotron, CodeLlama — bill at our open-model rate ($0 input, $0 output) once your account is activated with the $5 starter pack.
InferAll is an Anthropic-compatible gateway with $0 input/output on 118+ NVIDIA-hosted open-source models (within the free-plan daily request limit), pay-as-you-go access to Claude, GPT-4, and Gemini, and automatic provider failover. Point Claude Code, Cline, Cursor, LangChain or any OpenAI-compatible client at one base URL — one key, one bill.
Drop into Claude Code with two env vars — no code changes.
Drop-in for Claude Code
export ANTHROPIC_API_KEY=ifu_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai
claudeOr call the gateway directly with the official SDK — Python (inferall-ai 0.1.0 on PyPI) or TypeScript (@inferall/sdk 0.1.0 on npm).
Python SDK — install
pip install inferall-aiPython SDK — usage
from inferall import Inferall
ai = Inferall() # reads INFERALL_API_KEY
ai.text("Explain quantum computing in two sentences")200 free trial requests on 118+ open NVIDIA NIM models to start. $5 starter pack unlocks ongoing use — see /pricing.
217 models live · 121 free · 33,800 requests / 30d · gateway healthy
Why InferAll
118+ open-source models hosted on NVIDIA NIM — Llama 3.1 70B, Mixtral, Nemotron, CodeLlama — bill at our open-model rate ($0 input, $0 output) once your account is activated with the $5 starter pack.
One key fronts OpenAI, Anthropic, Google, NVIDIA NIM, Replicate, and Runway. Switch providers with a parameter, not a rewrite.
OpenAI-compatible at /ai/v1/generate. Anthropic-compatible at /v1/messages. Existing SDKs, agents, and editors work unchanged.
Pay-as-you-go on premium providers at their published token price with zero markup. Free OSS, premium when you need it, single invoice.
Resilience
Anthropic rate-limits. OpenAI has incidents. Google Gemini quotas reset. Production agents that call a single provider directly inherit every one of those failures.
InferAll's gateway watches for 429s, 529s, 5xx responses, and timeouts (30s default). When the primary provider can't serve a request, the same prompt is retried server-side against the next provider that can run an equivalent model — Anthropic to Google, OpenAI to NVIDIA, and so on. Your client code does not change.
Context-size-aware routing means a 1M-token prompt is only sent to providers whose models can actually fit it. Cheapest-first selection means you pay the least for the same quality bar.
Failover flow
# Anthropic returns 429 / 529 / 5xx?
# InferAll transparently retries against the next provider
# that can serve the requested model class.
POST https://api.inferall.ai/v1/messages
→ Anthropic (rate-limited) [retry]
→ Google Gemini 2.5 [200 OK]Providers
Editor integration
Anything that speaks the OpenAI or Anthropic wire format works with InferAll. That covers most modern coding agents — including the three below.
Anthropic's terminal-native coding agent. Point ANTHROPIC_BASE_URL at InferAll and route Claude Code through free OSS models for cheap turns, premium models for hard ones.
Setup snippet
export ANTHROPIC_API_KEY=ifu_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai
claudeThe autonomous coding agent for VS Code. Cline supports Anthropic-compatible endpoints — set the base URL in settings and your free OSS allowance covers the chatty inner loop.
Setup steps
# In Cline's Anthropic provider settings:
# API Key: ifu_...
# Base URL: https://api.inferall.aiCursor lets you bring your own OpenAI-compatible endpoint. Configure InferAll as a custom provider to use the same key across editors.
Setup steps
# In Cursor → Settings → Models → Custom API:
# OpenAI API Key: ifu_...
# OpenAI Base URL: https://api.inferall.ai/v1Prefer a one-click VS Code install with inference already wired in? Use the dedicated InferAll for VS Code extension — the canonical InferAll-branded experience, with the upstream Cline + base-URL path above still fully supported.
Dedicated Cline and Cursor integration guides are coming to the docs.
Solutions
01
Sign up at /keys and create an InferAll API key (prefix ifu_). Activate the key at /billing with the $5 starter pack — that $5 becomes spendable balance for any model (open NIM at $0 input/output, premium providers at the provider's rate with zero markup).
02
Set ANTHROPIC_BASE_URL=https://api.inferall.ai for Claude Code and Cline, or OPENAI_BASE_URL=https://api.inferall.ai/v1 for the OpenAI SDK. No code changes.
03
Free OSS models for cheap turns, premium providers when you ask for them, automatic failover when something breaks. One bill at the end of the month.
Free
$0/month
Pro
$29/month
Team
$99/month
Enterprise
Custom
InferAll is an Anthropic-compatible and OpenAI-compatible AI gateway. Point your existing Claude Code, Cline, Cursor, or SDK code at api.inferall.ai with one environment variable and get access to 207+ models across OpenAI, Anthropic, Google, NVIDIA NIM, Replicate, and Runway — including $0 input/output on 118+ open NVIDIA NIM models within the free-tier daily request limits.
Set ANTHROPIC_API_KEY to your InferAll key and ANTHROPIC_BASE_URL to https://api.inferall.ai, then run claude as usual. InferAll exposes an Anthropic-compatible /v1/messages endpoint, so Claude Code sends every request through InferAll. Route cheap turns to free NVIDIA-hosted OSS models and let premium Claude or GPT handle hard tasks.
When a provider rate-limits, returns a 5xx, or times out (30s default), InferAll's gateway transparently retries the same request against the next provider that can serve the model class. A single Anthropic outage or rate-limit no longer takes down your agent. Failover happens server-side — no client-side retry logic required.
118+ open-source models hosted on NVIDIA NIM are billed at $0 input and $0 output once your account is activated, capped only by the free-plan daily request rate (100 chat / 50 text / 20 image-analyze / 10 image-generate / 5 video-generate per day). The roster includes Llama 3.1 70B, Mixtral, Nemotron, and CodeLlama. Activation is a $5 starter pack that becomes spendable balance on your account — the open NIM models cost $0 against that balance, premium providers (OpenAI, Anthropic, Google) bill at the provider's published per-token rate with zero markup.
Other gateways aggregate paid providers. InferAll is the only gateway that bundles an Anthropic-compatible endpoint, an open NVIDIA NIM inference tier at $0 input/output, and automatic cross-provider failover into one product — at zero markup on premium tokens.
Yes — both SDKs are live. Python: pip install inferall-ai (0.1.0 on PyPI), import Inferall from inferall, and call text(), chat(), vision(), or generate(). TypeScript: npm install @inferall/sdk (0.1.0 on npm), import { Inferall } from "@inferall/sdk" with the same surface. Both read INFERALL_API_KEY from the environment (the legacy AI_GATEWAY_KEY is still accepted). Prefer the official OpenAI or Anthropic SDKs? Point them at api.inferall.ai and everything works unchanged.
The InferAll gateway is built and maintained by Kindly Robotics, Inc. The infrastructure code is available on GitHub at github.com/kindlyrobotics/infra. The gateway runs on Fly.io for low-latency global inference.