Free OSS tier
100,000 tokens/month on 186 open-source models hosted on NVIDIA NIM — Llama 3.1 405B, Mixtral, Nemotron, CodeLlama. No credit card to start.
InferAll is an Anthropic-compatible gateway with 100,000 free tokens/month on 186 NVIDIA-hosted open-source models, pay-as-you-go access to Claude, GPT-4, and Gemini, and automatic provider failover. Point Claude Code, Cline, or Cursor at one base URL — one key, one bill.
Drop into Claude Code with two env vars — no code changes.
Drop-in for Claude Code
export ANTHROPIC_API_KEY=ifa_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai
claudeOr call the gateway directly from Python with the official SDK (inferall-ai 0.1.0 on PyPI). TypeScript SDK coming soon.
Python SDK — install
pip install inferall-aiPython SDK — usage
from inferall import Inferall
ai = Inferall() # reads INFERALL_API_KEY
ai.text("Explain quantum computing in two sentences")Why InferAll
100,000 tokens/month on 186 open-source models hosted on NVIDIA NIM — Llama 3.1 405B, Mixtral, Nemotron, CodeLlama. No credit card to start.
One key fronts OpenAI, Anthropic, Google, NVIDIA NIM, Replicate, and Runway. Switch providers with a parameter, not a rewrite.
OpenAI-compatible at /ai/v1/generate. Anthropic-compatible at /v1/messages. Existing SDKs, agents, and editors work unchanged.
Pay-as-you-go on premium providers at their published token price with zero markup. Free OSS, premium when you need it, single invoice.
Resilience
Anthropic rate-limits. OpenAI has incidents. Google Gemini quotas reset. Production agents that call a single provider directly inherit every one of those failures.
InferAll's gateway watches for 429s, 529s, 5xx responses, and timeouts (30s default). When the primary provider can't serve a request, the same prompt is retried server-side against the next provider that can run an equivalent model — Anthropic to Google, OpenAI to NVIDIA, and so on. Your client code does not change.
Context-size-aware routing means a 1M-token prompt is only sent to providers whose models can actually fit it. Cheapest-first selection means you pay the least for the same quality bar.
Failover flow
# Anthropic returns 429 / 529 / 5xx?
# InferAll transparently retries against the next provider
# that can serve the requested model class.
POST https://api.inferall.ai/v1/messages
→ Anthropic (rate-limited) [retry]
→ Google Gemini 2.5 [200 OK]Providers
Editor integration
Anything that speaks the OpenAI or Anthropic wire format works with InferAll. That covers most modern coding agents — including the three below.
Anthropic's terminal-native coding agent. Point ANTHROPIC_BASE_URL at InferAll and route Claude Code through free OSS models for cheap turns, premium models for hard ones.
Setup snippet
export ANTHROPIC_API_KEY=ifa_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai
claudeThe autonomous coding agent for VS Code. Cline supports Anthropic-compatible endpoints — set the base URL in settings and your free OSS allowance covers the chatty inner loop.
Setup steps
# In Cline's Anthropic provider settings:
# API Key: ifa_...
# Base URL: https://api.inferall.aiCursor lets you bring your own OpenAI-compatible endpoint. Configure InferAll as a custom provider to use the same key across editors.
Setup steps
# In Cursor → Settings → Models → Custom API:
# OpenAI API Key: ifa_...
# OpenAI Base URL: https://api.inferall.ai/v1Prefer a one-click VS Code install with inference already wired in? Use the dedicated InferAll for VS Code extension — the canonical InferAll-branded experience, with the upstream Cline + base-URL path above still fully supported.
Dedicated Cline and Cursor integration guides are coming to the docs.
Solutions
01
Sign up at /keys and get an InferAll API key (prefix ifa_). The free tier needs no credit card.
02
Set ANTHROPIC_BASE_URL=https://api.inferall.ai for Claude Code and Cline, or OPENAI_BASE_URL=https://api.inferall.ai/v1 for the OpenAI SDK. No code changes.
03
Free OSS models for cheap turns, premium providers when you ask for them, automatic failover when something breaks. One bill at the end of the month.
Free
$0/month
Pro
$29/month
Team
$99/month
Enterprise
Custom
InferAll is an Anthropic-compatible and OpenAI-compatible AI gateway. Point your existing Claude Code, Cline, Cursor, or SDK code at api.inferall.ai with one environment variable and get access to 255+ models across OpenAI, Anthropic, Google, NVIDIA NIM, Replicate, and Runway — plus 100,000 free tokens/month on 186 open-source models.
Set ANTHROPIC_API_KEY to your InferAll key and ANTHROPIC_BASE_URL to https://api.inferall.ai, then run claude as usual. InferAll exposes an Anthropic-compatible /v1/messages endpoint, so Claude Code sends every request through InferAll. Route cheap turns to free NVIDIA-hosted OSS models and let premium Claude or GPT handle hard tasks.
When a provider rate-limits, returns a 5xx, or times out (30s default), InferAll's gateway transparently retries the same request against the next provider that can serve the model class. A single Anthropic outage or rate-limit no longer takes down your agent. Failover happens server-side — no client-side retry logic required.
186 open-source models hosted on NVIDIA NIM are free up to 100,000 tokens per month. The roster includes Llama 3.1 405B, Mixtral, Nemotron, and CodeLlama. No credit card required.
Other gateways aggregate paid providers. InferAll is the only gateway that bundles an Anthropic-compatible endpoint, a free OSS inference tier on NVIDIA NIM, and automatic cross-provider failover into one product — at zero markup on premium tokens.
Yes — the Python SDK is live on PyPI today as inferall-ai 0.1.0. Install with pip install inferall-ai, import Inferall from inferall, and call text(), chat(), vision(), or generate(). It reads INFERALL_API_KEY from the environment (the legacy AI_GATEWAY_KEY is still accepted). A TypeScript SDK is in progress and not yet published; in the meantime, point the official OpenAI or Anthropic SDKs at api.inferall.ai and everything works unchanged.
The InferAll gateway is built and maintained by Kindly Robotics, Inc. The infrastructure code is available on GitHub at github.com/kindlyrobotics/infra. The gateway runs on Fly.io for low-latency global inference.