Run Claude Code on freeOSS models. Upgrade whenyou need to.

InferAll is an Anthropic-compatible gateway with 100,000 free tokens/month on 186 NVIDIA-hosted open-source models, pay-as-you-go access to Claude, GPT-4, and Gemini, and automatic provider failover. Point Claude Code, Cline, or Cursor at one base URL — one key, one bill.

Drop into Claude Code with two env vars — no code changes.

Drop-in for Claude Code

export ANTHROPIC_API_KEY=ifa_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai
claude

Or call the gateway directly from Python with the official SDK (inferall-ai 0.1.0 on PyPI). TypeScript SDK coming soon.

Python SDK — install

pip install inferall-ai

Python SDK — usage

from inferall import Inferall

ai = Inferall()  # reads INFERALL_API_KEY
ai.text("Explain quantum computing in two sentences")

Why InferAll

Free OSS tier

100,000 tokens/month on 186 open-source models hosted on NVIDIA NIM — Llama 3.1 405B, Mixtral, Nemotron, CodeLlama. No credit card to start.

Multi-provider routing

One key fronts OpenAI, Anthropic, Google, NVIDIA NIM, Replicate, and Runway. Switch providers with a parameter, not a rewrite.

Drop-in compatibility

OpenAI-compatible at /ai/v1/generate. Anthropic-compatible at /v1/messages. Existing SDKs, agents, and editors work unchanged.

One bill

Pay-as-you-go on premium providers at their published token price with zero markup. Free OSS, premium when you need it, single invoice.

Resilience

Automatic failover across providers

Anthropic rate-limits. OpenAI has incidents. Google Gemini quotas reset. Production agents that call a single provider directly inherit every one of those failures.

InferAll's gateway watches for 429s, 529s, 5xx responses, and timeouts (30s default). When the primary provider can't serve a request, the same prompt is retried server-side against the next provider that can run an equivalent model — Anthropic to Google, OpenAI to NVIDIA, and so on. Your client code does not change.

Context-size-aware routing means a 1M-token prompt is only sent to providers whose models can actually fit it. Cheapest-first selection means you pay the least for the same quality bar.

Failover flow

# Anthropic returns 429 / 529 / 5xx?
# InferAll transparently retries against the next provider
# that can serve the requested model class.

POST https://api.inferall.ai/v1/messages
  → Anthropic (rate-limited)       [retry]
  → Google Gemini 2.5              [200 OK]

Providers

OpenAIGPT-4o, DALL-E 3, o1
AnthropicClaude Sonnet 4, Opus, Haiku
GoogleGemini 2.5, Veo 3, Imagen 4
NVIDIALlama 405B, Mixtral, 186 free models
ReplicateFlux, Stable Diffusion, video
RunwayGen-4.5, video generation

Editor integration

Use with your favorite editor

Anything that speaks the OpenAI or Anthropic wire format works with InferAll. That covers most modern coding agents — including the three below.

Claude Code

Anthropic's terminal-native coding agent. Point ANTHROPIC_BASE_URL at InferAll and route Claude Code through free OSS models for cheap turns, premium models for hard ones.

Setup snippet

export ANTHROPIC_API_KEY=ifa_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai
claude

Cline

The autonomous coding agent for VS Code. Cline supports Anthropic-compatible endpoints — set the base URL in settings and your free OSS allowance covers the chatty inner loop.

Setup steps

# In Cline's Anthropic provider settings:
#   API Key:  ifa_...
#   Base URL: https://api.inferall.ai

Cursor

Cursor lets you bring your own OpenAI-compatible endpoint. Configure InferAll as a custom provider to use the same key across editors.

Setup steps

# In Cursor → Settings → Models → Custom API:
#   OpenAI API Key:  ifa_...
#   OpenAI Base URL: https://api.inferall.ai/v1

Prefer a one-click VS Code install with inference already wired in? Use the dedicated InferAll for VS Code extension — the canonical InferAll-branded experience, with the upstream Cline + base-URL path above still fully supported.

Dedicated Cline and Cursor integration guides are coming to the docs.

How it works

01

Get your API key

Sign up at /keys and get an InferAll API key (prefix ifa_). The free tier needs no credit card.

02

Point your tool at InferAll

Set ANTHROPIC_BASE_URL=https://api.inferall.ai for Claude Code and Cline, or OPENAI_BASE_URL=https://api.inferall.ai/v1 for the OpenAI SDK. No code changes.

03

Let the gateway route

Free OSS models for cheap turns, premium providers when you ask for them, automatic failover when something breaks. One bill at the end of the month.

Pricing

Free

$0/month

  • 100k tokens/month
  • 186 NVIDIA-hosted OSS models
  • Anthropic + OpenAI compatible
  • Automatic provider failover

Pro

$29/month

  • 2M tokens/month included
  • All 255+ models, all providers
  • Image and video generation
  • Streaming and tool use

Team

$99/month

  • 10M tokens/month included
  • Multiple API keys
  • Priority routing
  • Usage analytics

Enterprise

Custom

  • Unlimited tokens
  • Custom model hosting
  • Priority support
  • Dedicated account team
Get started free

Questions

What is InferAll?

InferAll is an Anthropic-compatible and OpenAI-compatible AI gateway. Point your existing Claude Code, Cline, Cursor, or SDK code at api.inferall.ai with one environment variable and get access to 255+ models across OpenAI, Anthropic, Google, NVIDIA NIM, Replicate, and Runway — plus 100,000 free tokens/month on 186 open-source models.

How do I use InferAll with Claude Code?

Set ANTHROPIC_API_KEY to your InferAll key and ANTHROPIC_BASE_URL to https://api.inferall.ai, then run claude as usual. InferAll exposes an Anthropic-compatible /v1/messages endpoint, so Claude Code sends every request through InferAll. Route cheap turns to free NVIDIA-hosted OSS models and let premium Claude or GPT handle hard tasks.

How does automatic failover work?

When a provider rate-limits, returns a 5xx, or times out (30s default), InferAll's gateway transparently retries the same request against the next provider that can serve the model class. A single Anthropic outage or rate-limit no longer takes down your agent. Failover happens server-side — no client-side retry logic required.

How many free models does InferAll offer?

186 open-source models hosted on NVIDIA NIM are free up to 100,000 tokens per month. The roster includes Llama 3.1 405B, Mixtral, Nemotron, and CodeLlama. No credit card required.

How does InferAll compare to OpenRouter, LiteLLM, Portkey, or Vercel AI Gateway?

Other gateways aggregate paid providers. InferAll is the only gateway that bundles an Anthropic-compatible endpoint, a free OSS inference tier on NVIDIA NIM, and automatic cross-provider failover into one product — at zero markup on premium tokens.

Is there a native InferAll SDK?

Yes — the Python SDK is live on PyPI today as inferall-ai 0.1.0. Install with pip install inferall-ai, import Inferall from inferall, and call text(), chat(), vision(), or generate(). It reads INFERALL_API_KEY from the environment (the legacy AI_GATEWAY_KEY is still accepted). A TypeScript SDK is in progress and not yet published; in the meantime, point the official OpenAI or Anthropic SDKs at api.inferall.ai and everything works unchanged.

Is InferAll open source?

The InferAll gateway is built and maintained by Kindly Robotics, Inc. The infrastructure code is available on GitHub at github.com/kindlyrobotics/infra. The gateway runs on Fly.io for low-latency global inference.