Use case

Run Claude Code on free OSS models. Pay only when you need Claude.

Most Claude Code turns are cheap — file reads, summaries, status checks, small refactors. A minority are complex — multi-file edits, architecture decisions, hard reasoning. Routing the cheap turns to a free open-source model and reserving Claude for the complex ones can take real cost out of a developer's monthly bill. InferAll makes that routing the default by giving Claude Code one base URL with a free OSS allowance behind it.

Setup

Claude Code reads two environment variables to decide where to send requests: ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL. Point those at InferAll and the standard Claude Code flow works against the gateway — no adapter, no proxy, no translation layer.

The free tier needs no credit card. Sign in at /keys, copy the key (prefix ifa_), export the two variables, and run claude.

Two environment variables

# 1. Get an InferAll API key (ifa_...) at /keys — no credit card.
# 2. Point Claude Code at the InferAll gateway via the two
#    Anthropic environment variables.

export ANTHROPIC_API_KEY=ifa_...
export ANTHROPIC_BASE_URL=https://api.inferall.ai

# 3. Run Claude Code. Pick the free OSS model on launch (or with /model).
claude --model meta/llama-3.1-405b-instruct

# That's the whole setup. No proxy to run, no provider keys to wire,
# no adapter to install.

When to use a free model vs Claude

The honest framing is that a free OSS model and Claude aren't interchangeable; they have different sweet spots. The job is to route by task class, not to pick once and hope.

Free OSS models — Llama 3.1 405B, Mixtral, Nemotron, CodeLlama — are strong on classification, single-file edits, summaries of small-to-medium inputs, commit messages, docstring generation, and quick “explain this function” style turns. For most of the chatty inner-loop traffic Claude Code generates, an OSS model is good enough and the latency is similar. This is where the 100,000-token monthly allowance pays for itself.

Claude — claude-sonnet-4 and claude-opus-4 — pulls ahead on multi-file reasoning, planning long edits, debugging tricky failures, anything that requires holding more context together coherently. If a task involves “look at these five files and propose a change that keeps them in sync,” you generally want Claude answering it. That's also where the premium rate is worth paying.

The mixing pattern that works in practice: pin the session to a free model by default, and bump to Claude for the turns you can tell ahead of time will need it. Claude Code's in-session /model command makes this a one-line switch.

Pinning a free model

If you want a session that won't bill against premium accidentally — for instance, a code review pass on a public repository, a docs-rewrite session, or an exploratory question loop — pin the session to a free OSS model on launch.

Pin and switch

# Pin a free OSS model so the session never bills against premium.
# Useful for code review on a public repo, commit-message generation,
# or exploratory questions where Claude-quality isn't required.

claude --model meta/llama-3.1-405b-instruct

# Switch to Claude only for a specific task, then back.
# In a Claude Code session: /model claude-sonnet-4-20250514
# When done: /model meta/llama-3.1-405b-instruct

Free OSS models available

InferAll's free tier covers 186 open-source models hosted on NVIDIA NIM. The lineup that comes up most often for Claude Code workflows:

  • Llama 3.1 405B Instruct — strongest free general-purpose model on the roster for code-adjacent tasks.
  • Mixtral 8x22B Instruct — fast, good at structured outputs and tool calls.
  • Nemotron — NVIDIA's instruction-tuned line, well-suited to reasoning chains.
  • CodeLlama — a code-specialist family if your turns are code-completion-shaped.

The full current list lives in the docs. The roster grows as NIM adds models; the names above are the dependable starting points for Claude Code substitution.

Pricing math (illustrative)

The shape of the savings depends on your mix of cheap vs complex turns. We won't invent specific dollar figures because they shift with provider pricing and your own usage pattern. The structural argument:

If 60% of your Claude Code tokens are cheap turns that an OSS model handles well, routing those turns to the free tier (up to 100,000 tokens/month) and to NIM-rate OSS tokens past that, while keeping the complex 40% on Claude, cuts your effective per-token cost by a meaningful share. The exact share depends on the OSS-vs-Claude rate gap and on how often you stay under the free allowance.

The math below is a worked example. Substitute your own token volumes and your own current Anthropic billing rate to get a number that matches your situation.

Worked example

# Illustrative pricing math — not a guarantee, not a promise.
# Replace these with your own numbers from /docs and /#pricing.

# Hypothetical month of Claude Code usage:
#   Total prompt + completion tokens: 5,000,000
#   Mix:  60% cheap turns (file reads, summaries, classification)
#         40% complex turns (multi-file edits, planning, reasoning)

# Option A: every turn routes to Claude.
#   5,000,000 tokens at Anthropic's published rate         = $X total
#   (where $X depends on the Claude model and the
#   per-million-token rate at the time you do this math.)

# Option B: through InferAll, cheap turns route to free OSS.
#   3,000,000 tokens to NIM-hosted OSS (within / past 100k free):
#     first 100k: free
#     remainder:  billed at NIM rate (typically below Claude)
#   2,000,000 tokens to Claude at the published Anthropic rate.
#
# Net: Option B is meaningfully cheaper than Option A for
# workloads that fit the 60/40 cheap/complex shape. Run the numbers
# against your actual mix before you assume the magnitude.

For current InferAll plan pricing — what Pro and Team tiers include beyond the free allowance — see /#pricing. For the underlying provider rates we forward at zero markup, the upstream documentation is the source of truth.

Common workflows

Code review

Free OSS first; Claude only for files where the OSS review came back vague. The cheap turns — “is this file consistent with the rest of the module” — are what OSS handles well. The expensive turns — “does this change break invariants in three other files” — are what Claude is worth paying for.

Commit messages

Free OSS is enough. A summary of a diff is the kind of turn where Llama 3.1 405B and Claude produce essentially interchangeable output. Pin a free model for the commit-message generation step and don't spend premium tokens here.

Refactoring

Claude. Multi-file consistency is where premium models earn their cost. Free OSS can do single-function renames; multi-file refactors with type changes or contract changes are Claude territory.

Exploratory coding

Mix. Pin a free model for the question loop (“what does this library do, what's the pattern here, give me a starter example”) and bump to Claude when you're ready to write the production version. Most of the tokens land on the cheap side; the expensive ones are short and worth their cost.

Long-running agent loops

Be careful. Agent loops can drain the 100,000-token free allowance quickly because they call the model repeatedly on every step. Either cap the loop turns, route the inner-loop calls explicitly to OSS, or watch the dashboard. The cost story for long-running agents is fundamentally different than for an interactive Claude Code session.

Why we built this

We built InferAll because we wanted exactly this workflow for ourselves — Claude Code routed through a gateway with a free open-source allowance, premium models on tap when we needed them, one base URL and one bill. The market had either OSS gateways we'd have to operate, observability platforms that didn't solve the routing problem, or single-provider keys that locked us into one upstream. So this page is biased: we're describing the workflow we're selling. The technical claims should still hold up if you check them, and the corrections email at the bottom is real. If we got something wrong, tell us.

Frequently asked questions

Will my code be sent to NVIDIA when I use the free OSS tier?

If you target a NVIDIA NIM-hosted model on the free tier, the request body — including any code in the prompt — is forwarded to NVIDIA's NIM inference endpoint for that model. InferAll is the gateway in between; we route the call and meter your usage but the inference itself happens at NIM. If you don't want code to leave a particular trust boundary, route those calls to a model whose hosting you've already approved (Anthropic, OpenAI, Google) and use the free tier for code that's already in a public repository or for content you don't consider sensitive.

Can I disable fallback so Claude Code never uses Claude?

Yes. The model you select in your Claude Code configuration is the model that runs the call. If you pin a NVIDIA NIM model — for example, meta/llama-3.1-405b-instruct — Claude Code will use that model. InferAll's cross-provider failover triggers on upstream failures (429, 529, 5xx, timeout); it will route to a peer provider that can serve the same model class, not 'upgrade' you to a paid model behind your back. If you want strict pinning to a single upstream, set it in your client and the gateway respects it.

Which Claude model gets used when I 'upgrade' a turn?

Whatever model ID you put in the request. Claude Code lets you change the model at runtime; you can run a session on a free OSS model and switch to claude-sonnet-4-20250514 or claude-opus-4-20250514 for a specific task. InferAll forwards the request to the named upstream and bills accordingly — premium Claude tokens at Anthropic's published rate, zero markup.

What counts toward the 100k free token allowance?

Tokens routed to any of the 186 NVIDIA NIM-hosted models on the free tier — both prompt and completion tokens — count against your monthly free allowance. Calls to Anthropic, OpenAI, Google, Replicate, and Runway are billed at the upstream's published rate and do not draw from the free pool. The allowance resets monthly. There's no credit card required to use it.

Does Claude Code know it's hitting NVIDIA instead of Anthropic?

Claude Code knows it's speaking to whatever ANTHROPIC_BASE_URL points at; it doesn't see the upstream provider behind that endpoint. InferAll receives the Anthropic-format request, looks at the model name, and forwards to the matching upstream — Anthropic for Claude models, NVIDIA NIM for the free OSS catalog. The client doesn't change. From your perspective, you're configuring Claude Code as usual; the routing is the gateway's job.

What about tool calls and file edits — do free OSS models handle them?

The strongest free models on the NVIDIA NIM roster — Llama 3.1 405B, Mixtral, Nemotron — handle tool calls. Quality on multi-step file-editing workflows is generally below Claude's for the same prompt; that's the trade-off you're making by routing cheap turns to OSS. For pure read-and-summarize or quick-question turns the gap is small; for complex multi-file refactors it's wider. The honest answer is to try it on your codebase and route by task class.

Will my free allowance regenerate if I'm only doing cheap turns?

The 100,000-token allowance resets monthly, not continuously. Once you've used it within a month, additional calls to NIM-hosted models are billed at the published rate. Most developers doing typical Claude Code sessions stay inside the allowance for ordinary days of work; heavy automated agent loops can exhaust it. Watch your dashboard if you're running anything long-running.

Related

InferAll home — the gateway, the free tier, the failover story.

InferAll for VS Code — Cline-based agent with the gateway pre-wired, free first run.

Security — data handling, trust boundaries, what leaves your machine.

Rate limits — free-tier caps, per-key behavior, what happens on 429.

Pricing — free tier, Pro, Team, Enterprise.

AI inference API — endpoint surface, supported providers, code examples.

Unified AI API — one key, one bill, every provider.

Last updated: 2026-05-14.

InferAll facts on this page are drawn from this site and the gateway running at api.inferall.ai. Model availability on the NVIDIA NIM-hosted free tier may shift as upstream hosts change their roster. Specifics change. Have a correction? Email contact@kindly.fyi.