Question 1

Will my code be sent to NVIDIA when I use the free OSS tier?

Accepted Answer

If you target a NVIDIA NIM-hosted model on the free tier, the request body — including any code in the prompt — is forwarded to NVIDIA's NIM inference endpoint for that model. InferAll is the gateway in between; we route the call and meter your usage but the inference itself happens at NIM. If you don't want code to leave a particular trust boundary, route those calls to a model whose hosting you've already approved (Anthropic, OpenAI, Google) and use the free tier for code that's already in a public repository or for content you don't consider sensitive.

Question 2

Can I disable fallback so Claude Code never uses Claude?

Accepted Answer

Yes. The model you select in your Claude Code configuration is the model that runs the call. If you pin a NVIDIA NIM model — for example, meta/llama-3.1-405b-instruct — Claude Code will use that model. InferAll's cross-provider failover triggers on upstream failures (429, 529, 5xx, timeout); it will route to a peer provider that can serve the same model class, not 'upgrade' you to a paid model behind your back. If you want strict pinning to a single upstream, set it in your client and the gateway respects it.

Question 3

Which Claude model gets used when I 'upgrade' a turn?

Accepted Answer

Whatever model ID you put in the request. Claude Code lets you change the model at runtime; you can run a session on a free OSS model and switch to claude-sonnet-4-20250514 or claude-opus-4-20250514 for a specific task. InferAll forwards the request to the named upstream and bills accordingly — premium Claude tokens at Anthropic's published rate, zero markup.

Question 4

What counts toward the 100k free token allowance?

Accepted Answer

Tokens routed to any of the 186 NVIDIA NIM-hosted models on the free tier — both prompt and completion tokens — count against your monthly free allowance. Calls to Anthropic, OpenAI, Google, Replicate, and Runway are billed at the upstream's published rate and do not draw from the free pool. The allowance resets monthly. There's no credit card required to use it.

Question 5

Does Claude Code know it's hitting NVIDIA instead of Anthropic?

Accepted Answer

Claude Code knows it's speaking to whatever ANTHROPIC_BASE_URL points at; it doesn't see the upstream provider behind that endpoint. InferAll receives the Anthropic-format request, looks at the model name, and forwards to the matching upstream — Anthropic for Claude models, NVIDIA NIM for the free OSS catalog. The client doesn't change. From your perspective, you're configuring Claude Code as usual; the routing is the gateway's job.

Question 6

What about tool calls and file edits — do free OSS models handle them?

Accepted Answer

The strongest free models on the NVIDIA NIM roster — Llama 3.1 405B, Mixtral, Nemotron — handle tool calls. Quality on multi-step file-editing workflows is generally below Claude's for the same prompt; that's the trade-off you're making by routing cheap turns to OSS. For pure read-and-summarize or quick-question turns the gap is small; for complex multi-file refactors it's wider. The honest answer is to try it on your codebase and route by task class.

Question 7

Will my free allowance regenerate if I'm only doing cheap turns?

Accepted Answer

The 100,000-token allowance resets monthly, not continuously. Once you've used it within a month, additional calls to NIM-hosted models are billed at the published rate. Most developers doing typical Claude Code sessions stay inside the allowance for ordinary days of work; heavy automated agent loops can exhaust it. Watch your dashboard if you're running anything long-running.

Run Claude Code on free OSS models. Pay only when you need Claude.

Setup

When to use a free model vs Claude

Pinning a free model

Free OSS models available

Pricing math (illustrative)

Common workflows

Code review

Commit messages

Refactoring

Exploratory coding

Long-running agent loops

Why we built this

Frequently asked questions

Related