Use case

A free LLM API for AI agents. Pay only for the hard steps.

Agents are token-hungry in a way chatbots aren't. One user instruction fans out into many model calls — planning, tool selection, reading tool output, reflection loops, retries. The cost compounds fast, and most of those calls are mechanical enough that a capable open-source model handles them fine. InferAll gives your agent framework one OpenAI-compatible base URL with a free OSS allowance behind it, so the high-volume steps run at $0 and you pin a frontier model only for the planning and reasoning that actually needs one.

Why agents make this a cost problem

A chat app is roughly one model call per message. An agent is the opposite: a single “handle this for me” can become a plan step, a tool-selection step, a step that reads and interprets the tool's output, a reflection step that decides whether the result is good enough, a retry when it isn't, and a final synthesis step. That's many round-trips per task, and every one of them spends prompt and completion tokens.

During development and testing the multiplier is worse, because you re-run the same loops repeatedly while you tune prompts and tools. The tokens you spend proving the loop works dwarf the tokens any single production run costs. This is precisely where a free allowance earns its keep — the iteration phase is high-volume and low-stakes, the two properties that make a free OSS model the right call.

InferAll catalogs 118+ open-source models hosted on NVIDIA NIM — all billed at $0 input/$0 output against your starter balance. Activation is the $5 starter pack; that $5 covers your spendable balance and unlocks paid providers when you want them. The open NIM calls don't draw from your balance, so the $5 stretches as far as your daily request quota allows.

Setup

Agent frameworks — LangChain, LlamaIndex, CrewAI, and AutoGPT-style loops — are almost all built on the OpenAI client surface. So the integration is the same three changes everywhere: set the OpenAI base_url to https://api.inferall.ai/v1, use your InferAll key (prefix ifu_) as the API key, and set the model to a free OSS ID. No proxy, no adapter, no translation layer.

Sign in at /keys to create a key, then add a card at /billing to activate it with the $5 starter pack — that becomes your spendable balance. The gateway also speaks the Anthropic format at /v1/messages and exposes a native endpoint at /ai/v1/generate, but for agent frameworks the OpenAI-compatible /v1 base URL is the one you want.

OpenAI SDK — agent loop on a free model

# Drive an agent loop on a FREE open-source model via the
# OpenAI-compatible endpoint. Only three things change vs. calling
# OpenAI directly: base_url, the ifu_ key, and the model ID.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferall.ai/v1",   # OpenAI-compatible gateway
    api_key="ifu_...",                       # your InferAll key
)

# A free OSS model handles the high-volume inner-loop steps.
resp = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",     # $0 within the free allowance
    messages=[
        {"role": "system", "content": "You are a tool-using agent."},
        {"role": "user", "content": "List the open issues, then summarize them."},
    ],
    tools=[                                   # function-calling, translated
        {                                     # to the OSS provider serving the call
            "type": "function",
            "function": {
                "name": "list_issues",
                "description": "List open issues in the repo.",
                "parameters": {"type": "object", "properties": {}},
            },
        }
    ],
)

print(resp.choices[0].message)

# For the HARD planning/reasoning step, swap the model for one call:
#   model="claude-sonnet-4-6"   (or "gpt-4o") — billed at the upstream
#   rate, then drop back to the free OSS model for the rest of the loop.

Free for the easy steps, paid for the hard ones

The core move is to route by step class, not to pick one model for the whole agent. An open-source model and a frontier model have different sweet spots, and an agent conveniently separates into steps that map onto each.

The high-volume steps — classify an input, summarize a tool result, decide whether to keep looping, format the final answer, route to the right sub-task — run well on a free OSS model. They're the calls your agent makes the most of, so they're where the free allowance does the most work. Pin a model like meta/llama-3.1-8b-instruct for the fast, frequent ones and meta/llama-3.1-70b-instruct when a step needs more capability but still isn't the hard part.

The hard steps — decompose an ambiguous goal into an ordered plan, reason over a large heterogeneous context, recover from a failed multi-step sequence — are where you want a frontier model. Set the model to claude-sonnet-4-6 or gpt-4o for just that call, then drop back to the open NIM model for the rest of the loop. Because the hard steps are the rare ones, you pay frontier rates on a small slice of traffic while the $0-rate open lane absorbs the bulk.

For the heavier planning calls you can also reach for claude-opus-4-8 or a long-context Gemini model like gemini-2.5-pro (or gemini-2.5-flash when you want a cheaper frontier option). All of them are the same one-line model change through the same key and the same base URL.

LangChain — two tiers, one gateway

# LangChain: same gateway, one ChatOpenAI per "tier".
# Use the free model for the bulk of the loop and a frontier model
# only for the step that genuinely needs it.

from langchain_openai import ChatOpenAI

# Cheap tier — the high-volume default (free within the allowance).
cheap = ChatOpenAI(
    base_url="https://api.inferall.ai/v1",
    api_key="ifu_...",
    model="meta/llama-3.1-8b-instruct",   # fast, $0 within allowance
)

# Frontier tier — pinned for the hard planning / reasoning call only.
planner = ChatOpenAI(
    base_url="https://api.inferall.ai/v1",
    api_key="ifu_...",
    model="claude-sonnet-4-6",            # billed at upstream rate
)

# Route by step. Most calls hit `cheap`; `planner` is the rare one.
plan = planner.invoke("Decompose this goal into ordered steps: ...")
step = cheap.invoke(f"Execute step 1 of the plan: {plan.content}")
print(step.content)

Tool calling on free OSS models

The OpenAI-compatible endpoint accepts the standard tools / function-calling parameters and translates them to whatever provider serves the call. The Llama and Mixtral NIM models support tool calls, so your agent's tool schema reaches an open NIM model unchanged — you don't rewrite anything to move a tool-using loop onto the $0-rate lane.

The honest limitation: smaller open-source models are less reliable than frontier models at complex multi-tool orchestration. Give a model a handful of well-described tools and a simple loop and the OSS models do fine. Give it a large tool catalog, deeply nested arguments, and a long chain where one wrong tool choice cascades, and the gap widens — frontier models hold that together more consistently. The practical rule is the same as the cost rule: free OSS for the simple, high-volume tool steps; a pinned frontier model for the orchestration-heavy planning that decides which tools to call and in what order.

Open OSS models for agents

InferAll catalogs 118+ open-source models hosted on NVIDIA NIM — all billed at $0 input/$0 output. The ones that come up most often for agent workloads:

meta/llama-3.1-70b-instruct — strongest free general-purpose model on the roster for tool-use and reasoning-adjacent steps.
meta/llama-3.1-8b-instruct — fast and cheap to serve; a good default for the high-frequency classification and loop-control calls.
mistralai/mixtral-8x7b-instruct-v0.1 — solid at structured output and straightforward tool calls.
nvidia/nemotron-3-super-120b-a12b — a larger NVIDIA-tuned option when a free step wants more headroom.

The roster shifts as NIM adds and retires models, so don't hardcode a list from memory. Fetch GET https://api.inferall.ai/ai/v1/models for the live catalog and which IDs are currently $0, or browse the docs.

Failover, without surprise upgrades

Long-running agents care about reliability: a loop that dies because one upstream returned a transient error is a loop that wasted every token it spent up to that point. The gateway fails over across providers on recoverable upstream failures — 429, 529, 5xx, or timeout — routing to a peer that can serve the same model.

Crucially, failover does not silently upgrade you to a paid model. The model ID you send is the model that runs; if you pin a free OSS model, failover finds another host for that same model rather than swapping in a frontier tier and billing you for it. Cost surprises in an agent come from your own call volume — an unbounded loop calling the model on every step — not from the gateway re-routing you upward. If you're running anything long-lived, cap the loop turns and watch the dashboard.

Pricing math (illustrative)

We won't invent dollar figures — they shift with upstream pricing and with how your agent fans out. The structural argument is what matters: an agent splits into many cheap calls and a few expensive ones, and you only pay frontier rates on the few.

If roughly ten of every twelve calls in a task are simple enough for an open NIM model, those ten run at $0 input/ $0 output (subject to your tier's daily request quota), while the two hard calls go to a pinned frontier model at its published rate. The development and testing phase, where you re-run loops constantly, is where the $0-rate open lane does the most work. The exact savings depend on your call fan-out and how often you stay under the allowance — run it against your own numbers.

Worked example

# Illustrative cost shape for an agent — not a guarantee, not a promise.
# Substitute your own per-step token counts and current upstream rates.

# Why agents are different: ONE user task fans out into many calls.
#   plan -> pick tool -> read tool output -> reflect -> retry -> answer
#   Say a single task averages ~12 model calls.

# A hypothetical day of development / testing:
#   200 agent tasks x ~12 calls each            = ~2,400 calls
#   Of those calls:
#     ~10 calls/task are simple   (classify, summarize, loop control)
#     ~2  calls/task are hard     (plan, reason over big context)

# Through InferAll:
#   Simple calls  -> open NIM (meta/llama-3.1-8b-instruct, 70b):
#     billed at $0 input / $0 output against your $5 starter balance
#     (subject to your tier's daily request count quota)
#   Hard calls    -> pinned frontier (claude-sonnet-4-6 / gpt-4o):
#     billed at the upstream's published rate, zero markup.
#
# Net: you pay frontier rates on the ~2-of-12 calls that need a
# frontier model, and $0 (within allowance) on the ~10-of-12 that
# don't. The development/test phase — where you re-run loops
# constantly — is where the free allowance does the most work.
# Run the numbers against your own call fan-out before assuming
# the magnitude.

For current InferAll plan pricing — what Pro and Team tiers include beyond the free allowance — see /#pricing. For the underlying provider rates we forward at zero markup, the upstream documentation is the source of truth.

Common agent workflows

Development and testing loops

Free OSS, almost entirely. While you're tuning prompts and tools you re-run the loop constantly, and that iteration is high-volume and low-stakes — the exact shape the free allowance is built for. Pin a free model for the whole loop during development and only switch pieces to a frontier model once you've confirmed the step genuinely needs it.

Tool-calling agents

Mix. Simple, well-described tools and short chains run on a free model. The orchestration step that picks tools from a large catalog and sequences them is where you pin claude-sonnet-4-6 or gpt-4o for that one call, then hand the chosen tool calls back to the free model to execute.

Reflection / retry loops

Free OSS for the “is this good enough?” check and the retry; it's a frequent, mechanical judgment. Reserve a frontier model for the case where repeated retries keep failing and you need stronger reasoning to figure out why — a rare call relative to the loop's volume.

Planning and decomposition

Frontier. Turning an ambiguous goal into a correct, ordered plan is the call most worth paying for, and it's usually one call near the start of a task. Pinning claude-opus-4-8 or claude-sonnet-4-6 here and free OSS for execution is the canonical split.

Long-running / unbounded agents

Be careful. An agent that calls the model on every step without a turn cap can drain the 100,000-token allowance quickly and then keep billing. Cap the loop, route the inner steps explicitly to free OSS, and watch the dashboard. The cost story for an unbounded agent is fundamentally different from an interactive session.

Why we built this

We built InferAll because we wanted exactly this for our own agents — one OpenAI-compatible base URL, a free open-source allowance for the high-volume inner-loop calls, and frontier models on tap for the planning steps, all behind one key and one bill. The alternatives were OSS gateways we'd have to operate, observability tools that didn't solve routing, or single-provider keys that locked us into one upstream. So this page is biased: we're describing the workflow we're selling. The technical claims should still hold up if you check them, and the corrections email at the bottom is real. If we got something wrong, tell us.

Frequently asked questions

Why are $0-rate open models worth more for agents than for a chatbot?

An agent emits many model calls per user task — planning, tool selection, observing tool output, reflecting, retrying. A single 'do this for me' instruction can fan out into dozens of LLM round-trips. A chatbot is roughly one call per message. So even small per-token costs compound fast in an agent loop, which is exactly why routing the high-volume, low-stakes steps to a model billed at $0 matters: those steps are where the spend compounds. InferAll catalogs 118+ open-source models hosted on NVIDIA NIM — all billed at $0 input/$0 output against your $5 starter balance, with a daily request quota by tier (free: 50 chat/100 text per day; pro: 5,000 chat).

How do I point LangChain (or any agent framework) at the gateway?

Use the OpenAI-compatible endpoint. Set the OpenAI base_url to https://api.inferall.ai/v1, use your InferAll key (prefix ifu_) as the API key, and set the model to a free OSS ID such as meta/llama-3.1-70b-instruct. In LangChain that's ChatOpenAI(base_url=..., api_key=..., model=...); in the OpenAI SDK it's OpenAI(base_url=..., api_key=...). LlamaIndex, CrewAI, and most AutoGPT-style loops accept the same three settings because they're built on the OpenAI client surface. No proxy to run, no adapter to install.

Can free OSS models do tool calling / function calling?

Yes. The OpenAI-compatible endpoint accepts the standard tools / function-calling parameters and translates the request to whichever provider serves the call. The Llama and Mixtral NIM models support tool calls, so your agent's tool schema reaches them. The honest caveat: smaller open-source models are less reliable at complex multi-tool orchestration — long chains where the model has to pick the right tool from many options, thread arguments correctly, and recover from a bad call. They're solid on a handful of well-described tools and simple loops; they wobble on hard orchestration. Route the simple steps to OSS and pin a frontier model for the gnarly planning.

What's the 'free for the easy steps, paid for the hard ones' pattern?

Most of an agent's calls are mechanical: classify an input, summarize a tool result, decide whether to continue a loop, format an answer. Those run well on a free OSS model and burn the bulk of your tokens. A minority of calls are hard: decompose an ambiguous goal into a plan, reason over a large heterogeneous context, untangle a failed multi-step sequence. For those, set the model to a frontier ID — claude-sonnet-4-6 or gpt-4o — for just that call, then drop back to the free model. Because the hard calls are the rare ones, you pay frontier rates on a small slice of traffic while the free allowance absorbs the high-volume remainder.

Will the gateway silently upgrade my agent to a paid model?

No. The model ID in your request is the model that runs. InferAll's failover triggers only on recoverable upstream failures — 429, 529, 5xx, or timeout — and routes to a peer provider that can serve the same model, not to a more expensive tier. If you pin meta/llama-3.1-8b-instruct, your loop stays on that model; failover finds another host for it rather than swapping in claude-opus-4-8 behind your back. Cost surprises from a runaway loop come from your own call volume, not from the gateway re-routing you upward.

How does the pricing actually work?

Activation is a $5 starter pack that becomes spendable balance on your account. Calls to any of the 118+ NVIDIA NIM open-source models bill at $0 input/$0 output — so they don't draw from your balance. Calls you pin to a frontier model — Anthropic, OpenAI, Google — bill at the upstream's published per-token rate with zero markup, drawn from the same balance. On top of that, every tier has daily request count limits per operation (free: 50 chat/100 text per day; pro: 5,000 chat; team and enterprise scale up). Quotas reset daily UTC. Because agent loops call the model repeatedly, watch your dashboard for runaway loops that exhaust the daily quota.

Which open NIM models should an agent use, and where's the live list?

Dependable starting points are meta/llama-3.1-70b-instruct (strongest general-purpose open model on the roster for tool-use and reasoning steps), meta/llama-3.1-8b-instruct (fast and cheap-to-serve for high-frequency classification and routing decisions), mistralai/mixtral-8x7b-instruct-v0.1 (good at structured output), and nvidia/nemotron-3-super-120b-a12b. All bill at $0 input/$0 output. The catalog shifts as NIM adds models, so don't hardcode a roster from memory — fetch GET https://api.inferall.ai/ai/v1/models for the current list and which IDs are $0.

Claude Code on free OSS models — the same route-by-task pattern, applied to Claude Code.

InferAll home — the gateway, the $5 starter, the failover story.

InferAll for VS Code — Cline-based agent with the gateway pre-wired, free first run.

Rate limits — free-tier caps, per-key behavior, what happens on 429.

Pricing — Free, Pro, Team, Enterprise tiers.

AI inference API — endpoint surface, supported providers, code examples.

Unified AI API — one key, one bill, every provider.

Last updated: 2026-05-30.

InferAll facts on this page are drawn from this site and the gateway running at api.inferall.ai. Model availability on NVIDIA NIM may shift as upstream hosts change their roster. Specifics change. Have a correction? Email contact@kindly.fyi.