Use case

A free LLM API for AI agents. Pay only for the hard steps.

Agents are token-hungry in a way chatbots aren't. One user instruction fans out into many model calls — planning, tool selection, reading tool output, reflection loops, retries. The cost compounds fast, and most of those calls are mechanical enough that a capable open-source model handles them fine. InferAll gives your agent framework one OpenAI-compatible base URL with a free OSS allowance behind it, so the high-volume steps run at $0 and you pin a frontier model only for the planning and reasoning that actually needs one.

Why agents make this a cost problem

A chat app is roughly one model call per message. An agent is the opposite: a single “handle this for me” can become a plan step, a tool-selection step, a step that reads and interprets the tool's output, a reflection step that decides whether the result is good enough, a retry when it isn't, and a final synthesis step. That's many round-trips per task, and every one of them spends prompt and completion tokens.

During development and testing the multiplier is worse, because you re-run the same loops repeatedly while you tune prompts and tools. The tokens you spend proving the loop works dwarf the tokens any single production run costs. This is precisely where a free allowance earns its keep — the iteration phase is high-volume and low-stakes, the two properties that make a free OSS model the right call.

InferAll's free tier is 100,000 tokens/month across 110+ open-source models hosted on NVIDIA NIM, billed at $0 within the allowance. No card is required to start; a card is needed only for paid providers or usage past the free trial, not for the free allowance itself.

Setup

Agent frameworks — LangChain, LlamaIndex, CrewAI, and AutoGPT-style loops — are almost all built on the OpenAI client surface. So the integration is the same three changes everywhere: set the OpenAI base_url to https://api.inferall.ai/v1, use your InferAll key (prefix ifu_) as the API key, and set the model to a free OSS ID. No proxy, no adapter, no translation layer.

Sign in at /keys to create a key — no card needed to start. Add a card at /billing only for paid providers or beyond the free trial. The gateway also speaks the Anthropic format at /v1/messages and exposes a native endpoint at /ai/v1/generate, but for agent frameworks the OpenAI-compatible /v1 base URL is the one you want.

OpenAI SDK — agent loop on a free model

# Drive an agent loop on a FREE open-source model via the
# OpenAI-compatible endpoint. Only three things change vs. calling
# OpenAI directly: base_url, the ifu_ key, and the model ID.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferall.ai/v1",   # OpenAI-compatible gateway
    api_key="ifu_...",                       # your InferAll key
)

# A free OSS model handles the high-volume inner-loop steps.
resp = client.chat.completions.create(
    model="meta/llama-3.1-70b-instruct",     # $0 within the free allowance
    messages=[
        {"role": "system", "content": "You are a tool-using agent."},
        {"role": "user", "content": "List the open issues, then summarize them."},
    ],
    tools=[                                   # function-calling, translated
        {                                     # to the OSS provider serving the call
            "type": "function",
            "function": {
                "name": "list_issues",
                "description": "List open issues in the repo.",
                "parameters": {"type": "object", "properties": {}},
            },
        }
    ],
)

print(resp.choices[0].message)

# For the HARD planning/reasoning step, swap the model for one call:
#   model="claude-sonnet-4-6"   (or "gpt-4o") — billed at the upstream
#   rate, then drop back to the free OSS model for the rest of the loop.

Free for the easy steps, paid for the hard ones

The core move is to route by step class, not to pick one model for the whole agent. An open-source model and a frontier model have different sweet spots, and an agent conveniently separates into steps that map onto each.

The high-volume steps — classify an input, summarize a tool result, decide whether to keep looping, format the final answer, route to the right sub-task — run well on a free OSS model. They're the calls your agent makes the most of, so they're where the free allowance does the most work. Pin a model like meta/llama-3.1-8b-instruct for the fast, frequent ones and meta/llama-3.1-70b-instruct when a step needs more capability but still isn't the hard part.

The hard steps — decompose an ambiguous goal into an ordered plan, reason over a large heterogeneous context, recover from a failed multi-step sequence — are where you want a frontier model. Set the model to claude-sonnet-4-6 or gpt-4o for just that call, then drop back to the free model for the rest of the loop. Because the hard steps are the rare ones, you pay frontier rates on a small slice of traffic while the free tier absorbs the bulk.

For the heavier planning calls you can also reach for claude-opus-4-8 or a long-context Gemini model like gemini-2.5-pro (or gemini-2.5-flash when you want a cheaper frontier option). All of them are the same one-line model change through the same key and the same base URL.

LangChain — two tiers, one gateway

# LangChain: same gateway, one ChatOpenAI per "tier".
# Use the free model for the bulk of the loop and a frontier model
# only for the step that genuinely needs it.

from langchain_openai import ChatOpenAI

# Cheap tier — the high-volume default (free within the allowance).
cheap = ChatOpenAI(
    base_url="https://api.inferall.ai/v1",
    api_key="ifu_...",
    model="meta/llama-3.1-8b-instruct",   # fast, $0 within allowance
)

# Frontier tier — pinned for the hard planning / reasoning call only.
planner = ChatOpenAI(
    base_url="https://api.inferall.ai/v1",
    api_key="ifu_...",
    model="claude-sonnet-4-6",            # billed at upstream rate
)

# Route by step. Most calls hit `cheap`; `planner` is the rare one.
plan = planner.invoke("Decompose this goal into ordered steps: ...")
step = cheap.invoke(f"Execute step 1 of the plan: {plan.content}")
print(step.content)

Tool calling on free OSS models

The OpenAI-compatible endpoint accepts the standard tools / function-calling parameters and translates them to whatever provider serves the call. The Llama and Mixtral NIM models support tool calls, so your agent's tool schema reaches a free model unchanged — you don't rewrite anything to move a tool-using loop onto the free tier.

The honest limitation: smaller open-source models are less reliable than frontier models at complex multi-tool orchestration. Give a model a handful of well-described tools and a simple loop and the OSS models do fine. Give it a large tool catalog, deeply nested arguments, and a long chain where one wrong tool choice cascades, and the gap widens — frontier models hold that together more consistently. The practical rule is the same as the cost rule: free OSS for the simple, high-volume tool steps; a pinned frontier model for the orchestration-heavy planning that decides which tools to call and in what order.

Free OSS models for agents

The free tier covers 110+ open-source models hosted on NVIDIA NIM. The ones that come up most often for agent workloads:

  • meta/llama-3.1-70b-instruct — strongest free general-purpose model on the roster for tool-use and reasoning-adjacent steps.
  • meta/llama-3.1-8b-instruct — fast and cheap to serve; a good default for the high-frequency classification and loop-control calls.
  • mistralai/mixtral-8x7b-instruct-v0.1 — solid at structured output and straightforward tool calls.
  • nvidia/nemotron-3-super-120b-a12b — a larger NVIDIA-tuned option when a free step wants more headroom.

The roster shifts as NIM adds and retires models, so don't hardcode a list from memory. Fetch GET https://api.inferall.ai/ai/v1/models for the live catalog and which IDs are currently $0, or browse the docs.

Failover, without surprise upgrades

Long-running agents care about reliability: a loop that dies because one upstream returned a transient error is a loop that wasted every token it spent up to that point. The gateway fails over across providers on recoverable upstream failures — 429, 529, 5xx, or timeout — routing to a peer that can serve the same model.

Crucially, failover does not silently upgrade you to a paid model. The model ID you send is the model that runs; if you pin a free OSS model, failover finds another host for that same model rather than swapping in a frontier tier and billing you for it. Cost surprises in an agent come from your own call volume — an unbounded loop calling the model on every step — not from the gateway re-routing you upward. If you're running anything long-lived, cap the loop turns and watch the dashboard.

Pricing math (illustrative)

We won't invent dollar figures — they shift with upstream pricing and with how your agent fans out. The structural argument is what matters: an agent splits into many cheap calls and a few expensive ones, and you only pay frontier rates on the few.

If roughly ten of every twelve calls in a task are simple enough for a free OSS model, those ten run at $0 within the 100,000-token monthly allowance (and at the free tier's $0-rate OSS past it), while the two hard calls go to a pinned frontier model at its published rate. The development and testing phase, where you re-run loops constantly, is where the free allowance does the most work. The exact savings depend on your call fan-out and how often you stay under the allowance — run it against your own numbers.

Worked example

# Illustrative cost shape for an agent — not a guarantee, not a promise.
# Substitute your own per-step token counts and current upstream rates.

# Why agents are different: ONE user task fans out into many calls.
#   plan -> pick tool -> read tool output -> reflect -> retry -> answer
#   Say a single task averages ~12 model calls.

# A hypothetical day of development / testing:
#   200 agent tasks x ~12 calls each            = ~2,400 calls
#   Of those calls:
#     ~10 calls/task are simple   (classify, summarize, loop control)
#     ~2  calls/task are hard     (plan, reason over big context)

# Through InferAll:
#   Simple calls  -> free OSS (meta/llama-3.1-8b-instruct, 70b):
#     first 100,000 tokens/month: free
#     remainder:                  $0-rate OSS on the free tier
#   Hard calls    -> pinned frontier (claude-sonnet-4-6 / gpt-4o):
#     billed at the upstream's published rate, zero markup.
#
# Net: you pay frontier rates on the ~2-of-12 calls that need a
# frontier model, and $0 (within allowance) on the ~10-of-12 that
# don't. The development/test phase — where you re-run loops
# constantly — is where the free allowance does the most work.
# Run the numbers against your own call fan-out before assuming
# the magnitude.

For current InferAll plan pricing — what Pro and Team tiers include beyond the free allowance — see /#pricing. For the underlying provider rates we forward at zero markup, the upstream documentation is the source of truth.

Common agent workflows

Development and testing loops

Free OSS, almost entirely. While you're tuning prompts and tools you re-run the loop constantly, and that iteration is high-volume and low-stakes — the exact shape the free allowance is built for. Pin a free model for the whole loop during development and only switch pieces to a frontier model once you've confirmed the step genuinely needs it.

Tool-calling agents

Mix. Simple, well-described tools and short chains run on a free model. The orchestration step that picks tools from a large catalog and sequences them is where you pin claude-sonnet-4-6 or gpt-4o for that one call, then hand the chosen tool calls back to the free model to execute.

Reflection / retry loops

Free OSS for the “is this good enough?” check and the retry; it's a frequent, mechanical judgment. Reserve a frontier model for the case where repeated retries keep failing and you need stronger reasoning to figure out why — a rare call relative to the loop's volume.

Planning and decomposition

Frontier. Turning an ambiguous goal into a correct, ordered plan is the call most worth paying for, and it's usually one call near the start of a task. Pinning claude-opus-4-8 or claude-sonnet-4-6 here and free OSS for execution is the canonical split.

Long-running / unbounded agents

Be careful. An agent that calls the model on every step without a turn cap can drain the 100,000-token allowance quickly and then keep billing. Cap the loop, route the inner steps explicitly to free OSS, and watch the dashboard. The cost story for an unbounded agent is fundamentally different from an interactive session.

Why we built this

We built InferAll because we wanted exactly this for our own agents — one OpenAI-compatible base URL, a free open-source allowance for the high-volume inner-loop calls, and frontier models on tap for the planning steps, all behind one key and one bill. The alternatives were OSS gateways we'd have to operate, observability tools that didn't solve routing, or single-provider keys that locked us into one upstream. So this page is biased: we're describing the workflow we're selling. The technical claims should still hold up if you check them, and the corrections email at the bottom is real. If we got something wrong, tell us.

Frequently asked questions

Why is a free tier worth more for agents than for a chatbot?

An agent emits many model calls per user task — planning, tool selection, observing tool output, reflecting, retrying. A single 'do this for me' instruction can fan out into dozens of LLM round-trips. A chatbot is roughly one call per message. So the same free allowance stretches across far fewer user-visible actions in an agent, which is exactly why routing the high-volume, low-stakes steps to a free model matters: those steps are where the token count compounds. InferAll's free tier covers 100,000 tokens/month across 110+ open-source models hosted on NVIDIA NIM, billed at $0 within the allowance.

How do I point LangChain (or any agent framework) at the gateway?

Use the OpenAI-compatible endpoint. Set the OpenAI base_url to https://api.inferall.ai/v1, use your InferAll key (prefix ifu_) as the API key, and set the model to a free OSS ID such as meta/llama-3.1-70b-instruct. In LangChain that's ChatOpenAI(base_url=..., api_key=..., model=...); in the OpenAI SDK it's OpenAI(base_url=..., api_key=...). LlamaIndex, CrewAI, and most AutoGPT-style loops accept the same three settings because they're built on the OpenAI client surface. No proxy to run, no adapter to install.

Can free OSS models do tool calling / function calling?

Yes. The OpenAI-compatible endpoint accepts the standard tools / function-calling parameters and translates the request to whichever provider serves the call. The Llama and Mixtral NIM models support tool calls, so your agent's tool schema reaches them. The honest caveat: smaller open-source models are less reliable at complex multi-tool orchestration — long chains where the model has to pick the right tool from many options, thread arguments correctly, and recover from a bad call. They're solid on a handful of well-described tools and simple loops; they wobble on hard orchestration. Route the simple steps to OSS and pin a frontier model for the gnarly planning.

What's the 'free for the easy steps, paid for the hard ones' pattern?

Most of an agent's calls are mechanical: classify an input, summarize a tool result, decide whether to continue a loop, format an answer. Those run well on a free OSS model and burn the bulk of your tokens. A minority of calls are hard: decompose an ambiguous goal into a plan, reason over a large heterogeneous context, untangle a failed multi-step sequence. For those, set the model to a frontier ID — claude-sonnet-4-6 or gpt-4o — for just that call, then drop back to the free model. Because the hard calls are the rare ones, you pay frontier rates on a small slice of traffic while the free allowance absorbs the high-volume remainder.

Will the gateway silently upgrade my agent to a paid model?

No. The model ID in your request is the model that runs. InferAll's failover triggers only on recoverable upstream failures — 429, 529, 5xx, or timeout — and routes to a peer provider that can serve the same model, not to a more expensive tier. If you pin meta/llama-3.1-8b-instruct, your loop stays on that model; failover finds another host for it rather than swapping in claude-opus-4-8 behind your back. Cost surprises from a runaway loop come from your own call volume, not from the gateway re-routing you upward.

What counts against the 100k free token allowance?

Prompt and completion tokens routed to any of the 110+ NVIDIA NIM-hosted models on the free tier count toward the monthly allowance. Calls you pin to a frontier model — Anthropic, OpenAI, Google — are billed at the upstream's published rate and don't draw from the free pool. The allowance resets monthly, not continuously. No card is required to start — you're charged $0 within the free tier; a card is needed only for paid providers or to continue past the free trial. Because agent loops call the model repeatedly, watch the dashboard if you're running anything long-lived or unbounded.

Which free models should an agent use, and where's the live list?

Dependable starting points on the free tier are meta/llama-3.1-70b-instruct (strongest general-purpose free model on the roster for tool-use and reasoning steps), meta/llama-3.1-8b-instruct (fast and cheap-to-serve for high-frequency classification and routing decisions), mistralai/mixtral-8x7b-instruct-v0.1 (good at structured output), and nvidia/nemotron-3-super-120b-a12b. The catalog shifts as NIM adds models, so don't hardcode a roster from memory — fetch GET https://api.inferall.ai/ai/v1/models for the current list and which IDs are $0.

Related

Claude Code on free OSS models — the same route-by-task pattern, applied to Claude Code.

InferAll home — the gateway, the free tier, the failover story.

InferAll for VS Code — Cline-based agent with the gateway pre-wired, free first run.

Rate limits — free-tier caps, per-key behavior, what happens on 429.

Pricing — free tier, Pro, Team, Enterprise.

AI inference API — endpoint surface, supported providers, code examples.

Unified AI API — one key, one bill, every provider.

Last updated: 2026-05-30.

InferAll facts on this page are drawn from this site and the gateway running at api.inferall.ai. Model availability on the NVIDIA NIM-hosted free tier may shift as upstream hosts change their roster. Specifics change. Have a correction? Email contact@kindly.fyi.