The clearest fit is “we want what LiteLLM does but we don't want to run it.” If your team would spend a week standing up a LiteLLM proxy and another week wiring it into the rest of your platform — billing, alerting, capacity, on-call — and the result is a routing layer that isn't differentiating for your product, InferAll is the shortcut. Set two environment variables, your SDK code keeps working, and the gateway is somebody else's operational problem.
The free open-source inference tier is the other big one, and it's not something LiteLLM can give you on its own. LiteLLM routes; it doesn't host models. To get free OSS inference behind a LiteLLM proxy you'd need to bring your own NVIDIA NIM contract, your own self-hosted vLLM cluster, or a deal with a community host. InferAll bundles 100,000 tokens per month against 186 NVIDIA-hosted models — Llama 3.1 405B, Mixtral, Nemotron, CodeLlama — into the service. For chatty coding-agent traffic that burns through tokens on cheap turns, that allowance is real money.
The third reason is Claude Code in particular. Both LiteLLM and InferAll expose Anthropic-format endpoints, but with LiteLLM you're also operating the proxy that sits between Claude Code and the upstream Anthropic API. With InferAll, the Anthropic-compatible /v1/messages endpoint is hosted and the free tier means Claude Code can run against free OSS models without setting up billing first. That's the workflow we built the gateway around.
Finally, the billing surface comes with the service. Free, Pro, Team, and Enterprise plans live at /#pricing — no separate Postgres for spend tracking, no separate auth layer, no separate admin UI to keep up with the OSS release cadence.