Most AI apps are locked to one provider. When OpenAI has an outage, you're down. When Anthropic raises prices, you rebuild. When a better model launches, you rewrite your integration. InferAll gives you one [AI inference API](/solutions/ai-inference-api) key that routes to any provider — OpenAI, Anthropic, Google, NVIDIA, Replicate, and Runway. Switching is a parameter change, not a rewrite. Sign-up is a $5 starter pack that becomes usage credit you can spend on any model — open or premium — at the provider's published rate with zero markup. --- ### One prompt, four providers ```python from openai import OpenAI client = OpenAI( base_url="https://api.inferall.ai/v1", api_key="ifu_your_key_here", # get one at inferall.ai/keys ) prompt = "What are the tradeoffs between SQL and NoSQL databases?" # Route to any provider by changing just `model` for model in [ "meta/llama-3.1-70b-instruct", # NVIDIA NIM — open model "google/gemma-4-31b-it", # Google Gemma — open model "qwen/qwen3-coder-480b-a35b-instruct", # Qwen — open model "anthropic/claude-sonnet-4-6", # Anthropic Claude — premium ]: response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=200, ) print(f"\n=== {model.split('/')[-1]} ===") print(response.choices[0].message.content) ``` The first three route to NVIDIA NIM open models at our open-model rate. The last bills at Anthropic's published per-token rate with zero markup. All four come off the same `ifu_` key, on the same invoice. --- ### Automatic failover InferAll falls back to the next provider automatically when one fails. If the primary returns a 500, rate limit, or timeout, the gateway retries on the configured fallback chain — without any code in your application. ```python # This call retries on NVIDIA if Anthropic fails: response = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", # primary messages=[{"role": "user", "content": "Explain neural networks."}], ) # provider=anthropic attempted first, nvidia fallback on failure ``` No retries in your application code, no provider-specific error handling. --- ### Compare providers on the same task ```python import asyncio async def compare(prompt: str, models: list[str]): import httpx results = [] async with httpx.AsyncClient() as http: tasks = [ http.post( "https://api.inferall.ai/v1/chat/completions", headers={"Authorization": "Bearer ifu_your_key"}, json={ "model": m, "messages": [{"role": "user", "content": prompt}], "max_tokens": 150, }, timeout=30, ) for m in models ] responses = await asyncio.gather(*tasks, return_exceptions=True) for model, resp in zip(models, responses): if isinstance(resp, Exception): print(f"{model}: error") else: data = resp.json() text = data["choices"][0]["message"]["content"] print(f"\n{model.split('/')[-1]}:\n{text[:300]}") asyncio.run(compare( "Write a haiku about distributed systems.", ["meta/llama-3.1-70b-instruct", "google/gemma-4-31b-it", "qwen/qwen3.5-122b-a10b"] )) ``` --- ### Route by task type Different providers excel at different tasks. InferAll lets you route at the application layer: ```python def get_model(task: str) -> str: if task == "code": return "qwen/qwen3-coder-480b-a35b-instruct" # strong open coder elif task == "reasoning": return "nvidia/nemotron-3-super-120b-a12b" # largest open model elif task == "fast": return "meta/llama-3.1-8b-instruct" # fastest open model else: return "meta/llama-3.1-70b-instruct" # balanced default ``` These are all open models on NVIDIA NIM — billed at our open-model rate against your starter balance. Swap in `gpt-4o` or `claude-sonnet-4-6` when a task earns premium spend. ```python task = "code" model = get_model(task) response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": "Write a binary search in Python."}], ) ``` --- ### Open models available today All of these route through NVIDIA NIM at our open-model rate: ```sh curl https://api.inferall.ai/ai/v1/models \ | python3 -c " import sys, json models = json.load(sys.stdin) nim = [(k, v) for k, v in models.items() if v.get('provider') == 'nvidia' and v.get('type') == 'token'] print(f'{len(nim)} open token models') for k, _ in sorted(nim)[:10]: print(f' {k}') " ``` --- ### Get started Sign up at [inferall.ai/keys](https://inferall.ai/keys) and fund a key with the $5 starter pack. That $5 becomes usage credit you can spend on any model — open or premium — at the provider's published rate with zero markup. Then point your existing OpenAI SDK at `https://api.inferall.ai/v1` and pass any model ID in this post.

Switch between AI providers at runtime — one key, no code changes

Run Claude Code with 200 free requests via NVIDIA NIM — 60-second setup

NVIDIA Nemotron 3 Super 120B vs Claude Opus 4: when the free model is good enough

One observability ship found three production bugs in five hours