What we've shipped

Changelog

Reverse-chronological list of what's actually shipped on the InferAll AI gateway and the marketing site. Each entry links to the public PR where the change landed; gateway-side changes that live in the private infra repo are marked as such.

2026-07-13

402 responses gained X-Inferall-Gate-Reason for programmatic SDK branching

Every 402 payment-required response now includes X-Inferall-Gate-Reason: paid | cap | no_card | card_required. Lets SDK wrappers branch on the reason without JSON-parsing the body — e.g. auto-swap the request to a $0 NIM model when reason=paid, or halt retries entirely on reason=cap|no_card|card_required. Companion X-Inferall-Upgrade-URL header (and the upgrade_url JSON field) also now carry a ?src=402_<reason> query tag, so /billing landings from terminal-rendered 402 errors capture the reason in PostHog and let us measure the missing middle step: hit-gate → click-URL → open-checkout → pay. All four 402-emitting sites tagged: /ai/v1/generate (index.ts), /v1/messages (compat.ts), /v1/chat/completions (openai-compat.ts), and the free-model card-required gate (limiter.ts checkCardRequired). Full documentation on the response-headers table at /docs/rate-limits.

infra (private)

$5 starter pack now unlocks unlimited NVIDIA NIM access — the 200-call trial cap is gone

The pre-payment 200-call trial was gate-able on has_card (any Stripe customer relationship), and has_card was set at /api/billing/start when we create the Stripe Customer — before any card is attached. Bot clusters were minting throwaway Stripe customers and burning NIM quota per key without ever paying. New gate is has_paid_successfully — a settled real-money charge. In exchange, the 200-call cap is gone entirely: once your $5 activation settles, NIM open-source models stay $0 in/out with no per-user request cap beyond the standard free-plan daily limits (100 chat / 50 text / etc.). The $5 also becomes spendable balance for premium providers (OpenAI, Anthropic, Google) at zero markup. Reframed the marketing on /pricing, /billing, /auth, /docs, /docs/rate-limits, and /keys to match. Trade-off is honest: no more free pre-payment trial — but if you want to evaluate before committing, reply to any /billing email and we'll grant an evaluation credit manually.

infra (private)

2026-06-25

Paid users with multi-key accounts no longer hit the trial gate or card-required 402

If you made a $5 starter-pack purchase and then created additional API keys, those new keys could end up with the right tier but the wrong has_paid_successfully flag — leading to surprise 402s when you tried a premium model from a fresh key. Two fixes shipped: (1) the trial gate now bypasses entirely when has_paid_successfully=true on the calling key, regardless of tier — your access is governed by balance and rate limits, not by the 200-call trial cap; (2) new keys created after a successful payment now inherit has_paid_successfully from the user's existing key set at create time, so the inconsistency doesn't recur. Defense-in-depth: the bypass covers any pre-existing mis-tagged keys still in the DB; the inheritance prevents new ones from being created in the bad state. Caught via a new real-time observability signal (X-ALERT:claude_cli_blocked) that fires when a Claude Code / SDK user accumulates 3+ trial-block 402s in 5 minutes — surfacing the issue within an hour of the alert going live, on the first qualifying user.

infra (private)

GET /ai/v1/key/status — trial counter, balance, savings, all behind your existing API key

Returns a one-shot snapshot of where your account stands: tier, trial { used, remaining, cap=200 }, balance in cents and USD, has_card, has_paid_successfully, and a savings_7d block with vs_gpt4o_mini_usd / vs_claude_sonnet_usd (computed from your free-NIM tokens routed in the last 7 days). Auth is via the same API key you use for /v1/chat/completions (Bearer or x-api-key) — no separate dashboard JWT. Lets SDKs and CLIs render upgrade nudges proactively ("150 of 200 free calls remaining, $0.64 saved vs Claude Sonnet so far") without scraping headers. Pairs with two response headers also added: X-InferAll-Trial-Usage: N/200 and X-InferAll-Trial-Status: active | halfway | near_cap, so per-call awareness comes for free without an extra round-trip.

infra (private)

All trial-block / card-required 402 messages are now URL-first and under 120 characters

The trial-block error messages were running 200–300 characters with the action URL embedded mid-sentence. In a terminal where the SDK renders the error inline (Claude Code, in particular), the URL often got truncated or buried — users would read the prose, not see a clickable URL, and bounce. Rewrote the four 402 surfaces (trial_blocked_no_card, checkCardRequired, trial_blocked_paid, trial_blocked_cap) to lead with the URL, then a short one-line explanation, then stop. Sample: "https://inferall.ai/billing — visit to activate 200 free NIM-model calls. No payment required, just initiating checkout unlocks the trial." Same gate, same flow, sharper terminal rendering.

infra (private)

2026-06-24

X-InferAll-Trial-Usage header surfaces your free-trial count on every call

Every successful /v1/chat/completions response from a free-trial (tier=pending) key now includes the header `X-InferAll-Trial-Usage: N/200`, where N is the post-success count over the 200-call lifetime trial cap. SDK clients can surface "you've used 20 of 200 free calls" inline without an extra usage endpoint. Paid keys don't see the header (no relevant cap). Threaded through all four response paths — NVIDIA NIM and OpenAI passthrough, plus the Gemini- and Anthropic-translated paths — so the counter is consistent regardless of which provider answered the request. Closes a real friction case: heavy free users who pick NIM models that work well for them never bump into the upgrade gate and never realise there's a cap waiting. Same JSON body shape; OpenAI SDK clients keep working unchanged.

infra (private)

Bare-name model resolution: "nemotron-3-super-120b-a12b" works as well as "nvidia/nemotron-3-super-120b-a12b"

Pricing lookup was strict-match: if you sent model="nemotron-3-super-120b-a12b" without the "nvidia/" provider prefix, the gateway couldn't tell it was a free NIM model and the trial gate fired with a billing_required 402 — wrong outcome for a model that costs $0. Now the lookup falls back through common provider prefixes (nvidia, meta, deepseek-ai, google, anthropic, openai) when the bare name doesn't match directly. Affects every code path that depends on per-model pricing: trial allowance, free-model detection, cost reservation, and the inline-NIM-alternative suggestion in 402 copy. Bare gpt-oss-20b and gpt-oss-120b also resolve correctly now.

infra (private)

2026-06-23

402 trial-blocked-paid now offers a free NIM alternative inline

Users hitting the trial gate on a paid model (e.g. gpt-4, claude-sonnet, gemini-2.5-flash) used to see a 402 that said "add $5 to unlock premium providers, your trial covers NIM." True but not actionable — they had to figure out which NIM model to swap to. Now the 402 message also says: "Or try a $0 NIM model right now — e.g. swap model=meta/llama-3.1-8b-instruct or meta/llama-3.1-70b-instruct (or any of the 118 NIM models) and your trial covers them today." Gives stuck users a one-line script edit to keep going without paying. The $5 pack CTA stays in place; X-Inferall-Upgrade-URL header stays. Within an hour of the deploy a Go-SDK user that had been retrying gemini-2.5-flash for 30+ attempts swapped to meta/llama-3.1-70b-instruct and made 22 successful calls.

infra (private)

2026-06-22

Activation funnel — closed two billing bugs, shipped 4 telemetry layers, surfaced an outreach pipeline

Telemetry expansion: every 402 from the gateway's trial-block / card-required / credits-required gates now carries a structured error_code in ai_usage_log plus an X-Inferall-Upgrade-URL header for SDK/IDE retry loggers that don't parse JSON bodies. The trial-block path previously emitted 402s with no log entry at all — that entire cohort is now visible. Closed two billing bugs: (1) finalize_spend was debiting tier=pending trial users' balance on every free-NIM call because the MINIMUM_FREE_MODEL_COST_CENTS floor flowed into estimated_cost_cents; gate finalize_spend on real provider cost instead. (2) Stripe webhook endpoint was only subscribed to 4 of 9 handled events — patched to all 10 so charge.succeeded / charge.refunded / checkout.session.expired actually fire in production. checkout-expired events now also persist to billing_errors so dormant-checkout outreach can be queried programmatically. Added a per-user gate-storm tracker that emits [ALERT:gate_storm] when one user hits 15 trial-block 402s in 60s — surfaces SDK auto-retry loops in real time instead of via hourly sweeps.

infra (private) · site (private)

2026-06-15

Reliability sweep — 21 hardening fixes across the gateway

Shipped a 21-commit run touching billing race conditions (the mode=payment fix that unblocked our first paid customer, plus the mirror fix for mode=subscription), Stripe checkout improvements (abandoned-cart recovery, allow_promotion_codes, product description on the Stripe page, Stripe Customer idempotency keys to defeat double-click races), a charge.refunded handler so refunds actually subtract from balance, the compat /v1/messages fallback engine now short-circuits on non-recoverable upstream errors instead of cascading the whole chain, and a full timeout sweep — every outbound fetch in the gateway codebase (Supabase REST, Stripe API, and all five provider adapters: minimax/replicate/runway/stability/elevenlabs) now uses fetchWithTimeout with explicit budgets so a hung upstream socket can no longer pin a worker forever.

infra (private)

Pricing-claim correction — what the tiers actually do

Earlier entries in this changelog (and across the marketing site) described the Free tier as "100,000 tokens/month" — that was never the actual gateway behavior. The real model: open NVIDIA NIM models bill at $0 input/$0 output regardless of tier, premium providers (OpenAI, Anthropic, Google) bill at the provider's published per-token rate with zero markup, and every tier has a daily request count limit per operation (Free: 50 chat/100 text per day; Pro: 5,000 chat; Team and Enterprise scale up). Quotas reset daily UTC. Marketing copy across the homepage, /billing, /keys, /solutions/*, /use-cases/*, /extension, and the major blog posts was rewritten to describe this honestly. Old entries below are left as the historical record but the framing they describe is superseded by this one.

site (private)

2026-06-13

$5 Activation pack — cheapest path to start using the gateway

Added a $5 one-time activation pack as the entry pricing tier alongside Pro ($29/mo). The $5 becomes spendable balance: open NVIDIA NIM models bill at $0 input/$0 output against your balance, premium providers (OpenAI, Anthropic, Google) bill at the provider's published rate with zero markup. Real card + real money cleared by Stripe is the bot defense — bots don't burn $5 on a card they can't extract value from, and real evaluators get to start using the gateway without committing to $29/mo. Uses inline Stripe priceData (same pattern as the existing $100 credit pack) so no preconfigured Stripe Price ID was needed.

infra (private)

2026-06-12

Free tier framing — honest about the card requirement

Removed "no credit card required" / "no card needed to start" claims across the marketing site (hero, /pricing, /keys, /billing, /solutions/*, /use-cases/*, /integrations). The free tier still costs $0 within the 100,000-token monthly allowance on 118+ NVIDIA NIM open-source models, but activation requires a verified card on file — that's the gate that's holding back the bot signup wave that started in early June, and reversing it would re-open ~$7k/mo in upstream Anthropic outflow. The site now says so up-front. Pending-tier (no card yet) keys get a small trial allowance to evaluate before the activation prompt.

infra (private)

2026-06-03

ElevenLabs TTS — voice generation via the gateway

ElevenLabs text-to-speech is now live as a gateway provider. POST /ai/v1/generate with provider: "elevenlabs", operation: "tts" — returns binary audio/mpeg. Supports eleven_multilingual_v2 and named voice aliases (confident_narrator, warm_narrator, friendly_narrator, deep_narrator) or raw ElevenLabs voice UUIDs. Same ifu_ key as all other providers.

infra (private)

Model catalog expanded to 207 — GPT-4.1, o3, o4-mini, Claude 4.x

Added the full GPT-4.1 family (gpt-4.1 at $2/$8/M, gpt-4.1-mini at $0.40/$1.60/M, gpt-4.1-nano at $0.10/$0.40/M), OpenAI reasoning models o3 ($10/$40/M), o4-mini ($1.10/$4.40/M), and o1-pro ($150/$600/M). Also added five Anthropic Claude 4 variants that were live on Anthropic's API but absent from our catalog: claude-opus-4-1, claude-opus-4-5, claude-opus-4-6, claude-opus-4-7, and claude-sonnet-4-5. Fixed a routing bug where o4-mini requests didn't route to OpenAI. Catalog is now 207 models, 0 dead.

infra (private)

First-call quickstart for returning users

The /keys page now shows a persistent copy-paste curl snippet for pending-tier users who haven't made an API call yet — so the first-call path is visible even after you close and reopen the tab. Previously the snippet only appeared immediately after creating a new key; returning users saw their key prefix but no guidance. Also added a copy button and telemetry to track the never-activated cohort.

PR #86 · PR #88

Trial usage bar now renders even when the usage fetch fails

Fixed a bug where the trial progress bar (N/200 requests) was invisible to pending-tier users if the /ai/v1/usage fetch returned a non-401 error. The bar now defaults to showing 0/200 rather than hiding entirely, so users near the cap always see a warning.

PR #85 (private)

Stripe usage reporting fixed — paying developers now billed correctly

The daily GitHub Actions cron that reports usage to Stripe for metered billing was silently failing due to three missing secrets. Set and verified; first successful run confirmed. Usage-based subscriptions are now tracked correctly.

infra (private)

2026-05-30

Start free — no credit card required

You no longer need a card on file to try InferAll. Create a key and start calling the 110+ free NVIDIA NIM open-source models (Llama 3.1 70B/8B, Mixtral, Nemotron, CodeLlama, and more) right away — $0. A card is needed only when you want a paid provider (OpenAI, Anthropic, Google) or to continue past the free trial. Same ifu_ key, same OpenAI- and Anthropic-compatible endpoints — the card wall that used to sit in front of the free tier is gone.

infra (private)

2026-05-29

Automated model-health checks — dead models get flagged, not served

The gateway now monitors its whole catalog. A daily check (and a public GET /ai/v1/models/health endpoint) compares every model against its provider's live model list and flags any the provider has stopped serving — so when a provider deprecates, renames, or rotates a model ID (the drift that retired DALL·E, sunset Gemini 1.5, and regularly rotates NVIDIA NIM IDs) it gets caught and fixed instead of 404'ing your calls. Detection is registry-based and $0, so monitoring never burns the free tier it's watching.

infra (private)

Deprecated model IDs keep working; catalog drift cleaned up

If your app is still pinned to a deprecated Claude snapshot the provider has retired, the gateway now transparently routes it to the current equivalent and labels the response with the model that actually served it — so you keep working while you migrate instead of failing during an incident. We also removed catalog IDs providers no longer serve (the catalog now lists only models that actually run), repointed a stale fallback hop, and moved OpenAI image generation to the current gpt-image-1 (DALL·E was retired upstream) at the same price.

infra (private)

Large requests no longer get cut off mid-generation

Raised the upstream timeout for the primary provider attempt so large, long-running generations complete instead of being aborted at 30 seconds — while keeping fallback hops short so a real provider outage still fails over quickly. Fixes a class of large requests that were erroring after the primary and every fallback hit the old 30s ceiling.

infra (private)

Gateway hardened for reliability — multi-region, health checks, more memory

api.inferall.ai now runs on two machines across two regions (sjc + iad) with HTTP health checks, so a single machine or a regional edge incident no longer takes the gateway down — the proxy routes around an unhealthy instance. Per-machine memory was raised 256MB → 512MB to remove GC stalls under streaming load. Resolves the intermittent connection resets observed during a Fly SJC edge incident on 2026-05-28.

infra fly.toml (private)

Catalog expanded: GPT-5.5 family + latest Claude

Added GPT-5.5, GPT-5.5 Pro, GPT-5.4 (+ mini/nano), GPT-5, and Claude Opus 4.8 / Sonnet 4.6 to the model catalog (GET /ai/v1/models). Premium tokens bill at the provider's published price with zero markup, on your existing ifu_ key.

gateway change (private)

First-call quickstart: free $0 curl on any stack

After you create a key, the quickstart now offers a copy-paste curl that makes a $0 call on a free NVIDIA NIM open model via the OpenAI-compatible endpoint — alongside the Claude Code snippet — so you can confirm your key works in one paste, with no premium spend.

PR #42

Marketing/docs model IDs now verified in CI

Fixed two broken model IDs in the OpenRouter comparison and Claude Code use-case pages (they named models not in the catalog, so copy-paste examples 404'd), and added a CI check that fails if any blog / comparison / use-case / docs page references a model ID not in the live catalog. Prevents the bug class going forward.

PR #43 · PR #44

2026-05-26

Gateway 429 messages now point at /billing for upgrades

Both 429 responses from the gateway's limiter — spending-limit-exceeded and daily-rate-limit-exceeded — now lead with https://inferall.ai/billing as the primary upgrade CTA. Previously the spending-limit message only pointed at the Stripe portal (meant for existing subs adjusting card cap), and the daily-rate-limit message had no upgrade pointer at all.

gateway change (private)

Billing UX: tier-specific CTAs, current-plan indicator, Most-popular badge, success banner

Replaced the generic "Select" buttons on /billing with tier-specific labels ("Upgrade to Pro", "Upgrade to Team", "Get free API key" → /keys, "Contact sales"). The current tier gets a left-border accent and a disabled "Current plan" label. Pro carries a "· Most popular" badge. Post-checkout, ?success=true now triggers a "Thanks — your subscription is active" banner.

PR #36 · PR #37 (hotfix)

scripts/founder-stats.sh — reusable Stripe-aggregate refresh

Single shell script anyone with the Stripe CLI authed can run to refresh founder metrics: customer count, signup velocity, active subscriptions + plan breakdown, paid invoices, churn, void invoices, funnel breakdown. Outputs human-readable markdown + a JSON snapshot. Aggregate-only — no customer IDs, emails, or payment details written to disk.

PR #35

PostHog activation-funnel analytics + Replicate passthrough note + honesty fixes

PostHog wired to the marketing site (project 359601) — pageviews on soft navigation, plus explicit funnel events (inferall_auth_success with identify, inferall_key_created with tier, inferall_first_call_snippet_copied). Removed a fabricated aggregateRating from JSON-LD (no real ratings exist; honesty + Google structured-data policy). /live now acknowledges that the gateway passes any Replicate or NIM model id through directly, so the routable catalog is much larger than the 207 enumerated.

PR #34

VS Code extension: security + Code-of-Conduct contacts routed to InferAll

SECURITY.md and all nine Code-of-Conduct translations (en + ja, ar-sa, ko, es, zh-tw, pt-BR, zh-cn) now route security reports and abuse reports to the InferAll team instead of the upstream Cline fork's contact addresses.

inferall-vscode#4 · inferall-vscode#5

2026-05-25

Clonable quickstart in /examples

Python (inferall-ai) and TypeScript (@inferall/sdk) hello-world scripts plus a three-path README (Claude Code env vars, Python SDK, TypeScript SDK). Linked from the docs Quick-start.

PR #32

/live page — per-vendor model breakdown + source mix

Server-rendered with 1h ISR. Shows total / free / paid counts, the table of every vendor (google, nvidia, openai, meta, mistralai, anthropic, runway, microsoft, qwen, …), and live vs static source mix. Counts match what `curl /ai/v1/models | jq 'keys|length'` returns.

PR #31

/keys: ready-to-paste Claude Code snippet right after key creation

When you create a new key, the page now embeds a two-env-var snippet with the actual key inline and its own Copy button. Time from key → working `claude` command is one paste.

PR #30

Live gateway stats on the homepage hero and /status

Homepage hero now shows a one-line live readout: total models, free models, gateway healthy/degraded — fetched server-side from the public endpoints. /status replaces its previous placeholder with the same data while the full multi-region status page is built out.

PR #29

User-key prefix: `ifu_` (legacy `kr_user_` still accepted)

Dashboard-created user keys now mint with the InferAll-branded `ifu_` prefix instead of the legacy `kr_user_`. Docs, SDK examples, /compare snippets, and the security page were updated to match. Existing keys remain valid forever.

gateway change (private) · PR #28

Model-count copy aligned to verifiable numbers

Site now says "200+ models" / "150+ free open-source models" / "120+ NVIDIA-hosted models" — each rounded down from what /ai/v1/models actually enumerates today (207 / 158 / 123). Survives catalog fluctuation without overclaiming.

PR #27

Free tier framing: card on file required, $0 within the allowance

Removed every "no credit card" claim. The Free tier still costs $0 within the 100,000-token monthly allowance on NVIDIA NIM open-source models, but activation requires a card on file — we say so upfront across the marketing site and on /keys.

PR #26

Branded social-share assets

Open Graph + Twitter image, generated favicon, web manifest, sitemap expansion, /docs metadata. Link previews on Hacker News, X, Reddit, Slack now render correctly.

PR #25

2026-05-24

TypeScript SDK published — `@inferall/sdk` 0.1.0

The native TypeScript SDK is live on npm. `npm install @inferall/sdk`. Reads `INFERALL_API_KEY` from the environment; same `Inferall` class shape as the Python SDK.

PR #24

SLA published — 99.9% monthly uptime target + tiered service credits

Concrete SLA at /sla: 99.9% gateway control-plane uptime per calendar month, tiered service credits (10% / 25% / 50% of monthly fee), exclusions, failover behavior, and incident response.

PR #22

2026-05-23

Billing page: Team tier aligned to 10M tokens + Enterprise contact-sales tile

Pricing tiles on /billing now reflect the real Team tier (10M tokens/month) and add an Enterprise contact-sales tile linking to contact@kindly.fyi.

PR #19

Docs: SDK examples refreshed to shipped `inferall-ai` + `ifa_` keys

/docs SDK examples now show the actual published Python (`inferall-ai`) and TypeScript (`@inferall/sdk`) shapes, with `ifa_…`-prefixed keys.

PR #20

Python SDK restored to the hero — `pip install inferall-ai` 0.1.0

Homepage hero leads with the published Python SDK install + a minimal usage snippet.

PR #16

Earlier

Gateway: automatic cross-provider failover on 429 / 529 / 5xx / timeout

Server-side retry against the next capable provider in the chain when an upstream rate-limits or errors. Context-size-aware, cheapest-first selection. A single Anthropic outage or Google quota reset no longer surfaces to your application.

gateway change (private)