Docs

Rate limits

InferAll applies rate limits at two layers: (1) per-tier daily request-count caps enforced by the InferAll gateway (separate caps per operation type), and (2) upstream provider rate limits that are transparently retried against alternate providers via automatic failover.

Per-tier daily request limits

Each plan has separate daily caps per operation type, reset at 00:00 UTC. Upstream providers may apply additional per-second or burst limits, which InferAll handles via automatic failover.

Plantextchatimage-analyzeimage-generatevideo-generate
Free1005020105
Pro10,0005,0002,00050050
Team10,0005,0002,000500200
Enterprise100,00050,00020,0005,0001,000

Values are requests per day per operation. Counts reset at 00:00 UTC. The gateway is bursty-tolerant within the daily cap — there's no per-second sublimit. Each plan also includes a monthly $-amount of premium-provider usage ($5 free, $20 Pro, $100 Team) before prepaid balance is drawn down; see /pricing for the full structure.

Trial allowance (new accounts)

Each new account gets a lifetime allowance of 200 free requests against the 118+ open NVIDIA NIM models ($0 input/$0 output) before any charge is required. The trial unlocks as soon as you visit /billing and start the checkout flow (a Stripe Customer record is created at that point — the card itself isn't charged until you actually pick a pack).

After the 200 requests are used or you complete a $5 starter pack, you graduate to the free-plan daily caps in the table above and any premium-provider usage starts billing against your prepaid balance.

Response headers

Rate-limited responses (HTTP 429 or 529) include Retry-After. Quota-exhausted responses use HTTP 402 with a JSON body that includes the remaining balance and the credits-required link.

Retry-AfterOn 429 / 529 responses: seconds to wait before retrying. Honor this when present.
HTTP 402 bodybalance_usd, needed_usd, add_credits_url — surfaced when your monthly included usage plus prepaid balance can't cover the next request.

Exponential backoff

When InferAll returns 429 or 529, retry with exponential backoff and jitter. Prefer Retry-After when present.

# 429 / 529 response from InferAll
# Honor Retry-After when present, otherwise use exponential backoff.

attempt = 0
while attempt < MAX_RETRIES:
    response = http.post(url, headers=h, json=body)
    if response.status_code not in (429, 529):
        return response

    # Prefer server-supplied delay when available.
    delay_seconds = (
        int(response.headers.get("Retry-After", "0"))
        or (2 ** attempt) + jitter()
    )
    sleep(delay_seconds)
    attempt += 1

Recommended defaults: max 5 retries, base delay 1 second, multiplier 2, jitter up to 1 second. The OpenAI and Anthropic SDKs implement compatible behavior by default — using them as drop-in clients against InferAll Just Works.

Upstream provider limits

Individual upstream providers (Anthropic, OpenAI, Google, NVIDIA, Replicate, Runway) apply their own rate limits. InferAll's gateway watches for 429, 529, 5xx, and 30 second timeouts and transparently retries the same request against the next provider that can serve the requested model class. From your client's perspective, upstream rate limits typically do not surface as failures — see the SLA page for the full failover behavior.

Need higher limits?

Contact us for the Enterprise tier — custom daily request limits, custom monthly included usage, priority routing, and negotiated SLAs are available.