Docs

Rate limits

InferAll applies rate limits at two layers: (1) per-tier token allowances enforced by the InferAll gateway, and (2) upstream provider rate limits that are transparently retried against alternate providers via automatic failover.

Per-tier token allowances

Token allowances are the monthly budgets enforced by InferAll itself. Upstream providers may apply additional per-second or burst limits.

Free100k tokens / month186 NVIDIA-hosted OSS models
Pro2M tokens / monthAll 255+ models, all providers
Team10M tokens / monthMultiple keys, priority routing
EnterpriseCustomNegotiated; contact sales

Per-second and requests-per-minute limits per tier will be documented here once finalized. Until then, treat the gateway as bursty-tolerant and honor any Retry-After header you receive.

Response headers

Every successful response includes headers describing your remaining budget. Rate-limited responses include Retry-After.

X-RateLimit-RemainingTokens remaining in your current billing window.
X-RateLimit-LimitToken allowance for your current tier.
X-RateLimit-ResetUnix timestamp when the current window resets.
Retry-AfterOn 429 / 529 responses: seconds to wait before retrying.

Exponential backoff

When InferAll returns 429 or 529, retry with exponential backoff and jitter. Prefer Retry-After when present.

# 429 / 529 response from InferAll
# Honor Retry-After when present, otherwise use exponential backoff.

attempt = 0
while attempt < MAX_RETRIES:
    response = http.post(url, headers=h, json=body)
    if response.status_code not in (429, 529):
        return response

    # Prefer server-supplied delay when available.
    delay_seconds = (
        int(response.headers.get("Retry-After", "0"))
        or (2 ** attempt) + jitter()
    )
    sleep(delay_seconds)
    attempt += 1

Recommended defaults: max 5 retries, base delay 1 second, multiplier 2, jitter up to 1 second. The OpenAI and Anthropic SDKs implement compatible behavior by default — using them as drop-in clients against InferAll Just Works.

Upstream provider limits

Individual upstream providers (Anthropic, OpenAI, Google, NVIDIA, Replicate, Runway) apply their own rate limits. InferAll's gateway watches for 429, 529, 5xx, and 30 second timeouts and transparently retries the same request against the next provider that can serve the requested model class. From your client's perspective, upstream rate limits typically do not surface as failures — see the SLA page for the full failover behavior.

Need higher limits?

Contact us for the Enterprise tier — custom token allowances, priority routing, and negotiated limits are available.