Docs
Rate limits
InferAll applies rate limits at two layers: (1) per-tier token allowances enforced by the InferAll gateway, and (2) upstream provider rate limits that are transparently retried against alternate providers via automatic failover.
Per-tier token allowances
Token allowances are the monthly budgets enforced by InferAll itself. Upstream providers may apply additional per-second or burst limits.
Per-second and requests-per-minute limits per tier will be documented here once finalized. Until then, treat the gateway as bursty-tolerant and honor any Retry-After header you receive.
Response headers
Every successful response includes headers describing your remaining budget. Rate-limited responses include Retry-After.
X-RateLimit-RemainingTokens remaining in your current billing window.X-RateLimit-LimitToken allowance for your current tier.X-RateLimit-ResetUnix timestamp when the current window resets.Retry-AfterOn 429 / 529 responses: seconds to wait before retrying.Exponential backoff
When InferAll returns 429 or 529, retry with exponential backoff and jitter. Prefer Retry-After when present.
# 429 / 529 response from InferAll
# Honor Retry-After when present, otherwise use exponential backoff.
attempt = 0
while attempt < MAX_RETRIES:
response = http.post(url, headers=h, json=body)
if response.status_code not in (429, 529):
return response
# Prefer server-supplied delay when available.
delay_seconds = (
int(response.headers.get("Retry-After", "0"))
or (2 ** attempt) + jitter()
)
sleep(delay_seconds)
attempt += 1Recommended defaults: max 5 retries, base delay 1 second, multiplier 2, jitter up to 1 second. The OpenAI and Anthropic SDKs implement compatible behavior by default — using them as drop-in clients against InferAll Just Works.
Upstream provider limits
Individual upstream providers (Anthropic, OpenAI, Google, NVIDIA, Replicate, Runway) apply their own rate limits. InferAll's gateway watches for 429, 529, 5xx, and 30 second timeouts and transparently retries the same request against the next provider that can serve the requested model class. From your client's perspective, upstream rate limits typically do not surface as failures — see the SLA page for the full failover behavior.
Need higher limits?
Contact us for the Enterprise tier — custom token allowances, priority routing, and negotiated limits are available.