Delx
OpenClaw / Fix: Retry Storm (429)

Fix: Retry Storm (429)

A retry storm occurs when an agent responds to 429 rate-limit errors by immediately retrying, creating a feedback loop that amplifies load instead of recovering. Left unchecked, this can lock your agent out of the API for extended periods and degrade service for other consumers on shared infrastructure.

Symptoms

Root causes

Retry storms typically start when an agent lacks proper backoff logic and retries immediately on any failure. Contributing factors include: no maximum retry cap, ignoring the Retry-After header, multiple concurrent operations all retrying independently, and no circuit breaker to halt calls after repeated failures.

Step-by-step fix

  1. Stop all retry loops immediately. If your agent is actively storming, kill the process or disable the retry logic. The priority is stopping the flood of requests.
  2. Read the Retry-After header. Check the last 429 response for the Retry-After header value (in seconds). Do not send any request until this period has elapsed.
  3. Implement exponential backoff with jitter. Replace fixed-interval retries with exponential backoff: start at 1s, double on each failure (2s, 4s, 8s...), and add random jitter of 0-50% to avoid synchronized retries across agents.
  4. Cap maximum retries. Set a hard limit of 3-5 retries per operation. After exhausting retries, fail the operation gracefully and log the error for manual review.
  5. Add a circuit breaker. After N consecutive failures (e.g. 3), stop all API calls for a cooldown period (e.g. 60 seconds). Only resume after the cooldown and then with a single probe request.
  6. Prioritize essential operations during recovery. When rate-limited, queue only critical operations (heartbeats, urgent tool calls) and defer non-essential work until the rate limit window resets.

Code example

Python retry logic with exponential backoff and Retry-After header support:

import time, random, requests

def call_with_backoff(url, payload, max_retries=4):
    delay = 1.0
    for attempt in range(max_retries):
        resp = requests.post(url, json=payload)
        if resp.status_code != 429:
            return resp

        # Respect Retry-After header if present
        retry_after = resp.headers.get("Retry-After")
        if retry_after:
            delay = float(retry_after)
        else:
            delay = min(delay * 2, 60)  # exponential, cap 60s

        jitter = delay * random.uniform(0, 0.5)
        print(f"429 received. Waiting {delay + jitter:.1f}s...")
        time.sleep(delay + jitter)

    raise Exception("Max retries exceeded — circuit open")

You can also check rate-limit headers proactively before hitting the limit:

# Check remaining quota before sending a request
curl -s -o /dev/null -w "%{http_code}" \
  -D - https://api.delx.ai/v1/a2a \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"message/send","params":{"message":{"role":"user","parts":[{"type":"text","text":"ping"}]}},"id":1}' \
  | grep -i "x-ratelimit"

# Example response headers:
# X-RateLimit-Limit: 60
# X-RateLimit-Remaining: 45
# Retry-After: 0

Prevention

Validation

After implementing backoff, track 429 frequency and response latency percentiles over one full operational cycle (24 hours minimum). Confirm that: (1) 429 responses are followed by appropriate delays, not immediate retries, (2) the agent eventually recovers and resumes normal operation, and (3) CPU/memory usage remains stable during rate-limited periods. If 429s still cluster, increase your base delay or reduce concurrency further.

Related