Webhook Retry Strategies (2026): Idempotency, Backoff, Dead Letters
Your webhook handler will fail. The receiving service will be momentarily down, the database will be slow, the network will hiccup, your code will throw a transient error. The question isn't "will retries happen?" — it's "are you ready when they do?"
This guide covers the four pillars of a resilient webhook receiver: idempotency, safe retry semantics on the receiver side, understanding the sender's retry policy, and dead-letter handling for the requests that never succeed. Code examples are in Node.js (Express + Postgres), but the patterns are language-agnostic.
Quick recipe: dedupe by event ID before doing real work, return 2xx fast, treat duplicates as no-ops, and set up a dead-letter queue for events that fail too many times. The rest is sender-specific tuning.
Why retries are unavoidable
Webhook senders (Stripe, GitHub, Shopify, etc.) decide an event was "delivered" based on whether your endpoint returned a 2xx HTTP status. Anything else — 4xx, 5xx, timeout, TCP reset, your laptop closed mid-deploy — is a "failure" and the sender will try again, often aggressively.
This means your handler is going to see the same event multiple times. Sometimes 2-3 times during a routine outage; up to 17 times for Stripe over 3 days; up to 50 times for GitHub. If your code charges a credit card or sends an email, naïve handling = duplicate charges, duplicate emails, angry customers.
The good news: the fix is mostly mechanical. Once you have idempotency-by-event-ID, retries become benign.
Pillar 1: Idempotency by event ID
Every webhook payload from a serious provider includes a unique event ID:
| Provider | Event ID field | Format |
|---|---|---|
| Stripe | id (top-level) | evt_1ABC... |
| GitHub | X-GitHub-Delivery (header) | UUID v4 |
| Shopify | X-Shopify-Webhook-Id (header) | UUID v4 |
| Slack | X-Slack-Request-Timestamp + body hash | composite |
| Square | event_id (in body) | UUID v4 |
| HubSpot | eventId (per event in array) | numeric |
| SendGrid | sg_event_id (per event) | base64 |
The pattern: persist the event ID before doing real work, in a unique-indexed table. If the insert fails because the ID already exists, you've seen this event before — return 200 OK and do nothing.
CREATE TABLE processed_webhook_events (
event_id TEXT PRIMARY KEY,
source TEXT NOT NULL, -- 'stripe', 'github', etc.
received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
// Express + node-postgres
import express from "express";
import { Pool } from "pg";
const pool = new Pool();
const app = express();
app.post(
"/webhooks/stripe",
express.raw({ type: "application/json" }),
async (req, res) => {
// 1. Verify signature first (always — never skip).
const event = verifyStripeSignature(req); // throws if invalid
// 2. Try to record the event ID. Unique constraint = dedupe.
try {
await pool.query(
"INSERT INTO processed_webhook_events (event_id, source) VALUES ($1, $2)",
[event.id, "stripe"],
);
} catch (err) {
if ((err as { code?: string }).code === "23505") {
// Duplicate — Stripe is retrying. We already handled this event.
return res.json({ received: true, duplicate: true });
}
throw err;
}
// 3. Now do real work. If this throws, the row stays in the table
// but the event is unprocessed. See "transactional handlers" below.
await handleStripeEvent(event);
res.json({ received: true });
},
);
This is the most important pattern in the entire guide. Get this right and retries become free.
Transactional handlers (the next subtle bug)
The simple version above has a race: if handleStripeEvent throws after we recorded the event ID, retries see "duplicate" and skip the event — but the work never happened. Two fixes:
Option A — Mark events as pending then processed. Use a status column instead of pure existence:
ALTER TABLE processed_webhook_events
ADD COLUMN status TEXT NOT NULL DEFAULT 'pending',
ADD COLUMN processed_at TIMESTAMPTZ;
On retry, if a row exists with status='pending', you know the previous attempt died mid-flight. Pick up the work and re-run it. If status='processed', return 200 immediately.
Option B — Wrap the insert + work in a single DB transaction. If the work throws, the insert rolls back, and the next retry sees no row. This is the cleanest pattern when your business logic is also DB-bound:
await pool.query("BEGIN");
try {
await pool.query(
"INSERT INTO processed_webhook_events (event_id, source) VALUES ($1, $2)",
[event.id, "stripe"],
);
await handleStripeEvent(event); // does its own DB writes inside the txn
await pool.query("COMMIT");
} catch (err) {
await pool.query("ROLLBACK");
if ((err as { code?: string }).code === "23505") {
return res.json({ received: true, duplicate: true });
}
throw err; // sender will retry; rollback means we're idempotent
}
Option B is the right answer when handlers stay inside one database. Option A is necessary when your handler does external API calls (sending an email, calling Slack, etc.) that can't be rolled back.
Pillar 2: Return 2xx fast — defer work to a queue
Most providers time out at 5-30 seconds. If you do all your processing inline, every slow handler == retry storm. The solution: acknowledge fast, work asynchronously.
app.post("/webhooks/stripe", async (req, res) => {
const event = verifyStripeSignature(req);
// Sync: dedupe + enqueue + return.
await enqueueForProcessing(event); // this is BullMQ / SQS / DB-backed jobs
res.status(200).json({ received: true });
});
// Worker process — handles the actual logic, retries on its own schedule.
worker.process("stripe-events", async (job) => {
await handleStripeEvent(job.data);
});
Trade-off: now you have two retry layers (the sender's, and your worker's). Make sure your worker also dedupes by event ID before doing real work — same pattern as Pillar 1.
Pillar 3: Understand each sender's retry policy
You can't tune your dead-letter strategy without knowing the upstream retry budget. The current (2026) policies:
| Provider | Retry attempts | Backoff schedule | Total window |
|---|---|---|---|
| Stripe | Up to 17 | Exponential, ~immediately → ~3 days | 3 days |
| GitHub | Up to 50 | Exponential | ~8 hours |
| Shopify | Up to 19 | Exponential, ~hours apart | 48 hours |
| Slack | Up to 3 | 1 min, 5 min, 30 min | ~36 min |
| Twilio | Configurable (3 default) | Exponential per Webhook config | varies |
| Square | Up to 70 (!) | Exponential, ~immediately → ~72 hours | 72 hours |
| HubSpot | Up to 10 | Exponential | ~8 hours |
| SendGrid | Up to ~24 hours of retries | Exponential | 24 hours |
Two things this table tells you:
- The retry windows are LONG. If Stripe gives you 3 days and Square gives you 72 hours, your handler stability matters over days, not seconds. A "blip" outage that lasts 30 minutes will resolve itself before any of these senders give up.
- Slack is the outlier. ~36 minutes of retries means a longer outage drops Slack events on the floor. If Slack signals are critical to your app, you need defensive replay tooling.
Source: each provider's published retry docs as of 2026-04. Re-verify before quoting in production.
Designing for the sender, not against it
A common anti-pattern: returning 4xx for "expected" failures (like a duplicate event you don't care to process). Stripe and most others stop retrying on 4xx; they treat it as "your endpoint rejected the event, that's terminal."
The right responses:
- 2xx: "I have this. Please don't retry." Use even when you're skipping a duplicate or ignoring an event type.
- 5xx or timeout: "I'm broken. Please retry." Use for transient infra problems.
- 4xx: "Don't ever try again." Reserve for malformed requests or signature failures — explicit "stop retrying" intent.
If your endpoint returns 4xx during a partial outage, you'll silently lose events you actually wanted.
Pillar 4: Dead-letter handling
Eventually some events fail every retry. Maybe a customer was deleted between the event firing and your retry, maybe a downstream API changed schemas. You need:
- A dead-letter table that captures fully-failed events.
- An alert when something lands there.
- A manual replay path to reprocess after fixing the bug.
CREATE TABLE webhook_dead_letters (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
event_id TEXT NOT NULL,
raw_headers JSONB NOT NULL,
raw_body BYTEA NOT NULL, -- preserve EXACTLY what arrived
last_error TEXT NOT NULL,
attempts INT NOT NULL,
received_at TIMESTAMPTZ NOT NULL,
resolved_at TIMESTAMPTZ
);
Critical: store the raw bytes of the request, not the parsed JSON. When you fix the handler and want to replay, you need the exact bytes the signature was computed over — otherwise verification fails and you can't retry it cleanly.
Once a row lands here, alert (Slack, PagerDuty, email — whatever you use). Manual replay is then:
async function replayDeadLetter(id: number) {
const row = await db.oneOrNone(
"SELECT * FROM webhook_dead_letters WHERE id = $1",
[id],
);
if (!row) throw new Error("not found");
// Replay through the same handler — your idempotency table
// ensures we don't double-process if it ALSO hit the live path.
await processWebhook(row.source, row.raw_headers, row.raw_body);
await db.none(
"UPDATE webhook_dead_letters SET resolved_at = now() WHERE id = $1",
[id],
);
}
Test your retry handling without waiting for production
The hardest part of this whole architecture is testing. Real Stripe retries happen on Stripe's schedule, days apart. You can't reliably write an integration test against "what happens on the 4th retry."
Two patterns that work:
Use HookRay to capture and replay
HookRay gives you a public webhook URL that captures every incoming event. Once you've captured a real event, you can replay it manually as many times as you want, simulating retry behavior:
- Get a free HookRay URL (no signup required for the first 100 captures).
- Point Stripe / GitHub / Shopify at the URL in their dashboard.
- Trigger a real event (test mode is fine).
- From HookRay's UI, click "Replay" — re-send the captured webhook to your local handler (
localhost:3000via tunnel, or HookRay Pro forwards directly). - Click Replay 5 times in a row to verify your idempotency table catches duplicates.
This is the fastest "did my retry handling actually work?" loop. See the Webhook.site → HookRay migration guide if you're moving from a tool that doesn't support replay.
Provider CLIs
Stripe and GitHub both ship CLI tools that forward real events to localhost:
# Stripe — also supports `--resend-event-id evt_xxx` for replay
stripe listen --forward-to localhost:3000/webhooks/stripe
# GitHub
gh webhook forward --repo owner/repo --url localhost:3000/webhooks/github
These are great for the happy path but they don't simulate the sender's retry storm — for that you need replay (HookRay or roll your own).
Common retry bugs (and how to spot them)
Bug 1: Returning 200 too early. You return 200 then crash before persisting the event. Sender thinks it's done; you lost the data. Fix: persist (or enqueue) before returning 200.
Bug 2: Idempotency on the wrong key. Using a synthetic key (like ${customer_id}_${event_type}) instead of the provider's event ID. Two distinct events with the same composite key collide; legitimate events get dropped as "duplicates." Fix: always dedupe on the provider's event ID.
Bug 3: Returning 4xx for expected duplicates. This stops retries, which sounds good — until you realize that all transient errors during the duplicate path also become 4xx. You silently break legitimate retry. Fix: return 200 OK with {duplicate: true} body for known duplicates; reserve 4xx for truly malformed requests.
Bug 4: Inline external API calls. Your handler calls Stripe's API to fetch related data, the call hangs for 30 seconds, the webhook times out, the sender retries, your handler hangs again, your queue fills up. Fix: enqueue + ack fast (Pillar 2).
Bug 5: Lost events during deploys. Your handler is mid-processing when the container is replaced. The event gets a 5xx (or worse, a half-completed write) and the sender retries. Without graceful shutdown handling, your retry table doesn't capture the in-flight ID. Fix: drain the in-flight queue before deploying, OR use the pending status pattern from Option A above.
Summary checklist
Before declaring your webhook handler "production-ready," verify:
- Every handler dedupes by the provider's event ID
- The dedupe table has a UNIQUE constraint on
event_id - Either Option A (pending → processed) or Option B (transactional) is in use
- Handlers return 2xx on duplicates, NOT 4xx
- Long-running work is enqueued, not done inline
- You have a dead-letter table that stores raw headers + raw body
- You alert on dead-letter inserts (Slack/PagerDuty/email)
- You have a tested manual-replay path
- Your retry handling has been verified with HookRay replay or a similar tool
Related guides:
- Webhook Signature Verification (HMAC-SHA256) — the security pillar that pairs with this reliability pillar
- Stripe Webhook Best Practices — Stripe-specific deeper dive
- The 7 Best Webhook Testing Tools — tooling overview including replay support
- Hookdeck vs HookRay — when to graduate from a testing tool to production webhook infrastructure
Ready to test your webhooks?
Get a free webhook URL in 5 seconds. No signup required.
Start Testing — Free