·11 min read·webhook-reliability, idempotency, retry, exponential-backoff, dead-letter-queue, stripe, github, shopify

Webhook Retry Strategies (2026): Idempotency, Backoff, Dead Letters

Your webhook handler will fail. The receiving service will be momentarily down, the database will be slow, the network will hiccup, your code will throw a transient error. The question isn't "will retries happen?" — it's "are you ready when they do?"

This guide covers the four pillars of a resilient webhook receiver: idempotency, safe retry semantics on the receiver side, understanding the sender's retry policy, and dead-letter handling for the requests that never succeed. Code examples are in Node.js (Express + Postgres), but the patterns are language-agnostic.

Quick recipe: dedupe by event ID before doing real work, return 2xx fast, treat duplicates as no-ops, and set up a dead-letter queue for events that fail too many times. The rest is sender-specific tuning.

Why retries are unavoidable

Webhook senders (Stripe, GitHub, Shopify, etc.) decide an event was "delivered" based on whether your endpoint returned a 2xx HTTP status. Anything else — 4xx, 5xx, timeout, TCP reset, your laptop closed mid-deploy — is a "failure" and the sender will try again, often aggressively.

This means your handler is going to see the same event multiple times. Sometimes 2-3 times during a routine outage; up to 17 times for Stripe over 3 days; up to 50 times for GitHub. If your code charges a credit card or sends an email, naïve handling = duplicate charges, duplicate emails, angry customers.

The good news: the fix is mostly mechanical. Once you have idempotency-by-event-ID, retries become benign.

Pillar 1: Idempotency by event ID

Every webhook payload from a serious provider includes a unique event ID:

ProviderEvent ID fieldFormat
Stripeid (top-level)evt_1ABC...
GitHubX-GitHub-Delivery (header)UUID v4
ShopifyX-Shopify-Webhook-Id (header)UUID v4
SlackX-Slack-Request-Timestamp + body hashcomposite
Squareevent_id (in body)UUID v4
HubSpoteventId (per event in array)numeric
SendGridsg_event_id (per event)base64

The pattern: persist the event ID before doing real work, in a unique-indexed table. If the insert fails because the ID already exists, you've seen this event before — return 200 OK and do nothing.

CREATE TABLE processed_webhook_events (
  event_id  TEXT PRIMARY KEY,
  source    TEXT NOT NULL,           -- 'stripe', 'github', etc.
  received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
// Express + node-postgres
import express from "express";
import { Pool } from "pg";

const pool = new Pool();
const app = express();

app.post(
  "/webhooks/stripe",
  express.raw({ type: "application/json" }),
  async (req, res) => {
    // 1. Verify signature first (always — never skip).
    const event = verifyStripeSignature(req); // throws if invalid

    // 2. Try to record the event ID. Unique constraint = dedupe.
    try {
      await pool.query(
        "INSERT INTO processed_webhook_events (event_id, source) VALUES ($1, $2)",
        [event.id, "stripe"],
      );
    } catch (err) {
      if ((err as { code?: string }).code === "23505") {
        // Duplicate — Stripe is retrying. We already handled this event.
        return res.json({ received: true, duplicate: true });
      }
      throw err;
    }

    // 3. Now do real work. If this throws, the row stays in the table
    //    but the event is unprocessed. See "transactional handlers" below.
    await handleStripeEvent(event);

    res.json({ received: true });
  },
);

This is the most important pattern in the entire guide. Get this right and retries become free.

Transactional handlers (the next subtle bug)

The simple version above has a race: if handleStripeEvent throws after we recorded the event ID, retries see "duplicate" and skip the event — but the work never happened. Two fixes:

Option A — Mark events as pending then processed. Use a status column instead of pure existence:

ALTER TABLE processed_webhook_events
  ADD COLUMN status TEXT NOT NULL DEFAULT 'pending',
  ADD COLUMN processed_at TIMESTAMPTZ;

On retry, if a row exists with status='pending', you know the previous attempt died mid-flight. Pick up the work and re-run it. If status='processed', return 200 immediately.

Option B — Wrap the insert + work in a single DB transaction. If the work throws, the insert rolls back, and the next retry sees no row. This is the cleanest pattern when your business logic is also DB-bound:

await pool.query("BEGIN");
try {
  await pool.query(
    "INSERT INTO processed_webhook_events (event_id, source) VALUES ($1, $2)",
    [event.id, "stripe"],
  );
  await handleStripeEvent(event); // does its own DB writes inside the txn
  await pool.query("COMMIT");
} catch (err) {
  await pool.query("ROLLBACK");
  if ((err as { code?: string }).code === "23505") {
    return res.json({ received: true, duplicate: true });
  }
  throw err; // sender will retry; rollback means we're idempotent
}

Option B is the right answer when handlers stay inside one database. Option A is necessary when your handler does external API calls (sending an email, calling Slack, etc.) that can't be rolled back.

Pillar 2: Return 2xx fast — defer work to a queue

Most providers time out at 5-30 seconds. If you do all your processing inline, every slow handler == retry storm. The solution: acknowledge fast, work asynchronously.

app.post("/webhooks/stripe", async (req, res) => {
  const event = verifyStripeSignature(req);

  // Sync: dedupe + enqueue + return.
  await enqueueForProcessing(event); // this is BullMQ / SQS / DB-backed jobs
  res.status(200).json({ received: true });
});

// Worker process — handles the actual logic, retries on its own schedule.
worker.process("stripe-events", async (job) => {
  await handleStripeEvent(job.data);
});

Trade-off: now you have two retry layers (the sender's, and your worker's). Make sure your worker also dedupes by event ID before doing real work — same pattern as Pillar 1.

Pillar 3: Understand each sender's retry policy

You can't tune your dead-letter strategy without knowing the upstream retry budget. The current (2026) policies:

ProviderRetry attemptsBackoff scheduleTotal window
StripeUp to 17Exponential, ~immediately → ~3 days3 days
GitHubUp to 50Exponential~8 hours
ShopifyUp to 19Exponential, ~hours apart48 hours
SlackUp to 31 min, 5 min, 30 min~36 min
TwilioConfigurable (3 default)Exponential per Webhook configvaries
SquareUp to 70 (!)Exponential, ~immediately → ~72 hours72 hours
HubSpotUp to 10Exponential~8 hours
SendGridUp to ~24 hours of retriesExponential24 hours

Two things this table tells you:

  1. The retry windows are LONG. If Stripe gives you 3 days and Square gives you 72 hours, your handler stability matters over days, not seconds. A "blip" outage that lasts 30 minutes will resolve itself before any of these senders give up.
  2. Slack is the outlier. ~36 minutes of retries means a longer outage drops Slack events on the floor. If Slack signals are critical to your app, you need defensive replay tooling.

Source: each provider's published retry docs as of 2026-04. Re-verify before quoting in production.

Designing for the sender, not against it

A common anti-pattern: returning 4xx for "expected" failures (like a duplicate event you don't care to process). Stripe and most others stop retrying on 4xx; they treat it as "your endpoint rejected the event, that's terminal."

The right responses:

  • 2xx: "I have this. Please don't retry." Use even when you're skipping a duplicate or ignoring an event type.
  • 5xx or timeout: "I'm broken. Please retry." Use for transient infra problems.
  • 4xx: "Don't ever try again." Reserve for malformed requests or signature failures — explicit "stop retrying" intent.

If your endpoint returns 4xx during a partial outage, you'll silently lose events you actually wanted.

Pillar 4: Dead-letter handling

Eventually some events fail every retry. Maybe a customer was deleted between the event firing and your retry, maybe a downstream API changed schemas. You need:

  1. A dead-letter table that captures fully-failed events.
  2. An alert when something lands there.
  3. A manual replay path to reprocess after fixing the bug.
CREATE TABLE webhook_dead_letters (
  id BIGSERIAL PRIMARY KEY,
  source        TEXT NOT NULL,
  event_id      TEXT NOT NULL,
  raw_headers   JSONB NOT NULL,
  raw_body      BYTEA NOT NULL,         -- preserve EXACTLY what arrived
  last_error    TEXT NOT NULL,
  attempts      INT  NOT NULL,
  received_at   TIMESTAMPTZ NOT NULL,
  resolved_at   TIMESTAMPTZ
);

Critical: store the raw bytes of the request, not the parsed JSON. When you fix the handler and want to replay, you need the exact bytes the signature was computed over — otherwise verification fails and you can't retry it cleanly.

Once a row lands here, alert (Slack, PagerDuty, email — whatever you use). Manual replay is then:

async function replayDeadLetter(id: number) {
  const row = await db.oneOrNone(
    "SELECT * FROM webhook_dead_letters WHERE id = $1",
    [id],
  );
  if (!row) throw new Error("not found");

  // Replay through the same handler — your idempotency table
  // ensures we don't double-process if it ALSO hit the live path.
  await processWebhook(row.source, row.raw_headers, row.raw_body);
  await db.none(
    "UPDATE webhook_dead_letters SET resolved_at = now() WHERE id = $1",
    [id],
  );
}

Test your retry handling without waiting for production

The hardest part of this whole architecture is testing. Real Stripe retries happen on Stripe's schedule, days apart. You can't reliably write an integration test against "what happens on the 4th retry."

Two patterns that work:

Use HookRay to capture and replay

HookRay gives you a public webhook URL that captures every incoming event. Once you've captured a real event, you can replay it manually as many times as you want, simulating retry behavior:

  1. Get a free HookRay URL (no signup required for the first 100 captures).
  2. Point Stripe / GitHub / Shopify at the URL in their dashboard.
  3. Trigger a real event (test mode is fine).
  4. From HookRay's UI, click "Replay" — re-send the captured webhook to your local handler (localhost:3000 via tunnel, or HookRay Pro forwards directly).
  5. Click Replay 5 times in a row to verify your idempotency table catches duplicates.

This is the fastest "did my retry handling actually work?" loop. See the Webhook.site → HookRay migration guide if you're moving from a tool that doesn't support replay.

Provider CLIs

Stripe and GitHub both ship CLI tools that forward real events to localhost:

# Stripe — also supports `--resend-event-id evt_xxx` for replay
stripe listen --forward-to localhost:3000/webhooks/stripe

# GitHub
gh webhook forward --repo owner/repo --url localhost:3000/webhooks/github

These are great for the happy path but they don't simulate the sender's retry storm — for that you need replay (HookRay or roll your own).

Common retry bugs (and how to spot them)

Bug 1: Returning 200 too early. You return 200 then crash before persisting the event. Sender thinks it's done; you lost the data. Fix: persist (or enqueue) before returning 200.

Bug 2: Idempotency on the wrong key. Using a synthetic key (like ${customer_id}_${event_type}) instead of the provider's event ID. Two distinct events with the same composite key collide; legitimate events get dropped as "duplicates." Fix: always dedupe on the provider's event ID.

Bug 3: Returning 4xx for expected duplicates. This stops retries, which sounds good — until you realize that all transient errors during the duplicate path also become 4xx. You silently break legitimate retry. Fix: return 200 OK with {duplicate: true} body for known duplicates; reserve 4xx for truly malformed requests.

Bug 4: Inline external API calls. Your handler calls Stripe's API to fetch related data, the call hangs for 30 seconds, the webhook times out, the sender retries, your handler hangs again, your queue fills up. Fix: enqueue + ack fast (Pillar 2).

Bug 5: Lost events during deploys. Your handler is mid-processing when the container is replaced. The event gets a 5xx (or worse, a half-completed write) and the sender retries. Without graceful shutdown handling, your retry table doesn't capture the in-flight ID. Fix: drain the in-flight queue before deploying, OR use the pending status pattern from Option A above.

Summary checklist

Before declaring your webhook handler "production-ready," verify:

  • Every handler dedupes by the provider's event ID
  • The dedupe table has a UNIQUE constraint on event_id
  • Either Option A (pending → processed) or Option B (transactional) is in use
  • Handlers return 2xx on duplicates, NOT 4xx
  • Long-running work is enqueued, not done inline
  • You have a dead-letter table that stores raw headers + raw body
  • You alert on dead-letter inserts (Slack/PagerDuty/email)
  • You have a tested manual-replay path
  • Your retry handling has been verified with HookRay replay or a similar tool

Related guides:

Ready to test your webhooks?

Get a free webhook URL in 5 seconds. No signup required.

Start Testing — Free