Skip to main content

Command Palette

Search for a command to run...

Why I Add an Async Outbox Before Reaching for Kafka

A solo-dev pattern for reliable emails and webhooks without a message bus

Updated
9 min read

The first time you ship “send email after signup” as an inline API call, it works—until it doesn’t. One slow provider, one transient timeout, or one deploy at the wrong time and you start dropping side effects (emails, webhooks, audit logs) or, worse, sending duplicates. As a solo creator, the painful part isn’t just the bug—it’s the operational overhead of fixing it repeatedly without a team. Here’s the architectural decision I now default to: add an async outbox (in the same database) before reaching for a full message bus.

1) The decision: do you need a message bus yet?

When you’re building solo, reliability problems show up in the least glamorous places:

  • password reset emails that sometimes don’t arrive
  • webhook deliveries that randomly fail
  • “welcome” sequences that send twice
  • background jobs that disappear during deploys

The common root cause is coupling: your request path is doing too much work, and your side effects aren’t transactional with your core write.

The tempting solution is “add Kafka/RabbitMQ/SQS.” But running (or even integrating) a message bus is not free: schema evolution, retries, DLQs, observability, idempotency, consumer deployments, and a new failure domain.

My default for early-stage systems is an async outbox: write an “event to send” into the same database transaction as your core change, then have a worker deliver it with retries.

Key idea: if the business write commits, the side effect is guaranteed to be recorded—even if delivery happens later.


2) Context (The Problem Space)

Requirements & constraints

For a solo system design, I optimize for:

  • Correctness over immediacy: “email eventually sent” beats “sometimes sent instantly.”
  • Low operational load: fewer moving pieces, fewer dashboards.
  • Cost predictability: one database and one worker is usually enough.
  • Deploy safety: deploys shouldn’t drop side effects.

Typical scale expectations

This pattern holds comfortably for:

  • tens to hundreds of requests/sec
  • thousands to millions of outbox rows/day
  • side effects like email/webhook/analytics events

Non-functional requirements

  • At-least-once delivery (with idempotency on the consumer/provider side)
  • Retry with backoff
  • Observability: know what’s stuck and why
  • No phantom sends: don’t send an email if the user row didn’t commit

Why “just do it inline” doesn’t fit

Inline side effects fail in subtle ways:

  • you commit the DB write, then the email API times out → user exists, but no email
  • you send the email, then the DB transaction rolls back → email references a user that “doesn’t exist”
  • you retry the request, and now you send duplicates

The outbox is basically admitting: distributed systems exist even in a monolith (your DB + third-party APIs is already distributed).


3) Options considered

Below are the common approaches I’ve used/seen, and where they break.

Comparison table

OptionWhat it isProsConsBest when
Inline side effectsCall email/webhook provider inside requestSimple, low latencyNot transactional, timeouts hurt UX, duplicates on retriesTruly non-critical side effects
Background job queue onlyPush job to Redis/queue from requestAsync, faster requestsStill not transactional unless enqueue is in same transaction boundaryYou can tolerate occasional lost jobs
Async outbox (DB)Write outbox row in same DB transaction; worker deliversTransactional recording, fewer components, great for soloAdds polling/worker, needs idempotency + cleanupMVPs to mid-scale systems
CDC (change data capture)Stream DB changes to consumers (Debezium, logical replication)Near real-time, scalableOperational complexity, schema coupling, infra overheadData platforms, multiple consumers
Full message busKafka/RabbitMQ/SQS + producers/consumersHigh throughput, decoupling, replayMore infra, more failure modes, more toolingMany services/teams, high event volume

Option notes (the “gotchas”)

Inline side effects

  • Works until your provider latency spikes.
  • Forces you to choose between slow user experience and unreliable delivery.

Background queue only (Redis etc.)

  • Better UX, but if enqueue happens after commit and the process crashes in between, you lose the job.
  • If enqueue happens before commit and the commit fails, you send an email for a transaction that never happened.

Async outbox

  • You trade a bit of implementation complexity for a big jump in correctness.
  • Your DB becomes both the system of record and the “durable queue.”

CDC or message bus

  • Great when multiple consumers need the same events, or you need replay.
  • Usually too much surface area for a solo codebase early on.

4) The decision (What I chose)

I choose Async Outbox in the primary database as the default for emails/webhooks/audit events.

Primary reasons (ranked)

  1. Transactional integrity: the outbox row is committed with the business write.
  2. Operational simplicity: no new infra tier (beyond a worker process).
  3. Deploy resilience: if the worker is down, events accumulate; nothing is lost.
  4. Debuggability: outbox table is a truth source you can query with SQL.

What I gave up

  • Instant delivery: outbox is “near real-time,” not truly synchronous.
  • DB load: polling adds reads/writes; you must index correctly.
  • Exactly-once: you usually get at-least-once; duplicates are handled via idempotency.

Implementation overview

Data model

A minimal outbox table:

CREATE TABLE outbox_events (
  id            BIGSERIAL PRIMARY KEY,
  topic         TEXT NOT NULL,
  payload       JSONB NOT NULL,
  idempotency_key TEXT NOT NULL,
  status        TEXT NOT NULL DEFAULT 'pending', -- pending, processing, sent, failed
  attempts      INT  NOT NULL DEFAULT 0,
  available_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_at     TIMESTAMPTZ,
  lock_owner    TEXT,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Prevent duplicate logical events (e.g., signup welcome email)
CREATE UNIQUE INDEX outbox_idempotency_uk
  ON outbox_events(topic, idempotency_key);

-- Fast fetching of ready work
CREATE INDEX outbox_ready_idx
  ON outbox_events(status, available_at)
  WHERE status IN ('pending', 'failed');

The important design choice here is idempotency_key. This is what keeps “at-least-once delivery” from becoming “user got 3 emails.”

Examples of idempotency keys:

  • welcome_email:user_id=123
  • webhook:invoice_paid:invoice_id=abc

Writing to the outbox in the same transaction

Pseudocode (Node/TypeScript-ish, but the idea is language-agnostic):

await db.tx(async (tx) => {
  const user = await tx.query(
    `INSERT INTO users(email) VALUES ($1) RETURNING id, email`,
    [email]
  );

  await tx.query(
    `INSERT INTO outbox_events(topic, payload, idempotency_key)
     VALUES ($1, $2::jsonb, $3)
     ON CONFLICT (topic, idempotency_key) DO NOTHING`,
    [
      'email.welcome',
      JSON.stringify({ userId: user.id, email: user.email }),
      `welcome_email:user_id=${user.id}`
    ]
  );
});

This is the core win: either both rows commit, or neither does.

Worker: claim rows safely (skip locked)

The worker loop should:

  1. fetch a small batch of ready events
  2. atomically mark them as processing (lock)
  3. deliver
  4. mark sent or schedule retry

In Postgres, a common pattern is FOR UPDATE SKIP LOCKED:

WITH next AS (
  SELECT id
  FROM outbox_events
  WHERE status IN ('pending', 'failed')
    AND available_at <= now()
  ORDER BY created_at
  LIMIT 50
  FOR UPDATE SKIP LOCKED
)
UPDATE outbox_events e
SET status = 'processing',
    locked_at = now(),
    lock_owner = $1,
    updated_at = now()
FROM next
WHERE e.id = next.id
RETURNING e.id, e.topic, e.payload, e.attempts;

This lets you run multiple worker instances without double-processing the same row.

Retry policy with backoff

I usually start with exponential backoff capped at a few minutes.

function nextBackoffSeconds(attempt: number): number {
  // 1, 2, 4, 8, 16, 32, 60, 60...
  return Math.min(60, 2 ** Math.max(0, attempt));
}

async function markFailed(id: number, attempts: number, err: Error) {
  const delay = nextBackoffSeconds(attempts);
  await db.query(
    `UPDATE outbox_events
     SET status='failed',
         attempts = attempts + 1,
         available_at = now() + ($2 || ' seconds')::interval,
         updated_at = now(),
         payload = jsonb_set(payload, '{last_error}', to_jsonb($3::text), true)
     WHERE id = $1`,
    [id, delay, err.message]
  );
}

A few deliberate choices:

  • store last_error to make SQL-based debugging possible
  • cap backoff to avoid “retry storms”
  • keep it simple; you can add jitter later

5) Results & learnings

Because this is a general pattern (not tied to one product), I’ll share the kinds of numbers I’ve repeatedly observed in production-ish solo systems.

Performance impact

  • Request latency: moving side effects out of the request typically drops p95 latency by 200–1500ms (depending on provider latency). The DB write for an outbox row is usually single-digit milliseconds when indexed.
  • Throughput: a single worker polling every 250–1000ms and processing batches of 50 can comfortably handle hundreds to low thousands of events/min on modest hardware, assuming the downstream provider isn’t the bottleneck.
  • DB load: the outbox table can become write-heavy. With proper partial indexes and batch updates, I typically see outbox overhead remain <5–10% of total DB CPU for small-to-mid workloads.

What worked well

  • Debugging becomes SQL-native: “show me pending emails older than 10 minutes” is a query.
  • Deploys are less scary: if the worker is down for 10 minutes, you process the backlog.

Unexpected challenges

  • Idempotency is non-negotiable: you will send duplicates eventually (timeouts, provider ambiguity). Design for it.
  • Poison messages: some payloads will fail forever (bad email, invalid webhook URL). You need a terminal state and alerting.
  • Table growth: you must archive or delete sent events.

What I’d do differently next time

  • Add a dead status after N attempts and a lightweight admin view early.
  • Add basic metrics (counts by status, oldest pending age) before problems happen.

6) When this doesn’t work

The async outbox is not a universal answer.

Choose something else when:

  • You need fan-out to many consumers with different replay needs. Outbox can do it, but it becomes awkward; a proper bus or CDC can be cleaner.
  • Event volume is very high (e.g., tens of thousands/sec). Polling a relational DB becomes expensive; you’ll want streaming infrastructure.
  • You require strict ordering across partitions (e.g., per-customer ordering at scale). You can approximate ordering, but it gets complex.
  • Your primary DB is already the bottleneck. Turning it into a queue adds load; offloading to a dedicated queue might be healthier.
  • You can’t tolerate duplicates at all and downstream isn’t idempotent. You can get closer with provider-side idempotency keys, transactional inbox patterns, or exactly-once semantics in specific systems—but complexity rises quickly.

7) Key takeaways

  • Treat third-party APIs as unreliable dependencies; design side effects as async and retryable.
  • If you only adopt one reliability pattern early: write an outbox row in the same DB transaction as your business change.
  • Optimize for operational simplicity first; a worker + Postgres is often enough for a long time.
  • Assume at-least-once delivery and make events idempotent with explicit keys.
  • Plan for lifecycle: retries, dead-lettering (even if it’s just a dead status), and cleanup/archival.

8) Closing

If you’re building solo, the async outbox is one of those “boring” patterns that quietly saves weeks of debugging later.

What’s your default for side effects in early-stage systems—inline calls, a queue, an outbox, or straight to a message bus? And what failure pushed you there?