Why I Add an Async Outbox Before Reaching for Kafka

The first time you ship “send email after signup” as an inline API call, it works—until it doesn’t. One slow provider, one transient timeout, or one deploy at the wrong time and you start dropping side effects (emails, webhooks, audit logs) or, worse, sending duplicates. As a solo creator, the painful part isn’t just the bug—it’s the operational overhead of fixing it repeatedly without a team. Here’s the architectural decision I now default to: add an async outbox (in the same database) before reaching for a full message bus.

1) The decision: do you need a message bus yet?

When you’re building solo, reliability problems show up in the least glamorous places:

password reset emails that sometimes don’t arrive
webhook deliveries that randomly fail
“welcome” sequences that send twice
background jobs that disappear during deploys

The common root cause is coupling: your request path is doing too much work, and your side effects aren’t transactional with your core write.

The tempting solution is “add Kafka/RabbitMQ/SQS.” But running (or even integrating) a message bus is not free: schema evolution, retries, DLQs, observability, idempotency, consumer deployments, and a new failure domain.

My default for early-stage systems is an async outbox: write an “event to send” into the same database transaction as your core change, then have a worker deliver it with retries.

Key idea: if the business write commits, the side effect is guaranteed to be recorded—even if delivery happens later.

2) Context (The Problem Space)

Requirements & constraints

For a solo system design, I optimize for:

Correctness over immediacy: “email eventually sent” beats “sometimes sent instantly.”
Low operational load: fewer moving pieces, fewer dashboards.
Cost predictability: one database and one worker is usually enough.
Deploy safety: deploys shouldn’t drop side effects.

Typical scale expectations

This pattern holds comfortably for:

tens to hundreds of requests/sec
thousands to millions of outbox rows/day
side effects like email/webhook/analytics events

Non-functional requirements

At-least-once delivery (with idempotency on the consumer/provider side)
Retry with backoff
Observability: know what’s stuck and why
No phantom sends: don’t send an email if the user row didn’t commit

Why “just do it inline” doesn’t fit

Inline side effects fail in subtle ways:

you commit the DB write, then the email API times out → user exists, but no email
you send the email, then the DB transaction rolls back → email references a user that “doesn’t exist”
you retry the request, and now you send duplicates

The outbox is basically admitting: distributed systems exist even in a monolith (your DB + third-party APIs is already distributed).

3) Options considered

Below are the common approaches I’ve used/seen, and where they break.

Comparison table

Option	What it is	Pros	Cons	Best when
Inline side effects	Call email/webhook provider inside request	Simple, low latency	Not transactional, timeouts hurt UX, duplicates on retries	Truly non-critical side effects
Background job queue only	Push job to Redis/queue from request	Async, faster requests	Still not transactional unless enqueue is in same transaction boundary	You can tolerate occasional lost jobs
Async outbox (DB)	Write outbox row in same DB transaction; worker delivers	Transactional recording, fewer components, great for solo	Adds polling/worker, needs idempotency + cleanup	MVPs to mid-scale systems
CDC (change data capture)	Stream DB changes to consumers (Debezium, logical replication)	Near real-time, scalable	Operational complexity, schema coupling, infra overhead	Data platforms, multiple consumers
Full message bus	Kafka/RabbitMQ/SQS + producers/consumers	High throughput, decoupling, replay	More infra, more failure modes, more tooling	Many services/teams, high event volume

Option notes (the “gotchas”)

Inline side effects

Works until your provider latency spikes.
Forces you to choose between slow user experience and unreliable delivery.

Background queue only (Redis etc.)

Better UX, but if enqueue happens after commit and the process crashes in between, you lose the job.
If enqueue happens before commit and the commit fails, you send an email for a transaction that never happened.

Async outbox

You trade a bit of implementation complexity for a big jump in correctness.
Your DB becomes both the system of record and the “durable queue.”

CDC or message bus

Great when multiple consumers need the same events, or you need replay.
Usually too much surface area for a solo codebase early on.

4) The decision (What I chose)

I choose Async Outbox in the primary database as the default for emails/webhooks/audit events.

Primary reasons (ranked)

Transactional integrity: the outbox row is committed with the business write.
Operational simplicity: no new infra tier (beyond a worker process).
Deploy resilience: if the worker is down, events accumulate; nothing is lost.
Debuggability: outbox table is a truth source you can query with SQL.

What I gave up

Instant delivery: outbox is “near real-time,” not truly synchronous.
DB load: polling adds reads/writes; you must index correctly.
Exactly-once: you usually get at-least-once; duplicates are handled via idempotency.

Implementation overview

Data model

A minimal outbox table:

CREATE TABLE outbox_events (
  id            BIGSERIAL PRIMARY KEY,
  topic         TEXT NOT NULL,
  payload       JSONB NOT NULL,
  idempotency_key TEXT NOT NULL,
  status        TEXT NOT NULL DEFAULT 'pending', -- pending, processing, sent, failed
  attempts      INT  NOT NULL DEFAULT 0,
  available_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_at     TIMESTAMPTZ,
  lock_owner    TEXT,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Prevent duplicate logical events (e.g., signup welcome email)
CREATE UNIQUE INDEX outbox_idempotency_uk
  ON outbox_events(topic, idempotency_key);

-- Fast fetching of ready work
CREATE INDEX outbox_ready_idx
  ON outbox_events(status, available_at)
  WHERE status IN ('pending', 'failed');

The important design choice here is idempotency_key. This is what keeps “at-least-once delivery” from becoming “user got 3 emails.”

Examples of idempotency keys:

welcome_email:user_id=123
webhook:invoice_paid:invoice_id=abc

Writing to the outbox in the same transaction

Pseudocode (Node/TypeScript-ish, but the idea is language-agnostic):

await db.tx(async (tx) => {
  const user = await tx.query(
    `INSERT INTO users(email) VALUES ($1) RETURNING id, email`,
    [email]
  );

  await tx.query(
    `INSERT INTO outbox_events(topic, payload, idempotency_key)
     VALUES ($1, $2::jsonb, $3)
     ON CONFLICT (topic, idempotency_key) DO NOTHING`,
    [
      'email.welcome',
      JSON.stringify({ userId: user.id, email: user.email }),
      `welcome_email:user_id=${user.id}`
    ]
  );
});

This is the core win: either both rows commit, or neither does.

Worker: claim rows safely (skip locked)

The worker loop should:

fetch a small batch of ready events
atomically mark them as processing (lock)
deliver
mark sent or schedule retry

In Postgres, a common pattern is FOR UPDATE SKIP LOCKED:

WITH next AS (
  SELECT id
  FROM outbox_events
  WHERE status IN ('pending', 'failed')
    AND available_at <= now()
  ORDER BY created_at
  LIMIT 50
  FOR UPDATE SKIP LOCKED
)
UPDATE outbox_events e
SET status = 'processing',
    locked_at = now(),
    lock_owner = $1,
    updated_at = now()
FROM next
WHERE e.id = next.id
RETURNING e.id, e.topic, e.payload, e.attempts;

This lets you run multiple worker instances without double-processing the same row.

Retry policy with backoff

I usually start with exponential backoff capped at a few minutes.

function nextBackoffSeconds(attempt: number): number {
  // 1, 2, 4, 8, 16, 32, 60, 60...
  return Math.min(60, 2 ** Math.max(0, attempt));
}

async function markFailed(id: number, attempts: number, err: Error) {
  const delay = nextBackoffSeconds(attempts);
  await db.query(
    `UPDATE outbox_events
     SET status='failed',
         attempts = attempts + 1,
         available_at = now() + ($2 || ' seconds')::interval,
         updated_at = now(),
         payload = jsonb_set(payload, '{last_error}', to_jsonb($3::text), true)
     WHERE id = $1`,
    [id, delay, err.message]
  );
}

A few deliberate choices:

store last_error to make SQL-based debugging possible
cap backoff to avoid “retry storms”
keep it simple; you can add jitter later

5) Results & learnings

Because this is a general pattern (not tied to one product), I’ll share the kinds of numbers I’ve repeatedly observed in production-ish solo systems.

Performance impact

Request latency: moving side effects out of the request typically drops p95 latency by 200–1500ms (depending on provider latency). The DB write for an outbox row is usually single-digit milliseconds when indexed.
Throughput: a single worker polling every 250–1000ms and processing batches of 50 can comfortably handle hundreds to low thousands of events/min on modest hardware, assuming the downstream provider isn’t the bottleneck.
DB load: the outbox table can become write-heavy. With proper partial indexes and batch updates, I typically see outbox overhead remain <5–10% of total DB CPU for small-to-mid workloads.

What worked well

Debugging becomes SQL-native: “show me pending emails older than 10 minutes” is a query.
Deploys are less scary: if the worker is down for 10 minutes, you process the backlog.

Unexpected challenges

Idempotency is non-negotiable: you will send duplicates eventually (timeouts, provider ambiguity). Design for it.
Poison messages: some payloads will fail forever (bad email, invalid webhook URL). You need a terminal state and alerting.
Table growth: you must archive or delete sent events.

What I’d do differently next time

Add a dead status after N attempts and a lightweight admin view early.
Add basic metrics (counts by status, oldest pending age) before problems happen.

6) When this doesn’t work

The async outbox is not a universal answer.

Choose something else when:

You need fan-out to many consumers with different replay needs. Outbox can do it, but it becomes awkward; a proper bus or CDC can be cleaner.
Event volume is very high (e.g., tens of thousands/sec). Polling a relational DB becomes expensive; you’ll want streaming infrastructure.
You require strict ordering across partitions (e.g., per-customer ordering at scale). You can approximate ordering, but it gets complex.
Your primary DB is already the bottleneck. Turning it into a queue adds load; offloading to a dedicated queue might be healthier.
You can’t tolerate duplicates at all and downstream isn’t idempotent. You can get closer with provider-side idempotency keys, transactional inbox patterns, or exactly-once semantics in specific systems—but complexity rises quickly.

7) Key takeaways

Treat third-party APIs as unreliable dependencies; design side effects as async and retryable.
If you only adopt one reliability pattern early: write an outbox row in the same DB transaction as your business change.
Optimize for operational simplicity first; a worker + Postgres is often enough for a long time.
Assume at-least-once delivery and make events idempotent with explicit keys.
Plan for lifecycle: retries, dead-lettering (even if it’s just a dead status), and cleanup/archival.

8) Closing

If you’re building solo, the async outbox is one of those “boring” patterns that quietly saves weeks of debugging later.

What’s your default for side effects in early-stage systems—inline calls, a queue, an outbox, or straight to a message bus? And what failure pushed you there?

Why I Add an Async Outbox Before Reaching for Kafka

1) The decision: do you need a message bus yet?

2) Context (The Problem Space)

Requirements & constraints

Typical scale expectations

Non-functional requirements

Why “just do it inline” doesn’t fit

3) Options considered

Comparison table

Option notes (the “gotchas”)

Inline side effects

Background queue only (Redis etc.)

Async outbox

CDC or message bus

4) The decision (What I chose)

Primary reasons (ranked)

What I gave up

Implementation overview

Data model

Writing to the outbox in the same transaction

Worker: claim rows safely (skip locked)

Retry policy with backoff

5) Results & learnings

Performance impact

What worked well

Unexpected challenges

What I’d do differently next time

6) When this doesn’t work

7) Key takeaways

8) Closing

Comments

More from this blog

Why I Use SQLite Savepoints for Offline Workout Logging

Why I Use Partial Indexes for “Active Jobs” in Postgres

How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches

How We Counted 693 Live PMHNP Openings in California (and Why Volume ≠ Fit)

How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)

Command Palette

1) The decision: do you need a message bus yet?

2) Context (The Problem Space)

Requirements & constraints

Typical scale expectations

Non-functional requirements

Why “just do it inline” doesn’t fit

3) Options considered

Comparison table

Option notes (the “gotchas”)

Inline side effects

Background queue only (Redis etc.)

Async outbox

CDC or message bus

4) The decision (What I chose)

Primary reasons (ranked)

What I gave up

Implementation overview

Data model

Writing to the outbox in the same transaction

Worker: claim rows safely (skip locked)

Retry policy with backoff

5) Results & learnings

Performance impact

What worked well

Unexpected challenges

What I’d do differently next time

6) When this doesn’t work

7) Key takeaways

8) Closing

Comments

More from this blog