Why I Add an Async Outbox Before Reaching for Kafka
A solo-dev pattern for reliable emails and webhooks without a message bus
The first time you ship “send email after signup” as an inline API call, it works—until it doesn’t. One slow provider, one transient timeout, or one deploy at the wrong time and you start dropping side effects (emails, webhooks, audit logs) or, worse, sending duplicates. As a solo creator, the painful part isn’t just the bug—it’s the operational overhead of fixing it repeatedly without a team. Here’s the architectural decision I now default to: add an async outbox (in the same database) before reaching for a full message bus.
1) The decision: do you need a message bus yet?
When you’re building solo, reliability problems show up in the least glamorous places:
- password reset emails that sometimes don’t arrive
- webhook deliveries that randomly fail
- “welcome” sequences that send twice
- background jobs that disappear during deploys
The common root cause is coupling: your request path is doing too much work, and your side effects aren’t transactional with your core write.
The tempting solution is “add Kafka/RabbitMQ/SQS.” But running (or even integrating) a message bus is not free: schema evolution, retries, DLQs, observability, idempotency, consumer deployments, and a new failure domain.
My default for early-stage systems is an async outbox: write an “event to send” into the same database transaction as your core change, then have a worker deliver it with retries.
Key idea: if the business write commits, the side effect is guaranteed to be recorded—even if delivery happens later.
2) Context (The Problem Space)
Requirements & constraints
For a solo system design, I optimize for:
- Correctness over immediacy: “email eventually sent” beats “sometimes sent instantly.”
- Low operational load: fewer moving pieces, fewer dashboards.
- Cost predictability: one database and one worker is usually enough.
- Deploy safety: deploys shouldn’t drop side effects.
Typical scale expectations
This pattern holds comfortably for:
- tens to hundreds of requests/sec
- thousands to millions of outbox rows/day
- side effects like email/webhook/analytics events
Non-functional requirements
- At-least-once delivery (with idempotency on the consumer/provider side)
- Retry with backoff
- Observability: know what’s stuck and why
- No phantom sends: don’t send an email if the user row didn’t commit
Why “just do it inline” doesn’t fit
Inline side effects fail in subtle ways:
- you commit the DB write, then the email API times out → user exists, but no email
- you send the email, then the DB transaction rolls back → email references a user that “doesn’t exist”
- you retry the request, and now you send duplicates
The outbox is basically admitting: distributed systems exist even in a monolith (your DB + third-party APIs is already distributed).
3) Options considered
Below are the common approaches I’ve used/seen, and where they break.
Comparison table
| Option | What it is | Pros | Cons | Best when |
| Inline side effects | Call email/webhook provider inside request | Simple, low latency | Not transactional, timeouts hurt UX, duplicates on retries | Truly non-critical side effects |
| Background job queue only | Push job to Redis/queue from request | Async, faster requests | Still not transactional unless enqueue is in same transaction boundary | You can tolerate occasional lost jobs |
| Async outbox (DB) | Write outbox row in same DB transaction; worker delivers | Transactional recording, fewer components, great for solo | Adds polling/worker, needs idempotency + cleanup | MVPs to mid-scale systems |
| CDC (change data capture) | Stream DB changes to consumers (Debezium, logical replication) | Near real-time, scalable | Operational complexity, schema coupling, infra overhead | Data platforms, multiple consumers |
| Full message bus | Kafka/RabbitMQ/SQS + producers/consumers | High throughput, decoupling, replay | More infra, more failure modes, more tooling | Many services/teams, high event volume |
Option notes (the “gotchas”)
Inline side effects
- Works until your provider latency spikes.
- Forces you to choose between slow user experience and unreliable delivery.
Background queue only (Redis etc.)
- Better UX, but if enqueue happens after commit and the process crashes in between, you lose the job.
- If enqueue happens before commit and the commit fails, you send an email for a transaction that never happened.
Async outbox
- You trade a bit of implementation complexity for a big jump in correctness.
- Your DB becomes both the system of record and the “durable queue.”
CDC or message bus
- Great when multiple consumers need the same events, or you need replay.
- Usually too much surface area for a solo codebase early on.
4) The decision (What I chose)
I choose Async Outbox in the primary database as the default for emails/webhooks/audit events.
Primary reasons (ranked)
- Transactional integrity: the outbox row is committed with the business write.
- Operational simplicity: no new infra tier (beyond a worker process).
- Deploy resilience: if the worker is down, events accumulate; nothing is lost.
- Debuggability: outbox table is a truth source you can query with SQL.
What I gave up
- Instant delivery: outbox is “near real-time,” not truly synchronous.
- DB load: polling adds reads/writes; you must index correctly.
- Exactly-once: you usually get at-least-once; duplicates are handled via idempotency.
Implementation overview
Data model
A minimal outbox table:
CREATE TABLE outbox_events (
id BIGSERIAL PRIMARY KEY,
topic TEXT NOT NULL,
payload JSONB NOT NULL,
idempotency_key TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending, processing, sent, failed
attempts INT NOT NULL DEFAULT 0,
available_at TIMESTAMPTZ NOT NULL DEFAULT now(),
locked_at TIMESTAMPTZ,
lock_owner TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Prevent duplicate logical events (e.g., signup welcome email)
CREATE UNIQUE INDEX outbox_idempotency_uk
ON outbox_events(topic, idempotency_key);
-- Fast fetching of ready work
CREATE INDEX outbox_ready_idx
ON outbox_events(status, available_at)
WHERE status IN ('pending', 'failed');
The important design choice here is idempotency_key. This is what keeps “at-least-once delivery” from becoming “user got 3 emails.”
Examples of idempotency keys:
welcome_email:user_id=123webhook:invoice_paid:invoice_id=abc
Writing to the outbox in the same transaction
Pseudocode (Node/TypeScript-ish, but the idea is language-agnostic):
await db.tx(async (tx) => {
const user = await tx.query(
`INSERT INTO users(email) VALUES ($1) RETURNING id, email`,
[email]
);
await tx.query(
`INSERT INTO outbox_events(topic, payload, idempotency_key)
VALUES ($1, $2::jsonb, $3)
ON CONFLICT (topic, idempotency_key) DO NOTHING`,
[
'email.welcome',
JSON.stringify({ userId: user.id, email: user.email }),
`welcome_email:user_id=${user.id}`
]
);
});
This is the core win: either both rows commit, or neither does.
Worker: claim rows safely (skip locked)
The worker loop should:
- fetch a small batch of ready events
- atomically mark them as processing (lock)
- deliver
- mark sent or schedule retry
In Postgres, a common pattern is FOR UPDATE SKIP LOCKED:
WITH next AS (
SELECT id
FROM outbox_events
WHERE status IN ('pending', 'failed')
AND available_at <= now()
ORDER BY created_at
LIMIT 50
FOR UPDATE SKIP LOCKED
)
UPDATE outbox_events e
SET status = 'processing',
locked_at = now(),
lock_owner = $1,
updated_at = now()
FROM next
WHERE e.id = next.id
RETURNING e.id, e.topic, e.payload, e.attempts;
This lets you run multiple worker instances without double-processing the same row.
Retry policy with backoff
I usually start with exponential backoff capped at a few minutes.
function nextBackoffSeconds(attempt: number): number {
// 1, 2, 4, 8, 16, 32, 60, 60...
return Math.min(60, 2 ** Math.max(0, attempt));
}
async function markFailed(id: number, attempts: number, err: Error) {
const delay = nextBackoffSeconds(attempts);
await db.query(
`UPDATE outbox_events
SET status='failed',
attempts = attempts + 1,
available_at = now() + ($2 || ' seconds')::interval,
updated_at = now(),
payload = jsonb_set(payload, '{last_error}', to_jsonb($3::text), true)
WHERE id = $1`,
[id, delay, err.message]
);
}
A few deliberate choices:
- store
last_errorto make SQL-based debugging possible - cap backoff to avoid “retry storms”
- keep it simple; you can add jitter later
5) Results & learnings
Because this is a general pattern (not tied to one product), I’ll share the kinds of numbers I’ve repeatedly observed in production-ish solo systems.
Performance impact
- Request latency: moving side effects out of the request typically drops p95 latency by 200–1500ms (depending on provider latency). The DB write for an outbox row is usually single-digit milliseconds when indexed.
- Throughput: a single worker polling every 250–1000ms and processing batches of 50 can comfortably handle hundreds to low thousands of events/min on modest hardware, assuming the downstream provider isn’t the bottleneck.
- DB load: the outbox table can become write-heavy. With proper partial indexes and batch updates, I typically see outbox overhead remain <5–10% of total DB CPU for small-to-mid workloads.
What worked well
- Debugging becomes SQL-native: “show me pending emails older than 10 minutes” is a query.
- Deploys are less scary: if the worker is down for 10 minutes, you process the backlog.
Unexpected challenges
- Idempotency is non-negotiable: you will send duplicates eventually (timeouts, provider ambiguity). Design for it.
- Poison messages: some payloads will fail forever (bad email, invalid webhook URL). You need a terminal state and alerting.
- Table growth: you must archive or delete sent events.
What I’d do differently next time
- Add a
deadstatus after N attempts and a lightweight admin view early. - Add basic metrics (counts by status, oldest pending age) before problems happen.
6) When this doesn’t work
The async outbox is not a universal answer.
Choose something else when:
- You need fan-out to many consumers with different replay needs. Outbox can do it, but it becomes awkward; a proper bus or CDC can be cleaner.
- Event volume is very high (e.g., tens of thousands/sec). Polling a relational DB becomes expensive; you’ll want streaming infrastructure.
- You require strict ordering across partitions (e.g., per-customer ordering at scale). You can approximate ordering, but it gets complex.
- Your primary DB is already the bottleneck. Turning it into a queue adds load; offloading to a dedicated queue might be healthier.
- You can’t tolerate duplicates at all and downstream isn’t idempotent. You can get closer with provider-side idempotency keys, transactional inbox patterns, or exactly-once semantics in specific systems—but complexity rises quickly.
7) Key takeaways
- Treat third-party APIs as unreliable dependencies; design side effects as async and retryable.
- If you only adopt one reliability pattern early: write an outbox row in the same DB transaction as your business change.
- Optimize for operational simplicity first; a worker + Postgres is often enough for a long time.
- Assume at-least-once delivery and make events idempotent with explicit keys.
- Plan for lifecycle: retries, dead-lettering (even if it’s just a
deadstatus), and cleanup/archival.
8) Closing
If you’re building solo, the async outbox is one of those “boring” patterns that quietly saves weeks of debugging later.
What’s your default for side effects in early-stage systems—inline calls, a queue, an outbox, or straight to a message bus? And what failure pushed you there?