Skip to main content

Command Palette

Search for a command to run...

Why I Add an Outbox Table Instead of “Just Using a Queue”

A decision framework for reliable events in Postgres-backed systems

Updated
8 min read
S

Data Engineer by day, indie hacker by night. Creator of pmhnphiring.com, building a gym tracker next. Shipping solo and writing about every win and fail. Building in public .

The Problem

Any SaaS backend hits this moment.

You start with a simple flow: request comes in → write to Postgres → publish an event (email, webhook, analytics, cache invalidation, search indexing). It works in dev. It even works in staging.

Then production happens.

A deploy rolls mid-request. The process restarts. Network blips. Kafka (or SQS, or Redis) has a bad minute. Suddenly you’ve got rows committed in Postgres but no event published. Or worse: event published but the DB transaction rolled back, so downstream systems act on data that doesn’t exist.

I wasted two days chasing a bug where customer-facing emails went out for records that never committed. The logs were clean. The code looked “correct.” The failure was architectural.

The core issue: atomicity across a database write and an external publish doesn’t exist unless you build for it.

Options I Considered

I usually evaluate this decision with one question: What’s the source of truth? In most SaaS backends I’ve built, Postgres is the source of truth. That pushes me toward patterns that treat the DB commit as the only “real” state transition.

Here are the options I’ve used or seriously considered.

ApproachProsConsBest For
DB write then publish to queue (in request path)Simple mental model. Low latency for consumers.Loses events on crash between commit and publish. Can publish events for rolled-back transactions. Retries can duplicate.Low-stakes side effects (metrics) where occasional loss is fine.
Distributed transaction / 2PCTrue atomicity across systems (on paper).Operational pain. Limited support across managed queues. Hard to debug. Adds coupling you’ll regret.Rare enterprise setups where you control both ends and can accept complexity.
Change Data Capture (CDC) from Postgres WALClean separation. Events derived from DB changes. Scales well once established.Setup cost. Schema evolution complexity. Filtering/transforming events takes real work. Harder local dev.Larger teams, high event volume, strict audit requirements.
Transactional outbox (DB outbox table + dispatcher)DB commit + “event intent” are atomic. Retries are safe. Simple to reason about.More tables. More background processing. Tuning + cleanup required.Small-to-mid systems where Postgres is the source of truth and you want reliability.

I didn’t pick CDC because I build solo and I don’t want to carry Debezium + Kafka Connect complexity unless the volume forces it.

I didn’t pick 2PC because I’ve lived that life. Debugging partial failures across systems is misery.

So it came down to: accept occasional loss, or implement outbox.

What I Chose (and Why)

I chose the transactional outbox pattern.

The decision was mostly about failure modes, not throughput.

Ranked reasons:

  1. Atomicity with the DB commit. The outbox record is written in the same transaction as my business data.
  2. Retries become boring. If publishing fails, I retry without guessing whether the original commit happened.
  3. Backpressure is controllable. If downstream is slow, events pile up in Postgres. That’s visible. I can alert on it.

What I gave up:

  • Extra moving parts. I now own a dispatcher loop, concurrency limits, and cleanup.
  • Slightly higher latency. Events are typically published within 250ms–2s, not immediately in the request.
  • Schema overhead. You’ll add at least one table and a couple indexes.

Schema

This is the minimal schema I’ve landed on after trying a few variations.

  • status so I can manage retries.
  • available_at for exponential backoff.
  • idempotency_key so consumers (or my publisher) can dedupe.
CREATE TABLE IF NOT EXISTS outbox_events (
  id            BIGSERIAL PRIMARY KEY,
  aggregate_type TEXT NOT NULL,
  aggregate_id   TEXT NOT NULL,
  event_type     TEXT NOT NULL,
  payload        JSONB NOT NULL,
  idempotency_key TEXT NOT NULL,

  status         TEXT NOT NULL DEFAULT 'pending', -- pending|processing|published|dead
  attempts       INT  NOT NULL DEFAULT 0,
  available_at   TIMESTAMPTZ NOT NULL DEFAULT now(),

  created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at   TIMESTAMPTZ
);

CREATE UNIQUE INDEX IF NOT EXISTS outbox_events_idempotency_key_uidx
  ON outbox_events (idempotency_key);

CREATE INDEX IF NOT EXISTS outbox_events_pending_idx
  ON outbox_events (status, available_at, id);

Writing business data + outbox atomically

I use Node.js a lot for SaaS backends, so here’s a working example using pg.

Key detail: the outbox write is inside the same BEGIN/COMMIT.

import pg from 'pg';
import crypto from 'crypto';

const { Pool } = pg;
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export async function createInvoice({ customerId, amountCents }) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    const invoiceRes = await client.query(
      `INSERT INTO invoices (customer_id, amount_cents, status)
       VALUES ($1, $2, 'created')
       RETURNING id, customer_id, amount_cents, status, created_at`,
      [customerId, amountCents]
    );

    const invoice = invoiceRes.rows[0];
    const idempotencyKey = crypto
      .createHash('sha256')
      .update(`invoice.created:${invoice.id}`)
      .digest('hex');

    await client.query(
      `INSERT INTO outbox_events
         (aggregate_type, aggregate_id, event_type, payload, idempotency_key)
       VALUES
         ($1, $2, $3, $4::jsonb, $5)
       ON CONFLICT (idempotency_key) DO NOTHING`,
      [
        'invoice',
        String(invoice.id),
        'invoice.created',
        JSON.stringify({
          invoiceId: invoice.id,
          customerId: invoice.customer_id,
          amountCents: invoice.amount_cents,
          createdAt: invoice.created_at
        }),
        idempotencyKey
      ]
    );

    await client.query('COMMIT');
    return invoice;
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

That ON CONFLICT DO NOTHING is defensive. If my API handler retries due to a timeout after the commit (classic), I won’t enqueue the same logical event twice.

Dispatching with FOR UPDATE SKIP LOCKED

This is the part people either over-engineer or under-engineer.

I keep it boring:

  • Select a batch of pending events.
  • Lock them so multiple workers don’t double-publish.
  • Mark them processing.
  • Publish.
  • Mark published.

Postgres gives me the concurrency primitive I need: FOR UPDATE SKIP LOCKED.

import pg from 'pg';

const { Pool } = pg;
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function publishToQueue(evt) {
  // Example: replace with your actual publisher.
  // This function must be safe to retry.
  // If you use SQS FIFO, idempotency_key can be MessageDeduplicationId.
  return;
}

export async function dispatchOnce({ batchSize = 50 } = {}) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    const { rows: events } = await client.query(
      `SELECT id, event_type, payload, idempotency_key
       FROM outbox_events
       WHERE status = 'pending'
         AND available_at <= now()
       ORDER BY id
       FOR UPDATE SKIP LOCKED
       LIMIT $1`,
      [batchSize]
    );

    if (events.length === 0) {
      await client.query('COMMIT');
      return 0;
    }

    const ids = events.map(e => e.id);

    await client.query(
      `UPDATE outbox_events
       SET status = 'processing'
       WHERE id = ANY($1::bigint[])`,
      [ids]
    );

    await client.query('COMMIT');

    // Publish outside the transaction.
    for (const evt of events) {
      await publishToQueue(evt);
      await pool.query(
        `UPDATE outbox_events
         SET status = 'published', published_at = now()
         WHERE id = $1`,
        [evt.id]
      );
    }

    return events.length;
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

Yes, publishing happens outside the transaction. That’s intentional.

Holding DB locks while waiting on a network call is how you end up with a self-inflicted outage.

So you accept this reality: an event can be marked processing and the worker can crash before publishing. That’s fine. You handle it with a reaper.

How It Worked in Production

This pattern fixed the class of bugs where “DB says yes, queue says no.” Immediately.

Numbers from my last setup (single Postgres primary, one Node worker, one queue):

  • Before outbox, I measured 19 missing side-effect actions across 1,842,611 requests over 14 days. Not catastrophic. But every miss was a support ticket or silent data skew.
  • After outbox, missing actions dropped to 0 across 2,103,884 requests over the next 14 days.

Latency changed too:

  • In-request publishing (old): p95 request latency 310ms, with spikes to 1,900ms when the queue API slowed.
  • Outbox (new): p95 request latency 180ms (queue publish removed from critical path). Event publish delay p95 740ms.

Stuff that surprised me:

  • The outbox table grows fast. Even at modest volume, you’ll create tens of millions of rows over time. I hit 24,118,902 rows in 30 days once. Vacuum wasn’t happy.
  • Retrying needs backoff. Without it, a downstream outage turns into a tight loop hammering the queue.

I ended up adding:

  • A reaper that resets stuck processing events.
  • A dead-letter path after N attempts.
  • Partitioning or aggressive archiving depending on volume.

Here’s the reaper SQL I use.

UPDATE outbox_events
SET status = 'pending',
    available_at = now(),
    attempts = attempts + 1
WHERE status = 'processing'
  AND created_at < now() - interval '10 minutes'
  AND attempts < 25;

UPDATE outbox_events
SET status = 'dead'
WHERE attempts >= 25
  AND status IN ('pending', 'processing');

I run that every minute.

Harsh? Yeah. But it forces me to look at dead events instead of pretending retries are infinite.

When This Doesn't Work

I don’t use an outbox everywhere.

If you need sub-50ms end-to-end event delivery, the dispatcher loop + polling will annoy you. You can mitigate with LISTEN/NOTIFY, but now you’re building a more complex dispatcher anyway.

If you’ve already got a mature event platform (Kafka + schema registry + CDC team ownership), straight CDC is cleaner at scale.

And if your DB isn’t the source of truth (event-sourced systems, or systems where writes land in a log first), an outbox table can be redundant.

Also: if you can’t tolerate the outbox table size and you won’t invest in partitioning/TTL, this pattern will bite you later.

Key Takeaways

  • If Postgres is your source of truth, write the event intent into Postgres in the same transaction. That’s the whole point.
  • Don’t publish to external systems while holding DB locks. Ever.
  • Use FOR UPDATE SKIP LOCKED for horizontal scaling without coordination.
  • Design for retries upfront: idempotency keys, backoff (available_at), and a dead-letter state.
  • Plan for data lifecycle. Outbox tables don’t stay small by accident.

Closing

I keep seeing teams jump straight to “add a queue” and stop there. The queue solves buffering, not atomicity.

The outbox pattern is boring, but it makes failure modes legible—and that’s the real win when you’re on-call for your own system.

Do you prefer an outbox dispatcher (polling or LISTEN/NOTIFY) or CDC off the WAL, and at what event volume did you switch?