Skip to main content

Command Palette

Search for a command to run...

Why I Use Partial Indexes for “Active Jobs” in Postgres

Keeping 50ms p95 while 8,000+ listings churn daily

Updated
6 min read

The Problem

My job board has a simple read path on paper: show “active” jobs, let users filter by location, remote/hybrid, specialty, and recency.

In production it wasn’t simple.

I had 8,000+ active listings, ~2,000 companies, and ~200 listing updates per day coming from 10+ sources. Every source had its own “is this still active?” logic, so listings flipped between active and expired constantly. Users mostly care about active jobs. My database still had to keep expired rows for dedupe and audit.

The obvious approach was: add composite indexes for the filters.

That worked… until it didn’t. The indexes grew with expired rows too. Write amplification got worse. Autovacuum started showing up in my p95 latency charts. The ingestion pipeline didn’t fall over, but it got annoyingly close.

I wanted fast reads without paying the index tax on rows nobody queries.

Options I Considered

I ended up looking at three real options.

ApproachProsConsBest For
One big composite index across all jobsSimple mental model. Queries “just work.”Index includes expired rows; grows forever. Higher write cost on every update. Hard to tune.Small datasets or low churn tables
Table partitioning (active vs expired, or by time)Physical separation. Can drop old partitions. Vacuum is easier.Operational overhead. More DDL, more footguns. Partition pruning depends on query shape. Supabase migrations get trickier.Very large tables (10M+), strict retention rules
Partial indexes on status='active'Small index. Low write cost for expired rows. Keeps query planner happy.Queries must match the predicate. You’ll create multiple indexes if you have multiple “active” query patterns.Medium/large tables where most queries target a subset

Why not partitioning?

Partitioning is legit. If I had 50,000,000 rows with strict retention, I’d go there.

My table wasn’t that big. The churn was the issue.

Also: I’m running this on Postgres via Supabase. Partitioning is doable, but every migration becomes more delicate (especially if you need to change partition keys later). I’ve shipped enough DDL changes at 1:00 AM to know what I’m signing up for.

Why not “just cache it”?

I didn’t want Redis as a band-aid for avoidable index mistakes.

Caching helps, but my traffic pattern isn’t “same query repeated.” It’s lots of combinations: location + remote + posted_at + specialty. You cache the top few, sure, but the database still needs to handle the long tail.

So I stayed in Postgres and fixed the root cause.

What I Chose (and Why)

I moved from “index everything” to “index what users actually query”: active jobs.

That meant partial indexes.

Schema (simplified)

I keep a single jobs table with a status field. Expired rows stay. They’re useful for dedupe and for avoiding re-ingesting the same job from a source that republishes.

-- Postgres
create type job_status as enum ('active', 'expired');

create table if not exists public.jobs (
  id bigserial primary key,
  company_id bigint not null,
  title text not null,
  location text,
  remote boolean not null default false,
  specialty text,
  status job_status not null default 'active',
  posted_at timestamptz not null,
  updated_at timestamptz not null default now()
);

The partial indexes

My hottest query is: active jobs ordered by recency, filtered by a couple of fields.

So I created indexes that only cover status='active'.

-- Fast “feed” ordering for active jobs
create index if not exists jobs_active_posted_at_id_idx
on public.jobs (posted_at desc, id desc)
where status = 'active';

-- Common filter: remote + recency
create index if not exists jobs_active_remote_posted_at_id_idx
on public.jobs (remote, posted_at desc, id desc)
where status = 'active';

-- Common filter: location (text) + recency
create index if not exists jobs_active_location_posted_at_id_idx
on public.jobs (location, posted_at desc, id desc)
where status = 'active' and location is not null;

Why the (posted_at desc, id desc) tail everywhere?

Because I paginate by recency and I need a stable tie-breaker. Two jobs can share the same posted_at down to the second (scrapers do that). Without id, keyset pagination gets weird.

The trade-offs (what I gave up)

  • I gave up “one index to rule them all.” Now I have a small set of indexes that map to real query shapes.
  • I gave up some flexibility. If I forget status='active' in a query, the planner won’t use the partial index. You feel it immediately.
  • I accepted more schema work during feature development. Every new filter is a question: does it deserve an index?

That said, the ingestion pipeline stopped paying for rows nobody reads.

Querying from Next.js (Supabase)

This is roughly what my server route does for the jobs feed.

import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

type FeedParams = {
  remote?: boolean;
  location?: string;
  limit?: number;
};

export async function fetchActiveJobsFeed(params: FeedParams) {
  const limit = Math.min(params.limit ?? 25, 50);

  let q = supabase
    .from("jobs")
    .select("id, company_id, title, location, remote, specialty, posted_at")
    .eq("status", "active")
    .order("posted_at", { ascending: false })
    .order("id", { ascending: false })
    .limit(limit);

  if (params.remote !== undefined) q = q.eq("remote", params.remote);
  if (params.location) q = q.eq("location", params.location);

  const { data, error } = await q;
  if (error) throw error;
  return data;
}

That eq("status", "active") isn’t optional anymore. It’s part of the contract.

How It Worked in Production

Before partial indexes, I tried a single composite index that included status but covered the whole table. It “worked” until the table accumulated expired rows.

The symptoms were boring and painful:

  • p95 for the main feed query drifted from 50ms to 120ms over a couple weeks.
  • Ingestion updates (status flips + updated_at) started taking long enough that my cron window got tight.
  • Autovacuum activity correlated with read latency spikes.

After moving to partial indexes, the feed stabilized:

  • 50ms p95 for the job feed query (active + ordered by recency) under normal load.
  • Write cost dropped because expired rows stopped participating in the biggest indexes.
  • Index bloat slowed down visibly. I still vacuum, but it’s not constantly fighting giant indexes that include dead weight.

The surprise: I initially created too many partial indexes.

I mirrored every filter permutation (remote + specialty + location + …). Bad move. Postgres can combine bitmap scans sometimes, but you still pay maintenance overhead per index. I deleted the low-value ones and kept only what matched real traffic.

I got that traffic data by logging normalized filter shapes (not raw text) from the API: remote=true, location=CA, specialty=child-adolescent, etc. Two days of logs made the index decisions obvious.

When This Doesn't Work

Partial indexes break down when the “hot subset” isn’t stable.

If users query all statuses equally, you don’t have a subset to target. Same story if your predicate changes constantly (today it’s status='active', tomorrow it’s “active OR sponsored OR pinned”).

Also: if you have hundreds of tenants and each tenant mostly queries its own rows, partial indexes per-tenant are a trap. You’ll drown in indexes. At that point I’d rather use a composite index on (tenant_id, posted_at, id) and keep the schema boring.

And if you genuinely need strict retention and cheap drops, partitioning wins.

Key Takeaways

  • Index what users query, not what exists in the table. My users query status='active' almost exclusively.
  • Partial indexes are a write-optimization tool as much as a read-optimization tool.
  • Keep index count low. I started with 9 partial indexes and ended with 3 that mattered.
  • Make query shape a contract. If the app forgets the predicate (status='active'), performance becomes random.
  • Use real traffic to drive index design. Two days of filter-shape logs saved me from guessing.

Closing

Partial indexes gave me predictable reads without turning my ingest pipeline into an index-maintenance job.

If you’re running Postgres for a “mostly-active” dataset: do you model it as a status column with partial indexes, or do you physically split hot/cold data (partitioning or separate tables)? Where did your approach start hurting?