Skip to main content

Command Palette

Search for a command to run...

Why I Use pg_trgm Fuzzy Search Instead of Full-Text Search

Production notes from a 7,556-job board with fast, typo-tolerant filters

Updated
8 min read

Search is where job boards quietly fail. Users don’t type perfect keywords, job titles aren’t standardized across sources, and “PMHNP” gets spelled five different ways. In my PMHNP Hiring job board (7,556+ jobs, 1,368+ companies, ~200 daily updates), I had to pick a search strategy that was fast (<50ms p95 for common queries), typo-tolerant, and cheap to operate as a solo creator. This post is the architecture decision I made: choosing PostgreSQL pg_trgm similarity search over classic full‑text search—and what I gave up to get there.

Context: search in a scraped job board is messy by default

PMHNP Hiring is a job board I built for Psychiatric Mental Health Nurse Practitioners. The “product” looks simple—filter by location, remote, company, posted date, and search by title/company. The system behind it is less clean because the data comes from 10+ sources with different formatting and inconsistent fields.

A few constraints shaped the search decision:

  • Scale & churn: 7,556+ active jobs, 1,368+ companies, and ~200 daily updates (incremental ingestion, not full refresh).
  • UX reality: users type partial queries (“psych”, “nurse prac”), acronyms (“PMHNP”), and typos (“psychiatric nurse practioner”).
  • Performance target: keep the common feed queries (filters + search) at ~50ms p95 at the database layer.
  • Operational simplicity: I’m a solo creator. I didn’t want a separate search cluster to babysit.
  • Security model: Supabase + PostgreSQL with RLS. I wanted search to stay inside Postgres so it inherits the same access control semantics.

This is where the first non-obvious problem appears: classic full-text search (FTS) is great for “documents”, but job titles and company names behave more like short strings where typo tolerance and partial matches dominate.

In a job board, search is less about linguistic relevance and more about forgiving messy input.

I wasn’t building Google. I needed a search that:

  1. Works well on short fields: title, company_name, and sometimes location_text.
  2. Supports “contains-like” behavior: users often remember only part of a title.
  3. Handles typos: similarity, not exact token match.
  4. Composes with filters: search + (state, remote, posted_at, source) should stay fast.
  5. Plays nicely with ingestion: updates come daily; the index must handle churn.

I also didn’t want relevance tuning to turn into a second product. If I had to spend days tweaking ts_rank weights, that’s a smell.

Options considered

Below are the realistic choices I evaluated for Postgres/Supabase.

OptionWhat it isProsConsBest when
ILIKE '%query%' + B-treeNaive substring matchSimple, no extensionsSlow on large tables; can’t use B-tree with leading wildcardTiny datasets or admin tools
PostgreSQL Full-Text Search (tsvector)Token-based search using dictionariesGood for long text; ranking; language supportWeak on typos/partial strings; tuning needed; titles are shortArticles, descriptions, “document” search
pg_trgm (trigram similarity)String similarity via overlapping 3-char chunksTypo-tolerant; fast with GIN/GiST; great for short fieldsNot semantic; can match weirdly; needs threshold tuningNames, titles, short text, “forgiving” search
External search (Meilisearch/Typesense/Elastic)Dedicated search engineGreat relevance; typo handling; facetingExtra infra; sync complexity; more moving parts; costHigh scale, complex ranking, multi-field relevance

Why I didn’t stick with ILIKE

ILIKE feels tempting early on, especially when you’re “vibe coding” fast. But it collapses once you hit a few thousand rows and mix it with filters.

ILIKE '%pmhnp%' forces a scan unless you add specialized indexing. On 7k jobs it might still feel okay—until you add multi-tenant rules, joins to companies, and a few concurrent users.

Why full-text search wasn’t the right default

FTS shines when you search bodies of text. But for job boards, most searches are:

  • "pmhnp"
  • "psychiatric"
  • "remote"
  • "headway" (company)

FTS tokenization can hurt you here:

  • Typos don’t match.
  • Partial tokens don’t match unless you add prefix operators and accept recall/precision trade-offs.
  • Acronyms and short tokens can behave weirdly depending on dictionaries.

Why I didn’t jump to an external search engine

I love dedicated search engines—but operating them is a commitment:

  • You need a sync pipeline (DB → search index) that is correct under retries.
  • You now have two sources of truth for availability.
  • You have to decide how search respects RLS / access control.

For PMHNP Hiring, the cost and complexity weren’t justified. Postgres could do “good enough” search with less risk.

The decision: pg_trgm + a search vector column I can index

I chose PostgreSQL’s pg_trgm extension and built search around a single normalized field (title + company + location) that I could index with GIN.

Primary reasons (ranked):

  1. Typo tolerance on short fields without building relevance infrastructure.
  2. Composable performance with filters (state, remote, posted_at).
  3. Operational simplicity: no extra services; works inside Supabase.
  4. Predictable indexing story: GIN trigram indexes are battle-tested.

What I gave up:

  • No semantic relevance (synonyms, intent, “psych NP” == “PMHNP”).
  • Similarity search can return “surprising” matches unless you tune thresholds.
  • Ranking is simpler; you’re not doing sophisticated scoring.

Implementation overview

I model jobs and companies relationally, but for search I avoid doing multiple similarity checks across joins at query time. Instead, I denormalize a search_text field on jobs.

1) Enable pg_trgm and add an indexed field

-- One-time
create extension if not exists pg_trgm;

-- Add a denormalized search field
alter table jobs add column if not exists search_text text;

-- Keep it simple: lowercased, whitespace-normalized text
-- (I populate this during ingestion / upserts)

-- GIN index for fast trigram search
create index if not exists jobs_search_text_trgm
on jobs using gin (search_text gin_trgm_ops);

Why a denormalized search_text?

  • Similarity across multiple columns (title, company_name) can prevent index use or force multiple index scans.
  • Joining companies for every search adds overhead; my feed queries already join for display.
  • With 200 daily updates, recomputing search_text is cheap and keeps reads fast.

2) Populate search_text during upsert (pipeline-friendly)

My ingestion pipeline is: Cron (Vercel) → scraper → dedupe → upsert into Postgres. During the upsert, I compute search_text.

// pseudo-code inside the ingestion worker
const normalize = (s: string) =>
  s
    .toLowerCase()
    .replace(/[^a-z0-9\s]/g, " ")
    .replace(/\s+/g, " ")
    .trim();

const searchText = normalize([
  job.title,
  job.companyName,
  job.locationText,
  job.remote ? "remote" : "",
].filter(Boolean).join(" "));

await db.from("jobs").upsert({
  id: job.id,
  title: job.title,
  company_id: job.companyId,
  location_text: job.locationText,
  remote: job.remote,
  posted_at: job.postedAt,
  search_text: searchText,
  source: job.source,
});

This is a deliberate trade: write-time work for read-time speed.

3) Query pattern: similarity + filters + stable pagination

I treat search as “filtering” rather than a separate endpoint. Most users search while also filtering by state/remote.

-- Parameterized query idea
-- :q is the normalized query string
-- :min_sim is a tuned threshold (e.g., 0.2 to 0.35)

select
  j.id, j.title, j.posted_at, j.remote,
  c.name as company_name
from jobs j
join companies c on c.id = j.company_id
where
  (:state is null or j.state = :state)
  and (:remote is null or j.remote = :remote)
  and (:q = '' or j.search_text % :q)
order by
  case when :q = '' then 0 else similarity(j.search_text, :q) end desc,
  j.posted_at desc,
  j.id desc
limit :limit;

Notes:

  • The % operator is trigram “similarity match” (uses the trigram index).
  • similarity() is used only for ordering when a query exists.
  • The secondary ordering by posted_at, id keeps results stable.

4) Tuning similarity threshold without guesswork

The biggest footgun with pg_trgm is threshold tuning. Too low: irrelevant matches. Too high: you miss typos.

In Postgres you can set it per session:

-- Example: bump threshold for stricter matches
select set_limit(0.28);

-- Then run the search
select id, title
from jobs
where search_text % 'pmhnp remote';

In practice, I ended up using a slightly lower threshold for shorter queries and a higher one for longer queries (because long queries naturally have more trigrams).

Results & learnings (with real numbers)

After shipping pg_trgm search + the right supporting indexes (composite indexes for filters, plus connection pooling with pgBouncer), the database layer for the most common “feed + search” requests stabilized around:

  • ~50ms p95 query time for typical filtered listing queries (state/remote + optional search).
  • Search remained fast even with daily churn (~200 updates/day) because GIN index maintenance overhead at this scale is manageable.

What worked well:

  • Typos stopped mattering for the most common cases (company names, “psychiatric”, “practitioner”).
  • I didn’t need to invent a ranking model. Similarity + recency was “good enough”.
  • Keeping search in Postgres meant fewer moving parts and fewer failure modes.

Unexpected challenges:

  • Certain short queries (like "np") matched too broadly. The fix wasn’t more indexing—it was product constraints (minimum query length, or requiring at least one non-trivial token).
  • Similarity ordering can be noisy when many rows are “close enough”. Recency as a tie-breaker helped.

What I’d do differently:

  • Add a lightweight synonym layer (application-side) for domain terms (e.g., map “psych np” → “pmhnp psychiatric”). This is cheaper than building semantic search and improves relevance a lot.

When this doesn’t work

pg_trgm isn’t a universal answer. I’d pick something else if:

  • You need semantic relevance: synonyms, intent understanding, “director of nursing” matching “DON”, etc. That’s where a dedicated search engine or embeddings start to win.
  • You’re searching long bodies of text (job descriptions). FTS (or hybrid FTS + trgm) is often better.
  • Your dataset is massive and high-churn (hundreds of millions of rows). GIN index size and maintenance can become expensive.
  • You need advanced faceting + ranking beyond what SQL can comfortably express.

A pragmatic hybrid that I’d consider later:

  • FTS for descriptions (token relevance)
  • pg_trgm for titles/company (typo tolerance)
  • Merge/rank results in SQL or application layer

But I wouldn’t start there unless search quality is the core differentiator.

Key takeaways

  1. Match the search tool to the shape of your data: short strings behave differently than documents.
  2. Optimize for the operational budget you actually have: one Postgres instance you understand beats two systems you barely monitor.
  3. Denormalize intentionally when it removes joins from your hot path; pay the cost at ingestion time.
  4. Thresholds are product decisions: minimum query length and similarity limits are UX levers, not just database knobs.
  5. Use recency as a stabilizer: in job boards, “newer” is often a better tie-breaker than perfect relevance.

Closing

If you’ve built search for a marketplace or job board: did you stick with Postgres (FTS/trgm) or graduate to a dedicated search engine? I’m especially curious where your tipping point was—data size, relevance requirements, or team/ops maturity.