sathish builds

Why I Use Partial Indexes for “Active Jobs” in Postgres

Sathish — Tue, 07 Apr 2026 15:00:56 GMT

The Problem

My job board has a simple read path on paper: show “active” jobs, let users filter by location, remote/hybrid, specialty, and recency.

In production it wasn’t simple.

I had 8,000+ active listings, ~2,000 companies, and ~200 listing updates per day coming from 10+ sources. Every source had its own “is this still active?” logic, so listings flipped between active and expired constantly. Users mostly care about active jobs. My database still had to keep expired rows for dedupe and audit.

The obvious approach was: add composite indexes for the filters.

That worked… until it didn’t. The indexes grew with expired rows too. Write amplification got worse. Autovacuum started showing up in my p95 latency charts. The ingestion pipeline didn’t fall over, but it got annoyingly close.

I wanted fast reads without paying the index tax on rows nobody queries.

Options I Considered

I ended up looking at three real options.

Approach	Pros	Cons	Best For
One big composite index across all jobs	Simple mental model. Queries “just work.”	Index includes expired rows; grows forever. Higher write cost on every update. Hard to tune.	Small datasets or low churn tables
Table partitioning (active vs expired, or by time)	Physical separation. Can drop old partitions. Vacuum is easier.	Operational overhead. More DDL, more footguns. Partition pruning depends on query shape. Supabase migrations get trickier.	Very large tables (10M+), strict retention rules
Partial indexes on `status='active'`	Small index. Low write cost for expired rows. Keeps query planner happy.	Queries must match the predicate. You’ll create multiple indexes if you have multiple “active” query patterns.	Medium/large tables where most queries target a subset

Why not partitioning?

Partitioning is legit. If I had 50,000,000 rows with strict retention, I’d go there.

My table wasn’t that big. The churn was the issue.

Also: I’m running this on Postgres via Supabase. Partitioning is doable, but every migration becomes more delicate (especially if you need to change partition keys later). I’ve shipped enough DDL changes at 1:00 AM to know what I’m signing up for.

Why not “just cache it”?

I didn’t want Redis as a band-aid for avoidable index mistakes.

Caching helps, but my traffic pattern isn’t “same query repeated.” It’s lots of combinations: location + remote + posted_at + specialty. You cache the top few, sure, but the database still needs to handle the long tail.

So I stayed in Postgres and fixed the root cause.

What I Chose (and Why)

I moved from “index everything” to “index what users actually query”: active jobs.

That meant partial indexes.

Schema (simplified)

I keep a single jobs table with a status field. Expired rows stay. They’re useful for dedupe and for avoiding re-ingesting the same job from a source that republishes.

-- Postgres
create type job_status as enum ('active', 'expired');

create table if not exists public.jobs (
  id bigserial primary key,
  company_id bigint not null,
  title text not null,
  location text,
  remote boolean not null default false,
  specialty text,
  status job_status not null default 'active',
  posted_at timestamptz not null,
  updated_at timestamptz not null default now()
);

The partial indexes

My hottest query is: active jobs ordered by recency, filtered by a couple of fields.

So I created indexes that only cover status='active'.

-- Fast “feed” ordering for active jobs
create index if not exists jobs_active_posted_at_id_idx
on public.jobs (posted_at desc, id desc)
where status = 'active';

-- Common filter: remote + recency
create index if not exists jobs_active_remote_posted_at_id_idx
on public.jobs (remote, posted_at desc, id desc)
where status = 'active';

-- Common filter: location (text) + recency
create index if not exists jobs_active_location_posted_at_id_idx
on public.jobs (location, posted_at desc, id desc)
where status = 'active' and location is not null;

Why the (posted_at desc, id desc) tail everywhere?

Because I paginate by recency and I need a stable tie-breaker. Two jobs can share the same posted_at down to the second (scrapers do that). Without id, keyset pagination gets weird.

The trade-offs (what I gave up)

I gave up “one index to rule them all.” Now I have a small set of indexes that map to real query shapes.
I gave up some flexibility. If I forget status='active' in a query, the planner won’t use the partial index. You feel it immediately.
I accepted more schema work during feature development. Every new filter is a question: does it deserve an index?

That said, the ingestion pipeline stopped paying for rows nobody reads.

Querying from Next.js (Supabase)

This is roughly what my server route does for the jobs feed.

import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

type FeedParams = {
  remote?: boolean;
  location?: string;
  limit?: number;
};

export async function fetchActiveJobsFeed(params: FeedParams) {
  const limit = Math.min(params.limit ?? 25, 50);

  let q = supabase
    .from("jobs")
    .select("id, company_id, title, location, remote, specialty, posted_at")
    .eq("status", "active")
    .order("posted_at", { ascending: false })
    .order("id", { ascending: false })
    .limit(limit);

  if (params.remote !== undefined) q = q.eq("remote", params.remote);
  if (params.location) q = q.eq("location", params.location);

  const { data, error } = await q;
  if (error) throw error;
  return data;
}

That eq("status", "active") isn’t optional anymore. It’s part of the contract.

How It Worked in Production

Before partial indexes, I tried a single composite index that included status but covered the whole table. It “worked” until the table accumulated expired rows.

The symptoms were boring and painful:

p95 for the main feed query drifted from 50ms to 120ms over a couple weeks.
Ingestion updates (status flips + updated_at) started taking long enough that my cron window got tight.
Autovacuum activity correlated with read latency spikes.

After moving to partial indexes, the feed stabilized:

50ms p95 for the job feed query (active + ordered by recency) under normal load.
Write cost dropped because expired rows stopped participating in the biggest indexes.
Index bloat slowed down visibly. I still vacuum, but it’s not constantly fighting giant indexes that include dead weight.

The surprise: I initially created too many partial indexes.

I mirrored every filter permutation (remote + specialty + location + …). Bad move. Postgres can combine bitmap scans sometimes, but you still pay maintenance overhead per index. I deleted the low-value ones and kept only what matched real traffic.

I got that traffic data by logging normalized filter shapes (not raw text) from the API: remote=true, location=CA, specialty=child-adolescent, etc. Two days of logs made the index decisions obvious.

When This Doesn't Work

Partial indexes break down when the “hot subset” isn’t stable.

If users query all statuses equally, you don’t have a subset to target. Same story if your predicate changes constantly (today it’s status='active', tomorrow it’s “active OR sponsored OR pinned”).

Also: if you have hundreds of tenants and each tenant mostly queries its own rows, partial indexes per-tenant are a trap. You’ll drown in indexes. At that point I’d rather use a composite index on (tenant_id, posted_at, id) and keep the schema boring.

And if you genuinely need strict retention and cheap drops, partitioning wins.

Key Takeaways

Index what users query, not what exists in the table. My users query status='active' almost exclusively.
Partial indexes are a write-optimization tool as much as a read-optimization tool.
Keep index count low. I started with 9 partial indexes and ended with 3 that mattered.
Make query shape a contract. If the app forgets the predicate (status='active'), performance becomes random.
Use real traffic to drive index design. Two days of filter-shape logs saved me from guessing.

Closing

Partial indexes gave me predictable reads without turning my ingest pipeline into an index-maintenance job.

If you’re running Postgres for a “mostly-active” dataset: do you model it as a status column with partial indexes, or do you physically split hot/cold data (partitioning or separate tables)? Where did your approach start hurting?

How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches

Sathish — Thu, 05 Mar 2026 16:00:39 GMT

The BLS projection (35% PMHNP growth from 2024–2034) is a macro signal. The hard part is translating it into a daily system that answers: where are the roles, what do they pay, how fast do they move, and which postings are actually real.

How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches

The BLS projection of 35% PMHNP job growth (2024–2034) is a clean number that shows up in headlines. As builders, we treat it as a macro input—useful, but not directly actionable.

What’s actionable is what that trend turns into at the job-posting layer:

more postings (and more duplicates)
faster hiring cycles (time-to-fill compresses)
wider variance by state, setting, and employer type
more compensation noise (ranges, bonuses, productivity, telehealth modifiers)

At PMHNP Hiring, we aggregate from 500+ job sources daily and maintain 10,000+ verified PMHNP jobs across 50 states. The goal isn’t to repeat the BLS number—it’s to make it queryable: where is growth showing up today, and what does “good role” mean in that market?

Macro projection → micro signals we can measure

The 35% projection is best read as sustained demand: more people seeking care, expanding access efforts, and pressure on psychiatry capacity. PMHNPs sit right in the middle.

But “more jobs” doesn’t mean “every job is a fit.” So we track operational signals that correlate with a heated market:

posting velocity (new jobs per day/week by state + setting)
time-to-fill proxies (how long a posting stays live, how often it gets refreshed)
credentialing friction (signals in text like “credentialing support,” “start in 2–4 weeks,” etc.)
comp package completeness (whether salary, schedule, supervision, and ramp are specified)

One example we surface internally is a rolling days-live metric. It’s not perfect, but it’s observable at scale.

-- Rough “days live” metric from our normalized postings table
select
  state,
  percentile_cont(0.5) within group (order by (now() - first_seen_at)) as median_days_live,
  count(*) as active_jobs
from jobs
where
  role = 'PMHNP'
  and status = 'active'
group by state
order by median_days_live asc;

If median days-live drops, it often matches what clinicians feel as “employers are moving faster.” In the original blog, we cited time-to-fill tightening (e.g., ~32 days vs ~45). We treat those as hypotheses and validate them against our own observed posting lifecycle.

The ingestion pipeline: 500+ sources, one schema

Scraping PMHNP jobs isn’t “fetch HTML, parse title.” Every source has its own quirks:

different location formats (city/state, remote, multi-state, “within 50 miles”)
different salary formats (hourly, annual, per-visit, wide ranges)
duplicated listings across ATS platforms, job boards, and staffing agencies
stale posts that get re-published with new IDs

Our pipeline is built to turn that mess into a stable contract:

Fetch (scheduled jobs) from boards/ATS feeds
Parse into a common intermediate model
Normalize fields (title → role, location → geo, salary → annualized range)
Deduplicate and verify
Index for real-time filtering + alerts

Tech stack pieces:

Next.js + TypeScript for UI and API routes
Supabase (Postgres) for storage + full-text search + RLS
Stripe for billing (alerts/subscriptions)

Deduplication: the “more jobs” trap

Job growth increases volume, but it also increases duplicates. A single PMHNP opening can appear on:

the employer site
an ATS mirror
2–5 job boards
a staffing listing with rewritten text

If we don’t dedupe, users think there are more unique opportunities than there are—and alerts become spam.

We generate a fingerprint using a mix of deterministic and fuzzy signals:

normalized employer name
canonicalized location (lat/lng + radius buckets)
role taxonomy (PMHNP vs “Psych NP” variants)
compensation overlap (when present)
text similarity on responsibilities and requirements

Pseudo-code sketch:

type Job = {
  title: string
  employer: string
  city?: string
  state?: string
  lat?: number
  lng?: number
  description: string
  salaryMin?: number
  salaryMax?: number
}

function fingerprint(job: Job) {
  return hash([
    normalizeEmployer(job.employer),
    normalizeRole(job.title),
    geoBucket(job.lat, job.lng, 0.25),
    salaryBucket(job.salaryMin, job.salaryMax),
    simhash(normalizeText(job.description))
  ].join('|'))
}

This is what turns macro growth into a trustworthy count of verified jobs.

Salary data: why “$139K–$155K” is a normalization problem

Clinically, people want to know pay. Technically, pay is one of the messiest fields we ingest.

Common failure modes:

hourly rates without hours/week
“$120k–$250k” ranges that include productivity/bonus but aren’t labeled
sign-on bonuses mixed into base
“per diem” roles listed as annual
DNP vs MSN differentials inconsistently stated

So we normalize salaries into an annualized range with metadata:

pay_type (hourly/annual/unknown)
annual_min, annual_max
confidence_score
includes_bonus (best-effort)

Example normalization logic:

function annualize({ payType, min, max, hoursPerWeek = 40 }: any) {
  if (payType === 'hourly') {
    return {
      annualMin: min * hoursPerWeek * 52,
      annualMax: max * hoursPerWeek * 52,
    }
  }
  return { annualMin: min, annualMax: max }
}

That’s how we can talk about national ranges (like ~$139K–$155K common bands, entry-level around ~$126K) while still being honest about variance and data quality.

What the growth signal means for “career confidence” (as data)

The BLS projection improves the odds that you’ll find opportunities—but the quality of those opportunities depends on factors you can filter for:

setting (outpatient, community mental health, integrated care, telepsych)
onboarding support (mentorship/supervision signals)
realistic ramp + admin time
credentialing speed

From a product standpoint, this is why we invest in structured fields extracted from unstructured text. A fast offer can correlate with staffing pressure; we try to surface the context so users can choose strategically.

In other words: the 35% growth projection is the headline. The system work is turning it into a search experience where you can reliably answer, “Where are the real roles, and which ones are built to support me once I’m hired?”

How We Counted 693 Live PMHNP Openings in California (and Why Volume ≠ Fit)

Sathish — Fri, 27 Feb 2026 16:00:17 GMT

California shows 693 PMHNP openings in our index. The interesting part isn’t the number—it’s how you get a trustworthy count from messy job feeds, and what the distribution says about metros, settings, and competition.

California State Spotlight: 693 openings, highest volume—here’s what the data is really saying

California is the highest-volume PMHNP market in our dataset right now: 693 verified openings. If you’re building a job aggregator, that number isn’t a headline—it’s a stress test.

“693” only matters if it’s current, deduplicated, and geographically correct. Job boards repost. Health systems syndicate. Staffing firms clone. Locations get written as “Bay Area” or “Remote (CA)” or “Various Locations.” Salary ranges show up as hourly, annually, or not at all.

This post is a technical look at how we surface California’s volume on PMHNP Hiring (Next.js + TypeScript + Supabase), and why volume doesn’t automatically mean fit.

If you want to browse what’s live right now, this is the production view: https://pmhnphiring.com/jobs/state/california

Why CA leads in openings (and what “leads” means in a pipeline)

California’s lead is driven by three factors we can observe directly in the ingestion layer:

Sheer employer surface area: large systems + multi-site outpatient groups generate continuous posting churn.
Broad location graph: dense metros, fast-growing suburbs, and rural shortage zones produce postings across many counties.
High repost velocity: CA roles are more likely to be syndicated across multiple sources, which inflates raw counts.

That third point is where data engineering matters. If we naïvely counted every scraped URL, CA would look even bigger—but it would be wrong.

At a high level, our daily pipeline looks like:

Ingest from 500+ sources (ATS pages, job boards, employer sites)
Normalize fields (title, employer, location, compensation)
Deduplicate across syndication
Verify freshness (remove stale/reposted listings)
Geocode & tag (state, metro, setting signals)
Serve via fast filters + alerts

California just happens to be where every one of those steps gets exercised at scale.

Counting “693”: dedupe + freshness are the whole story

The hardest part of “state spotlights” is making sure the count represents distinct, open roles.

1) Deduplication across sources

A single PMHNP role can appear:

on an employer’s ATS
on 3–10 job boards
reposted weekly with a new URL

We dedupe by generating a stable fingerprint from normalized fields. The exact recipe evolves, but conceptually:

// simplified
function fingerprint(job: NormalizedJob) {
  return hash([
    normalizeEmployer(job.employerName),
    normalizeTitle(job.title),
    normalizeLocation(job.location), // city/state if present
    normalizeReq(job.requirementsText ?? ""),
  ].join("|"))
}

Then we cluster near-duplicates (minor title differences, “Psych NP” vs “PMHNP”) using similarity thresholds.

2) Freshness: “live” vs “stale repost”

High-volume states are repost-heavy. To keep the CA page useful, we track signals like:

last_seen_at (when a crawler last confirmed it exists)
source_updated_at (if the source exposes it)
closed/404 signals

A job can be popular and still sit open for months—but it needs to be verifiably available.

Where the CA jobs are: geo resolution beats “Bay Area” strings

California hiring isn’t evenly distributed, and you can’t analyze distribution if locations are sloppy.

Location normalization challenges we see in CA

“Los Angeles, CA” (easy)
“San Francisco Bay Area” (needs mapping)
“Remote in California” (state-only)
“Multiple Locations” (often a multi-site group)

We resolve locations into a consistent shape:

{
  "state": "CA",
  "city": "San Diego",
  "metro": "San Diego-Chula Vista-Carlsbad, CA",
  "is_remote": false,
  "lat": 32.7157,
  "lng": -117.1611
}

When we can’t confidently infer city/metro, we still keep the job (state-level filtering matters), but we avoid over-claiming precision in metro counts.

What the distribution usually looks like

Big metros: widest mix (outpatient, inpatient, specialty)
Rural/semi-rural: fewer postings, often higher urgency and narrower candidate pools

Outpatient dominates the CA index—clinic medication management, integrated care, and community mental health appear frequently—because those orgs post continuously and across many sites.

Salary + cost of living: normalization before conclusions

California is commonly high-paying, but salary data is messy:

hourly vs annual
wide ranges (“$160k–$240k”) vs single numbers
missing compensation (common)

We normalize to a comparable annualized range when possible:

function annualize(amount: number, unit: "hour"|"year") {
  if (unit === "year") return amount
  return amount * 40 * 52
}

Then we store both the raw and normalized values so the UI can explain what it’s showing.

Cost of living (COL) is not a single number either; it’s region-specific. The product approach we take is: show salary when available, but keep filtering centered on role constraints (onsite vs hybrid, setting, call, population) because those are consistently present in the data.

Volume ≠ fit: what to infer from 693 openings

From a builder’s point of view, “highest volume” usually means:

more duplicates to crush
more employer posting patterns (waves, evergreen roles)
more variance in requirements (onsite-only, specific populations, credentialing timelines)

For job seekers, that translates to: some CA postings close fast because applicant volume is high; others linger because constraints are tight.

If you’re exploring CA, use the live index to filter down to the jobs that match your constraints instead of optimizing for the raw count:

region/commute reality
outpatient vs inpatient
remote/hybrid flags
salary (when present)

California leads the country in openings—but the real win is turning that noisy volume into a clean, searchable set of roles you can actually act on.

How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)

Sathish — Thu, 26 Feb 2026 16:00:39 GMT

In our pipeline, “PMHNP-BC required” isn’t a nice-to-have string. It’s a high-signal field that determines whether a job matches a clinician at all—so we treat it like structured data, not copy.

How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)

PMHNP-BC shows up in job posts so often it can read like alphabet soup. But in most hiring pipelines it’s a gatekeeper credential: if the posting requires it, your application often won’t make it past an ATS rule or a credentialing checkpoint.

From a product/builders angle, that makes PMHNP-BC a piece of data we have to extract reliably. If we misread it (false positive or false negative), we either:

show you jobs you can’t actually credential into, or
hide jobs you’re qualified for.

PMHNP Hiring aggregates 500+ sources daily and maintains 10,000+ verified PMHNP jobs across 50 states. Credential requirements are one of the highest-impact fields we normalize because they drive real-time filtering, alerts, and matching.

You can see how frequently it appears by scanning current listings on https://pmhnphiring.com/jobs.

PMHNP-BC meaning (as data): what the credential actually stands for

PMHNP-BC stands for Psychiatric–Mental Health Nurse Practitioner – Board Certified.

In practice, when employers say “PMHNP-BC required,” they’re almost always referring to ANCC board certification for the PMHNP population focus.

Why this matters operationally:

Employers aren’t hiring “an NP who does psych.” They’re hiring someone whose education, clinical hours, and board exam align to psychiatric-mental health scope.
The “BC” is HR/credentialing shorthand for a standardized, third-party-verifiable credential.
Many postings are flexible on schedule, setting, and sometimes experience. They’re usually not flexible on board certification because it touches credentialing, payer enrollment, and risk management.

So we don’t treat “PMHNP-BC” as marketing copy. We treat it like a structured constraint.

Why most employers require ANCC certification (and how that shows up in job data)

In a typical hiring flow, a job goes through multiple checkpoints:

Recruiter/ATS intake
Clinical leadership review
Credentialing
Payer enrollment

ANCC board certification is a clean way for those systems and teams to stay aligned. From the data side, it’s also one of the few credentials that appears consistently across job posts, PDFs, and ATS templates.

That consistency is why it becomes a filterable requirement on our end.

The problem: job posts mention it in messy ways

Across sources, we see variants like:

PMHNP-BC required
PMHNP BC
Board Certified PMHNP
ANCC certification required
Psych NP (BC)
Must be board certified in psychiatry

And we also see confusing near-misses:

“Psych NP preferred” (not necessarily a hard requirement)
“BC/BE” (board certified / board eligible — more common in physician postings, but it leaks into mixed templates)
“Active license required” (license ≠ board certification)

So the engineering job is: turn noisy text into a reliable field.

Our extraction approach: from raw text to a normalized requirement

At ingestion time we store the raw posting, then produce a normalized record used for search/matching.

A simplified TypeScript shape looks like:

export type BoardCert =
  | { required: true; body: "ANCC"; credential: "PMHNP-BC" }
  | { required: false; body?: "ANCC"; credential?: "PMHNP-BC" }
  | { required: null };

export interface NormalizedJob {
  id: string;
  title: string;
  description: string;
  requirements_text: string;
  board_cert: BoardCert;
  // ...salary, location, setting, etc.
}

Step 1: pattern detection with guardrails

We start with deterministic signals (regex + keyword proximity) because they’re explainable and easy to debug.

const PMHNP_BC_PATTERNS = [
  /\bPMHNP\s*[- ]?BC\b/i,
  /\bPsychiatric\s*-?Mental\s*Health\s*Nurse\s*Practitioner\s*-?\s*Board\s*Certified\b/i,
  /\bANCC\b.*\b(PMHNP|Psych)\b/i,
  /\bboard\s*certif(ied|ication)\b.*\bPMHNP\b/i,
];

const REQUIREMENT_CUES = [/\brequired\b/i, /\bmust\b/i, /\bmandatory\b/i];
const SOFT_CUES = [/\bpreferred\b/i, /\bplus\b/i, /\bdesired\b/i];

export function extractBoardCert(text: string): BoardCert {
  const hasCredential = PMHNP_BC_PATTERNS.some((re) => re.test(text));
  if (!hasCredential) return { required: null };

  const isRequired = REQUIREMENT_CUES.some((re) => re.test(text)) &&
    !SOFT_CUES.some((re) => re.test(text));

  return {
    required: isRequired,
    body: /\bANCC\b/i.test(text) ? "ANCC" : "ANCC",
    credential: "PMHNP-BC",
  };
}

We bias toward not marking it “required” unless the copy is explicit. False “required” flags are worse than missing a “preferred” mention because they filter out jobs.

Step 2: dedup + canonicalization across sources

The same job often appears on multiple boards with different formatting. Our dedup pipeline clusters postings and merges fields. For credentials, we keep:

the most explicit requirement statement (source-of-truth ranking), and
a trace back to the raw text snippet that triggered it.

That trace matters when users ask “why did this job get filtered out?” Debuggability is a product feature.

How we surface it: filters, matching, and alerts

Once normalized, board_cert.required === true becomes a first-class filter in the app.

Real-time filtering (Next.js): users can hide “PMHNP-BC required” jobs if they’re still in school or board-pending.
Custom matching: if a user profile says “ANCC PMHNP-BC: yes,” those jobs rank higher.
Alerts: new jobs that flip from “preferred” to “required” (or vice versa) can trigger notifications, because it changes eligibility.

This is the builder’s takeaway: “PMHNP-BC” isn’t just a career acronym—it’s a constraint that must survive scraping, parsing, deduping, and ranking.

What PMHNP-BC signals (and what it doesn’t)

What it signals in job data:

The employer expects board certification for psychiatric-mental health NP scope (typically ANCC).
The job likely flows into credentialing/payer systems that will verify it.

What it doesn’t guarantee:

salary level (we still have to normalize comp across hourly/annual/RVU ranges)
autonomy level or supervision model
call burden or patient acuity

Those require different extraction pipelines.

If you’re building job search tooling, treat credentials like schema

A lot of job aggregators treat credentials as plain text. In healthcare hiring, credentials behave more like typed fields with strict semantics.

For PMHNP Hiring, “PMHNP-BC required” is one of the simplest examples of why: it changes who the job is for.

If you want to explore how often it appears, browse live listings here: https://pmhnphiring.com/jobs.

How We Measure the DNP vs MSN Pay Delta for PMHNP Jobs (and Turn It Into ROI Math)

Sathish — Wed, 25 Feb 2026 16:00:53 GMT

The “DNP earns $10–20K more than MSN” claim is directionally true in our job dataset—but only after you normalize salary formats, dedupe reposts, and separate degree signals from everything else employers pay for.

How We Measure the DNP vs MSN Pay Delta for PMHNP Jobs (and Turn It Into ROI Math)

The DNP vs MSN question usually collapses into one number: is an extra ~$10–20K/year worth more school?

From a product/data engineering angle, that number is not something you “look up.” It’s something you derive—from messy job postings, inconsistent compensation fields, duplicate listings, and fuzzy degree requirements.

PMHNP Hiring aggregates from 500+ job sources daily and maintains 10,000+ verified PMHNP jobs across all 50 states. Here’s how we turn raw postings into (1) a defensible pay delta and (2) an honest break-even model you can use.

1) The data problem: job postings aren’t a salary table

A PMHNP posting might say:

“$65–$80/hr” (hourly)
“$130k base + bonus” (mixed)
“Up to $180k” (ceiling only)
“Competitive” (no number)
“MSN required; DNP preferred” (ambiguous degree signal)

If you just average these strings, you’ll get nonsense.

Our pipeline (high level)

Ingest → Parse → Normalize → Deduplicate → Enrich → Serve

Ingest: scheduled collectors pull from job boards, health system career pages, ATS feeds, and smaller niche sites.
Parse: extract compensation text + structured hints (interval, min/max, currency), plus requirements (degree, licensure, telehealth, etc.).
Normalize: convert hourly/monthly to annual, handle ranges, and standardize to a comparable “annualized base” field.
Deduplicate: collapse reposts across sources (same job syndicated to 5 boards) so one employer doesn’t overweight the stats.
Enrich: geocode locations, tag setting (hospital/outpatient/telehealth), detect pay bands vs free-text.
Serve: Next.js + TypeScript API routes query Supabase with filters and return real-time results.

2) Salary normalization: turning “$75/hr” into a comparable annual number

A big reason the DNP vs MSN delta looks noisy is that job posts mix comp intervals.

We normalize to an annual estimate with explicit assumptions:

hourly → annual = hourly * 40 * 52
daily/weekly/monthly similarly
ranges → we store min_annual, max_annual, and a midpoint_annual

Example (TypeScript-ish pseudocode):

type Comp = { min?: number; max?: number; interval: 'hour'|'year' };

function annualize(comp: Comp) {
  const factor = comp.interval === 'hour' ? 40 * 52 : 1;
  const min = comp.min ? comp.min * factor : undefined;
  const max = comp.max ? comp.max * factor : undefined;
  const midpoint = min && max ? (min + max) / 2 : min ?? max;
  return { minAnnual: min, maxAnnual: max, midpointAnnual: midpoint };
}

We also track a confidence score (e.g., explicit range vs “up to”) so we can filter analyses to “high-confidence comp only” when computing salary deltas.

3) Degree detection: “required” vs “preferred” matters

For the DNP/MSN comparison, we classify degree language into buckets:

MSN_required
DNP_required
DNP_preferred
degree_unspecified

This is mostly rules + targeted patterns (not a vague “AI summary”). Why? Because “DNP preferred” frequently correlates with higher-paying org types (large systems) without being the direct cause of higher pay.

A simplified extraction sketch:

-- example: classify degree requirement from extracted text
case
  when req_text ilike '%dnp%required%' then 'DNP_required'
  when req_text ilike '%msn%required%' then 'MSN_required'
  when req_text ilike '%dnp%preferred%' then 'DNP_preferred'
  else 'degree_unspecified'
end as degree_bucket

This lets us compute apples-to-apples comparisons like:

same state
same setting (telehealth vs outpatient vs hospital)
similar experience requirements
high-confidence salary only

4) What the data shows: the ~$10–20K delta is real, but conditional

After normalization + dedupe + filtering to postings with usable comp data, we repeatedly see a DNP-to-MSN pay delta around ~$10–20K/year.

The important caveats (which show up clearly once you slice the data):

Pay-band orgs (health systems, large groups, some FQHCs) more often encode formal degree differentials.
Smaller practices often pay the same for MSN vs DNP and price more heavily on “can you carry a panel?”
Telehealth-first roles sometimes pay more overall, but the premium is often tied to productivity, multi-state coverage, or schedule—degree text may be incidental.

This is why we expose filters on the jobs page and keep the salary guide separate: one is real-time market evidence, the other is aggregated range context.

5) Turning salary delta into break-even time (the ROI calculation)

The cleanest ROI view is: how long to break even?

Break-even years:

break_even_years = total_cost / annual_salary_lift

Where total_cost should include:

tuition + fees
interest/loan costs
lost income if you delay full-time work or reduce hours

Example:

Total cost = $40,000
Salary lift = $12,000/year

Break-even ≈ 3.3 years.

But if:

Total cost = $70,000
Lift = $10,000/year

Break-even = 7 years.

That’s the part many “average bump” discussions miss: a $15K delta can vanish if the doctorate delays earnings by a year.

6) How we surface this in the product

From a UI standpoint, “DNP vs MSN” is just a filter. Under the hood, it’s a chain of data decisions:

normalized compensation fields stored in Supabase
deduped job entities (one canonical job, many source URLs)
degree buckets with confidence
location geocoding for state/city slices
real-time query performance so you can compare your market quickly

If you want to sanity-check your target area, start with live postings on the main jobs page and then cross-reference the broader ranges in the salary guide:

https://pmhnphiring.com/jobs
https://pmhnphiring.com/salary-guide

The takeaway isn’t “DNP always wins.” It’s: the ROI depends on where you plan to work, how the employer prices credentials, and whether the extra schooling changes your time-to-earn.

Why I Add an Outbox Table Instead of “Just Using a Queue”

Sathish — Tue, 24 Feb 2026 16:00:52 GMT

The Problem

Any SaaS backend hits this moment.

You start with a simple flow: request comes in → write to Postgres → publish an event (email, webhook, analytics, cache invalidation, search indexing). It works in dev. It even works in staging.

Then production happens.

A deploy rolls mid-request. The process restarts. Network blips. Kafka (or SQS, or Redis) has a bad minute. Suddenly you’ve got rows committed in Postgres but no event published. Or worse: event published but the DB transaction rolled back, so downstream systems act on data that doesn’t exist.

I wasted two days chasing a bug where customer-facing emails went out for records that never committed. The logs were clean. The code looked “correct.” The failure was architectural.

The core issue: atomicity across a database write and an external publish doesn’t exist unless you build for it.

Options I Considered

I usually evaluate this decision with one question: What’s the source of truth? In most SaaS backends I’ve built, Postgres is the source of truth. That pushes me toward patterns that treat the DB commit as the only “real” state transition.

Here are the options I’ve used or seriously considered.

Approach	Pros	Cons	Best For
DB write then publish to queue (in request path)	Simple mental model. Low latency for consumers.	Loses events on crash between commit and publish. Can publish events for rolled-back transactions. Retries can duplicate.	Low-stakes side effects (metrics) where occasional loss is fine.
Distributed transaction / 2PC	True atomicity across systems (on paper).	Operational pain. Limited support across managed queues. Hard to debug. Adds coupling you’ll regret.	Rare enterprise setups where you control both ends and can accept complexity.
Change Data Capture (CDC) from Postgres WAL	Clean separation. Events derived from DB changes. Scales well once established.	Setup cost. Schema evolution complexity. Filtering/transforming events takes real work. Harder local dev.	Larger teams, high event volume, strict audit requirements.
Transactional outbox (DB outbox table + dispatcher)	DB commit + “event intent” are atomic. Retries are safe. Simple to reason about.	More tables. More background processing. Tuning + cleanup required.	Small-to-mid systems where Postgres is the source of truth and you want reliability.

I didn’t pick CDC because I build solo and I don’t want to carry Debezium + Kafka Connect complexity unless the volume forces it.

I didn’t pick 2PC because I’ve lived that life. Debugging partial failures across systems is misery.

So it came down to: accept occasional loss, or implement outbox.

What I Chose (and Why)

I chose the transactional outbox pattern.

The decision was mostly about failure modes, not throughput.

Ranked reasons:

Atomicity with the DB commit. The outbox record is written in the same transaction as my business data.
Retries become boring. If publishing fails, I retry without guessing whether the original commit happened.
Backpressure is controllable. If downstream is slow, events pile up in Postgres. That’s visible. I can alert on it.

What I gave up:

Extra moving parts. I now own a dispatcher loop, concurrency limits, and cleanup.
Slightly higher latency. Events are typically published within 250ms–2s, not immediately in the request.
Schema overhead. You’ll add at least one table and a couple indexes.

Schema

This is the minimal schema I’ve landed on after trying a few variations.

status so I can manage retries.
available_at for exponential backoff.
idempotency_key so consumers (or my publisher) can dedupe.

CREATE TABLE IF NOT EXISTS outbox_events (
  id            BIGSERIAL PRIMARY KEY,
  aggregate_type TEXT NOT NULL,
  aggregate_id   TEXT NOT NULL,
  event_type     TEXT NOT NULL,
  payload        JSONB NOT NULL,
  idempotency_key TEXT NOT NULL,

  status         TEXT NOT NULL DEFAULT 'pending', -- pending|processing|published|dead
  attempts       INT  NOT NULL DEFAULT 0,
  available_at   TIMESTAMPTZ NOT NULL DEFAULT now(),

  created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at   TIMESTAMPTZ
);

CREATE UNIQUE INDEX IF NOT EXISTS outbox_events_idempotency_key_uidx
  ON outbox_events (idempotency_key);

CREATE INDEX IF NOT EXISTS outbox_events_pending_idx
  ON outbox_events (status, available_at, id);

Writing business data + outbox atomically

I use Node.js a lot for SaaS backends, so here’s a working example using pg.

Key detail: the outbox write is inside the same BEGIN/COMMIT.

import pg from 'pg';
import crypto from 'crypto';

const { Pool } = pg;
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export async function createInvoice({ customerId, amountCents }) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    const invoiceRes = await client.query(
      `INSERT INTO invoices (customer_id, amount_cents, status)
       VALUES ($1, $2, 'created')
       RETURNING id, customer_id, amount_cents, status, created_at`,
      [customerId, amountCents]
    );

    const invoice = invoiceRes.rows[0];
    const idempotencyKey = crypto
      .createHash('sha256')
      .update(`invoice.created:${invoice.id}`)
      .digest('hex');

    await client.query(
      `INSERT INTO outbox_events
         (aggregate_type, aggregate_id, event_type, payload, idempotency_key)
       VALUES
         ($1, $2, $3, $4::jsonb, $5)
       ON CONFLICT (idempotency_key) DO NOTHING`,
      [
        'invoice',
        String(invoice.id),
        'invoice.created',
        JSON.stringify({
          invoiceId: invoice.id,
          customerId: invoice.customer_id,
          amountCents: invoice.amount_cents,
          createdAt: invoice.created_at
        }),
        idempotencyKey
      ]
    );

    await client.query('COMMIT');
    return invoice;
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

That ON CONFLICT DO NOTHING is defensive. If my API handler retries due to a timeout after the commit (classic), I won’t enqueue the same logical event twice.

Dispatching with `FOR UPDATE SKIP LOCKED`

This is the part people either over-engineer or under-engineer.

I keep it boring:

Select a batch of pending events.
Lock them so multiple workers don’t double-publish.
Mark them processing.
Publish.
Mark published.

Postgres gives me the concurrency primitive I need: FOR UPDATE SKIP LOCKED.

import pg from 'pg';

const { Pool } = pg;
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function publishToQueue(evt) {
  // Example: replace with your actual publisher.
  // This function must be safe to retry.
  // If you use SQS FIFO, idempotency_key can be MessageDeduplicationId.
  return;
}

export async function dispatchOnce({ batchSize = 50 } = {}) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    const { rows: events } = await client.query(
      `SELECT id, event_type, payload, idempotency_key
       FROM outbox_events
       WHERE status = 'pending'
         AND available_at <= now()
       ORDER BY id
       FOR UPDATE SKIP LOCKED
       LIMIT $1`,
      [batchSize]
    );

    if (events.length === 0) {
      await client.query('COMMIT');
      return 0;
    }

    const ids = events.map(e => e.id);

    await client.query(
      `UPDATE outbox_events
       SET status = 'processing'
       WHERE id = ANY($1::bigint[])`,
      [ids]
    );

    await client.query('COMMIT');

    // Publish outside the transaction.
    for (const evt of events) {
      await publishToQueue(evt);
      await pool.query(
        `UPDATE outbox_events
         SET status = 'published', published_at = now()
         WHERE id = $1`,
        [evt.id]
      );
    }

    return events.length;
  } catch (e) {
    await client.query('ROLLBACK');
    throw e;
  } finally {
    client.release();
  }
}

Yes, publishing happens outside the transaction. That’s intentional.

Holding DB locks while waiting on a network call is how you end up with a self-inflicted outage.

So you accept this reality: an event can be marked processing and the worker can crash before publishing. That’s fine. You handle it with a reaper.

How It Worked in Production

This pattern fixed the class of bugs where “DB says yes, queue says no.” Immediately.

Numbers from my last setup (single Postgres primary, one Node worker, one queue):

Before outbox, I measured 19 missing side-effect actions across 1,842,611 requests over 14 days. Not catastrophic. But every miss was a support ticket or silent data skew.
After outbox, missing actions dropped to 0 across 2,103,884 requests over the next 14 days.

Latency changed too:

In-request publishing (old): p95 request latency 310ms, with spikes to 1,900ms when the queue API slowed.
Outbox (new): p95 request latency 180ms (queue publish removed from critical path). Event publish delay p95 740ms.

Stuff that surprised me:

The outbox table grows fast. Even at modest volume, you’ll create tens of millions of rows over time. I hit 24,118,902 rows in 30 days once. Vacuum wasn’t happy.
Retrying needs backoff. Without it, a downstream outage turns into a tight loop hammering the queue.

I ended up adding:

A reaper that resets stuck processing events.
A dead-letter path after N attempts.
Partitioning or aggressive archiving depending on volume.

Here’s the reaper SQL I use.

UPDATE outbox_events
SET status = 'pending',
    available_at = now(),
    attempts = attempts + 1
WHERE status = 'processing'
  AND created_at < now() - interval '10 minutes'
  AND attempts < 25;

UPDATE outbox_events
SET status = 'dead'
WHERE attempts >= 25
  AND status IN ('pending', 'processing');

I run that every minute.

Harsh? Yeah. But it forces me to look at dead events instead of pretending retries are infinite.

When This Doesn't Work

I don’t use an outbox everywhere.

If you need sub-50ms end-to-end event delivery, the dispatcher loop + polling will annoy you. You can mitigate with LISTEN/NOTIFY, but now you’re building a more complex dispatcher anyway.

If you’ve already got a mature event platform (Kafka + schema registry + CDC team ownership), straight CDC is cleaner at scale.

And if your DB isn’t the source of truth (event-sourced systems, or systems where writes land in a log first), an outbox table can be redundant.

Also: if you can’t tolerate the outbox table size and you won’t invest in partitioning/TTL, this pattern will bite you later.

Key Takeaways

If Postgres is your source of truth, write the event intent into Postgres in the same transaction. That’s the whole point.
Don’t publish to external systems while holding DB locks. Ever.
Use FOR UPDATE SKIP LOCKED for horizontal scaling without coordination.
Design for retries upfront: idempotency keys, backoff (available_at), and a dead-letter state.
Plan for data lifecycle. Outbox tables don’t stay small by accident.

Closing

I keep seeing teams jump straight to “add a queue” and stop there. The queue solves buffering, not atomicity.

The outbox pattern is boring, but it makes failure modes legible—and that’s the real win when you’re on-call for your own system.

Do you prefer an outbox dispatcher (polling or LISTEN/NOTIFY) or CDC off the WAL, and at what event volume did you switch?

How We Compared Telehealth vs In-Person PMHNP Pay Across 10,000+ Job Posts

Sathish — Mon, 23 Feb 2026 21:31:47 GMT

“Telehealth pays less” is usually a conclusion drawn from one offer. When you aggregate thousands of postings and normalize comp structures, the pattern flips: telehealth often pays more.

The myth: telehealth is “easier,” so it pays less

On PMHNP Hiring we ingest 500+ sources daily and maintain 10,000+ verified PMHNP jobs across all 50 states. When we look at compensation across that dataset (after normalizing salary formats and removing duplicates), the common claim that in-person always pays more doesn’t hold up.

Across postings that include usable pay data, telehealth roles often price higher than in-person roles.

That doesn’t mean every remote job beats every onsite job. It means the distribution is different enough that treating telehealth as a “pay cut for flexibility” is a bad default.

This post is the builder’s version of the question: what does the data say, and what did we have to do technically to make it comparable?

Why pay comparisons are hard (and why raw job boards mislead)

Job posts rarely ship “clean” salary fields. The same compensation can show up as:

$140/hr (W2 hourly)
$1,200/day
$250/visit (1099)
Up to $220k (base + bonus unknown)
80% collections (requires assumptions)

If you compare those strings directly, you’ll produce nonsense. Our pipeline has to:

Extract comp from messy text (structured fields when available, otherwise description parsing)
Normalize to comparable units (hourly ↔ annual, ranges ↔ midpoint)
Classify pay model (salary, hourly, per-visit, RVU/collections)
Deduplicate cross-posted roles so one high-paying listing doesn’t appear 30 times
Segment by modality (telehealth vs in-person vs hybrid) using both metadata and text signals

Only after that do “telehealth vs in-person” comparisons become meaningful.

The pipeline: from scraped postings to comparable numbers

At a high level, we treat each source as an input adapter that maps into a common schema, then run enrichment steps.

1) Canonical job schema

We store a normalized representation (Supabase/Postgres), keeping raw fields for debugging:

type PayModel = 'salary' | 'hourly' | 'per_visit' | 'rvu' | 'collections' | 'unknown'

type Job = {
  id: string
  source: string
  source_job_id: string
  title: string
  company: string
  location_text: string
  remote_type: 'telehealth' | 'in_person' | 'hybrid' | 'unknown'
  pay_model: PayModel
  pay_min?: number
  pay_max?: number
  pay_unit?: 'year' | 'hour' | 'visit'
  pay_currency?: 'USD'
  description: string
  posted_at: string
  fingerprint: string // for dedupe
}

2) Salary parsing + normalization

We normalize into an annualized estimate only when the pay model supports it. For hourly W2 roles, annualization is straightforward (with assumptions). For per-visit/collections, we keep the model explicit to avoid inventing certainty.

const HOURS_PER_YEAR = 2080

function annualize(job: Job) {
  if (job.pay_model === 'hourly' && job.pay_unit === 'hour') {
    return {
      annual_min: job.pay_min ? job.pay_min * HOURS_PER_YEAR : null,
      annual_max: job.pay_max ? job.pay_max * HOURS_PER_YEAR : null,
      confidence: 'medium',
    }
  }

  if (job.pay_model === 'salary' && job.pay_unit === 'year') {
    return {
      annual_min: job.pay_min ?? null,
      annual_max: job.pay_max ?? null,
      confidence: 'high',
    }
  }

  // per-visit / collections / RVU require volume assumptions → do not annualize by default
  return { annual_min: null, annual_max: null, confidence: 'low' }
}

This is where a lot of “telehealth pays less” myths come from: many remote roles are posted as per-visit or production-based, while hospital roles are posted as clean annual salaries. If you only compare annual-salary postings, you bias toward in-person systems.

3) Deduplication (the hidden salary inflation bug)

High-volume telehealth platforms syndicate aggressively. Without dedupe, your dataset overcounts the same role and skews pay stats.

We generate a fingerprint from stable fields (company + title + state/license requirement + pay band + remote type) and cluster near-matches.

Architecture note: dedupe is a blend of deterministic hashing + fuzzy matching (string similarity on company/title) with thresholds tuned by manual review.

What the data shows: telehealth often prices higher

After normalization and dedupe, we compare distributions by remote_type, segmented by pay model (salary vs hourly vs per-visit).

A simplified SQL sketch:

select
  remote_type,
  pay_model,
  percentile_cont(0.5) within group (order by annual_mid) as p50,
  count(*) as n
from (
  select
    remote_type,
    pay_model,
    case
      when annual_min is not null and annual_max is not null then (annual_min + annual_max)/2
      when annual_min is not null then annual_min
      when annual_max is not null then annual_max
      else null
    end as annual_mid
  from job_comp_normalized
  where annual_min is not null or annual_max is not null
) x
group by 1,2
order by n desc;

The repeated pattern we see:

Telehealth salary/hourly postings often cluster higher than comparable in-person postings.
The gap gets bigger in roles that signal urgency: multi-state licensing, nights/weekends, fast start dates.
In-person still wins in specific slices: hospital systems with strong benefits and stable base salaries.

So why does telehealth price higher so often?

The business math behind the pay (as seen through job post signals)

From a data standpoint, remote roles correlate with signals that predict higher comp:

Competition is national, not local
- Telehealth employers compete against other remote-first orgs. We see faster repost cycles and higher pay edits in these listings.
Many remote models are throughput-optimized
- Posts mention standardized workflows, shorter appointment gaps, and reduced no-shows. That tends to pair with productivity pay or higher hourly rates.
Coverage + urgency premiums
- Remote roles disproportionately include nights/weekends, rural coverage, and “licensed in X state” requirements.

Technically, these show up as text features we can index and filter: weekend, after-hours, multi-state, compact, ASAP, etc. They’re not perfect, but they’re strong enough to segment on.

What we surface in the product (and why it matters for negotiation)

On the UI side (Next.js + TypeScript), we expose filters that map directly to the normalized schema:

Telehealth / in-person / hybrid
Pay model (salary vs hourly vs per-visit)
Pay range (only when confidence is sufficient)
State licensing requirements

Alerts (email/push) are triggered when new postings match saved filters, so users can watch their slice of the market rather than relying on anecdotes.

If you’re negotiating, the practical takeaway is data-driven: don’t assume telehealth implies a discount. Treat modality as one variable, then compare roles with the same pay model and similar constraints.

Next up: improving apples-to-apples comparisons

The hardest remaining problem is per-visit and collections-based comp. We’re working on “expected annual comp” estimates by pairing postings with realistic volume assumptions (and clearly labeling them as assumptions). That’s the only way to compare a $250/visit role against a $190k base role without hand-waving.

Why I Use Canonical + noindex as an SEO Safety Net

Sathish — Thu, 05 Feb 2026 16:00:55 GMT

The Problem

Duplicate URLs aren’t a “SEO issue”. They’re a system design issue.

When I’m building a content-heavy app solo, URLs multiply fast. Trailing slash vs no slash. ?utm_source= junk. Filters like ?state=ca&role=icu. Sort options. Even framework-level behavior (redirects, notFound(), middleware) can create multiple reachable URLs for the same document.

Google doesn’t ask permission. It picks a canonical on its own if you don’t.

My failure mode was predictable: I’d ship a feature, traffic would go up, then Search Console would show duplicates, “Discovered — currently not indexed”, and soft 404s. Worse, the wrong URLs would rank (parameterized junk), and the ones I cared about wouldn’t.

So I treated it like any other production bug: define an invariant. One document → one indexable URL.

Options I Considered

There are a few common approaches. None is perfect.

Approach	Pros	Cons	Best For
Canonical only ()	Keeps alternates crawlable; consolidates signals	Google may ignore it; duplicates still consume crawl budget	Mild duplication where alternates are still useful
`noindex` only ()	Hard stop for indexing; fast cleanup	Doesn’t consolidate signals well; still crawled unless blocked	Thin pages or internal utility pages
Redirect alternates to one URL (301/308)	Strongest consolidation; simplest mental model	Breaks some UX (filters/back button); can cause redirect chains	When alternates are truly equivalent
Canonical + `noindex` + robots rules (hybrid)	Defensive; handles messy real-world URLs	Easy to over-block; requires discipline in routing	Apps with filters, facets, and lots of generated URLs

I started with canonical-only. It worked until it didn’t.

Here’s why canonical-only failed for me:

Parameterized URLs often got indexed anyway. Google treated them as distinct enough.
Canonical mistakes are easy. One bug in a shared layout and you emit the wrong canonical for thousands of pages.
Crawl budget isn’t theoretical when you have lots of pages. Duplicates dilute attention.

Redirect-only was tempting. But I didn’t want to redirect every filter combination.

Faceted URLs are tricky:

Some facets are trash (sort order, tracking params).
Some facets are legitimate landing pages (state, role, category).

If you redirect everything, you kill valid long-tail entry points. If you redirect nothing, you get duplication.

So I went hybrid.

What I Chose (and Why)

I chose canonical + selective noindex + robots.txt rules for known junk params, plus a hard rule: every indexable page must emit an explicit canonical.

Ranked reasons:

Fail-safe behavior. If a duplicate URL slips through, it still won’t get indexed.
Control. I decide which facets deserve indexing. Not Google.
Incremental rollout. I can add rules per route type without rewriting routing.

What I gave up:

I gave up the simplicity of “canonical everywhere and pray”. Now I maintain explicit allow/deny logic.
I gave up indexing for some URLs that might’ve been valuable. That’s on me to evaluate.

Step 1: Normalize the canonical URL

In Next.js, the easiest trap is building canonicals from req.url or searchParams. Don’t.

I treat canonical as a pure function of the route params that define the document.

// app/lib/seo.ts
export function canonicalUrl(baseUrl: string, pathname: string) {
  // Enforce a consistent policy: no trailing slash except root
  const cleanPath = pathname === "/" ? "/" : pathname.replace(/\/+$/, "");
  return new URL(cleanPath, baseUrl).toString();
}

export function isIndexablePath(pathname: string) {
  // Index only real landing pages. Everything else gets noindex.
  // Adjust to your domain model.
  if (pathname === "/") return true;

  // Example allow-list patterns
  if (/^\/states\/[a-z]{2}$/.test(pathname)) return true;
  if (/^\/cities\/[a-z-]+$/.test(pathname)) return true;
  if (/^\/categories\/[a-z-]+$/.test(pathname)) return true;
  if (/^\/jobs\/[0-9]+$/.test(pathname)) return true;

  return false;
}

Step 2: Emit canonical + robots per page (or layout)

If you’re on the App Router, generateMetadata() is the cleanest place to do this.

// app/(public)/[...slug]/page.tsx
import type { Metadata } from "next";
import { canonicalUrl, isIndexablePath } from "@/app/lib/seo";

const BASE_URL = process.env.NEXT_PUBLIC_BASE_URL!;

export async function generateMetadata(
  { params }: { params: Promise<{ slug?: string[] }> }
): Promise<Metadata> {
  const { slug } = await params;
  const pathname = "/" + (slug?.join("/") ?? "");

  const canonical = canonicalUrl(BASE_URL, pathname);
  const indexable = isIndexablePath(pathname);

  return {
    alternates: { canonical },
    robots: indexable
      ? { index: true, follow: true }
      : { index: false, follow: true },
  };
}

export default function Page() {
  return null;
}

That follow: true is intentional. I still want discovery through internal links even if the page itself isn’t indexable.

Step 3: Kill obvious junk at the robots layer

Robots.txt isn’t a noindex mechanism anymore (Google stopped respecting noindex in robots years ago). But it’s still useful for crawl control.

I block parameters that should never be crawled.

# public/robots.txt
User-agent: *
Disallow: /*?utm_
Disallow: /*&utm_
Disallow: /*?ref=
Disallow: /*&ref=
Disallow: /*?sort=
Disallow: /*&sort=

# Let everything else be crawlable
Allow: /

This doesn’t prevent indexing if there are external links pointing at a URL (Google can index a URL it can’t crawl). That’s why I still rely on noindex for anything that’s reachable and shouldn’t be indexed.

Step 4: Make 404s real 404s

Soft 404s were another source of garbage URLs showing up. If a page doesn’t exist, return a real 404.

In App Router, notFound() does the right thing if you don’t swallow it and render a 200.

// app/jobs/[id]/page.tsx
import { notFound } from "next/navigation";

async function getJob(id: number) {
  const res = await fetch(`${process.env.API_BASE_URL}/jobs/${id}`, {
    cache: "no-store",
  });
  if (res.status === 404) return null;
  if (!res.ok) throw new Error(`Failed to fetch job ${id}: ${res.status}`);
  return res.json() as Promise<{ id: number; title: string }>;
}

export default async function JobPage(
  { params }: { params: Promise<{ id: string }> }
) {
  const { id } = await params;
  const jobId = Number(id);
  if (!Number.isInteger(jobId) || jobId <= 0) notFound();

  const job = await getJob(jobId);
  if (!job) notFound();

  return (
    
      {job.title}
      Job #{job.id}
    
  );
}

That tiny Number.isInteger(jobId) check prevented a whole class of /jobs/abc garbage from turning into “valid” pages.

How It Worked in Production

This was one of those fixes where you don’t get to celebrate immediately. Google takes its time. Also, Search Console reporting lags.

But the signals were clear.

Duplicate canonical issues dropped from 46 to 0 in 9 days.
Soft 404s dropped from 12 to 0 after I fixed notFound() usage and stopped returning 200s for missing entities.
“Discovered — currently not indexed” URLs went down by 50+ after I blocked crawl traps (utm_, sort, ref) and noindexed non-landing facet pages.

The surprise: canonical correctness mattered more than I expected.

I had one bug where I accidentally emitted the same canonical for every city page because I computed it in a layout using the parent route path. Google didn’t just ignore the canonical. It started clustering pages together. Rankings got weird. Pages dropped.

After I moved canonical generation to the leaf route and made it purely derived from route params, the clustering stopped.

This wasn’t “SEO”. It was a distributed system resolving conflicting identifiers.

When This Doesn't Work

This setup breaks when you actually want faceted navigation to be indexable at scale.

If your business depends on long-tail combinations (think /laptops?brand=lenovo&ram=32gb&cpu=amd), blanket noindex on parameterized URLs will kneecap you. In that world, you need a real facet strategy: allow-list specific combinations, generate clean path-based landing pages, and control internal linking so you don’t create infinite crawl graphs.

Also: if your canonical logic depends on runtime headers (host, protocol) behind proxies/CDNs, you’ll generate mismatched canonicals (http vs https). That’s a mess. Use an explicit BASE_URL and stick to it.

Key Takeaways

Treat URLs like primary keys. One document should have one indexable identifier.
Canonical-only is optimistic. Canonical + selective noindex is defensive.
Robots.txt controls crawl. It doesn’t guarantee deindexing.
Make 404s real 404s. Soft 404s are just duplicate-content bugs wearing a different hat.
Keep an allow-list for indexable routes. If you can’t explain why a URL should rank, it shouldn’t be indexable.

Closing

I’ve settled on a rule: if a URL can be generated by a user clicking around (filters, sorting, tracking params), it’s guilty until proven innocent.

What’s your rule for deciding which faceted URLs become first-class landing pages, and which ones get canonical + noindex?

Why I Use MMKV Over AsyncStorage for Persisted State

Sathish — Sat, 31 Jan 2026 16:00:55 GMT

Offline-first apps live or die by perceived speed. In my React Native fitness app (5‑second set logging, SQLite-first), the difference between “instant” and “laggy” often comes down to one unglamorous detail: how you persist state. I initially treated persistence as an afterthought (AsyncStorage + JSON), then watched startup time and UI responsiveness degrade as I added an achievement system and more client-side state. This post is about why I switched to MMKV for persisted state with Zustand—and what I traded away to get consistently snappy interactions.

The problem space: persistence is on the hot path

I’m building a mobile workout app where the core interaction is logging a set in ~5 seconds. The app is offline-first:

SQLite is the primary data store (workouts, sets, exercises)
The UI needs to feel instantaneous (sub-100ms interactions)
State must survive restarts (in-progress workout, last used timers, user preferences, cached AI hints)
A new achievement system introduced more “derived UI state” (unlocked milestones, celebratory banners, streaks)

At my current scale (10-person waitlist, ~400+ exercises), this isn’t “big data.” But mobile performance is nonlinear: you can be “small” and still feel slow.

Two constraints made persistence a first-class architecture decision:

Cold start is the first impression. If I can’t restore enough state quickly, users land in a blank screen or loading spinners.
Offline-first means more local state. Remote is not the source of truth; the device is. That shifts more responsibility to local persistence.

I also build in a “vibe coding” style (Cursor + Claude). That speeds up iteration, but it also increases the risk of accidental performance regressions—so I wanted a persistence layer that’s hard to misuse.

Options considered

The decision was specifically about persisted app state (Zustand store snapshots, preferences, small caches), not the main relational data (that lives in SQLite).

Here are the options I seriously considered.

Option	What it is	Pros	Cons	Best when
AsyncStorage (community)	Simple key/value storage, async JS API	Built-in mental model, widely used, minimal native setup	JSON serialize/parse overhead, slower reads on startup, easy to store too much, performance varies by platform/implementation	Very small state, infrequent reads, low perf sensitivity
SQLite for everything	Store preferences/state tables in SQLite	One database to rule them all, queryable, consistent backup story	More schema work, migrations, overkill for tiny blobs, still needs careful read patterns on startup	You need relational queries or complex local joins
Secure storage (Keychain/Keystore)	OS-provided encrypted storage	Great for secrets, tokens	Not meant for frequent reads/writes, capacity limits, slower	Credentials, API keys, refresh tokens
MMKV	Fast key/value storage via JSI (C++), sync reads	Very fast, synchronous reads (no async waterfall), good for persisted state, supports encryption	Native dependency, synchronous API can be abused, not queryable like SQLite	Hot-path state (startup, navigation), medium-sized persisted slices

Why not just use AsyncStorage “correctly”?

You can. If you aggressively minimize what you persist, debounce writes, and avoid reading too much on boot, AsyncStorage can be fine.

But in practice, “correctly” is the hard part—especially as a solo creator moving fast.

What pushed me away:

Async waterfall on startup: await getItem() chains across multiple keys can add up.
Serialization overhead: JSON parse/stringify becomes noticeable when you persist larger objects (like achievement states or cached AI responses).
Non-obvious regressions: a single new persisted field can silently add milliseconds to every startup.

Why not store state in SQLite?

I’m already using SQLite heavily, so the “one local store” idea was tempting.

But I want to keep a clear boundary:

SQLite: durable domain data (workouts/sets/exercises), needs migrations, integrity constraints.
KV store: UI/session preferences and caches (fast, schema-less, easy to wipe).

Mixing them tends to create a junk-drawer schema where every new UI flag becomes a table row. It’s workable, but it’s not the kind of complexity I want early.

The decision: MMKV for persisted Zustand slices

I chose MMKV as the persistence backend for Zustand.

Ranked reasons:

Startup speed via synchronous reads: restoring state doesn’t require an async chain before rendering.
Lower overhead for small-to-medium blobs: less pain from JSON parse/stringify on hot paths.
Better guardrails: it nudges me toward persisting only what matters, because it’s easy to keep state slices small and explicit.

What I gave up:

More native surface area (dependency management, Expo config/plugins)
Potential UI jank if I abuse sync reads/writes (sync is a tool, not a free lunch)
Less portability than AsyncStorage (MMKV is native-first)

Implementation overview

The key architectural choice wasn’t “use MMKV” in isolation; it was persist only the slices that must survive a restart.

My rule: if it can be recomputed from SQLite, don’t persist it in MMKV.

1) Create an MMKV-backed storage adapter for Zustand

// storage/mmkv.ts
import { MMKV } from 'react-native-mmkv'

export const mmkv = new MMKV({
  id: 'gymtracker-mmkv',
  // Optional: encryptionKey can be added, but be careful with key management.
})

export const zustandStorage = {
  setItem: (name: string, value: string) => {
    mmkv.set(name, value)
  },
  getItem: (name: string) => {
    const v = mmkv.getString(name)
    return v ?? null
  },
  removeItem: (name: string) => {
    mmkv.delete(name)
  },
}

This adapter keeps the persistence boundary clean: Zustand only sees a string-based storage interface.

2) Persist only “session-critical” state

Example: the currently active workout session (IDs, timestamps, UI mode), not the full workout history.

// state/useWorkoutSessionStore.ts
import { create } from 'zustand'
import { persist, createJSONStorage } from 'zustand/middleware'
import { zustandStorage } from '../storage/mmkv'

type WorkoutSessionState = {
  activeWorkoutId: string | null
  startedAt: number | null
  restTimerSeconds: number
  setActiveWorkout: (id: string | null) => void
  setRestTimerSeconds: (s: number) => void
  resetSession: () => void
}

export const useWorkoutSessionStore = create()(
  persist(
    (set) => ({
      activeWorkoutId: null,
      startedAt: null,
      restTimerSeconds: 90,
      setActiveWorkout: (id) => set({ activeWorkoutId: id, startedAt: id ? Date.now() : null }),
      setRestTimerSeconds: (s) => set({ restTimerSeconds: s }),
      resetSession: () => set({ activeWorkoutId: null, startedAt: null }),
    }),
    {
      name: 'workout-session-v1',
      storage: createJSONStorage(() => zustandStorage),
      partialize: (state) => ({
        activeWorkoutId: state.activeWorkoutId,
        startedAt: state.startedAt,
        restTimerSeconds: state.restTimerSeconds,
      }),
    }
  )
)

Two important details:

partialize: prevents “accidental persistence” of large or unstable state.
versioning via key name: makes it easier to invalidate when you change shapes.

3) Don’t persist derived/large objects (store references instead)

A trap I fell into early: persisting “achievement UI state” as a big object (every milestone, last shown, UI banners). It inflated the persisted blob and increased restore time.

Instead, I persist only:

last shown milestone ID
a set of unlocked milestone IDs (small)
timestamps for rate-limiting celebrations

Everything else is derived from a static milestone catalog bundled with the app.

// state/useAchievementsStore.ts
import { create } from 'zustand'
import { persist, createJSONStorage } from 'zustand/middleware'
import { zustandStorage } from '../storage/mmkv'

type AchievementsState = {
  unlockedIds: Record<string, true>
  lastCelebratedId: string | null
  lastCelebratedAt: number | null
  unlock: (id: string) => void
  markCelebrated: (id: string) => void
}

export const useAchievementsStore = create()(
  persist(
    (set, get) => ({
      unlockedIds: {},
      lastCelebratedId: null,
      lastCelebratedAt: null,
      unlock: (id) => {
        if (get().unlockedIds[id]) return
        set((s) => ({ unlockedIds: { ...s.unlockedIds, [id]: true } }))
      },
      markCelebrated: (id) => set({ lastCelebratedId: id, lastCelebratedAt: Date.now() }),
    }),
    {
      name: 'achievements-v1',
      storage: createJSONStorage(() => zustandStorage),
      partialize: (s) => ({
        unlockedIds: s.unlockedIds,
        lastCelebratedId: s.lastCelebratedId,
        lastCelebratedAt: s.lastCelebratedAt,
      }),
    }
  )
)

This keeps MMKV as a fast “memory of the app,” not a second database.

Results & learnings (numbers + gotchas)

I don’t have millions of users, so my numbers are device-level measurements, not fleet-wide telemetry.

On a mid-range Android device (Pixel 6a-class) and an iPhone 13-class device, measured with simple timestamp logging around hydration and first interactive screen:

Persisted hydration time (Zustand restore):
- AsyncStorage (before): ~35–80ms typical, with occasional 150ms spikes when the persisted blob grew
- MMKV (after): ~5–15ms typical, fewer spikes
Time to first “workout screen interactive” (not full app start, but navigation + state ready):
- Before: ~450–650ms
- After: ~320–500ms

The bigger win wasn’t the raw numbers—it was predictability. The spikes were what made the UI feel unreliable.

Unexpected challenges:

Sync APIs are easy to misuse: MMKV reads are synchronous. If you start doing lots of reads during render (especially in list items), you can create jank. My mitigation: read once in store hydration, then use in-memory state.
Data shape discipline matters more than the storage engine: MMKV didn’t save me from persisting too much. partialize did.
Wipe strategy: for debugging and schema changes, having a clear “reset local state” action is essential. KV stores make this easier than SQLite.

Key insight: MMKV improved the ceiling, but state-slice design removed the foot-guns.

When this doesn’t work

MMKV isn’t a universal recommendation.

Choose something else if:

You need complex queries over persisted data (use SQLite). KV storage is not fun once you need filtering, joins, or analytics.
Your persisted state is tiny and read rarely. AsyncStorage is simpler and “good enough” when you’re persisting a couple of strings.
You’re in a strict managed environment where adding native modules is costly (depending on your Expo setup and policies).
You have multi-process or cross-app access requirements. MMKV has patterns for this, but the complexity rises quickly.

Also: if your app’s main performance problem is expensive renders, heavy images/GIFs, or slow SQLite queries, switching persistence backends won’t move the needle much.

Key takeaways

Treat persistence as part of your performance budget, not a utility. It’s on the startup path.
Persist references, not aggregates: store IDs and timestamps; derive the rest from SQLite or static catalogs.
Use partialize (or equivalents) as a guardrail to prevent “state creep.”
Prefer predictable performance over theoretical simplicity when your UX depends on speed.
Measure spikes, not just averages—users feel the worst 5%.

Closing

If you’re building an offline-first React Native app, what do you persist outside your primary database—and how do you keep that persisted state from quietly growing into a second system of record?

Why I Add an Async Outbox Before Reaching for Kafka

Sathish — Thu, 29 Jan 2026 16:00:18 GMT

The first time you ship “send email after signup” as an inline API call, it works—until it doesn’t. One slow provider, one transient timeout, or one deploy at the wrong time and you start dropping side effects (emails, webhooks, audit logs) or, worse, sending duplicates. As a solo creator, the painful part isn’t just the bug—it’s the operational overhead of fixing it repeatedly without a team. Here’s the architectural decision I now default to: add an async outbox (in the same database) before reaching for a full message bus.

1) The decision: do you need a message bus yet?

When you’re building solo, reliability problems show up in the least glamorous places:

password reset emails that sometimes don’t arrive
webhook deliveries that randomly fail
“welcome” sequences that send twice
background jobs that disappear during deploys

The common root cause is coupling: your request path is doing too much work, and your side effects aren’t transactional with your core write.

The tempting solution is “add Kafka/RabbitMQ/SQS.” But running (or even integrating) a message bus is not free: schema evolution, retries, DLQs, observability, idempotency, consumer deployments, and a new failure domain.

My default for early-stage systems is an async outbox: write an “event to send” into the same database transaction as your core change, then have a worker deliver it with retries.

Key idea: if the business write commits, the side effect is guaranteed to be recorded—even if delivery happens later.

2) Context (The Problem Space)

Requirements & constraints

For a solo system design, I optimize for:

Correctness over immediacy: “email eventually sent” beats “sometimes sent instantly.”
Low operational load: fewer moving pieces, fewer dashboards.
Cost predictability: one database and one worker is usually enough.
Deploy safety: deploys shouldn’t drop side effects.

Typical scale expectations

This pattern holds comfortably for:

tens to hundreds of requests/sec
thousands to millions of outbox rows/day
side effects like email/webhook/analytics events

Non-functional requirements

At-least-once delivery (with idempotency on the consumer/provider side)
Retry with backoff
Observability: know what’s stuck and why
No phantom sends: don’t send an email if the user row didn’t commit

Why “just do it inline” doesn’t fit

Inline side effects fail in subtle ways:

you commit the DB write, then the email API times out → user exists, but no email
you send the email, then the DB transaction rolls back → email references a user that “doesn’t exist”
you retry the request, and now you send duplicates

The outbox is basically admitting: distributed systems exist even in a monolith (your DB + third-party APIs is already distributed).

3) Options considered

Below are the common approaches I’ve used/seen, and where they break.

Comparison table

Option	What it is	Pros	Cons	Best when
Inline side effects	Call email/webhook provider inside request	Simple, low latency	Not transactional, timeouts hurt UX, duplicates on retries	Truly non-critical side effects
Background job queue only	Push job to Redis/queue from request	Async, faster requests	Still not transactional unless enqueue is in same transaction boundary	You can tolerate occasional lost jobs
Async outbox (DB)	Write outbox row in same DB transaction; worker delivers	Transactional recording, fewer components, great for solo	Adds polling/worker, needs idempotency + cleanup	MVPs to mid-scale systems
CDC (change data capture)	Stream DB changes to consumers (Debezium, logical replication)	Near real-time, scalable	Operational complexity, schema coupling, infra overhead	Data platforms, multiple consumers
Full message bus	Kafka/RabbitMQ/SQS + producers/consumers	High throughput, decoupling, replay	More infra, more failure modes, more tooling	Many services/teams, high event volume

Option notes (the “gotchas”)

Inline side effects

Works until your provider latency spikes.
Forces you to choose between slow user experience and unreliable delivery.

Background queue only (Redis etc.)

Better UX, but if enqueue happens after commit and the process crashes in between, you lose the job.
If enqueue happens before commit and the commit fails, you send an email for a transaction that never happened.

Async outbox

You trade a bit of implementation complexity for a big jump in correctness.
Your DB becomes both the system of record and the “durable queue.”

CDC or message bus

Great when multiple consumers need the same events, or you need replay.
Usually too much surface area for a solo codebase early on.

4) The decision (What I chose)

I choose Async Outbox in the primary database as the default for emails/webhooks/audit events.

Primary reasons (ranked)

Transactional integrity: the outbox row is committed with the business write.
Operational simplicity: no new infra tier (beyond a worker process).
Deploy resilience: if the worker is down, events accumulate; nothing is lost.
Debuggability: outbox table is a truth source you can query with SQL.

What I gave up

Instant delivery: outbox is “near real-time,” not truly synchronous.
DB load: polling adds reads/writes; you must index correctly.
Exactly-once: you usually get at-least-once; duplicates are handled via idempotency.

Implementation overview

Data model

A minimal outbox table:

CREATE TABLE outbox_events (
  id            BIGSERIAL PRIMARY KEY,
  topic         TEXT NOT NULL,
  payload       JSONB NOT NULL,
  idempotency_key TEXT NOT NULL,
  status        TEXT NOT NULL DEFAULT 'pending', -- pending, processing, sent, failed
  attempts      INT  NOT NULL DEFAULT 0,
  available_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_at     TIMESTAMPTZ,
  lock_owner    TEXT,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Prevent duplicate logical events (e.g., signup welcome email)
CREATE UNIQUE INDEX outbox_idempotency_uk
  ON outbox_events(topic, idempotency_key);

-- Fast fetching of ready work
CREATE INDEX outbox_ready_idx
  ON outbox_events(status, available_at)
  WHERE status IN ('pending', 'failed');

The important design choice here is idempotency_key. This is what keeps “at-least-once delivery” from becoming “user got 3 emails.”

Examples of idempotency keys:

welcome_email:user_id=123
webhook:invoice_paid:invoice_id=abc

Writing to the outbox in the same transaction

Pseudocode (Node/TypeScript-ish, but the idea is language-agnostic):

await db.tx(async (tx) => {
  const user = await tx.query(
    `INSERT INTO users(email) VALUES ($1) RETURNING id, email`,
    [email]
  );

  await tx.query(
    `INSERT INTO outbox_events(topic, payload, idempotency_key)
     VALUES ($1, $2::jsonb, $3)
     ON CONFLICT (topic, idempotency_key) DO NOTHING`,
    [
      'email.welcome',
      JSON.stringify({ userId: user.id, email: user.email }),
      `welcome_email:user_id=${user.id}`
    ]
  );
});

This is the core win: either both rows commit, or neither does.

Worker: claim rows safely (skip locked)

The worker loop should:

fetch a small batch of ready events
atomically mark them as processing (lock)
deliver
mark sent or schedule retry

In Postgres, a common pattern is FOR UPDATE SKIP LOCKED:

WITH next AS (
  SELECT id
  FROM outbox_events
  WHERE status IN ('pending', 'failed')
    AND available_at <= now()
  ORDER BY created_at
  LIMIT 50
  FOR UPDATE SKIP LOCKED
)
UPDATE outbox_events e
SET status = 'processing',
    locked_at = now(),
    lock_owner = $1,
    updated_at = now()
FROM next
WHERE e.id = next.id
RETURNING e.id, e.topic, e.payload, e.attempts;

This lets you run multiple worker instances without double-processing the same row.

Retry policy with backoff

I usually start with exponential backoff capped at a few minutes.

function nextBackoffSeconds(attempt: number): number {
  // 1, 2, 4, 8, 16, 32, 60, 60...
  return Math.min(60, 2 ** Math.max(0, attempt));
}

async function markFailed(id: number, attempts: number, err: Error) {
  const delay = nextBackoffSeconds(attempts);
  await db.query(
    `UPDATE outbox_events
     SET status='failed',
         attempts = attempts + 1,
         available_at = now() + ($2 || ' seconds')::interval,
         updated_at = now(),
         payload = jsonb_set(payload, '{last_error}', to_jsonb($3::text), true)
     WHERE id = $1`,
    [id, delay, err.message]
  );
}

A few deliberate choices:

store last_error to make SQL-based debugging possible
cap backoff to avoid “retry storms”
keep it simple; you can add jitter later

5) Results & learnings

Because this is a general pattern (not tied to one product), I’ll share the kinds of numbers I’ve repeatedly observed in production-ish solo systems.

Performance impact

Request latency: moving side effects out of the request typically drops p95 latency by 200–1500ms (depending on provider latency). The DB write for an outbox row is usually single-digit milliseconds when indexed.
Throughput: a single worker polling every 250–1000ms and processing batches of 50 can comfortably handle hundreds to low thousands of events/min on modest hardware, assuming the downstream provider isn’t the bottleneck.
DB load: the outbox table can become write-heavy. With proper partial indexes and batch updates, I typically see outbox overhead remain <5–10% of total DB CPU for small-to-mid workloads.

What worked well

Debugging becomes SQL-native: “show me pending emails older than 10 minutes” is a query.
Deploys are less scary: if the worker is down for 10 minutes, you process the backlog.

Unexpected challenges

Idempotency is non-negotiable: you will send duplicates eventually (timeouts, provider ambiguity). Design for it.
Poison messages: some payloads will fail forever (bad email, invalid webhook URL). You need a terminal state and alerting.
Table growth: you must archive or delete sent events.

What I’d do differently next time

Add a dead status after N attempts and a lightweight admin view early.
Add basic metrics (counts by status, oldest pending age) before problems happen.

6) When this doesn’t work

The async outbox is not a universal answer.

Choose something else when:

You need fan-out to many consumers with different replay needs. Outbox can do it, but it becomes awkward; a proper bus or CDC can be cleaner.
Event volume is very high (e.g., tens of thousands/sec). Polling a relational DB becomes expensive; you’ll want streaming infrastructure.
You require strict ordering across partitions (e.g., per-customer ordering at scale). You can approximate ordering, but it gets complex.
Your primary DB is already the bottleneck. Turning it into a queue adds load; offloading to a dedicated queue might be healthier.
You can’t tolerate duplicates at all and downstream isn’t idempotent. You can get closer with provider-side idempotency keys, transactional inbox patterns, or exactly-once semantics in specific systems—but complexity rises quickly.

7) Key takeaways

Treat third-party APIs as unreliable dependencies; design side effects as async and retryable.
If you only adopt one reliability pattern early: write an outbox row in the same DB transaction as your business change.
Optimize for operational simplicity first; a worker + Postgres is often enough for a long time.
Assume at-least-once delivery and make events idempotent with explicit keys.
Plan for lifecycle: retries, dead-lettering (even if it’s just a dead status), and cleanup/archival.

8) Closing

If you’re building solo, the async outbox is one of those “boring” patterns that quietly saves weeks of debugging later.

What’s your default for side effects in early-stage systems—inline calls, a queue, an outbox, or straight to a message bus? And what failure pushed you there?

Why I Use pg_trgm Fuzzy Search Instead of Full-Text Search

Sathish — Tue, 27 Jan 2026 16:00:14 GMT

Search is where job boards quietly fail. Users don’t type perfect keywords, job titles aren’t standardized across sources, and “PMHNP” gets spelled five different ways. In my PMHNP Hiring job board (7,556+ jobs, 1,368+ companies, ~200 daily updates), I had to pick a search strategy that was fast (<50ms p95 for common queries), typo-tolerant, and cheap to operate as a solo creator. This post is the architecture decision I made: choosing PostgreSQL pg_trgm similarity search over classic full‑text search—and what I gave up to get there.

Context: search in a scraped job board is messy by default

PMHNP Hiring is a job board I built for Psychiatric Mental Health Nurse Practitioners. The “product” looks simple—filter by location, remote, company, posted date, and search by title/company. The system behind it is less clean because the data comes from 10+ sources with different formatting and inconsistent fields.

A few constraints shaped the search decision:

Scale & churn: 7,556+ active jobs, 1,368+ companies, and ~200 daily updates (incremental ingestion, not full refresh).
UX reality: users type partial queries (“psych”, “nurse prac”), acronyms (“PMHNP”), and typos (“psychiatric nurse practioner”).
Performance target: keep the common feed queries (filters + search) at ~50ms p95 at the database layer.
Operational simplicity: I’m a solo creator. I didn’t want a separate search cluster to babysit.
Security model: Supabase + PostgreSQL with RLS. I wanted search to stay inside Postgres so it inherits the same access control semantics.

This is where the first non-obvious problem appears: classic full-text search (FTS) is great for “documents”, but job titles and company names behave more like short strings where typo tolerance and partial matches dominate.

In a job board, search is less about linguistic relevance and more about forgiving messy input.

The problem space: what I actually needed from search

I wasn’t building Google. I needed a search that:

Works well on short fields: title, company_name, and sometimes location_text.
Supports “contains-like” behavior: users often remember only part of a title.
Handles typos: similarity, not exact token match.
Composes with filters: search + (state, remote, posted_at, source) should stay fast.
Plays nicely with ingestion: updates come daily; the index must handle churn.

I also didn’t want relevance tuning to turn into a second product. If I had to spend days tweaking ts_rank weights, that’s a smell.

Options considered

Below are the realistic choices I evaluated for Postgres/Supabase.

Option	What it is	Pros	Cons	Best when
`ILIKE '%query%'` + B-tree	Naive substring match	Simple, no extensions	Slow on large tables; can’t use B-tree with leading wildcard	Tiny datasets or admin tools
PostgreSQL Full-Text Search (`tsvector`)	Token-based search using dictionaries	Good for long text; ranking; language support	Weak on typos/partial strings; tuning needed; titles are short	Articles, descriptions, “document” search
`pg_trgm` (trigram similarity)	String similarity via overlapping 3-char chunks	Typo-tolerant; fast with GIN/GiST; great for short fields	Not semantic; can match weirdly; needs threshold tuning	Names, titles, short text, “forgiving” search
External search (Meilisearch/Typesense/Elastic)	Dedicated search engine	Great relevance; typo handling; faceting	Extra infra; sync complexity; more moving parts; cost	High scale, complex ranking, multi-field relevance

Why I didn’t stick with `ILIKE`

ILIKE feels tempting early on, especially when you’re “vibe coding” fast. But it collapses once you hit a few thousand rows and mix it with filters.

ILIKE '%pmhnp%' forces a scan unless you add specialized indexing. On 7k jobs it might still feel okay—until you add multi-tenant rules, joins to companies, and a few concurrent users.

Why full-text search wasn’t the right default

FTS shines when you search bodies of text. But for job boards, most searches are:

"pmhnp"
"psychiatric"
"remote"
"headway" (company)

FTS tokenization can hurt you here:

Typos don’t match.
Partial tokens don’t match unless you add prefix operators and accept recall/precision trade-offs.
Acronyms and short tokens can behave weirdly depending on dictionaries.

Why I didn’t jump to an external search engine

I love dedicated search engines—but operating them is a commitment:

You need a sync pipeline (DB → search index) that is correct under retries.
You now have two sources of truth for availability.
You have to decide how search respects RLS / access control.

For PMHNP Hiring, the cost and complexity weren’t justified. Postgres could do “good enough” search with less risk.

The decision: pg_trgm + a search vector column I can index

I chose PostgreSQL’s pg_trgm extension and built search around a single normalized field (title + company + location) that I could index with GIN.

Primary reasons (ranked):

Typo tolerance on short fields without building relevance infrastructure.
Composable performance with filters (state, remote, posted_at).
Operational simplicity: no extra services; works inside Supabase.
Predictable indexing story: GIN trigram indexes are battle-tested.

What I gave up:

No semantic relevance (synonyms, intent, “psych NP” == “PMHNP”).
Similarity search can return “surprising” matches unless you tune thresholds.
Ranking is simpler; you’re not doing sophisticated scoring.

Implementation overview

I model jobs and companies relationally, but for search I avoid doing multiple similarity checks across joins at query time. Instead, I denormalize a search_text field on jobs.

1) Enable `pg_trgm` and add an indexed field

-- One-time
create extension if not exists pg_trgm;

-- Add a denormalized search field
alter table jobs add column if not exists search_text text;

-- Keep it simple: lowercased, whitespace-normalized text
-- (I populate this during ingestion / upserts)

-- GIN index for fast trigram search
create index if not exists jobs_search_text_trgm
on jobs using gin (search_text gin_trgm_ops);

Why a denormalized search_text?

Similarity across multiple columns (title, company_name) can prevent index use or force multiple index scans.
Joining companies for every search adds overhead; my feed queries already join for display.
With 200 daily updates, recomputing search_text is cheap and keeps reads fast.

2) Populate `search_text` during upsert (pipeline-friendly)

My ingestion pipeline is: Cron (Vercel) → scraper → dedupe → upsert into Postgres. During the upsert, I compute search_text.

// pseudo-code inside the ingestion worker
const normalize = (s: string) =>
  s
    .toLowerCase()
    .replace(/[^a-z0-9\s]/g, " ")
    .replace(/\s+/g, " ")
    .trim();

const searchText = normalize([
  job.title,
  job.companyName,
  job.locationText,
  job.remote ? "remote" : "",
].filter(Boolean).join(" "));

await db.from("jobs").upsert({
  id: job.id,
  title: job.title,
  company_id: job.companyId,
  location_text: job.locationText,
  remote: job.remote,
  posted_at: job.postedAt,
  search_text: searchText,
  source: job.source,
});

This is a deliberate trade: write-time work for read-time speed.

3) Query pattern: similarity + filters + stable pagination

I treat search as “filtering” rather than a separate endpoint. Most users search while also filtering by state/remote.

-- Parameterized query idea
-- :q is the normalized query string
-- :min_sim is a tuned threshold (e.g., 0.2 to 0.35)

select
  j.id, j.title, j.posted_at, j.remote,
  c.name as company_name
from jobs j
join companies c on c.id = j.company_id
where
  (:state is null or j.state = :state)
  and (:remote is null or j.remote = :remote)
  and (:q = '' or j.search_text % :q)
order by
  case when :q = '' then 0 else similarity(j.search_text, :q) end desc,
  j.posted_at desc,
  j.id desc
limit :limit;

Notes:

The % operator is trigram “similarity match” (uses the trigram index).
similarity() is used only for ordering when a query exists.
The secondary ordering by posted_at, id keeps results stable.

4) Tuning similarity threshold without guesswork

The biggest footgun with pg_trgm is threshold tuning. Too low: irrelevant matches. Too high: you miss typos.

In Postgres you can set it per session:

-- Example: bump threshold for stricter matches
select set_limit(0.28);

-- Then run the search
select id, title
from jobs
where search_text % 'pmhnp remote';

In practice, I ended up using a slightly lower threshold for shorter queries and a higher one for longer queries (because long queries naturally have more trigrams).

Results & learnings (with real numbers)

After shipping pg_trgm search + the right supporting indexes (composite indexes for filters, plus connection pooling with pgBouncer), the database layer for the most common “feed + search” requests stabilized around:

~50ms p95 query time for typical filtered listing queries (state/remote + optional search).
Search remained fast even with daily churn (~200 updates/day) because GIN index maintenance overhead at this scale is manageable.

What worked well:

Typos stopped mattering for the most common cases (company names, “psychiatric”, “practitioner”).
I didn’t need to invent a ranking model. Similarity + recency was “good enough”.
Keeping search in Postgres meant fewer moving parts and fewer failure modes.

Unexpected challenges:

Certain short queries (like "np") matched too broadly. The fix wasn’t more indexing—it was product constraints (minimum query length, or requiring at least one non-trivial token).
Similarity ordering can be noisy when many rows are “close enough”. Recency as a tie-breaker helped.

What I’d do differently:

Add a lightweight synonym layer (application-side) for domain terms (e.g., map “psych np” → “pmhnp psychiatric”). This is cheaper than building semantic search and improves relevance a lot.

When this doesn’t work

pg_trgm isn’t a universal answer. I’d pick something else if:

You need semantic relevance: synonyms, intent understanding, “director of nursing” matching “DON”, etc. That’s where a dedicated search engine or embeddings start to win.
You’re searching long bodies of text (job descriptions). FTS (or hybrid FTS + trgm) is often better.
Your dataset is massive and high-churn (hundreds of millions of rows). GIN index size and maintenance can become expensive.
You need advanced faceting + ranking beyond what SQL can comfortably express.

A pragmatic hybrid that I’d consider later:

FTS for descriptions (token relevance)
pg_trgm for titles/company (typo tolerance)
Merge/rank results in SQL or application layer

But I wouldn’t start there unless search quality is the core differentiator.

Key takeaways

Match the search tool to the shape of your data: short strings behave differently than documents.
Optimize for the operational budget you actually have: one Postgres instance you understand beats two systems you barely monitor.
Denormalize intentionally when it removes joins from your hot path; pay the cost at ingestion time.
Thresholds are product decisions: minimum query length and similarity limits are UX levers, not just database knobs.
Use recency as a stabilizer: in job boards, “newer” is often a better tie-breaker than perfect relevance.

Closing

If you’ve built search for a marketplace or job board: did you stick with Postgres (FTS/trgm) or graduate to a dedicated search engine? I’m especially curious where your tipping point was—data size, relevance requirements, or team/ops maturity.

Why I Built a Durable Offline Queue for AI Calls in React Native

Sathish — Sat, 24 Jan 2026 16:00:39 GMT

AI features are easy to demo on perfect Wi‑Fi and painfully fragile in the real world. In my fitness app project (React Native + Expo + SQLite), users can log a set in ~5 seconds and optionally get AI help (workout suggestions, explanations, quick adjustments). The architectural decision that mattered most wasn’t the prompt design—it was whether AI calls should be “best-effort” or “durable”. I chose a durable, persisted offline queue for OpenAI requests so the UX stays responsive, battery-friendly, and predictable even when the device is offline or rate-limited.

Context: the problem space (and why it’s subtle)

I’m building a mobile workout tracker where the core loop is fast: open app → log set → move on. The app is offline-first: SQLite is the primary store, and sync is “cloud-backup”, not “cloud-source-of-truth”. Scale is small today (10 waitlist, ~400 exercises), but the constraints are real:

Sub-100ms UI interactions for logging (anything slower feels like friction mid-set)
Offline and spotty network are normal (basements, gyms with bad reception)
Battery and data usage matter (background retry loops can be expensive)
AI calls are non-critical (logging must work without them)
OpenAI limits and latency are unpredictable (429s, timeouts, slow responses)

The naive approach is: “Call the API when the user taps, show a spinner, retry on failure.” That’s fine for a chat app. For a workout logger, it’s a UX regression: it blocks the user on something that isn’t essential.

So the decision: Should AI requests be synchronous and UI-coupled, or should they be durable tasks that can be executed later?

Key insight: In offline-first apps, anything that touches the network should be treated like a background job—especially if it’s optional.

Options considered

I considered four patterns for integrating AI calls without degrading the core logging experience.

Comparison table

Option	What it is	Pros	Cons	Best when
A) Synchronous call in UI flow	Call OpenAI on button tap, await result	Simple mental model; fewer moving parts	UI stalls; brittle offline; retries drain battery; hard to rate-limit	AI is core feature and latency is acceptable
B) Fire-and-forget in memory	Trigger request, don’t await; store result in state when it returns	UI stays fast; minimal code	If app is killed, request is lost; no backoff; duplicates likely	AI is “nice to have” and losing responses is OK
C) Durable local queue (SQLite)	Persist tasks; worker processes when online; backoff + rate limits	Survives restarts; controllable retries; good offline UX; measurable	More code; need idempotency + dedupe; needs observability	Offline-first apps with optional network features
D) Server-side job queue	Send intent to backend; backend calls OpenAI and pushes result	Centralized control; better secrets management; easier analytics	Requires backend; still needs device-side offline handling; more cost/ops	You already run a backend and need shared results

Why I didn’t choose A or B

A (synchronous) made the UI hostage to network conditions. Even if I didn’t block the whole screen, it introduced “pending” states everywhere and created edge cases (user logs next set while previous AI call is still inflight).
B (in-memory) sounded attractive until I simulated real behavior: mobile OS kills the app, users background it, network flips, and you end up with lost work or duplicates.

Why I didn’t choose D (server-side)

Longer term, a backend queue is compelling. But right now the app is offline-first and early-stage. Adding a backend just to make AI reliable felt like premature complexity. Also, I’d still need a device-side outbox because requests originate offline.

That led to C: a durable local queue.

The decision: a persisted offline queue (SQLite outbox)

I implemented an Outbox pattern for AI requests:

Every AI intent becomes a row in ai_jobs in SQLite.
UI writes a job and immediately returns (optimistic UX).
A background worker processes jobs when:
- device is online
- rate limit allows
- app is in foreground (initially; background execution is a later enhancement)
Results are stored back into SQLite and projected into UI state.

Architecture diagram (Mermaid)

flowchart LR
  UI[React Native UI]
  Z[Zustand Store]
  DB[(SQLite)]
  Q[ai_jobs Outbox]
  W[Queue Worker]
  NET[Network Check]
  RL[Rate Limiter]
  OAI[OpenAI API]
  RES[ai_results]

  UI --> Z
  Z --> DB
  UI -->|enqueue intent| Q
  Q --> DB
  W --> NET
  W --> RL
  W -->|claim job| Q
  W -->|call| OAI
  OAI -->|response| W
  W --> RES
  RES --> DB
  DB --> UI

Data model: jobs need to be idempotent

The main thing I learned from data engineering is: distributed systems fail in boring ways. Mobile is a distributed system with a very unreliable worker (the phone).

Each job needs:

a stable idempotency key (so retries don’t duplicate effects)
status transitions that are safe across crashes
metadata for backoff and debugging

A minimal schema:

-- SQLite
CREATE TABLE IF NOT EXISTS ai_jobs (
  id TEXT PRIMARY KEY,
  type TEXT NOT NULL,
  payload_json TEXT NOT NULL,
  status TEXT NOT NULL, -- queued | running | done | failed
  attempts INTEGER NOT NULL DEFAULT 0,
  run_after_ms INTEGER NOT NULL DEFAULT 0,
  locked_until_ms INTEGER NOT NULL DEFAULT 0,
  last_error TEXT,
  created_at_ms INTEGER NOT NULL,
  updated_at_ms INTEGER NOT NULL
);

CREATE INDEX IF NOT EXISTS idx_ai_jobs_status_run_after
ON ai_jobs(status, run_after_ms);

Enqueue from UI (fast path)

The UI path must be cheap: one insert, no network.

import { nanoid } from "nanoid/non-secure";

type AiJobType = "exercise_suggestion" | "form_explanation";

export async function enqueueAiJob(db: any, type: AiJobType, payload: unknown) {
  const now = Date.now();
  const id = nanoid();

  await db.runAsync(
    `INSERT INTO ai_jobs(id, type, payload_json, status, attempts, run_after_ms, locked_until_ms, created_at_ms, updated_at_ms)
     VALUES(?, ?, ?, 'queued', 0, 0, 0, ?, ?)`
    , [id, type, JSON.stringify(payload), now, now]
  );

  return id;
}

Design choice: I’m not doing anything clever here (no batching, no compression). The win is that it’s deterministic and restart-safe.

Claim + process: avoid duplicate workers

Even on-device, you can end up with multiple workers (hot reload, navigation bugs, accidental multiple intervals). I added a “lease” field locked_until_ms to prevent double-processing.

const LEASE_MS = 15_000;

async function claimNextJob(db: any) {
  const now = Date.now();

  // Find a runnable job
  const job = await db.getFirstAsync(
    `SELECT * FROM ai_jobs
     WHERE status = 'queued'
       AND run_after_ms <= ?
       AND locked_until_ms <= ?
     ORDER BY created_at_ms ASC
     LIMIT 1`,
    [now, now]
  );

  if (!job) return null;

  // Lease it (best-effort atomicity; acceptable for single-device SQLite)
  const lockedUntil = now + LEASE_MS;
  await db.runAsync(
    `UPDATE ai_jobs
     SET status = 'running', locked_until_ms = ?, updated_at_ms = ?
     WHERE id = ?`,
    [lockedUntil, now, job.id]
  );

  return { ...job, locked_until_ms: lockedUntil, status: "running" };
}

This is not perfect distributed locking, but for a single SQLite DB on one phone it’s pragmatic.

Backoff + rate limiting: protect UX and battery

Two failure modes matter:

Offline / flaky network → repeated failures
429 rate limits → hammering the API makes it worse

I used an exponential backoff with jitter, and a simple per-user token bucket stored in memory + persisted timestamp in SQLite (so app restarts don’t immediately retry everything).

function nextRunAfterMs(attempts: number) {
  const base = Math.min(60_000, 1000 * Math.pow(2, attempts)); // cap at 60s
  const jitter = Math.floor(Math.random() * 400); // 0-400ms
  return Date.now() + base + jitter;
}

async function markJobRetry(db: any, id: string, attempts: number, err: unknown) {
  const now = Date.now();
  const runAfter = nextRunAfterMs(attempts);

  await db.runAsync(
    `UPDATE ai_jobs
     SET status='queued', attempts=?, run_after_ms=?, locked_until_ms=0, last_error=?, updated_at_ms=?
     WHERE id=?`,
    [attempts, runAfter, String(err), now, id]
  );
}

Decision detail: I intentionally cap backoff at 60 seconds for now because AI is non-critical, and long retry windows reduce battery churn. If a job can’t succeed within a few minutes, it’s usually because the user is offline for a while—better to wait for connectivity.

Result persistence: decouple UI from worker

When the worker completes, it stores a result row keyed by job id (or domain entity id), and marks the job done.

That means the UI can render “AI suggestion pending” without caring whether the app was restarted.

Results & learnings (so far)

This is early (Builder Day 30), but a few things are measurable even at small scale.

Performance impact

Logging path latency: enqueue insert is typically 1–4ms on my test device (Pixel 7), and the UI remains under my ~100ms interaction budget.
Cold start impact: negligible because the worker doesn’t run until after initial render (I schedule it after navigation is ready). The main startup cost remains loading the exercise library (handled via lazy loading).

Reliability improvements

No more “spinners that never resolve” when the user goes underground.
If the app is killed mid-request, the job lease expires and it retries later.
Rate limiting is centralized, so I can enforce “N AI calls per minute” without sprinkling guards across UI components.

Unexpected challenges

Duplicate intents: users tap twice, or navigate back/forward quickly. Without dedupe, you pay twice. I’m adding a domain-level idempotency key (e.g., suggestion:{workoutSessionId}:{exerciseId}:{setIndex}) to collapse duplicates.
Observability: debugging on-device queues is annoying. I added a hidden “Queue Inspector” screen that lists jobs, attempts, and last_error. Not pretty, but it cuts debugging time.
Context window management: queued jobs can run minutes later. If the payload references “current set”, it may no longer be current. I learned to enqueue immutable references (IDs + snapshot fields), not “current state”.

When this doesn’t work

A durable local queue is not a universal solution.

If AI output must be immediate, like real-time coaching, you’ll still need synchronous calls (or at least streaming). Queueing helps reliability but not latency.
If you require cross-device consistency, device-local jobs can diverge. You’ll want a server-side queue or a shared sync layer where jobs are replicated and deduped across devices.
If you need strong guarantees, SQLite leasing is “good enough” for single-device but not equivalent to a transactional distributed queue. If you later add background tasks, multiple processes, or extensions, you’ll need more robust locking.
If your payloads are huge, storing them in SQLite can bloat DB size and slow queries. In that case, store payloads as separate blobs/files and reference them.

Key takeaways

Treat optional network features as background jobs in offline-first apps; don’t couple them to UI interactions.
Persist the queue (SQLite outbox) so app restarts and OS kills don’t lose work.
Design for idempotency early—duplicates are normal, not an edge case.
Centralize backoff and rate limiting to protect battery, UX, and your API bill.
Enqueue immutable snapshots, not “current state”, because queued work executes later.

Closing

I’m happy with the durability and UX improvements, but I’m still unsure about the “right” next step: background execution (TaskManager) vs moving AI orchestration server-side once multi-device sync becomes real.

If you’ve built offline queues on mobile: do you prefer a device outbox like this, or do you push intents to a backend queue as early as possible—and why?

Why I Prefer Keyset Pagination for High-Volume Feeds

Sathish — Fri, 23 Jan 2026 16:00:38 GMT

Pagination looks like a UI problem until it becomes a production bottleneck. Once your dataset grows and users start filtering, sorting, and jumping between pages, the wrong pagination strategy quietly burns CPU, increases query latency, and creates confusing duplicates or missing rows. As a solo creator, you don’t get to “throw a team at it”—you need a choice that’s fast, predictable, and hard to break at 2am. This is why I default to keyset (cursor) pagination over OFFSET/LIMIT for most high-volume feeds, and when I still won’t use it.

The problem space (constraints that matter)

Pagination becomes architectural when it touches three things at once:

1) Performance at scale: As tables grow from thousands to millions of rows, naive pagination can turn into a linear scan. The user still sees “page 40”, but your database is doing work proportional to 40 * page_size.

2) Correctness under writes: Feeds are rarely static. New rows get inserted, old rows get updated, and background jobs backfill data. Offset-based pagination can return duplicates or skip records as the underlying ordering shifts.

3) Operational simplicity: Solo development is a constraint. I prefer designs that are:

hard to misuse across endpoints
easy to reason about when debugging
index-friendly
cheap (less CPU, fewer slow queries)

Non-functional requirements I usually assume for a “feed-like” endpoint:

p95 latency target: < 150ms for common queries (excluding network)
Predictable performance: page 1 and page 100 shouldn’t be 10x apart
Stable ordering: no duplicates, minimal “jumping”
Backwards/forwards navigation: at least “next page”; ideally “previous” too

Why “existing solutions” don’t fit by default:

Many ORMs make OFFSET/LIMIT feel like the obvious default.
Many frontend designs assume numeric pages (1…N), which biases you toward OFFSET.
Some developers ship OFFSET early “just for MVP” and then discover it’s embedded in caching, links, emails, and analytics.

Key insight: pagination is part of your data contract. Changing it later is possible, but it’s never free.

Options considered

Below are the common strategies I’ve used or audited in production-like systems.

Option	What it is	Pros	Cons	Works best when
OFFSET/LIMIT	`ORDER BY ... OFFSET x LIMIT y`	Simple; supports random access (page 37)	Slower as offset grows; duplicates/skips under writes; deep pages expensive	Small tables; mostly static datasets; admin views
Keyset (Cursor)	`WHERE (sort_key, id) < (cursor_sort_key, cursor_id) LIMIT y`	O(1) page-to-page; stable under inserts; index-friendly	Harder random access; needs careful cursor encoding; tricky with complex sorting	Feeds, timelines, infinite scroll, large datasets
Seek by ID only	`WHERE id < last_id`	Very fast; simplest cursor	Only works if ID correlates with desired order; breaks if you sort by time/score	Append-only logs; monotonic IDs; simple “latest first”
Snapshot + OFFSET	Pin a consistent snapshot (repeatable read) and use OFFSET	Correctness improves; keeps numeric pages	Still pays offset cost; snapshot mgmt complexity; not great for long browsing sessions	Reporting; exports; short sessions
Precomputed page map	Materialize page boundaries (e.g., store cursor per page)	Enables random access + keyset speed	Extra storage; invalidation complexity; rebuild cost	Highly trafficked, mostly read-only catalogs

What actually bites you in production

OFFSET cost is not theoretical. In Postgres, OFFSET often means scanning and discarding rows. Even with indexes, the engine still has to walk past N rows.
Correctness is a user-facing feature. Duplicate items in a feed erode trust. Missing items can be worse.
Random page jumps are overrated. Most consumer feeds are “next/previous” patterns. Numeric page links are common in catalogs, not timelines.

The decision (what I choose and why)

I default to keyset pagination with a composite cursor:

Primary sort key: created_at (or whatever defines the feed)
Tie-breaker: id (unique, stable)

Why (ranked reasons)

1) Predictable query cost: page 1 and page 100 are similar complexity. 2) Correctness under concurrent writes: fewer duplicates/skips because you’re anchoring to a position, not a row count. 3) Index leverage: a composite index can satisfy the query efficiently. 4) Simpler operations: fewer slow queries, fewer surprise p95 spikes.

What I give up

True random access like “go to page 42” is non-trivial.
You must design a cursor format (encoding/decoding, validation, expiry decisions).
Some sorting modes don’t map well (e.g., sorting by a computed score that changes frequently).

Implementation overview (Postgres examples)

1) Schema + index that makes keyset work

Assume a table:

id is unique
created_at is the primary ordering

-- Postgres
CREATE TABLE items (
  id BIGSERIAL PRIMARY KEY,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  title TEXT NOT NULL,
  payload JSONB
);

-- Composite index to support ORDER BY created_at DESC, id DESC
CREATE INDEX items_created_at_id_desc
ON items (created_at DESC, id DESC);

The tie-breaker (id) matters because many rows can share the same created_at at millisecond resolution or due to batch inserts.

2) The keyset query (next page)

The cursor is the last row of the previous page: (created_at, id).

-- First page (no cursor)
SELECT id, created_at, title
FROM items
ORDER BY created_at DESC, id DESC
LIMIT $1;

-- Next page (cursor provided)
SELECT id, created_at, title
FROM items
WHERE (created_at, id) < ($2::timestamptz, $3::bigint)
ORDER BY created_at DESC, id DESC
LIMIT $1;

Why tuple comparison? It’s concise and maps cleanly to the “sort key + tie-breaker” concept.

3) Cursor encoding (don’t leak raw values blindly)

I like a compact, signed cursor so:

clients can’t easily tamper with it
you can evolve cursor formats

Below is a minimal Node-style example (works similarly in any backend):

import crypto from "crypto";

const SECRET = process.env.CURSOR_SECRET;

export function encodeCursor({ createdAt, id }) {
  const payload = JSON.stringify({ v: 1, createdAt, id });
  const sig = crypto.createHmac("sha256", SECRET).update(payload).digest("base64url");
  return Buffer.from(payload).toString("base64url") + "." + sig;
}

export function decodeCursor(cursor) {
  const [b64, sig] = cursor.split(".");
  const payload = Buffer.from(b64, "base64url").toString("utf8");
  const expected = crypto.createHmac("sha256", SECRET).update(payload).digest("base64url");
  if (sig !== expected) throw new Error("Invalid cursor");
  const obj = JSON.parse(payload);
  if (obj.v !== 1) throw new Error("Unsupported cursor version");
  return obj;
}

Trade-off: signing adds a tiny CPU cost, but it prevents a whole class of “cursor points to weird place” bugs and makes abuse harder.

4) API shape (make misuse difficult)

I prefer an API contract like:

limit (bounded)
cursor (opaque)
returns: items[], next_cursor

{
  "items": [{ "id": 123, "created_at": "2026-01-29T10:00:00Z", "title": "..." }],
  "next_cursor": "eyJ2IjoxLCJjcmVhdGVkQXQiOiIyMDI2LTAxLTI5VDEwOjAwOjAwWiIsImlkIjoxMjN9.abc..."
}

This nudges the frontend toward “infinite scroll / load more”, which matches the strengths of keyset.

Diagram: how keyset pagination reads data

graph TD
  A[Client] -->|GET /feed?limit=20| B[API]
  B -->|SQL: ORDER BY created_at,id DESC LIMIT 20| C[(Postgres)]
  C --> B
  B -->|items + next_cursor| A
  A -->|GET /feed?limit=20&cursor=...| B
  B -->|SQL: WHERE (created_at,id) < cursor ORDER BY ... LIMIT 20| C

The key is that the second query doesn’t “count past” earlier rows; it “seeks” to a position.

Results & learnings (numbers + what surprised me)

I’m intentionally not tying this to any specific product; these are representative numbers from benchmarking patterns I’ve repeated over time.

Performance comparison (representative)

On Postgres with a table in the 1–10M row range, ordering by (created_at DESC, id DESC):

OFFSET/LIMIT
- page 1 (OFFSET 0): often ~10–30ms query time
- page 500 (OFFSET 9,980 with limit 20): can drift to ~80–250ms depending on cache and vacuum state
- deep pages: p95 spikes are common under concurrent load
Keyset
- page 1: ~10–30ms (similar)
- page 500: usually stays in the same band (~10–40ms) because the index seek remains efficient

What surprised me early on:

OFFSET-based endpoints can look “fine” in staging because you rarely test deep pages.
Keyset pagination makes caching easier for “top of feed” traffic, because you avoid expensive deep scans that compete for shared resources.
The biggest win isn’t average latency—it’s tail latency predictability.

Operational learning

If you’re solo, the win is fewer incidents caused by growth. Keyset is one of those choices where you pay a bit of complexity upfront (cursor encoding, edge cases) to avoid repeated performance firefighting later.

When this doesn’t work

Keyset pagination is not a universal default.

Use something else when:

1) You truly need random access (e.g., “page 37 of 2,000” is a real UX requirement). Catalogs, admin UIs, and compliance exports often need numeric pages.

2) Your sort key is unstable (e.g., “score” changes frequently). If the ordering changes between requests, any pagination strategy can feel inconsistent—but keyset can be especially confusing because the cursor anchors to a moving target.

3) You need total counts and exact page numbers. Keyset doesn’t naturally provide “total pages”. You can compute counts separately, but it’s another query and can be expensive.

4) Complex multi-column sorts with NULL semantics. Still doable, but your cursor logic becomes more fragile. At some point, you’re building a mini query planner.

In those cases, I’ll either:

accept OFFSET for smaller datasets + add safeguards (max page, caching, read replica), or
build a hybrid: keyset for “browse”, offset for “jump”, or
materialize results (precomputed boundaries) if the dataset is mostly read-only.

Key takeaways (a framework you can reuse)

1) Decide based on growth shape, not current size. If you expect the table to grow continuously, avoid strategies with linear deep-page costs. 2) Correctness under writes is part of UX. If duplicates/skips are unacceptable, prefer cursor-based approaches. 3) Pick an ordering you can index. Keyset only shines when your ORDER BY matches an index. 4) Make the cursor opaque and versioned. You’ll thank yourself when you evolve sorting or add filters. 5) Optimize for operations as a solo creator: predictable p95 beats cleverness.

Closing

If you had to choose today: would you trade away random page access to get stable performance and fewer pagination bugs? I’m curious where you draw that line—especially for datasets that are both large and frequently updated.

Why I Use Materialized Views for Job Board Aggregations

Sathish — Thu, 22 Jan 2026 17:31:59 GMT

When I built a PMHNP job board that aggregates 7,556+ listings across 1,368+ companies (with ~200 updates/day), the “simple” parts got hard fast—especially counts and facets. Users expect filters like state, remote, and company to feel instant. But running aggregation queries on every request competes with writes from the pipeline and spikes latency. This article is one architectural decision: why I chose PostgreSQL materialized views for aggregations (and what I gave up) to keep p95 query times around ~50ms in production.

The problem space

I’m Sathish (@Sathish_Daggula), a data engineer turned indie hacker. My production system is a niche job board for Psychiatric Mental Health Nurse Practitioners (PMHNP). It runs on Next.js 14 + Supabase (Postgres) + Vercel.

The workload is deceptively mixed:

Read-heavy UX expectations: job listings, search, filters, “companies hiring” pages.
Write-heavy pipeline bursts: 10+ sources → scrape → normalize → dedupe → upsert. Roughly 200+ daily updates (some days spiky).
Aggregation-heavy UI: “jobs by state”, “top companies”, counts for filters, weekly email alerts that need grouped data.

Non-functional constraints mattered more than features:

Latency: I targeted “feels instant” for filters. In practice: ~50ms p95 query time for the most common endpoints.
Cost & ops: I’m a solo creator; I wanted fewer moving parts than adding Redis + workers + a separate analytics store.
Correctness: counts that drift are worse than slow counts. If a filter says “Texas (123)”, it must be defensible.

The immediate pain: aggregation queries (COUNT/GROUP BY) were the first thing to degrade as the dataset and filters grew. They’re also the easiest to accidentally make expensive.

Key insight: for job boards, the “list page” is not the hard part—facets and rollups are.

Options considered

I evaluated four approaches for aggregations (facets + dashboards + email rollups). Here’s how I framed it.

Option	What it is	Pros	Cons	Best when
1) On-the-fly aggregations	Run GROUP BY/COUNT queries per request	Always fresh; simplest conceptually	Can get slow fast; competes with writes; needs careful indexing	Small datasets, low concurrency, few filters
2) Application-side caching	Cache aggregation results (memory/Redis/CDN)	Very fast reads; flexible TTL	Cache invalidation is real work; stale data risks; extra infra	Data changes slowly or staleness is acceptable
3) Precomputed tables (manual rollups)	Maintain rollup tables updated by pipeline/jobs	Fast and explicit; can be incremental	More code paths; must handle backfills; consistency bugs possible	High scale, strict SLAs, you can afford ops
4) PostgreSQL materialized views	Database-managed snapshot of a query, refreshable	Fast reads; fewer app bugs; strong SQL ergonomics	Refresh cost; staleness window; concurrency nuances	Medium scale, lots of repeated rollups, minimal infra

Why not just do on-the-fly?

I started there. It worked until I added more dimensions (state, remote, employment type, source, posted date windows) and more surfaces (homepage, company page, email alerts). A single “facet query” can be okay, but multiple facets per page means you’re effectively running a small analytics workload in your OLTP path.

Why not Redis/CDN caching?

I like caching, but I’m careful about using it as a crutch:

Cache keys explode with combinations of filters.
TTL-based staleness can create “why is this count wrong?” moments.
Invalidation requires you to know exactly which writes affect which keys.

For a solo project, I wanted fewer “distributed correctness” problems.

Why not manual rollup tables?

This is the serious approach at scale. But it turns your pipeline into a mini data warehouse project:

You need incremental logic per dimension.
Backfills become tricky.
If you ever change business logic (“what counts as active?”), you’re rewriting historical rollups.

I wasn’t ready to take on that complexity.

The decision

I chose PostgreSQL materialized views for the repeated aggregations that power:

Filter counts / facets (by state, remote, company)
“Top companies hiring” lists
Weekly email alert grouping

Why materialized views (ranked reasons)

Predictable read performance: the view is effectively a precomputed table.
Lower bug surface: aggregation logic stays in SQL, not duplicated across API endpoints.
Operational simplicity: no extra data store; refresh can run from the same scheduler as my scraper.
Easier evolution than manual rollups: I can change the query and refresh, instead of rewriting incremental update logic.

What I gave up

Real-time freshness: materialized views are snapshots. I accepted a refresh window.
Refresh cost: refresh is work the database must do, and it can contend with other load.
Some constraints: for concurrent refresh you need unique indexes and you must design the view accordingly.

Implementation overview

At a high level, my system looks like this:

flowchart LR
  A[Vercel Cron] --> B[Scrapers 10+ sources]
  B --> C[Normalize + Deduplicate]
  C --> D[(PostgreSQL via Supabase)]
  D --> E[Materialized Views
(facets, rollups)]
  E --> F[Next.js API/Server Actions]
  F --> G[UI: Search + Filters]
  E --> H[Weekly Job Alerts]

The key is: the app reads facets from materialized views, while the pipeline writes to base tables. Refresh happens on a cadence that matches my “good enough” freshness requirements.

For example, a simplified “jobs by state” facet.

-- 1) Materialized view
create materialized view if not exists mv_jobs_by_state as
select
  j.state as state,
  count(*) as job_count,
  max(j.updated_at) as last_updated_at
from jobs j
where j.status = 'active'
group by j.state;

-- 2) Index to make reads fast
create unique index if not exists mv_jobs_by_state_state_uidx
on mv_jobs_by_state(state);

Notes:

I include status = 'active' because “active jobs” is a business concept, not just a row count.
The unique index enables refresh materialized view concurrently.

2) Refresh strategy: concurrent, scheduled, and scoped

I refresh materialized views on a schedule (Vercel Cron). The important part is to avoid blocking reads.

-- Concurrent refresh avoids blocking selects.
refresh materialized view concurrently mv_jobs_by_state;

In practice I don’t refresh every minute. For a job board, a 15–60 minute staleness window is usually acceptable, and it dramatically reduces refresh churn.

If you have multiple materialized views, refresh order matters (and you may want to stagger them).

In Next.js, I query the materialized view rather than running a GROUP BY on the jobs table for every request.

// Server-side query (Next.js 14)
import { createClient } from "@supabase/supabase-js";

export async function getStateFacets() {
  const supabase = createClient(
    process.env.SUPABASE_URL!,
    process.env.SUPABASE_SERVICE_ROLE_KEY!
  );

  const { data, error } = await supabase
    .from("mv_jobs_by_state")
    .select("state, job_count")
    .order("job_count", { ascending: false });

  if (error) throw error;
  return data;
}

I use the service role for server-side trusted reads; for user-facing reads you can also expose views via RLS carefully (more on that in trade-offs).

Results & learnings

What improved

Latency: the endpoints that previously ran GROUP BY queries now read from small, indexed materialized views. This is a big contributor to keeping ~50ms p95 query times for common read paths.
Database stability: expensive aggregations moved off the “every request” path. Writes from the ingestion pipeline are less likely to coincide with heavy read aggregation.
Developer speed: when I added weekly job alerts, I reused the same rollups. Business logic stayed centralized.

Unexpected challenges

Refresh timing is a product decision
- If you refresh too often, you pay a constant compute tax.
- If you refresh too rarely, users see stale counts.
- For my case, “jobs updated daily-ish” means a modest refresh cadence is fine.
Concurrent refresh has requirements
- You need a unique index on the materialized view.
- Your view query must produce stable unique rows for that index.
RLS + views needs deliberate thought
- Supabase RLS is great for multi-tenant security, but you must decide whether facets are public, tenant-scoped, or admin-only.
- Sometimes it’s safer to keep views behind server-only access rather than exposing them directly.

Learning: materialized views are not “set and forget.” The refresh cadence is part of the system design.

When this doesn’t work

Materialized views are a strong middle ground, but I wouldn’t recommend them universally.

They’re a poor fit when:

You need real-time facets (seconds-level freshness). In that case, you’ll likely need streaming updates + cache invalidation, or incremental rollups.
Your refresh cost is too high because the underlying query scans huge tables. At large scale, you’ll want partitioning, incremental aggregation, or a separate OLAP store.
You have high write throughput and refresh contention becomes visible. Even concurrent refresh consumes resources.
Your facets depend on user-specific permissions. If every user sees different counts (e.g., per-tenant private jobs), you either need tenant-specific materialized views (messy) or compute on the fly with good indexes.

At some point, the “right” solution becomes: a small analytical layer (ClickHouse/BigQuery) or a dedicated caching strategy with explicit invalidation.

Key takeaways

Facets are analytics: treat them like a separate workload from listing reads.
Materialized views are a pragmatic middle between slow GROUP BY and complex rollup pipelines.
Design for refresh: pick a cadence aligned with user expectations, not engineering aesthetics.
Use concurrent refresh + unique indexes to avoid blocking reads.
Keep business meaning in SQL (e.g., what counts as “active”) to avoid duplicating logic across endpoints.

Closing

If you’ve built a read-heavy product with frequent updates, how do you handle aggregations—materialized views, Redis caching, rollup tables, or an OLAP store? I’m especially curious what refresh/invalidation strategies have held up for you in production.

Enhancing Your Development Workflow with AI: A Deep Dive into Vibe Coding

Sathish — Fri, 16 Jan 2026 15:00:14 GMT

Introduction

In the fast-paced world of software development, efficiency is key. As an indie hacker and data engineer, I've constantly sought ways to enhance my productivity and streamline my workflow. Enter Vibe Coding—a concept that leverages AI tools like Cursor and Claude AI to augment the development process. In this article, I’ll walk you through how these tools have revolutionized my coding experience, allowing me to build better products faster.

The Concept of Vibe Coding

Vibe Coding is about creating a harmonious development environment where AI tools work in tandem to assist in coding tasks. Cursor, an AI-powered code assistant, and Claude AI, a conversational AI tool, form the backbone of my setup, each playing a crucial role in different stages of development.

Getting Started with Cursor

Cursor is a tool designed to increase code quality and development speed. It helps identify errors, optimize code, and even generate boilerplate code. Here's a typical use case:

# Before optimization
for i in range(len(my_list)):
    print(my_list[i])

# After optimization with Cursor
for item in my_list:
    print(item)

The tool not only improves my coding efficiency but also provides learning opportunities by highlighting better coding practices.

Leveraging Claude AI for Conceptual Clarity

Claude AI serves as a digital brainstorming partner, offering insights and suggestions that enhance problem-solving. Whether it’s understanding complex algorithms or exploring new tech stacks, Claude AI is there to provide clarity.

// Claude AI suggestion for implementing a debounce function
function debounce(func, wait) {
  let timeout;
  return function(...args) {
    clearTimeout(timeout);
    timeout = setTimeout(() => func.apply(this, args), wait);
  };
}

Integrating AI into Daily Workflow

Integrating these tools into my daily workflow has been seamless. I use Cursor for code reviews and Claude AI for design discussions. This integration allows for more focus on creative problem-solving and less on mundane tasks.

Overcoming Challenges with AI Tools

While AI tools offer significant benefits, they are not without challenges. Ensuring data privacy and managing AI suggestions that align with project requirements are ongoing considerations. However, the productivity gains far outweigh these challenges.

Future of Vibe Coding

As AI continues to evolve, the potential for Vibe Coding is limitless. Future advancements may include more intuitive interfaces and deeper integration with existing development tools, making the coding experience even more immersive and efficient.

Key Takeaways

Vibe Coding leverages AI tools to create a harmonious and efficient development environment.
Cursor optimizes code quality and speeds up development processes.
Claude AI provides conceptual clarity and enhances problem-solving.
Integrating AI into daily workflows can significantly boost productivity.
Ongoing challenges include data privacy and aligning AI suggestions with project needs.

CTA

Curious about incorporating AI into your development process? Start exploring tools like Cursor and Claude AI. Share your experiences or questions in the comments! Follow my journey: @Sathish_Daggula on X.

Developing an Offline-First Fitness App with React Native: The Journey of Gym Tracker

Sathish — Wed, 14 Jan 2026 15:00:21 GMT

Introduction

Fitness apps have become an integral part of our lives, guiding us in maintaining a healthy lifestyle. As the creator of Gym Tracker, an app currently on the waitlist with 423 exercises and an AI coach, my vision was to build an offline-first app. This decision was driven by the need to provide uninterrupted service to users, even in areas with poor connectivity. In this article, I'll share the journey of developing Gym Tracker using React Native and Expo, focusing on the strategies that enabled offline functionality and health data synchronization.

Choosing React Native and Expo

React Native was the natural choice for Gym Tracker due to its ability to deliver a native app experience on both iOS and Android from a single codebase. Expo further simplified the process, offering tools and libraries that accelerated development.

import { registerRootComponent } from 'expo';
import App from './App';

registerRootComponent(App); // This ensures your app is loaded correctly

Implementing Offline-First Strategy

One of the primary challenges was ensuring the app functioned offline. To achieve this, I used AsyncStorage for local data storage, allowing critical data such as user progress and workouts to be accessed without an internet connection.

import AsyncStorage from '@react-native-async-storage/async-storage';

const storeWorkout = async (workout) => {
  try {
    const jsonValue = JSON.stringify(workout);
    await AsyncStorage.setItem('@workout_key', jsonValue);
  } catch (e) {
    console.error('Error storing the workout:', e);
  }
};

AI Integration for Personalized Coaching

Integrating AI for personalized coaching was another ambitious feature. Using machine learning models, the app provides tailored exercise recommendations based on the user's goals and performance.

Health Data Synchronization

Synchronizing health data across devices was essential, especially for tracking metrics like steps and calories burned. Using libraries such as Expo's HealthKit for iOS and Google Fit for Android, I ensured seamless integration and synchronization.

import * as GoogleFit from 'react-native-google-fit';

GoogleFit.startRecording((callback) => { console.log('Steps data is being recorded:', callback); });

Designing an Intuitive User Interface

The user interface of Gym Tracker needed to be both functional and visually appealing. I used a vibrant color palette to create an energetic feel, ensuring users remain motivated and engaged.


  Welcome to Gym Tracker!

Testing and Deployment

Extensive testing on different devices was critical to ensure the app's performance and reliability. Using Expo's over-the-air updates, I could roll out fixes and improvements quickly without requiring users to update the app manually.

Key Takeaways

React Native and Expo offer powerful tools for building cross-platform apps with native performance.
Offline-first strategies ensure app functionality even in low-connectivity areas.
AI integration enhances user experience by providing personalized fitness recommendations.
Synchronizing health data across platforms is crucial for comprehensive fitness tracking.
An intuitive and engaging UI is vital for user retention and motivation.

CTA

Eager to dive into mobile development? Start experimenting with React Native and Expo. Have insights or questions to share? Drop a comment below! Follow my journey: @Sathish_Daggula on X.

sathish builds

Why I Use Partial Indexes for “Active Jobs” in Postgres

The Problem

Options I Considered

Why not partitioning?

Why not “just cache it”?

What I Chose (and Why)

Schema (simplified)

The partial indexes

The trade-offs (what I gave up)

Querying from Next.js (Supabase)

How It Worked in Production

When This Doesn't Work

Key Takeaways

Closing

How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches

How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches

Macro projection → micro signals we can measure

The ingestion pipeline: 500+ sources, one schema

Deduplication: the “more jobs” trap

Salary data: why “$139K–$155K” is a normalization problem

What the growth signal means for “career confidence” (as data)

How We Counted 693 Live PMHNP Openings in California (and Why Volume ≠ Fit)

California State Spotlight: 693 openings, highest volume—here’s what the data is really saying

Why CA leads in openings (and what “leads” means in a pipeline)

Counting “693”: dedupe + freshness are the whole story

1) Deduplication across sources

2) Freshness: “live” vs “stale repost”

Where the CA jobs are: geo resolution beats “Bay Area” strings

Location normalization challenges we see in CA

What the distribution usually looks like

Salary + cost of living: normalization before conclusions

Volume ≠ fit: what to infer from 693 openings

How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)

How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)

PMHNP-BC meaning (as data): what the credential actually stands for

Why most employers require ANCC certification (and how that shows up in job data)

The problem: job posts mention it in messy ways

Our extraction approach: from raw text to a normalized requirement

Step 1: pattern detection with guardrails

Step 2: dedup + canonicalization across sources

How we surface it: filters, matching, and alerts

What PMHNP-BC signals (and what it doesn’t)

If you’re building job search tooling, treat credentials like schema

How We Measure the DNP vs MSN Pay Delta for PMHNP Jobs (and Turn It Into ROI Math)

How We Measure the DNP vs MSN Pay Delta for PMHNP Jobs (and Turn It Into ROI Math)

1) The data problem: job postings aren’t a salary table

Our pipeline (high level)

2) Salary normalization: turning “$75/hr” into a comparable annual number

3) Degree detection: “required” vs “preferred” matters

4) What the data shows: the ~$10–20K delta is real, but conditional

5) Turning salary delta into break-even time (the ROI calculation)

6) How we surface this in the product

Why I Add an Outbox Table Instead of “Just Using a Queue”

The Problem

Options I Considered

What I Chose (and Why)

Schema

Writing business data + outbox atomically

Dispatching with FOR UPDATE SKIP LOCKED

How It Worked in Production

When This Doesn't Work

Key Takeaways

Closing

How We Compared Telehealth vs In-Person PMHNP Pay Across 10,000+ Job Posts

The myth: telehealth is “easier,” so it pays less

Why pay comparisons are hard (and why raw job boards mislead)

The pipeline: from scraped postings to comparable numbers

1) Canonical job schema

2) Salary parsing + normalization

3) Deduplication (the hidden salary inflation bug)

What the data shows: telehealth often prices higher

The business math behind the pay (as seen through job post signals)

What we surface in the product (and why it matters for negotiation)

Next up: improving apples-to-apples comparisons

Why I Use Canonical + noindex as an SEO Safety Net

The Problem

Options I Considered

What I Chose (and Why)

Step 1: Normalize the canonical URL

Dispatching with `FOR UPDATE SKIP LOCKED`

Why I didn’t stick with `ILIKE`

1) Enable `pg_trgm` and add an indexed field

2) Populate `search_text` during upsert (pipeline-friendly)