<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[sathish builds]]></title><description><![CDATA[Engineering decisions, build logs, and lessons from shipping real products solo. Creator of pmhnphiring.com and currently building a gym tracker. Writing about what actually works (and what doesn't).]]></description><link>https://blog.dvskr.dev</link><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 12:08:32 GMT</lastBuildDate><atom:link href="https://blog.dvskr.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Why I Use Partial Indexes for “Active Jobs” in Postgres]]></title><description><![CDATA[The Problem
My job board has a simple read path on paper: show “active” jobs, let users filter by location, remote/hybrid, specialty, and recency.
In production it wasn’t simple.
I had 8,000+ active listings, ~2,000 companies, and ~200 listing update...]]></description><link>https://blog.dvskr.dev/why-i-use-partial-indexes-for-active-jobs-in-postgres</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-use-partial-indexes-for-active-jobs-in-postgres</guid><category><![CDATA[Databases]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[systemdesign]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Tue, 07 Apr 2026 15:00:56 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-the-problem">The Problem</h2>
<p>My job board has a simple read path on paper: show “active” jobs, let users filter by location, remote/hybrid, specialty, and recency.</p>
<p>In production it wasn’t simple.</p>
<p>I had 8,000+ active listings, ~2,000 companies, and ~200 listing updates per day coming from 10+ sources. Every source had its own “is this still active?” logic, so listings flipped between <code>active</code> and <code>expired</code> constantly. Users mostly care about <em>active</em> jobs. My database still had to keep expired rows for dedupe and audit.</p>
<p>The obvious approach was: add composite indexes for the filters.</p>
<p>That worked… until it didn’t. The indexes grew with expired rows too. Write amplification got worse. Autovacuum started showing up in my p95 latency charts. The ingestion pipeline didn’t fall over, but it got annoyingly close.</p>
<p>I wanted fast reads without paying the index tax on rows nobody queries.</p>
<h2 id="heading-options-i-considered">Options I Considered</h2>
<p>I ended up looking at three real options.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Approach</td><td>Pros</td><td>Cons</td><td>Best For</td></tr>
</thead>
<tbody>
<tr>
<td>One big composite index across all jobs</td><td>Simple mental model. Queries “just work.”</td><td>Index includes expired rows; grows forever. Higher write cost on every update. Hard to tune.</td><td>Small datasets or low churn tables</td></tr>
<tr>
<td>Table partitioning (active vs expired, or by time)</td><td>Physical separation. Can drop old partitions. Vacuum is easier.</td><td>Operational overhead. More DDL, more footguns. Partition pruning depends on query shape. Supabase migrations get trickier.</td><td>Very large tables (10M+), strict retention rules</td></tr>
<tr>
<td>Partial indexes on <code>status='active'</code></td><td>Small index. Low write cost for expired rows. Keeps query planner happy.</td><td>Queries must match the predicate. You’ll create multiple indexes if you have multiple “active” query patterns.</td><td>Medium/large tables where most queries target a subset</td></tr>
</tbody>
</table>
</div><h3 id="heading-why-not-partitioning">Why not partitioning?</h3>
<p>Partitioning is legit. If I had 50,000,000 rows with strict retention, I’d go there.</p>
<p>My table wasn’t that big. The churn was the issue.</p>
<p>Also: I’m running this on Postgres via Supabase. Partitioning is doable, but every migration becomes more delicate (especially if you need to change partition keys later). I’ve shipped enough DDL changes at 1:00 AM to know what I’m signing up for.</p>
<h3 id="heading-why-not-just-cache-it">Why not “just cache it”?</h3>
<p>I didn’t want Redis as a band-aid for avoidable index mistakes.</p>
<p>Caching helps, but my traffic pattern isn’t “same query repeated.” It’s lots of combinations: location + remote + posted_at + specialty. You cache the top few, sure, but the database still needs to handle the long tail.</p>
<p>So I stayed in Postgres and fixed the root cause.</p>
<h2 id="heading-what-i-chose-and-why">What I Chose (and Why)</h2>
<p>I moved from “index everything” to “index what users actually query”: active jobs.</p>
<p>That meant partial indexes.</p>
<h3 id="heading-schema-simplified">Schema (simplified)</h3>
<p>I keep a single <code>jobs</code> table with a status field. Expired rows stay. They’re useful for dedupe and for avoiding re-ingesting the same job from a source that republishes.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Postgres</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">type</span> job_status <span class="hljs-keyword">as</span> enum (<span class="hljs-string">'active'</span>, <span class="hljs-string">'expired'</span>);

<span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> public.jobs (
  <span class="hljs-keyword">id</span> bigserial primary <span class="hljs-keyword">key</span>,
  company_id <span class="hljs-built_in">bigint</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span>,
  title <span class="hljs-built_in">text</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span>,
  location <span class="hljs-built_in">text</span>,
  remote <span class="hljs-built_in">boolean</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">default</span> <span class="hljs-literal">false</span>,
  specialty <span class="hljs-built_in">text</span>,
  <span class="hljs-keyword">status</span> job_status <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">default</span> <span class="hljs-string">'active'</span>,
  posted_at timestamptz <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span>,
  updated_at timestamptz <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">default</span> <span class="hljs-keyword">now</span>()
);
</code></pre>
<h3 id="heading-the-partial-indexes">The partial indexes</h3>
<p>My hottest query is: active jobs ordered by recency, filtered by a couple of fields.</p>
<p>So I created indexes that only cover <code>status='active'</code>.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Fast “feed” ordering for active jobs</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> jobs_active_posted_at_id_idx
<span class="hljs-keyword">on</span> public.jobs (posted_at <span class="hljs-keyword">desc</span>, <span class="hljs-keyword">id</span> <span class="hljs-keyword">desc</span>)
<span class="hljs-keyword">where</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'active'</span>;

<span class="hljs-comment">-- Common filter: remote + recency</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> jobs_active_remote_posted_at_id_idx
<span class="hljs-keyword">on</span> public.jobs (remote, posted_at <span class="hljs-keyword">desc</span>, <span class="hljs-keyword">id</span> <span class="hljs-keyword">desc</span>)
<span class="hljs-keyword">where</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'active'</span>;

<span class="hljs-comment">-- Common filter: location (text) + recency</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> jobs_active_location_posted_at_id_idx
<span class="hljs-keyword">on</span> public.jobs (location, posted_at <span class="hljs-keyword">desc</span>, <span class="hljs-keyword">id</span> <span class="hljs-keyword">desc</span>)
<span class="hljs-keyword">where</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'active'</span> <span class="hljs-keyword">and</span> location <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span>;
</code></pre>
<p>Why the <code>(posted_at desc, id desc)</code> tail everywhere?</p>
<p>Because I paginate by recency and I need a stable tie-breaker. Two jobs can share the same <code>posted_at</code> down to the second (scrapers do that). Without <code>id</code>, keyset pagination gets weird.</p>
<h3 id="heading-the-trade-offs-what-i-gave-up">The trade-offs (what I gave up)</h3>
<ul>
<li>I gave up “one index to rule them all.” Now I have a small set of indexes that map to real query shapes.</li>
<li>I gave up some flexibility. If I forget <code>status='active'</code> in a query, the planner won’t use the partial index. You feel it immediately.</li>
<li>I accepted more schema work during feature development. Every new filter is a question: does it deserve an index?</li>
</ul>
<p>That said, the ingestion pipeline stopped paying for rows nobody reads.</p>
<h3 id="heading-querying-from-nextjs-supabase">Querying from Next.js (Supabase)</h3>
<p>This is roughly what my server route does for the jobs feed.</p>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> { createClient } <span class="hljs-keyword">from</span> <span class="hljs-string">"@supabase/supabase-js"</span>;

<span class="hljs-keyword">const</span> supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

<span class="hljs-keyword">type</span> FeedParams = {
  remote?: <span class="hljs-built_in">boolean</span>;
  location?: <span class="hljs-built_in">string</span>;
  limit?: <span class="hljs-built_in">number</span>;
};

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">fetchActiveJobsFeed</span>(<span class="hljs-params">params: FeedParams</span>) </span>{
  <span class="hljs-keyword">const</span> limit = <span class="hljs-built_in">Math</span>.min(params.limit ?? <span class="hljs-number">25</span>, <span class="hljs-number">50</span>);

  <span class="hljs-keyword">let</span> q = supabase
    .from(<span class="hljs-string">"jobs"</span>)
    .select(<span class="hljs-string">"id, company_id, title, location, remote, specialty, posted_at"</span>)
    .eq(<span class="hljs-string">"status"</span>, <span class="hljs-string">"active"</span>)
    .order(<span class="hljs-string">"posted_at"</span>, { ascending: <span class="hljs-literal">false</span> })
    .order(<span class="hljs-string">"id"</span>, { ascending: <span class="hljs-literal">false</span> })
    .limit(limit);

  <span class="hljs-keyword">if</span> (params.remote !== <span class="hljs-literal">undefined</span>) q = q.eq(<span class="hljs-string">"remote"</span>, params.remote);
  <span class="hljs-keyword">if</span> (params.location) q = q.eq(<span class="hljs-string">"location"</span>, params.location);

  <span class="hljs-keyword">const</span> { data, error } = <span class="hljs-keyword">await</span> q;
  <span class="hljs-keyword">if</span> (error) <span class="hljs-keyword">throw</span> error;
  <span class="hljs-keyword">return</span> data;
}
</code></pre>
<p>That <code>eq("status", "active")</code> isn’t optional anymore. It’s part of the contract.</p>
<h2 id="heading-how-it-worked-in-production">How It Worked in Production</h2>
<p>Before partial indexes, I tried a single composite index that included <code>status</code> but covered the whole table. It “worked” until the table accumulated expired rows.</p>
<p>The symptoms were boring and painful:</p>
<ul>
<li>p95 for the main feed query drifted from 50ms to 120ms over a couple weeks.</li>
<li>Ingestion updates (status flips + updated_at) started taking long enough that my cron window got tight.</li>
<li>Autovacuum activity correlated with read latency spikes.</li>
</ul>
<p>After moving to partial indexes, the feed stabilized:</p>
<ul>
<li>50ms p95 for the job feed query (active + ordered by recency) under normal load.</li>
<li>Write cost dropped because expired rows stopped participating in the biggest indexes.</li>
<li>Index bloat slowed down visibly. I still vacuum, but it’s not constantly fighting giant indexes that include dead weight.</li>
</ul>
<p>The surprise: I initially created too many partial indexes.</p>
<p>I mirrored every filter permutation (remote + specialty + location + …). Bad move. Postgres can combine bitmap scans sometimes, but you still pay maintenance overhead per index. I deleted the low-value ones and kept only what matched real traffic.</p>
<p>I got that traffic data by logging normalized filter shapes (not raw text) from the API: <code>remote=true</code>, <code>location=CA</code>, <code>specialty=child-adolescent</code>, etc. Two days of logs made the index decisions obvious.</p>
<h2 id="heading-when-this-doesnt-work">When This Doesn't Work</h2>
<p>Partial indexes break down when the “hot subset” isn’t stable.</p>
<p>If users query <em>all statuses</em> equally, you don’t have a subset to target. Same story if your predicate changes constantly (today it’s <code>status='active'</code>, tomorrow it’s “active OR sponsored OR pinned”).</p>
<p>Also: if you have hundreds of tenants and each tenant mostly queries its own rows, partial indexes per-tenant are a trap. You’ll drown in indexes. At that point I’d rather use a composite index on <code>(tenant_id, posted_at, id)</code> and keep the schema boring.</p>
<p>And if you genuinely need strict retention and cheap drops, partitioning wins.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ul>
<li>Index what users query, not what exists in the table. My users query <code>status='active'</code> almost exclusively.</li>
<li>Partial indexes are a write-optimization tool as much as a read-optimization tool.</li>
<li>Keep index count low. I started with 9 partial indexes and ended with 3 that mattered.</li>
<li>Make query shape a contract. If the app forgets the predicate (<code>status='active'</code>), performance becomes random.</li>
<li>Use real traffic to drive index design. Two days of filter-shape logs saved me from guessing.</li>
</ul>
<h2 id="heading-closing">Closing</h2>
<p>Partial indexes gave me predictable reads without turning my ingest pipeline into an index-maintenance job.</p>
<p>If you’re running Postgres for a “mostly-active” dataset: do you model it as a status column with partial indexes, or do you physically split hot/cold data (partitioning or separate tables)? Where did your approach start hurting?</p>
]]></content:encoded></item><item><title><![CDATA[How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches]]></title><description><![CDATA[The BLS projection (35% PMHNP growth from 2024–2034) is a macro signal. The hard part is translating it into a daily system that answers: where are the roles, what do they pay, how fast do they move, and which postings are actually real.
How We Turn ...]]></description><link>https://blog.dvskr.dev/how-we-turn-a-35-bls-pmhnp-growth-projection-into-search-alerts-and-better-job-matches</link><guid isPermaLink="true">https://blog.dvskr.dev/how-we-turn-a-35-bls-pmhnp-growth-projection-into-search-alerts-and-better-job-matches</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[healthcare tech]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[supabase]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Thu, 05 Mar 2026 16:00:39 GMT</pubDate><content:encoded><![CDATA[<p>The BLS projection (35% PMHNP growth from 2024–2034) is a macro signal. The hard part is translating it into a daily system that answers: where are the roles, what do they pay, how fast do they move, and which postings are actually real.</p>
<h1 id="heading-how-we-turn-a-35-bls-pmhnp-growth-projection-into-search-alerts-and-better-job-matches">How We Turn a 35% BLS PMHNP Growth Projection Into Search, Alerts, and Better Job Matches</h1>
<p>The BLS projection of <strong>35% PMHNP job growth (2024–2034)</strong> is a clean number that shows up in headlines. As builders, we treat it as a <em>macro input</em>—useful, but not directly actionable.</p>
<p>What’s actionable is what that trend turns into at the job-posting layer:</p>
<ul>
<li>more postings (and more duplicates)</li>
<li>faster hiring cycles (time-to-fill compresses)</li>
<li>wider variance by state, setting, and employer type</li>
<li>more compensation noise (ranges, bonuses, productivity, telehealth modifiers)</li>
</ul>
<p>At <strong>PMHNP Hiring</strong>, we aggregate from <strong>500+ job sources daily</strong> and maintain <strong>10,000+ verified PMHNP jobs across 50 states</strong>. The goal isn’t to repeat the BLS number—it’s to make it queryable: <em>where is growth showing up today, and what does “good role” mean in that market?</em></p>
<h2 id="heading-macro-projection-micro-signals-we-can-measure">Macro projection → micro signals we can measure</h2>
<p>The 35% projection is best read as sustained demand: more people seeking care, expanding access efforts, and pressure on psychiatry capacity. PMHNPs sit right in the middle.</p>
<p>But “more jobs” doesn’t mean “every job is a fit.” So we track operational signals that correlate with a heated market:</p>
<ul>
<li><strong>posting velocity</strong> (new jobs per day/week by state + setting)</li>
<li><strong>time-to-fill proxies</strong> (how long a posting stays live, how often it gets refreshed)</li>
<li><strong>credentialing friction</strong> (signals in text like “credentialing support,” “start in 2–4 weeks,” etc.)</li>
<li><strong>comp package completeness</strong> (whether salary, schedule, supervision, and ramp are specified)</li>
</ul>
<p>One example we surface internally is a rolling <em>days-live</em> metric. It’s not perfect, but it’s observable at scale.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Rough “days live” metric from our normalized postings table</span>
<span class="hljs-keyword">select</span>
  state,
  <span class="hljs-keyword">percentile_cont</span>(<span class="hljs-number">0.5</span>) <span class="hljs-keyword">within</span> <span class="hljs-keyword">group</span> (<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> (<span class="hljs-keyword">now</span>() - first_seen_at)) <span class="hljs-keyword">as</span> median_days_live,
  <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">as</span> active_jobs
<span class="hljs-keyword">from</span> jobs
<span class="hljs-keyword">where</span>
  <span class="hljs-keyword">role</span> = <span class="hljs-string">'PMHNP'</span>
  <span class="hljs-keyword">and</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'active'</span>
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> state
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> median_days_live <span class="hljs-keyword">asc</span>;
</code></pre>
<p>If median days-live drops, it often matches what clinicians feel as “employers are moving faster.” In the original blog, we cited time-to-fill tightening (e.g., ~32 days vs ~45). We treat those as hypotheses and validate them against our own observed posting lifecycle.</p>
<h2 id="heading-the-ingestion-pipeline-500-sources-one-schema">The ingestion pipeline: 500+ sources, one schema</h2>
<p>Scraping PMHNP jobs isn’t “fetch HTML, parse title.” Every source has its own quirks:</p>
<ul>
<li>different location formats (city/state, remote, multi-state, “within 50 miles”)</li>
<li>different salary formats (hourly, annual, per-visit, wide ranges)</li>
<li>duplicated listings across ATS platforms, job boards, and staffing agencies</li>
<li>stale posts that get re-published with new IDs</li>
</ul>
<p>Our pipeline is built to turn that mess into a stable contract:</p>
<ol>
<li><strong>Fetch</strong> (scheduled jobs) from boards/ATS feeds</li>
<li><strong>Parse</strong> into a common intermediate model</li>
<li><strong>Normalize</strong> fields (title → role, location → geo, salary → annualized range)</li>
<li><strong>Deduplicate</strong> and <strong>verify</strong></li>
<li><strong>Index</strong> for real-time filtering + alerts</li>
</ol>
<p>Tech stack pieces:</p>
<ul>
<li><strong>Next.js + TypeScript</strong> for UI and API routes</li>
<li><strong>Supabase (Postgres)</strong> for storage + full-text search + RLS</li>
<li><strong>Stripe</strong> for billing (alerts/subscriptions)</li>
</ul>
<h2 id="heading-deduplication-the-more-jobs-trap">Deduplication: the “more jobs” trap</h2>
<p>Job growth increases volume, but it also increases duplicates. A single PMHNP opening can appear on:</p>
<ul>
<li>the employer site</li>
<li>an ATS mirror</li>
<li>2–5 job boards</li>
<li>a staffing listing with rewritten text</li>
</ul>
<p>If we don’t dedupe, users think there are more unique opportunities than there are—and alerts become spam.</p>
<p>We generate a fingerprint using a mix of deterministic and fuzzy signals:</p>
<ul>
<li>normalized employer name</li>
<li>canonicalized location (lat/lng + radius buckets)</li>
<li>role taxonomy (PMHNP vs “Psych NP” variants)</li>
<li>compensation overlap (when present)</li>
<li>text similarity on responsibilities and requirements</li>
</ul>
<p>Pseudo-code sketch:</p>
<pre><code class="lang-ts"><span class="hljs-keyword">type</span> Job = {
  title: <span class="hljs-built_in">string</span>
  employer: <span class="hljs-built_in">string</span>
  city?: <span class="hljs-built_in">string</span>
  state?: <span class="hljs-built_in">string</span>
  lat?: <span class="hljs-built_in">number</span>
  lng?: <span class="hljs-built_in">number</span>
  description: <span class="hljs-built_in">string</span>
  salaryMin?: <span class="hljs-built_in">number</span>
  salaryMax?: <span class="hljs-built_in">number</span>
}

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">fingerprint</span>(<span class="hljs-params">job: Job</span>) </span>{
  <span class="hljs-keyword">return</span> hash([
    normalizeEmployer(job.employer),
    normalizeRole(job.title),
    geoBucket(job.lat, job.lng, <span class="hljs-number">0.25</span>),
    salaryBucket(job.salaryMin, job.salaryMax),
    simhash(normalizeText(job.description))
  ].join(<span class="hljs-string">'|'</span>))
}
</code></pre>
<p>This is what turns macro growth into a trustworthy count of <em>verified</em> jobs.</p>
<h2 id="heading-salary-data-why-139k155k-is-a-normalization-problem">Salary data: why “$139K–$155K” is a normalization problem</h2>
<p>Clinically, people want to know pay. Technically, pay is one of the messiest fields we ingest.</p>
<p>Common failure modes:</p>
<ul>
<li>hourly rates without hours/week</li>
<li>“$120k–$250k” ranges that include productivity/bonus but aren’t labeled</li>
<li>sign-on bonuses mixed into base</li>
<li>“per diem” roles listed as annual</li>
<li>DNP vs MSN differentials inconsistently stated</li>
</ul>
<p>So we normalize salaries into an annualized range with metadata:</p>
<ul>
<li><code>pay_type</code> (hourly/annual/unknown)</li>
<li><code>annual_min</code>, <code>annual_max</code></li>
<li><code>confidence_score</code></li>
<li><code>includes_bonus</code> (best-effort)</li>
</ul>
<p>Example normalization logic:</p>
<pre><code class="lang-ts"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">annualize</span>(<span class="hljs-params">{ payType, min, max, hoursPerWeek = 40 }: <span class="hljs-built_in">any</span></span>) </span>{
  <span class="hljs-keyword">if</span> (payType === <span class="hljs-string">'hourly'</span>) {
    <span class="hljs-keyword">return</span> {
      annualMin: min * hoursPerWeek * <span class="hljs-number">52</span>,
      annualMax: max * hoursPerWeek * <span class="hljs-number">52</span>,
    }
  }
  <span class="hljs-keyword">return</span> { annualMin: min, annualMax: max }
}
</code></pre>
<p>That’s how we can talk about national ranges (like ~$139K–$155K common bands, entry-level around ~$126K) while still being honest about variance and data quality.</p>
<h2 id="heading-what-the-growth-signal-means-for-career-confidence-as-data">What the growth signal means for “career confidence” (as data)</h2>
<p>The BLS projection improves the odds that you’ll find opportunities—but the <em>quality</em> of those opportunities depends on factors you can filter for:</p>
<ul>
<li>setting (outpatient, community mental health, integrated care, telepsych)</li>
<li>onboarding support (mentorship/supervision signals)</li>
<li>realistic ramp + admin time</li>
<li>credentialing speed</li>
</ul>
<p>From a product standpoint, this is why we invest in structured fields extracted from unstructured text. A fast offer can correlate with staffing pressure; we try to surface the context so users can choose strategically.</p>
<p>In other words: the 35% growth projection is the headline. The system work is turning it into a search experience where you can reliably answer, <strong>“Where are the real roles, and which ones are built to support me once I’m hired?”</strong></p>
]]></content:encoded></item><item><title><![CDATA[How We Counted 693 Live PMHNP Openings in California (and Why Volume ≠ Fit)]]></title><description><![CDATA[California shows 693 PMHNP openings in our index. The interesting part isn’t the number—it’s how you get a trustworthy count from messy job feeds, and what the distribution says about metros, settings, and competition.
California State Spotlight: 693...]]></description><link>https://blog.dvskr.dev/how-we-counted-693-live-pmhnp-openings-in-california-and-why-volume-fit</link><guid isPermaLink="true">https://blog.dvskr.dev/how-we-counted-693-live-pmhnp-openings-in-california-and-why-volume-fit</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[healthcare tech]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[supabase]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Fri, 27 Feb 2026 16:00:17 GMT</pubDate><content:encoded><![CDATA[<p>California shows 693 PMHNP openings in our index. The interesting part isn’t the number—it’s how you get a trustworthy count from messy job feeds, and what the distribution says about metros, settings, and competition.</p>
<h1 id="heading-california-state-spotlight-693-openings-highest-volumeheres-what-the-data-is-really-saying">California State Spotlight: 693 openings, highest volume—here’s what the data is really saying</h1>
<p>California is the highest-volume PMHNP market in our dataset right now: <strong>693 verified openings</strong>. If you’re building a job aggregator, that number isn’t a headline—it’s a stress test.</p>
<p>“693” only matters if it’s <strong>current</strong>, <strong>deduplicated</strong>, and <strong>geographically correct</strong>. Job boards repost. Health systems syndicate. Staffing firms clone. Locations get written as “Bay Area” or “Remote (CA)” or “Various Locations.” Salary ranges show up as hourly, annually, or not at all.</p>
<p>This post is a technical look at how we surface California’s volume on PMHNP Hiring (Next.js + TypeScript + Supabase), and why volume doesn’t automatically mean fit.</p>
<blockquote>
<p>If you want to browse what’s live right now, this is the production view: https://pmhnphiring.com/jobs/state/california</p>
</blockquote>
<hr />
<h2 id="heading-why-ca-leads-in-openings-and-what-leads-means-in-a-pipeline">Why CA leads in openings (and what “leads” means in a pipeline)</h2>
<p>California’s lead is driven by three factors we can observe directly in the ingestion layer:</p>
<ol>
<li><strong>Sheer employer surface area</strong>: large systems + multi-site outpatient groups generate continuous posting churn.</li>
<li><strong>Broad location graph</strong>: dense metros, fast-growing suburbs, and rural shortage zones produce postings across many counties.</li>
<li><strong>High repost velocity</strong>: CA roles are more likely to be syndicated across multiple sources, which inflates raw counts.</li>
</ol>
<p>That third point is where data engineering matters. If we naïvely counted every scraped URL, CA would look even bigger—but it would be wrong.</p>
<p>At a high level, our daily pipeline looks like:</p>
<ul>
<li><strong>Ingest</strong> from 500+ sources (ATS pages, job boards, employer sites)</li>
<li><strong>Normalize</strong> fields (title, employer, location, compensation)</li>
<li><strong>Deduplicate</strong> across syndication</li>
<li><strong>Verify freshness</strong> (remove stale/reposted listings)</li>
<li><strong>Geocode &amp; tag</strong> (state, metro, setting signals)</li>
<li><strong>Serve</strong> via fast filters + alerts</li>
</ul>
<p>California just happens to be where every one of those steps gets exercised at scale.</p>
<hr />
<h2 id="heading-counting-693-dedupe-freshness-are-the-whole-story">Counting “693”: dedupe + freshness are the whole story</h2>
<p>The hardest part of “state spotlights” is making sure the count represents <strong>distinct, open roles</strong>.</p>
<h3 id="heading-1-deduplication-across-sources">1) Deduplication across sources</h3>
<p>A single PMHNP role can appear:</p>
<ul>
<li>on an employer’s ATS</li>
<li>on 3–10 job boards</li>
<li>reposted weekly with a new URL</li>
</ul>
<p>We dedupe by generating a stable fingerprint from normalized fields. The exact recipe evolves, but conceptually:</p>
<pre><code class="lang-ts"><span class="hljs-comment">// simplified</span>
<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">fingerprint</span>(<span class="hljs-params">job: NormalizedJob</span>) </span>{
  <span class="hljs-keyword">return</span> hash([
    normalizeEmployer(job.employerName),
    normalizeTitle(job.title),
    normalizeLocation(job.location), <span class="hljs-comment">// city/state if present</span>
    normalizeReq(job.requirementsText ?? <span class="hljs-string">""</span>),
  ].join(<span class="hljs-string">"|"</span>))
}
</code></pre>
<p>Then we cluster near-duplicates (minor title differences, “Psych NP” vs “PMHNP”) using similarity thresholds.</p>
<h3 id="heading-2-freshness-live-vs-stale-repost">2) Freshness: “live” vs “stale repost”</h3>
<p>High-volume states are repost-heavy. To keep the CA page useful, we track signals like:</p>
<ul>
<li>last_seen_at (when a crawler last confirmed it exists)</li>
<li>source_updated_at (if the source exposes it)</li>
<li>closed/404 signals</li>
</ul>
<p>A job can be popular and still sit open for months—but it needs to be <strong>verifiably available</strong>.</p>
<hr />
<h2 id="heading-where-the-ca-jobs-are-geo-resolution-beats-bay-area-strings">Where the CA jobs are: geo resolution beats “Bay Area” strings</h2>
<p>California hiring isn’t evenly distributed, and you can’t analyze distribution if locations are sloppy.</p>
<h3 id="heading-location-normalization-challenges-we-see-in-ca">Location normalization challenges we see in CA</h3>
<ul>
<li>“Los Angeles, CA” (easy)</li>
<li>“San Francisco Bay Area” (needs mapping)</li>
<li>“Remote in California” (state-only)</li>
<li>“Multiple Locations” (often a multi-site group)</li>
</ul>
<p>We resolve locations into a consistent shape:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"state"</span>: <span class="hljs-string">"CA"</span>,
  <span class="hljs-attr">"city"</span>: <span class="hljs-string">"San Diego"</span>,
  <span class="hljs-attr">"metro"</span>: <span class="hljs-string">"San Diego-Chula Vista-Carlsbad, CA"</span>,
  <span class="hljs-attr">"is_remote"</span>: <span class="hljs-literal">false</span>,
  <span class="hljs-attr">"lat"</span>: <span class="hljs-number">32.7157</span>,
  <span class="hljs-attr">"lng"</span>: <span class="hljs-number">-117.1611</span>
}
</code></pre>
<p>When we can’t confidently infer city/metro, we still keep the job (state-level filtering matters), but we avoid over-claiming precision in metro counts.</p>
<h3 id="heading-what-the-distribution-usually-looks-like">What the distribution usually looks like</h3>
<ul>
<li><strong>Big metros</strong>: widest mix (outpatient, inpatient, specialty)</li>
<li><strong>Rural/semi-rural</strong>: fewer postings, often higher urgency and narrower candidate pools</li>
</ul>
<p>Outpatient dominates the CA index—clinic medication management, integrated care, and community mental health appear frequently—because those orgs post continuously and across many sites.</p>
<hr />
<h2 id="heading-salary-cost-of-living-normalization-before-conclusions">Salary + cost of living: normalization before conclusions</h2>
<p>California is commonly high-paying, but salary data is messy:</p>
<ul>
<li>hourly vs annual</li>
<li>wide ranges (“$160k–$240k”) vs single numbers</li>
<li>missing compensation (common)</li>
</ul>
<p>We normalize to a comparable annualized range when possible:</p>
<pre><code class="lang-ts"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">annualize</span>(<span class="hljs-params">amount: <span class="hljs-built_in">number</span>, unit: "hour"|"year"</span>) </span>{
  <span class="hljs-keyword">if</span> (unit === <span class="hljs-string">"year"</span>) <span class="hljs-keyword">return</span> amount
  <span class="hljs-keyword">return</span> amount * <span class="hljs-number">40</span> * <span class="hljs-number">52</span>
}
</code></pre>
<p>Then we store both the raw and normalized values so the UI can explain what it’s showing.</p>
<p>Cost of living (COL) is not a single number either; it’s region-specific. The product approach we take is: show salary when available, but keep filtering centered on <strong>role constraints</strong> (onsite vs hybrid, setting, call, population) because those are consistently present in the data.</p>
<hr />
<h2 id="heading-volume-fit-what-to-infer-from-693-openings">Volume ≠ fit: what to infer from 693 openings</h2>
<p>From a builder’s point of view, “highest volume” usually means:</p>
<ul>
<li><strong>more duplicates to crush</strong></li>
<li><strong>more employer posting patterns</strong> (waves, evergreen roles)</li>
<li><strong>more variance in requirements</strong> (onsite-only, specific populations, credentialing timelines)</li>
</ul>
<p>For job seekers, that translates to: some CA postings close fast because applicant volume is high; others linger because constraints are tight.</p>
<p>If you’re exploring CA, use the live index to filter down to the jobs that match your constraints instead of optimizing for the raw count:</p>
<ul>
<li>region/commute reality</li>
<li>outpatient vs inpatient</li>
<li>remote/hybrid flags</li>
<li>salary (when present)</li>
</ul>
<p>California leads the country in openings—but the real win is turning that noisy volume into a clean, searchable set of roles you can actually act on.</p>
]]></content:encoded></item><item><title><![CDATA[How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)]]></title><description><![CDATA[In our pipeline, “PMHNP-BC required” isn’t a nice-to-have string. It’s a high-signal field that determines whether a job matches a clinician at all—so we treat it like structured data, not copy.
How We Detect “PMHNP-BC Required” in 500+ Job Feeds (an...]]></description><link>https://blog.dvskr.dev/how-we-detect-pmhnp-bc-required-in-500-job-feeds-and-what-the-credential-actually-means</link><guid isPermaLink="true">https://blog.dvskr.dev/how-we-detect-pmhnp-bc-required-in-500-job-feeds-and-what-the-credential-actually-means</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[healthcare tech]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[supabase]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Thu, 26 Feb 2026 16:00:39 GMT</pubDate><content:encoded><![CDATA[<p>In our pipeline, “PMHNP-BC required” isn’t a nice-to-have string. It’s a high-signal field that determines whether a job matches a clinician at all—so we treat it like structured data, not copy.</p>
<h1 id="heading-how-we-detect-pmhnp-bc-required-in-500-job-feeds-and-what-the-credential-actually-means">How We Detect “PMHNP-BC Required” in 500+ Job Feeds (and What the Credential Actually Means)</h1>
<p>PMHNP-BC shows up in job posts so often it can read like alphabet soup. But in most hiring pipelines it’s a gatekeeper credential: if the posting requires it, your application often won’t make it past an ATS rule or a credentialing checkpoint.</p>
<p>From a product/builders angle, that makes <strong>PMHNP-BC</strong> a piece of data we have to extract reliably. If we misread it (false positive or false negative), we either:</p>
<ul>
<li>show you jobs you can’t actually credential into, or</li>
<li>hide jobs you’re qualified for.</li>
</ul>
<p>PMHNP Hiring aggregates <strong>500+ sources daily</strong> and maintains <strong>10,000+ verified PMHNP jobs across 50 states</strong>. Credential requirements are one of the highest-impact fields we normalize because they drive real-time filtering, alerts, and matching.</p>
<p>You can see how frequently it appears by scanning current listings on https://pmhnphiring.com/jobs.</p>
<hr />
<h2 id="heading-pmhnp-bc-meaning-as-data-what-the-credential-actually-stands-for">PMHNP-BC meaning (as data): what the credential actually stands for</h2>
<p><strong>PMHNP-BC</strong> stands for <strong>Psychiatric–Mental Health Nurse Practitioner – Board Certified</strong>.</p>
<p>In practice, when employers say “PMHNP-BC required,” they’re almost always referring to <strong>ANCC board certification</strong> for the PMHNP population focus.</p>
<p>Why this matters operationally:</p>
<ul>
<li>Employers aren’t hiring “an NP who does psych.” They’re hiring someone whose education, clinical hours, and board exam align to psychiatric-mental health scope.</li>
<li>The “BC” is HR/credentialing shorthand for a standardized, third-party-verifiable credential.</li>
<li>Many postings are flexible on schedule, setting, and sometimes experience. They’re usually not flexible on board certification because it touches <strong>credentialing, payer enrollment, and risk management</strong>.</li>
</ul>
<p>So we don’t treat “PMHNP-BC” as marketing copy. We treat it like a <strong>structured constraint</strong>.</p>
<hr />
<h2 id="heading-why-most-employers-require-ancc-certification-and-how-that-shows-up-in-job-data">Why most employers require ANCC certification (and how that shows up in job data)</h2>
<p>In a typical hiring flow, a job goes through multiple checkpoints:</p>
<ol>
<li>Recruiter/ATS intake</li>
<li>Clinical leadership review</li>
<li>Credentialing</li>
<li>Payer enrollment</li>
</ol>
<p>ANCC board certification is a clean way for those systems and teams to stay aligned. From the data side, it’s also one of the few credentials that appears consistently across job posts, PDFs, and ATS templates.</p>
<p>That consistency is why it becomes a filterable requirement on our end.</p>
<h3 id="heading-the-problem-job-posts-mention-it-in-messy-ways">The problem: job posts mention it in messy ways</h3>
<p>Across sources, we see variants like:</p>
<ul>
<li><code>PMHNP-BC required</code></li>
<li><code>PMHNP BC</code></li>
<li><code>Board Certified PMHNP</code></li>
<li><code>ANCC certification required</code></li>
<li><code>Psych NP (BC)</code></li>
<li><code>Must be board certified in psychiatry</code></li>
</ul>
<p>And we also see confusing near-misses:</p>
<ul>
<li>“Psych NP preferred” (not necessarily a hard requirement)</li>
<li>“BC/BE” (board certified / board eligible — more common in physician postings, but it leaks into mixed templates)</li>
<li>“Active license required” (license ≠ board certification)</li>
</ul>
<p>So the engineering job is: <strong>turn noisy text into a reliable field</strong>.</p>
<hr />
<h2 id="heading-our-extraction-approach-from-raw-text-to-a-normalized-requirement">Our extraction approach: from raw text to a normalized requirement</h2>
<p>At ingestion time we store the raw posting, then produce a normalized record used for search/matching.</p>
<p>A simplified TypeScript shape looks like:</p>
<pre><code class="lang-ts"><span class="hljs-keyword">export</span> <span class="hljs-keyword">type</span> BoardCert =
  | { required: <span class="hljs-literal">true</span>; body: <span class="hljs-string">"ANCC"</span>; credential: <span class="hljs-string">"PMHNP-BC"</span> }
  | { required: <span class="hljs-literal">false</span>; body?: <span class="hljs-string">"ANCC"</span>; credential?: <span class="hljs-string">"PMHNP-BC"</span> }
  | { required: <span class="hljs-literal">null</span> };

<span class="hljs-keyword">export</span> <span class="hljs-keyword">interface</span> NormalizedJob {
  id: <span class="hljs-built_in">string</span>;
  title: <span class="hljs-built_in">string</span>;
  description: <span class="hljs-built_in">string</span>;
  requirements_text: <span class="hljs-built_in">string</span>;
  board_cert: BoardCert;
  <span class="hljs-comment">// ...salary, location, setting, etc.</span>
}
</code></pre>
<h3 id="heading-step-1-pattern-detection-with-guardrails">Step 1: pattern detection with guardrails</h3>
<p>We start with deterministic signals (regex + keyword proximity) because they’re explainable and easy to debug.</p>
<pre><code class="lang-ts"><span class="hljs-keyword">const</span> PMHNP_BC_PATTERNS = [
  <span class="hljs-regexp">/\bPMHNP\s*[- ]?BC\b/i</span>,
  <span class="hljs-regexp">/\bPsychiatric\s*-?Mental\s*Health\s*Nurse\s*Practitioner\s*-?\s*Board\s*Certified\b/i</span>,
  <span class="hljs-regexp">/\bANCC\b.*\b(PMHNP|Psych)\b/i</span>,
  <span class="hljs-regexp">/\bboard\s*certif(ied|ication)\b.*\bPMHNP\b/i</span>,
];

<span class="hljs-keyword">const</span> REQUIREMENT_CUES = [<span class="hljs-regexp">/\brequired\b/i</span>, <span class="hljs-regexp">/\bmust\b/i</span>, <span class="hljs-regexp">/\bmandatory\b/i</span>];
<span class="hljs-keyword">const</span> SOFT_CUES = [<span class="hljs-regexp">/\bpreferred\b/i</span>, <span class="hljs-regexp">/\bplus\b/i</span>, <span class="hljs-regexp">/\bdesired\b/i</span>];

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">extractBoardCert</span>(<span class="hljs-params">text: <span class="hljs-built_in">string</span></span>): <span class="hljs-title">BoardCert</span> </span>{
  <span class="hljs-keyword">const</span> hasCredential = PMHNP_BC_PATTERNS.some(<span class="hljs-function">(<span class="hljs-params">re</span>) =&gt;</span> re.test(text));
  <span class="hljs-keyword">if</span> (!hasCredential) <span class="hljs-keyword">return</span> { required: <span class="hljs-literal">null</span> };

  <span class="hljs-keyword">const</span> isRequired = REQUIREMENT_CUES.some(<span class="hljs-function">(<span class="hljs-params">re</span>) =&gt;</span> re.test(text)) &amp;&amp;
    !SOFT_CUES.some(<span class="hljs-function">(<span class="hljs-params">re</span>) =&gt;</span> re.test(text));

  <span class="hljs-keyword">return</span> {
    required: isRequired,
    body: <span class="hljs-regexp">/\bANCC\b/i</span>.test(text) ? <span class="hljs-string">"ANCC"</span> : <span class="hljs-string">"ANCC"</span>,
    credential: <span class="hljs-string">"PMHNP-BC"</span>,
  };
}
</code></pre>
<p>We bias toward <strong>not marking it “required”</strong> unless the copy is explicit. False “required” flags are worse than missing a “preferred” mention because they filter out jobs.</p>
<h3 id="heading-step-2-dedup-canonicalization-across-sources">Step 2: dedup + canonicalization across sources</h3>
<p>The same job often appears on multiple boards with different formatting. Our dedup pipeline clusters postings and merges fields. For credentials, we keep:</p>
<ul>
<li>the most explicit requirement statement (source-of-truth ranking), and</li>
<li>a trace back to the raw text snippet that triggered it.</li>
</ul>
<p>That trace matters when users ask “why did this job get filtered out?” Debuggability is a product feature.</p>
<hr />
<h2 id="heading-how-we-surface-it-filters-matching-and-alerts">How we surface it: filters, matching, and alerts</h2>
<p>Once normalized, <code>board_cert.required === true</code> becomes a first-class filter in the app.</p>
<ul>
<li><strong>Real-time filtering (Next.js)</strong>: users can hide “PMHNP-BC required” jobs if they’re still in school or board-pending.</li>
<li><strong>Custom matching</strong>: if a user profile says “ANCC PMHNP-BC: yes,” those jobs rank higher.</li>
<li><strong>Alerts</strong>: new jobs that flip from “preferred” to “required” (or vice versa) can trigger notifications, because it changes eligibility.</li>
</ul>
<p>This is the builder’s takeaway: “PMHNP-BC” isn’t just a career acronym—it’s a constraint that must survive scraping, parsing, deduping, and ranking.</p>
<hr />
<h2 id="heading-what-pmhnp-bc-signals-and-what-it-doesnt">What PMHNP-BC signals (and what it doesn’t)</h2>
<p>What it signals in job data:</p>
<ul>
<li>The employer expects <strong>board certification for psychiatric-mental health NP scope</strong> (typically ANCC).</li>
<li>The job likely flows into credentialing/payer systems that will verify it.</li>
</ul>
<p>What it doesn’t guarantee:</p>
<ul>
<li>salary level (we still have to normalize comp across hourly/annual/RVU ranges)</li>
<li>autonomy level or supervision model</li>
<li>call burden or patient acuity</li>
</ul>
<p>Those require different extraction pipelines.</p>
<hr />
<h2 id="heading-if-youre-building-job-search-tooling-treat-credentials-like-schema">If you’re building job search tooling, treat credentials like schema</h2>
<p>A lot of job aggregators treat credentials as plain text. In healthcare hiring, credentials behave more like <strong>typed fields</strong> with strict semantics.</p>
<p>For PMHNP Hiring, “PMHNP-BC required” is one of the simplest examples of why: it changes who the job is for.</p>
<p>If you want to explore how often it appears, browse live listings here: https://pmhnphiring.com/jobs.</p>
]]></content:encoded></item><item><title><![CDATA[How We Measure the DNP vs MSN Pay Delta for PMHNP Jobs (and Turn It Into ROI Math)]]></title><description><![CDATA[The “DNP earns $10–20K more than MSN” claim is directionally true in our job dataset—but only after you normalize salary formats, dedupe reposts, and separate degree signals from everything else employers pay for.
How We Measure the DNP vs MSN Pay De...]]></description><link>https://blog.dvskr.dev/how-we-measure-the-dnp-vs-msn-pay-delta-for-pmhnp-jobs-and-turn-it-into-roi-math</link><guid isPermaLink="true">https://blog.dvskr.dev/how-we-measure-the-dnp-vs-msn-pay-delta-for-pmhnp-jobs-and-turn-it-into-roi-math</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[healthcare tech]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[supabase]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Wed, 25 Feb 2026 16:00:53 GMT</pubDate><content:encoded><![CDATA[<p>The “DNP earns $10–20K more than MSN” claim is directionally true in our job dataset—but only after you normalize salary formats, dedupe reposts, and separate degree signals from everything else employers pay for.</p>
<h1 id="heading-how-we-measure-the-dnp-vs-msn-pay-delta-for-pmhnp-jobs-and-turn-it-into-roi-math">How We Measure the DNP vs MSN Pay Delta for PMHNP Jobs (and Turn It Into ROI Math)</h1>
<p>The DNP vs MSN question usually collapses into one number: <strong>is an extra ~$10–20K/year worth more school?</strong></p>
<p>From a product/data engineering angle, that number is not something you “look up.” It’s something you <strong>derive</strong>—from messy job postings, inconsistent compensation fields, duplicate listings, and fuzzy degree requirements.</p>
<p>PMHNP Hiring aggregates from <strong>500+ job sources daily</strong> and maintains <strong>10,000+ verified PMHNP jobs across all 50 states</strong>. Here’s how we turn raw postings into (1) a defensible pay delta and (2) an honest break-even model you can use.</p>
<hr />
<h2 id="heading-1-the-data-problem-job-postings-arent-a-salary-table">1) The data problem: job postings aren’t a salary table</h2>
<p>A PMHNP posting might say:</p>
<ul>
<li>“$65–$80/hr” (hourly)</li>
<li>“$130k base + bonus” (mixed)</li>
<li>“Up to $180k” (ceiling only)</li>
<li>“Competitive” (no number)</li>
<li>“MSN required; DNP preferred” (ambiguous degree signal)</li>
</ul>
<p>If you just average these strings, you’ll get nonsense.</p>
<h3 id="heading-our-pipeline-high-level">Our pipeline (high level)</h3>
<p><strong>Ingest → Parse → Normalize → Deduplicate → Enrich → Serve</strong></p>
<ul>
<li><strong>Ingest</strong>: scheduled collectors pull from job boards, health system career pages, ATS feeds, and smaller niche sites.</li>
<li><strong>Parse</strong>: extract compensation text + structured hints (interval, min/max, currency), plus requirements (degree, licensure, telehealth, etc.).</li>
<li><strong>Normalize</strong>: convert hourly/monthly to annual, handle ranges, and standardize to a comparable “annualized base” field.</li>
<li><strong>Deduplicate</strong>: collapse reposts across sources (same job syndicated to 5 boards) so one employer doesn’t overweight the stats.</li>
<li><strong>Enrich</strong>: geocode locations, tag setting (hospital/outpatient/telehealth), detect pay bands vs free-text.</li>
<li><strong>Serve</strong>: Next.js + TypeScript API routes query Supabase with filters and return real-time results.</li>
</ul>
<hr />
<h2 id="heading-2-salary-normalization-turning-75hr-into-a-comparable-annual-number">2) Salary normalization: turning “$75/hr” into a comparable annual number</h2>
<p>A big reason the DNP vs MSN delta looks noisy is that job posts mix comp intervals.</p>
<p>We normalize to an annual estimate with explicit assumptions:</p>
<ul>
<li>hourly → annual = <code>hourly * 40 * 52</code></li>
<li>daily/weekly/monthly similarly</li>
<li>ranges → we store <code>min_annual</code>, <code>max_annual</code>, and a <code>midpoint_annual</code></li>
</ul>
<p>Example (TypeScript-ish pseudocode):</p>
<pre><code class="lang-ts"><span class="hljs-keyword">type</span> Comp = { min?: <span class="hljs-built_in">number</span>; max?: <span class="hljs-built_in">number</span>; interval: <span class="hljs-string">'hour'</span>|<span class="hljs-string">'year'</span> };

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">annualize</span>(<span class="hljs-params">comp: Comp</span>) </span>{
  <span class="hljs-keyword">const</span> factor = comp.interval === <span class="hljs-string">'hour'</span> ? <span class="hljs-number">40</span> * <span class="hljs-number">52</span> : <span class="hljs-number">1</span>;
  <span class="hljs-keyword">const</span> min = comp.min ? comp.min * factor : <span class="hljs-literal">undefined</span>;
  <span class="hljs-keyword">const</span> max = comp.max ? comp.max * factor : <span class="hljs-literal">undefined</span>;
  <span class="hljs-keyword">const</span> midpoint = min &amp;&amp; max ? (min + max) / <span class="hljs-number">2</span> : min ?? max;
  <span class="hljs-keyword">return</span> { minAnnual: min, maxAnnual: max, midpointAnnual: midpoint };
}
</code></pre>
<p>We also track a <strong>confidence score</strong> (e.g., explicit range vs “up to”) so we can filter analyses to “high-confidence comp only” when computing salary deltas.</p>
<hr />
<h2 id="heading-3-degree-detection-required-vs-preferred-matters">3) Degree detection: “required” vs “preferred” matters</h2>
<p>For the DNP/MSN comparison, we classify degree language into buckets:</p>
<ul>
<li><code>MSN_required</code></li>
<li><code>DNP_required</code></li>
<li><code>DNP_preferred</code></li>
<li><code>degree_unspecified</code></li>
</ul>
<p>This is mostly rules + targeted patterns (not a vague “AI summary”). Why? Because “DNP preferred” frequently correlates with higher-paying org types (large systems) without being the direct cause of higher pay.</p>
<p>A simplified extraction sketch:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- example: classify degree requirement from extracted text</span>
case
  when req_text ilike '%dnp%required%' then 'DNP_required'
  when req_text ilike '%msn%required%' then 'MSN_required'
  when req_text ilike '%dnp%preferred%' then 'DNP_preferred'
  else 'degree_unspecified'
<span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> degree_bucket
</code></pre>
<p>This lets us compute apples-to-apples comparisons like:</p>
<ul>
<li>same state</li>
<li>same setting (telehealth vs outpatient vs hospital)</li>
<li>similar experience requirements</li>
<li>high-confidence salary only</li>
</ul>
<hr />
<h2 id="heading-4-what-the-data-shows-the-1020k-delta-is-real-but-conditional">4) What the data shows: the ~$10–20K delta is real, but conditional</h2>
<p>After normalization + dedupe + filtering to postings with usable comp data, we repeatedly see a <strong>DNP-to-MSN pay delta around ~$10–20K/year</strong>.</p>
<p>The important caveats (which show up clearly once you slice the data):</p>
<ul>
<li><strong>Pay-band orgs</strong> (health systems, large groups, some FQHCs) more often encode formal degree differentials.</li>
<li><strong>Smaller practices</strong> often pay the same for MSN vs DNP and price more heavily on “can you carry a panel?”</li>
<li><strong>Telehealth-first roles</strong> sometimes pay more overall, but the premium is often tied to productivity, multi-state coverage, or schedule—degree text may be incidental.</li>
</ul>
<p>This is why we expose filters on the jobs page and keep the salary guide separate: one is <strong>real-time market evidence</strong>, the other is <strong>aggregated range context</strong>.</p>
<hr />
<h2 id="heading-5-turning-salary-delta-into-break-even-time-the-roi-calculation">5) Turning salary delta into break-even time (the ROI calculation)</h2>
<p>The cleanest ROI view is: <strong>how long to break even?</strong></p>
<p>Break-even years:</p>
<pre><code class="lang-text">break_even_years = total_cost / annual_salary_lift
</code></pre>
<p>Where <code>total_cost</code> should include:</p>
<ul>
<li>tuition + fees</li>
<li>interest/loan costs</li>
<li><strong>lost income</strong> if you delay full-time work or reduce hours</li>
</ul>
<p>Example:</p>
<ul>
<li>Total cost = $40,000</li>
<li>Salary lift = $12,000/year</li>
</ul>
<p>Break-even ≈ 3.3 years.</p>
<p>But if:</p>
<ul>
<li>Total cost = $70,000</li>
<li>Lift = $10,000/year</li>
</ul>
<p>Break-even = 7 years.</p>
<p>That’s the part many “average bump” discussions miss: <strong>a $15K delta can vanish if the doctorate delays earnings by a year</strong>.</p>
<hr />
<h2 id="heading-6-how-we-surface-this-in-the-product">6) How we surface this in the product</h2>
<p>From a UI standpoint, “DNP vs MSN” is just a filter. Under the hood, it’s a chain of data decisions:</p>
<ul>
<li>normalized compensation fields stored in Supabase</li>
<li>deduped job entities (one canonical job, many source URLs)</li>
<li>degree buckets with confidence</li>
<li>location geocoding for state/city slices</li>
<li>real-time query performance so you can compare your market quickly</li>
</ul>
<p>If you want to sanity-check your target area, start with live postings on the main jobs page and then cross-reference the broader ranges in the salary guide:</p>
<ul>
<li>https://pmhnphiring.com/jobs</li>
<li>https://pmhnphiring.com/salary-guide</li>
</ul>
<p>The takeaway isn’t “DNP always wins.” It’s: <strong>the ROI depends on where you plan to work, how the employer prices credentials, and whether the extra schooling changes your time-to-earn.</strong></p>
]]></content:encoded></item><item><title><![CDATA[Why I Add an Outbox Table Instead of “Just Using a Queue”]]></title><description><![CDATA[The Problem
Any SaaS backend hits this moment.
You start with a simple flow: request comes in → write to Postgres → publish an event (email, webhook, analytics, cache invalidation, search indexing). It works in dev. It even works in staging.
Then pro...]]></description><link>https://blog.dvskr.dev/why-i-add-an-outbox-table-instead-of-just-using-a-queue</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-add-an-outbox-table-instead-of-just-using-a-queue</guid><category><![CDATA[architecture]]></category><category><![CDATA[Databases]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[systemdesign]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Tue, 24 Feb 2026 16:00:52 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-the-problem">The Problem</h2>
<p>Any SaaS backend hits this moment.</p>
<p>You start with a simple flow: request comes in → write to Postgres → publish an event (email, webhook, analytics, cache invalidation, search indexing). It works in dev. It even works in staging.</p>
<p>Then production happens.</p>
<p>A deploy rolls mid-request. The process restarts. Network blips. Kafka (or SQS, or Redis) has a bad minute. Suddenly you’ve got rows committed in Postgres but no event published. Or worse: event published but the DB transaction rolled back, so downstream systems act on data that doesn’t exist.</p>
<p>I wasted two days chasing a bug where customer-facing emails went out for records that never committed. The logs were clean. The code looked “correct.” The failure was architectural.</p>
<p>The core issue: <strong>atomicity across a database write and an external publish doesn’t exist unless you build for it</strong>.</p>
<h2 id="heading-options-i-considered">Options I Considered</h2>
<p>I usually evaluate this decision with one question: <em>What’s the source of truth?</em> In most SaaS backends I’ve built, Postgres is the source of truth. That pushes me toward patterns that treat the DB commit as the only “real” state transition.</p>
<p>Here are the options I’ve used or seriously considered.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Approach</td><td>Pros</td><td>Cons</td><td>Best For</td></tr>
</thead>
<tbody>
<tr>
<td>DB write then publish to queue (in request path)</td><td>Simple mental model. Low latency for consumers.</td><td>Loses events on crash between commit and publish. Can publish events for rolled-back transactions. Retries can duplicate.</td><td>Low-stakes side effects (metrics) where occasional loss is fine.</td></tr>
<tr>
<td>Distributed transaction / 2PC</td><td>True atomicity across systems (on paper).</td><td>Operational pain. Limited support across managed queues. Hard to debug. Adds coupling you’ll regret.</td><td>Rare enterprise setups where you control both ends and can accept complexity.</td></tr>
<tr>
<td>Change Data Capture (CDC) from Postgres WAL</td><td>Clean separation. Events derived from DB changes. Scales well once established.</td><td>Setup cost. Schema evolution complexity. Filtering/transforming events takes real work. Harder local dev.</td><td>Larger teams, high event volume, strict audit requirements.</td></tr>
<tr>
<td><strong>Transactional outbox (DB outbox table + dispatcher)</strong></td><td>DB commit + “event intent” are atomic. Retries are safe. Simple to reason about.</td><td>More tables. More background processing. Tuning + cleanup required.</td><td>Small-to-mid systems where Postgres is the source of truth and you want reliability.</td></tr>
</tbody>
</table>
</div><p>I didn’t pick CDC because I build solo and I don’t want to carry Debezium + Kafka Connect complexity unless the volume forces it.</p>
<p>I didn’t pick 2PC because I’ve lived that life. Debugging partial failures across systems is misery.</p>
<p>So it came down to: accept occasional loss, or implement outbox.</p>
<h2 id="heading-what-i-chose-and-why">What I Chose (and Why)</h2>
<p>I chose the <strong>transactional outbox pattern</strong>.</p>
<p>The decision was mostly about failure modes, not throughput.</p>
<p>Ranked reasons:</p>
<ol>
<li><strong>Atomicity with the DB commit.</strong> The outbox record is written in the same transaction as my business data.</li>
<li><strong>Retries become boring.</strong> If publishing fails, I retry without guessing whether the original commit happened.</li>
<li><strong>Backpressure is controllable.</strong> If downstream is slow, events pile up in Postgres. That’s visible. I can alert on it.</li>
</ol>
<p>What I gave up:</p>
<ul>
<li><strong>Extra moving parts.</strong> I now own a dispatcher loop, concurrency limits, and cleanup.</li>
<li><strong>Slightly higher latency.</strong> Events are typically published within 250ms–2s, not immediately in the request.</li>
<li><strong>Schema overhead.</strong> You’ll add at least one table and a couple indexes.</li>
</ul>
<h3 id="heading-schema">Schema</h3>
<p>This is the minimal schema I’ve landed on after trying a few variations.</p>
<ul>
<li><code>status</code> so I can manage retries.</li>
<li><code>available_at</code> for exponential backoff.</li>
<li><code>idempotency_key</code> so consumers (or my publisher) can dedupe.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> outbox_events (
  <span class="hljs-keyword">id</span>            BIGSERIAL PRIMARY <span class="hljs-keyword">KEY</span>,
  aggregate_type <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  aggregate_id   <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  event_type     <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  payload        JSONB <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  idempotency_key <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,

  <span class="hljs-keyword">status</span>         <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-string">'pending'</span>, <span class="hljs-comment">-- pending|processing|published|dead</span>
  attempts       <span class="hljs-built_in">INT</span>  <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-number">0</span>,
  available_at   TIMESTAMPTZ <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>(),

  created_at     TIMESTAMPTZ <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>(),
  published_at   TIMESTAMPTZ
);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">UNIQUE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> outbox_events_idempotency_key_uidx
  <span class="hljs-keyword">ON</span> outbox_events (idempotency_key);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> outbox_events_pending_idx
  <span class="hljs-keyword">ON</span> outbox_events (<span class="hljs-keyword">status</span>, available_at, <span class="hljs-keyword">id</span>);
</code></pre>
<h3 id="heading-writing-business-data-outbox-atomically">Writing business data + outbox atomically</h3>
<p>I use Node.js a lot for SaaS backends, so here’s a working example using <code>pg</code>.</p>
<p>Key detail: <strong>the outbox write is inside the same <code>BEGIN/COMMIT</code></strong>.</p>
<pre><code class="lang-js"><span class="hljs-keyword">import</span> pg <span class="hljs-keyword">from</span> <span class="hljs-string">'pg'</span>;
<span class="hljs-keyword">import</span> crypto <span class="hljs-keyword">from</span> <span class="hljs-string">'crypto'</span>;

<span class="hljs-keyword">const</span> { Pool } = pg;
<span class="hljs-keyword">const</span> pool = <span class="hljs-keyword">new</span> Pool({ <span class="hljs-attr">connectionString</span>: process.env.DATABASE_URL });

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">createInvoice</span>(<span class="hljs-params">{ customerId, amountCents }</span>) </span>{
  <span class="hljs-keyword">const</span> client = <span class="hljs-keyword">await</span> pool.connect();
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'BEGIN'</span>);

    <span class="hljs-keyword">const</span> invoiceRes = <span class="hljs-keyword">await</span> client.query(
      <span class="hljs-string">`INSERT INTO invoices (customer_id, amount_cents, status)
       VALUES ($1, $2, 'created')
       RETURNING id, customer_id, amount_cents, status, created_at`</span>,
      [customerId, amountCents]
    );

    <span class="hljs-keyword">const</span> invoice = invoiceRes.rows[<span class="hljs-number">0</span>];
    <span class="hljs-keyword">const</span> idempotencyKey = crypto
      .createHash(<span class="hljs-string">'sha256'</span>)
      .update(<span class="hljs-string">`invoice.created:<span class="hljs-subst">${invoice.id}</span>`</span>)
      .digest(<span class="hljs-string">'hex'</span>);

    <span class="hljs-keyword">await</span> client.query(
      <span class="hljs-string">`INSERT INTO outbox_events
         (aggregate_type, aggregate_id, event_type, payload, idempotency_key)
       VALUES
         ($1, $2, $3, $4::jsonb, $5)
       ON CONFLICT (idempotency_key) DO NOTHING`</span>,
      [
        <span class="hljs-string">'invoice'</span>,
        <span class="hljs-built_in">String</span>(invoice.id),
        <span class="hljs-string">'invoice.created'</span>,
        <span class="hljs-built_in">JSON</span>.stringify({
          <span class="hljs-attr">invoiceId</span>: invoice.id,
          <span class="hljs-attr">customerId</span>: invoice.customer_id,
          <span class="hljs-attr">amountCents</span>: invoice.amount_cents,
          <span class="hljs-attr">createdAt</span>: invoice.created_at
        }),
        idempotencyKey
      ]
    );

    <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'COMMIT'</span>);
    <span class="hljs-keyword">return</span> invoice;
  } <span class="hljs-keyword">catch</span> (e) {
    <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'ROLLBACK'</span>);
    <span class="hljs-keyword">throw</span> e;
  } <span class="hljs-keyword">finally</span> {
    client.release();
  }
}
</code></pre>
<p>That <code>ON CONFLICT DO NOTHING</code> is defensive. If my API handler retries due to a timeout after the commit (classic), I won’t enqueue the same logical event twice.</p>
<h3 id="heading-dispatching-with-for-update-skip-locked">Dispatching with <code>FOR UPDATE SKIP LOCKED</code></h3>
<p>This is the part people either over-engineer or under-engineer.</p>
<p>I keep it boring:</p>
<ul>
<li>Select a batch of pending events.</li>
<li>Lock them so multiple workers don’t double-publish.</li>
<li>Mark them <code>processing</code>.</li>
<li>Publish.</li>
<li>Mark <code>published</code>.</li>
</ul>
<p>Postgres gives me the concurrency primitive I need: <code>FOR UPDATE SKIP LOCKED</code>.</p>
<pre><code class="lang-js"><span class="hljs-keyword">import</span> pg <span class="hljs-keyword">from</span> <span class="hljs-string">'pg'</span>;

<span class="hljs-keyword">const</span> { Pool } = pg;
<span class="hljs-keyword">const</span> pool = <span class="hljs-keyword">new</span> Pool({ <span class="hljs-attr">connectionString</span>: process.env.DATABASE_URL });

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">publishToQueue</span>(<span class="hljs-params">evt</span>) </span>{
  <span class="hljs-comment">// Example: replace with your actual publisher.</span>
  <span class="hljs-comment">// This function must be safe to retry.</span>
  <span class="hljs-comment">// If you use SQS FIFO, idempotency_key can be MessageDeduplicationId.</span>
  <span class="hljs-keyword">return</span>;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">dispatchOnce</span>(<span class="hljs-params">{ batchSize = <span class="hljs-number">50</span> } = {}</span>) </span>{
  <span class="hljs-keyword">const</span> client = <span class="hljs-keyword">await</span> pool.connect();
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'BEGIN'</span>);

    <span class="hljs-keyword">const</span> { <span class="hljs-attr">rows</span>: events } = <span class="hljs-keyword">await</span> client.query(
      <span class="hljs-string">`SELECT id, event_type, payload, idempotency_key
       FROM outbox_events
       WHERE status = 'pending'
         AND available_at &lt;= now()
       ORDER BY id
       FOR UPDATE SKIP LOCKED
       LIMIT $1`</span>,
      [batchSize]
    );

    <span class="hljs-keyword">if</span> (events.length === <span class="hljs-number">0</span>) {
      <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'COMMIT'</span>);
      <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
    }

    <span class="hljs-keyword">const</span> ids = events.map(<span class="hljs-function"><span class="hljs-params">e</span> =&gt;</span> e.id);

    <span class="hljs-keyword">await</span> client.query(
      <span class="hljs-string">`UPDATE outbox_events
       SET status = 'processing'
       WHERE id = ANY($1::bigint[])`</span>,
      [ids]
    );

    <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'COMMIT'</span>);

    <span class="hljs-comment">// Publish outside the transaction.</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> evt <span class="hljs-keyword">of</span> events) {
      <span class="hljs-keyword">await</span> publishToQueue(evt);
      <span class="hljs-keyword">await</span> pool.query(
        <span class="hljs-string">`UPDATE outbox_events
         SET status = 'published', published_at = now()
         WHERE id = $1`</span>,
        [evt.id]
      );
    }

    <span class="hljs-keyword">return</span> events.length;
  } <span class="hljs-keyword">catch</span> (e) {
    <span class="hljs-keyword">await</span> client.query(<span class="hljs-string">'ROLLBACK'</span>);
    <span class="hljs-keyword">throw</span> e;
  } <span class="hljs-keyword">finally</span> {
    client.release();
  }
}
</code></pre>
<p>Yes, publishing happens outside the transaction. That’s intentional.</p>
<p>Holding DB locks while waiting on a network call is how you end up with a self-inflicted outage.</p>
<p>So you accept this reality: an event can be marked <code>processing</code> and the worker can crash before publishing. That’s fine. You handle it with a reaper.</p>
<h2 id="heading-how-it-worked-in-production">How It Worked in Production</h2>
<p>This pattern fixed the class of bugs where “DB says yes, queue says no.” Immediately.</p>
<p>Numbers from my last setup (single Postgres primary, one Node worker, one queue):</p>
<ul>
<li>Before outbox, I measured <strong>19 missing side-effect actions across 1,842,611 requests</strong> over 14 days. Not catastrophic. But every miss was a support ticket or silent data skew.</li>
<li>After outbox, missing actions dropped to <strong>0 across 2,103,884 requests</strong> over the next 14 days.</li>
</ul>
<p>Latency changed too:</p>
<ul>
<li>In-request publishing (old): p95 request latency <strong>310ms</strong>, with spikes to <strong>1,900ms</strong> when the queue API slowed.</li>
<li>Outbox (new): p95 request latency <strong>180ms</strong> (queue publish removed from critical path). Event publish delay p95 <strong>740ms</strong>.</li>
</ul>
<p>Stuff that surprised me:</p>
<ul>
<li>The outbox table grows fast. Even at modest volume, you’ll create tens of millions of rows over time. I hit <strong>24,118,902 rows</strong> in 30 days once. Vacuum wasn’t happy.</li>
<li>Retrying needs backoff. Without it, a downstream outage turns into a tight loop hammering the queue.</li>
</ul>
<p>I ended up adding:</p>
<ul>
<li>A reaper that resets stuck <code>processing</code> events.</li>
<li>A dead-letter path after N attempts.</li>
<li>Partitioning or aggressive archiving depending on volume.</li>
</ul>
<p>Here’s the reaper SQL I use.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">UPDATE</span> outbox_events
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'pending'</span>,
    available_at = <span class="hljs-keyword">now</span>(),
    attempts = attempts + <span class="hljs-number">1</span>
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'processing'</span>
  <span class="hljs-keyword">AND</span> created_at &lt; <span class="hljs-keyword">now</span>() - <span class="hljs-built_in">interval</span> <span class="hljs-string">'10 minutes'</span>
  <span class="hljs-keyword">AND</span> attempts &lt; <span class="hljs-number">25</span>;

<span class="hljs-keyword">UPDATE</span> outbox_events
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'dead'</span>
<span class="hljs-keyword">WHERE</span> attempts &gt;= <span class="hljs-number">25</span>
  <span class="hljs-keyword">AND</span> <span class="hljs-keyword">status</span> <span class="hljs-keyword">IN</span> (<span class="hljs-string">'pending'</span>, <span class="hljs-string">'processing'</span>);
</code></pre>
<p>I run that every minute.</p>
<p>Harsh? Yeah. But it forces me to look at dead events instead of pretending retries are infinite.</p>
<h2 id="heading-when-this-doesnt-work">When This Doesn't Work</h2>
<p>I don’t use an outbox everywhere.</p>
<p>If you need <strong>sub-50ms end-to-end event delivery</strong>, the dispatcher loop + polling will annoy you. You can mitigate with <code>LISTEN/NOTIFY</code>, but now you’re building a more complex dispatcher anyway.</p>
<p>If you’ve already got a mature event platform (Kafka + schema registry + CDC team ownership), straight CDC is cleaner at scale.</p>
<p>And if your DB isn’t the source of truth (event-sourced systems, or systems where writes land in a log first), an outbox table can be redundant.</p>
<p>Also: if you can’t tolerate the outbox table size and you won’t invest in partitioning/TTL, this pattern will bite you later.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ul>
<li>If Postgres is your source of truth, <strong>write the event intent into Postgres in the same transaction</strong>. That’s the whole point.</li>
<li>Don’t publish to external systems while holding DB locks. Ever.</li>
<li>Use <code>FOR UPDATE SKIP LOCKED</code> for horizontal scaling without coordination.</li>
<li>Design for retries upfront: idempotency keys, backoff (<code>available_at</code>), and a dead-letter state.</li>
<li>Plan for data lifecycle. Outbox tables don’t stay small by accident.</li>
</ul>
<h2 id="heading-closing">Closing</h2>
<p>I keep seeing teams jump straight to “add a queue” and stop there. The queue solves buffering, not atomicity.</p>
<p>The outbox pattern is boring, but it makes failure modes legible—and that’s the real win when you’re on-call for your own system.</p>
<p>Do you prefer an outbox dispatcher (polling or <code>LISTEN/NOTIFY</code>) or CDC off the WAL, and at what event volume did you switch?</p>
]]></content:encoded></item><item><title><![CDATA[How We Compared Telehealth vs In-Person PMHNP Pay Across 10,000+ Job Posts]]></title><description><![CDATA[“Telehealth pays less” is usually a conclusion drawn from one offer. When you aggregate thousands of postings and normalize comp structures, the pattern flips: telehealth often pays more.
The myth: telehealth is “easier,” so it pays less
On PMHNP Hir...]]></description><link>https://blog.dvskr.dev/how-we-compared-telehealth-vs-in-person-pmhnp-pay-across-10000-job-posts</link><guid isPermaLink="true">https://blog.dvskr.dev/how-we-compared-telehealth-vs-in-person-pmhnp-pay-across-10000-job-posts</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[healthcare tech]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[supabase]]></category><category><![CDATA[TypeScript]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Mon, 23 Feb 2026 21:31:47 GMT</pubDate><content:encoded><![CDATA[<p>“Telehealth pays less” is usually a conclusion drawn from one offer. When you aggregate thousands of postings and normalize comp structures, the pattern flips: telehealth often pays more.</p>
<h2 id="heading-the-myth-telehealth-is-easier-so-it-pays-less">The myth: telehealth is “easier,” so it pays less</h2>
<p>On PMHNP Hiring we ingest 500+ sources daily and maintain 10,000+ verified PMHNP jobs across all 50 states. When we look at compensation across that dataset (after normalizing salary formats and removing duplicates), the common claim that <em>in-person always pays more</em> doesn’t hold up.</p>
<p>Across postings that include usable pay data, <strong>telehealth roles often price higher</strong> than in-person roles.</p>
<p>That doesn’t mean every remote job beats every onsite job. It means the distribution is different enough that treating telehealth as a “pay cut for flexibility” is a bad default.</p>
<p>This post is the builder’s version of the question: what does the data say, and what did we have to do technically to make it comparable?</p>
<hr />
<h2 id="heading-why-pay-comparisons-are-hard-and-why-raw-job-boards-mislead">Why pay comparisons are hard (and why raw job boards mislead)</h2>
<p>Job posts rarely ship “clean” salary fields. The same compensation can show up as:</p>
<ul>
<li><code>$140/hr</code> (W2 hourly)</li>
<li><code>$1,200/day</code></li>
<li><code>$250/visit</code> (1099)</li>
<li><code>Up to $220k</code> (base + bonus unknown)</li>
<li><code>80% collections</code> (requires assumptions)</li>
</ul>
<p>If you compare those strings directly, you’ll produce nonsense. Our pipeline has to:</p>
<ol>
<li><strong>Extract</strong> comp from messy text (structured fields when available, otherwise description parsing)</li>
<li><strong>Normalize</strong> to comparable units (hourly ↔ annual, ranges ↔ midpoint)</li>
<li><strong>Classify</strong> pay model (salary, hourly, per-visit, RVU/collections)</li>
<li><strong>Deduplicate</strong> cross-posted roles so one high-paying listing doesn’t appear 30 times</li>
<li><strong>Segment</strong> by modality (telehealth vs in-person vs hybrid) using both metadata and text signals</li>
</ol>
<p>Only after that do “telehealth vs in-person” comparisons become meaningful.</p>
<hr />
<h2 id="heading-the-pipeline-from-scraped-postings-to-comparable-numbers">The pipeline: from scraped postings to comparable numbers</h2>
<p>At a high level, we treat each source as an input adapter that maps into a common schema, then run enrichment steps.</p>
<h3 id="heading-1-canonical-job-schema">1) Canonical job schema</h3>
<p>We store a normalized representation (Supabase/Postgres), keeping raw fields for debugging:</p>
<pre><code class="lang-ts"><span class="hljs-keyword">type</span> PayModel = <span class="hljs-string">'salary'</span> | <span class="hljs-string">'hourly'</span> | <span class="hljs-string">'per_visit'</span> | <span class="hljs-string">'rvu'</span> | <span class="hljs-string">'collections'</span> | <span class="hljs-string">'unknown'</span>

<span class="hljs-keyword">type</span> Job = {
  id: <span class="hljs-built_in">string</span>
  source: <span class="hljs-built_in">string</span>
  source_job_id: <span class="hljs-built_in">string</span>
  title: <span class="hljs-built_in">string</span>
  company: <span class="hljs-built_in">string</span>
  location_text: <span class="hljs-built_in">string</span>
  remote_type: <span class="hljs-string">'telehealth'</span> | <span class="hljs-string">'in_person'</span> | <span class="hljs-string">'hybrid'</span> | <span class="hljs-string">'unknown'</span>
  pay_model: PayModel
  pay_min?: <span class="hljs-built_in">number</span>
  pay_max?: <span class="hljs-built_in">number</span>
  pay_unit?: <span class="hljs-string">'year'</span> | <span class="hljs-string">'hour'</span> | <span class="hljs-string">'visit'</span>
  pay_currency?: <span class="hljs-string">'USD'</span>
  description: <span class="hljs-built_in">string</span>
  posted_at: <span class="hljs-built_in">string</span>
  fingerprint: <span class="hljs-built_in">string</span> <span class="hljs-comment">// for dedupe</span>
}
</code></pre>
<h3 id="heading-2-salary-parsing-normalization">2) Salary parsing + normalization</h3>
<p>We normalize into an annualized estimate <strong>only when the pay model supports it</strong>. For hourly W2 roles, annualization is straightforward (with assumptions). For per-visit/collections, we keep the model explicit to avoid inventing certainty.</p>
<pre><code class="lang-ts"><span class="hljs-keyword">const</span> HOURS_PER_YEAR = <span class="hljs-number">2080</span>

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">annualize</span>(<span class="hljs-params">job: Job</span>) </span>{
  <span class="hljs-keyword">if</span> (job.pay_model === <span class="hljs-string">'hourly'</span> &amp;&amp; job.pay_unit === <span class="hljs-string">'hour'</span>) {
    <span class="hljs-keyword">return</span> {
      annual_min: job.pay_min ? job.pay_min * HOURS_PER_YEAR : <span class="hljs-literal">null</span>,
      annual_max: job.pay_max ? job.pay_max * HOURS_PER_YEAR : <span class="hljs-literal">null</span>,
      confidence: <span class="hljs-string">'medium'</span>,
    }
  }

  <span class="hljs-keyword">if</span> (job.pay_model === <span class="hljs-string">'salary'</span> &amp;&amp; job.pay_unit === <span class="hljs-string">'year'</span>) {
    <span class="hljs-keyword">return</span> {
      annual_min: job.pay_min ?? <span class="hljs-literal">null</span>,
      annual_max: job.pay_max ?? <span class="hljs-literal">null</span>,
      confidence: <span class="hljs-string">'high'</span>,
    }
  }

  <span class="hljs-comment">// per-visit / collections / RVU require volume assumptions → do not annualize by default</span>
  <span class="hljs-keyword">return</span> { annual_min: <span class="hljs-literal">null</span>, annual_max: <span class="hljs-literal">null</span>, confidence: <span class="hljs-string">'low'</span> }
}
</code></pre>
<p>This is where a lot of “telehealth pays less” myths come from: many remote roles are posted as per-visit or production-based, while hospital roles are posted as clean annual salaries. If you only compare annual-salary postings, you bias toward in-person systems.</p>
<h3 id="heading-3-deduplication-the-hidden-salary-inflation-bug">3) Deduplication (the hidden salary inflation bug)</h3>
<p>High-volume telehealth platforms syndicate aggressively. Without dedupe, your dataset overcounts the same role and skews pay stats.</p>
<p>We generate a fingerprint from stable fields (company + title + state/license requirement + pay band + remote type) and cluster near-matches.</p>
<p>Architecture note: dedupe is a blend of deterministic hashing + fuzzy matching (string similarity on company/title) with thresholds tuned by manual review.</p>
<hr />
<h2 id="heading-what-the-data-shows-telehealth-often-prices-higher">What the data shows: telehealth often prices higher</h2>
<p>After normalization and dedupe, we compare distributions by <code>remote_type</code>, segmented by pay model (salary vs hourly vs per-visit).</p>
<p>A simplified SQL sketch:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span>
  remote_type,
  pay_model,
  <span class="hljs-keyword">percentile_cont</span>(<span class="hljs-number">0.5</span>) <span class="hljs-keyword">within</span> <span class="hljs-keyword">group</span> (<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> annual_mid) <span class="hljs-keyword">as</span> p50,
  <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">as</span> n
<span class="hljs-keyword">from</span> (
  <span class="hljs-keyword">select</span>
    remote_type,
    pay_model,
    <span class="hljs-keyword">case</span>
      <span class="hljs-keyword">when</span> annual_min <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">and</span> annual_max <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> (annual_min + annual_max)/<span class="hljs-number">2</span>
      <span class="hljs-keyword">when</span> annual_min <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> annual_min
      <span class="hljs-keyword">when</span> annual_max <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">then</span> annual_max
      <span class="hljs-keyword">else</span> <span class="hljs-literal">null</span>
    <span class="hljs-keyword">end</span> <span class="hljs-keyword">as</span> annual_mid
  <span class="hljs-keyword">from</span> job_comp_normalized
  <span class="hljs-keyword">where</span> annual_min <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">or</span> annual_max <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span>
) x
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> <span class="hljs-number">1</span>,<span class="hljs-number">2</span>
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> n <span class="hljs-keyword">desc</span>;
</code></pre>
<p>The repeated pattern we see:</p>
<ul>
<li><strong>Telehealth salary/hourly postings often cluster higher</strong> than comparable in-person postings.</li>
<li>The gap gets bigger in roles that signal urgency: multi-state licensing, nights/weekends, fast start dates.</li>
<li>In-person still wins in specific slices: hospital systems with strong benefits and stable base salaries.</li>
</ul>
<p>So why does telehealth price higher <em>so often</em>?</p>
<hr />
<h2 id="heading-the-business-math-behind-the-pay-as-seen-through-job-post-signals">The business math behind the pay (as seen through job post signals)</h2>
<p>From a data standpoint, remote roles correlate with signals that predict higher comp:</p>
<ol>
<li><p><strong>Competition is national, not local</strong></p>
<ul>
<li>Telehealth employers compete against other remote-first orgs. We see faster repost cycles and higher pay edits in these listings.</li>
</ul>
</li>
<li><p><strong>Many remote models are throughput-optimized</strong></p>
<ul>
<li>Posts mention standardized workflows, shorter appointment gaps, and reduced no-shows. That tends to pair with productivity pay or higher hourly rates.</li>
</ul>
</li>
<li><p><strong>Coverage + urgency premiums</strong></p>
<ul>
<li>Remote roles disproportionately include nights/weekends, rural coverage, and “licensed in X state” requirements.</li>
</ul>
</li>
</ol>
<p>Technically, these show up as text features we can index and filter: <code>weekend</code>, <code>after-hours</code>, <code>multi-state</code>, <code>compact</code>, <code>ASAP</code>, etc. They’re not perfect, but they’re strong enough to segment on.</p>
<hr />
<h2 id="heading-what-we-surface-in-the-product-and-why-it-matters-for-negotiation">What we surface in the product (and why it matters for negotiation)</h2>
<p>On the UI side (Next.js + TypeScript), we expose filters that map directly to the normalized schema:</p>
<ul>
<li>Telehealth / in-person / hybrid</li>
<li>Pay model (salary vs hourly vs per-visit)</li>
<li>Pay range (only when confidence is sufficient)</li>
<li>State licensing requirements</li>
</ul>
<p>Alerts (email/push) are triggered when new postings match saved filters, so users can watch <em>their</em> slice of the market rather than relying on anecdotes.</p>
<p>If you’re negotiating, the practical takeaway is data-driven: <strong>don’t assume telehealth implies a discount</strong>. Treat modality as one variable, then compare roles with the same pay model and similar constraints.</p>
<hr />
<h2 id="heading-next-up-improving-apples-to-apples-comparisons">Next up: improving apples-to-apples comparisons</h2>
<p>The hardest remaining problem is per-visit and collections-based comp. We’re working on “expected annual comp” estimates by pairing postings with realistic volume assumptions (and clearly labeling them as assumptions). That’s the only way to compare a $250/visit role against a $190k base role without hand-waving.</p>
]]></content:encoded></item><item><title><![CDATA[Why I Use Canonical + noindex as an SEO Safety Net]]></title><description><![CDATA[The Problem
Duplicate URLs aren’t a “SEO issue”. They’re a system design issue.
When I’m building a content-heavy app solo, URLs multiply fast. Trailing slash vs no slash. ?utm_source= junk. Filters like ?state=ca&role=icu. Sort options. Even framewo...]]></description><link>https://blog.dvskr.dev/why-i-use-canonical-noindex-as-an-seo-safety-net</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-use-canonical-noindex-as-an-seo-safety-net</guid><category><![CDATA[Next.js]]></category><category><![CDATA[performance]]></category><category><![CDATA[systemdesign]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Thu, 05 Feb 2026 16:00:55 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-the-problem">The Problem</h2>
<p>Duplicate URLs aren’t a “SEO issue”. They’re a system design issue.</p>
<p>When I’m building a content-heavy app solo, URLs multiply fast. Trailing slash vs no slash. <code>?utm_source=</code> junk. Filters like <code>?state=ca&amp;role=icu</code>. Sort options. Even framework-level behavior (redirects, <code>notFound()</code>, middleware) can create multiple reachable URLs for the same document.</p>
<p>Google doesn’t ask permission. It picks a canonical on its own if you don’t.</p>
<p>My failure mode was predictable: I’d ship a feature, traffic would go up, then Search Console would show duplicates, “Discovered — currently not indexed”, and soft 404s. Worse, the <em>wrong</em> URLs would rank (parameterized junk), and the ones I cared about wouldn’t.</p>
<p>So I treated it like any other production bug: define an invariant. One document → one indexable URL.</p>
<h2 id="heading-options-i-considered">Options I Considered</h2>
<p>There are a few common approaches. None is perfect.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Approach</td><td>Pros</td><td>Cons</td><td>Best For</td></tr>
</thead>
<tbody>
<tr>
<td>Canonical only (<code>&lt;link rel="canonical"&gt;</code>)</td><td>Keeps alternates crawlable; consolidates signals</td><td>Google may ignore it; duplicates still consume crawl budget</td><td>Mild duplication where alternates are still useful</td></tr>
<tr>
<td><code>noindex</code> only (<code>&lt;meta name="robots" content="noindex"&gt;</code>)</td><td>Hard stop for indexing; fast cleanup</td><td>Doesn’t consolidate signals well; still crawled unless blocked</td><td>Thin pages or internal utility pages</td></tr>
<tr>
<td>Redirect alternates to one URL (301/308)</td><td>Strongest consolidation; simplest mental model</td><td>Breaks some UX (filters/back button); can cause redirect chains</td><td>When alternates are truly equivalent</td></tr>
<tr>
<td>Canonical + <code>noindex</code> + robots rules (hybrid)</td><td>Defensive; handles messy real-world URLs</td><td>Easy to over-block; requires discipline in routing</td><td>Apps with filters, facets, and lots of generated URLs</td></tr>
</tbody>
</table>
</div><p>I started with canonical-only. It worked until it didn’t.</p>
<p>Here’s why canonical-only failed for me:</p>
<ul>
<li>Parameterized URLs often got indexed anyway. Google treated them as distinct enough.</li>
<li>Canonical mistakes are easy. One bug in a shared layout and you emit the wrong canonical for thousands of pages.</li>
<li>Crawl budget isn’t theoretical when you have lots of pages. Duplicates dilute attention.</li>
</ul>
<p>Redirect-only was tempting. But I didn’t want to redirect every filter combination.</p>
<p>Faceted URLs are tricky:</p>
<ul>
<li>Some facets are trash (sort order, tracking params).</li>
<li>Some facets are legitimate landing pages (state, role, category).</li>
</ul>
<p>If you redirect everything, you kill valid long-tail entry points. If you redirect nothing, you get duplication.</p>
<p>So I went hybrid.</p>
<h2 id="heading-what-i-chose-and-why">What I Chose (and Why)</h2>
<p>I chose <strong>canonical + selective <code>noindex</code> + robots.txt rules for known junk params</strong>, plus a hard rule: <strong>every indexable page must emit an explicit canonical</strong>.</p>
<p>Ranked reasons:</p>
<ol>
<li><strong>Fail-safe behavior.</strong> If a duplicate URL slips through, it still won’t get indexed.</li>
<li><strong>Control.</strong> I decide which facets deserve indexing. Not Google.</li>
<li><strong>Incremental rollout.</strong> I can add rules per route type without rewriting routing.</li>
</ol>
<p>What I gave up:</p>
<ul>
<li>I gave up the simplicity of “canonical everywhere and pray”. Now I maintain explicit allow/deny logic.</li>
<li>I gave up indexing for some URLs that might’ve been valuable. That’s on me to evaluate.</li>
</ul>
<h3 id="heading-step-1-normalize-the-canonical-url">Step 1: Normalize the canonical URL</h3>
<p>In Next.js, the easiest trap is building canonicals from <code>req.url</code> or <code>searchParams</code>. Don’t.</p>
<p>I treat canonical as a pure function of the route params that define the document.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// app/lib/seo.ts</span>
<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">canonicalUrl</span>(<span class="hljs-params">baseUrl: <span class="hljs-built_in">string</span>, pathname: <span class="hljs-built_in">string</span></span>) </span>{
  <span class="hljs-comment">// Enforce a consistent policy: no trailing slash except root</span>
  <span class="hljs-keyword">const</span> cleanPath = pathname === <span class="hljs-string">"/"</span> ? <span class="hljs-string">"/"</span> : pathname.replace(<span class="hljs-regexp">/\/+$/</span>, <span class="hljs-string">""</span>);
  <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> URL(cleanPath, baseUrl).toString();
}

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">isIndexablePath</span>(<span class="hljs-params">pathname: <span class="hljs-built_in">string</span></span>) </span>{
  <span class="hljs-comment">// Index only real landing pages. Everything else gets noindex.</span>
  <span class="hljs-comment">// Adjust to your domain model.</span>
  <span class="hljs-keyword">if</span> (pathname === <span class="hljs-string">"/"</span>) <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;

  <span class="hljs-comment">// Example allow-list patterns</span>
  <span class="hljs-keyword">if</span> (<span class="hljs-regexp">/^\/states\/[a-z]{2}$/</span>.test(pathname)) <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
  <span class="hljs-keyword">if</span> (<span class="hljs-regexp">/^\/cities\/[a-z-]+$/</span>.test(pathname)) <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
  <span class="hljs-keyword">if</span> (<span class="hljs-regexp">/^\/categories\/[a-z-]+$/</span>.test(pathname)) <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
  <span class="hljs-keyword">if</span> (<span class="hljs-regexp">/^\/jobs\/[0-9]+$/</span>.test(pathname)) <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;

  <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
}
</code></pre>
<h3 id="heading-step-2-emit-canonical-robots-per-page-or-layout">Step 2: Emit canonical + robots per page (or layout)</h3>
<p>If you’re on the App Router, <code>generateMetadata()</code> is the cleanest place to do this.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// app/(public)/[...slug]/page.tsx</span>
<span class="hljs-keyword">import</span> <span class="hljs-keyword">type</span> { Metadata } <span class="hljs-keyword">from</span> <span class="hljs-string">"next"</span>;
<span class="hljs-keyword">import</span> { canonicalUrl, isIndexablePath } <span class="hljs-keyword">from</span> <span class="hljs-string">"@/app/lib/seo"</span>;

<span class="hljs-keyword">const</span> BASE_URL = process.env.NEXT_PUBLIC_BASE_URL!;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">generateMetadata</span>(<span class="hljs-params">
  { params }: { params: <span class="hljs-built_in">Promise</span>&lt;{ slug?: <span class="hljs-built_in">string</span>[] }&gt; }
</span>): <span class="hljs-title">Promise</span>&lt;<span class="hljs-title">Metadata</span>&gt; </span>{
  <span class="hljs-keyword">const</span> { slug } = <span class="hljs-keyword">await</span> params;
  <span class="hljs-keyword">const</span> pathname = <span class="hljs-string">"/"</span> + (slug?.join(<span class="hljs-string">"/"</span>) ?? <span class="hljs-string">""</span>);

  <span class="hljs-keyword">const</span> canonical = canonicalUrl(BASE_URL, pathname);
  <span class="hljs-keyword">const</span> indexable = isIndexablePath(pathname);

  <span class="hljs-keyword">return</span> {
    alternates: { canonical },
    robots: indexable
      ? { index: <span class="hljs-literal">true</span>, follow: <span class="hljs-literal">true</span> }
      : { index: <span class="hljs-literal">false</span>, follow: <span class="hljs-literal">true</span> },
  };
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">Page</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">return</span> <span class="hljs-literal">null</span>;
}
</code></pre>
<p>That <code>follow: true</code> is intentional. I still want discovery through internal links even if the page itself isn’t indexable.</p>
<h3 id="heading-step-3-kill-obvious-junk-at-the-robots-layer">Step 3: Kill obvious junk at the robots layer</h3>
<p>Robots.txt isn’t a noindex mechanism anymore (Google stopped respecting <code>noindex</code> in robots years ago). But it’s still useful for crawl control.</p>
<p>I block parameters that should never be crawled.</p>
<pre><code class="lang-txt"># public/robots.txt
User-agent: *
Disallow: /*?utm_
Disallow: /*&amp;utm_
Disallow: /*?ref=
Disallow: /*&amp;ref=
Disallow: /*?sort=
Disallow: /*&amp;sort=

# Let everything else be crawlable
Allow: /
</code></pre>
<p>This doesn’t prevent indexing if there are external links pointing at a URL (Google can index a URL it can’t crawl). That’s why I still rely on <code>noindex</code> for anything that’s reachable and shouldn’t be indexed.</p>
<h3 id="heading-step-4-make-404s-real-404s">Step 4: Make 404s real 404s</h3>
<p>Soft 404s were another source of garbage URLs showing up. If a page doesn’t exist, return a real 404.</p>
<p>In App Router, <code>notFound()</code> does the right thing <em>if</em> you don’t swallow it and render a 200.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// app/jobs/[id]/page.tsx</span>
<span class="hljs-keyword">import</span> { notFound } <span class="hljs-keyword">from</span> <span class="hljs-string">"next/navigation"</span>;

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">getJob</span>(<span class="hljs-params">id: <span class="hljs-built_in">number</span></span>) </span>{
  <span class="hljs-keyword">const</span> res = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">`<span class="hljs-subst">${process.env.API_BASE_URL}</span>/jobs/<span class="hljs-subst">${id}</span>`</span>, {
    cache: <span class="hljs-string">"no-store"</span>,
  });
  <span class="hljs-keyword">if</span> (res.status === <span class="hljs-number">404</span>) <span class="hljs-keyword">return</span> <span class="hljs-literal">null</span>;
  <span class="hljs-keyword">if</span> (!res.ok) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">`Failed to fetch job <span class="hljs-subst">${id}</span>: <span class="hljs-subst">${res.status}</span>`</span>);
  <span class="hljs-keyword">return</span> res.json() <span class="hljs-keyword">as</span> <span class="hljs-built_in">Promise</span>&lt;{ id: <span class="hljs-built_in">number</span>; title: <span class="hljs-built_in">string</span> }&gt;;
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">JobPage</span>(<span class="hljs-params">
  { params }: { params: <span class="hljs-built_in">Promise</span>&lt;{ id: <span class="hljs-built_in">string</span> }&gt; }
</span>) </span>{
  <span class="hljs-keyword">const</span> { id } = <span class="hljs-keyword">await</span> params;
  <span class="hljs-keyword">const</span> jobId = <span class="hljs-built_in">Number</span>(id);
  <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">Number</span>.isInteger(jobId) || jobId &lt;= <span class="hljs-number">0</span>) notFound();

  <span class="hljs-keyword">const</span> job = <span class="hljs-keyword">await</span> getJob(jobId);
  <span class="hljs-keyword">if</span> (!job) notFound();

  <span class="hljs-keyword">return</span> (
    &lt;main&gt;
      &lt;h1&gt;{job.title}&lt;/h1&gt;
      &lt;p&gt;Job #{job.id}&lt;/p&gt;
    &lt;/main&gt;
  );
}
</code></pre>
<p>That tiny <code>Number.isInteger(jobId)</code> check prevented a whole class of <code>/jobs/abc</code> garbage from turning into “valid” pages.</p>
<h2 id="heading-how-it-worked-in-production">How It Worked in Production</h2>
<p>This was one of those fixes where you don’t get to celebrate immediately. Google takes its time. Also, Search Console reporting lags.</p>
<p>But the signals were clear.</p>
<ul>
<li>Duplicate canonical issues dropped from <strong>46 to 0</strong> in <strong>9 days</strong>.</li>
<li>Soft 404s dropped from <strong>12 to 0</strong> after I fixed <code>notFound()</code> usage and stopped returning 200s for missing entities.</li>
<li>“Discovered — currently not indexed” URLs went down by <strong>50+</strong> after I blocked crawl traps (<code>utm_</code>, <code>sort</code>, <code>ref</code>) and noindexed non-landing facet pages.</li>
</ul>
<p>The surprise: canonical correctness mattered more than I expected.</p>
<p>I had one bug where I accidentally emitted the same canonical for every city page because I computed it in a layout using the parent route path. Google didn’t just ignore the canonical. It started clustering pages together. Rankings got weird. Pages dropped.</p>
<p>After I moved canonical generation to the leaf route and made it purely derived from route params, the clustering stopped.</p>
<p>This wasn’t “SEO”. It was a distributed system resolving conflicting identifiers.</p>
<h2 id="heading-when-this-doesnt-work">When This Doesn't Work</h2>
<p>This setup breaks when you actually want faceted navigation to be indexable at scale.</p>
<p>If your business depends on long-tail combinations (think <code>/laptops?brand=lenovo&amp;ram=32gb&amp;cpu=amd</code>), blanket <code>noindex</code> on parameterized URLs will kneecap you. In that world, you need a real facet strategy: allow-list specific combinations, generate clean path-based landing pages, and control internal linking so you don’t create infinite crawl graphs.</p>
<p>Also: if your canonical logic depends on runtime headers (host, protocol) behind proxies/CDNs, you’ll generate mismatched canonicals (<code>http</code> vs <code>https</code>). That’s a mess. Use an explicit <code>BASE_URL</code> and stick to it.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ul>
<li>Treat URLs like primary keys. One document should have one indexable identifier.</li>
<li>Canonical-only is optimistic. Canonical + selective <code>noindex</code> is defensive.</li>
<li>Robots.txt controls crawl. It doesn’t guarantee deindexing.</li>
<li>Make 404s real 404s. Soft 404s are just duplicate-content bugs wearing a different hat.</li>
<li>Keep an allow-list for indexable routes. If you can’t explain why a URL should rank, it shouldn’t be indexable.</li>
</ul>
<h2 id="heading-closing">Closing</h2>
<p>I’ve settled on a rule: if a URL can be generated by a user clicking around (filters, sorting, tracking params), it’s guilty until proven innocent.</p>
<p>What’s your rule for deciding which faceted URLs become first-class landing pages, and which ones get canonical + <code>noindex</code>?</p>
]]></content:encoded></item><item><title><![CDATA[Why I Use MMKV Over AsyncStorage for Persisted State]]></title><description><![CDATA[Offline-first apps live or die by perceived speed. In my React Native fitness app (5‑second set logging, SQLite-first), the difference between “instant” and “laggy” often comes down to one unglamorous detail: how you persist state. I initially treate...]]></description><link>https://blog.dvskr.dev/why-i-use-mmkv-over-asyncstorage-for-persisted-state</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-use-mmkv-over-asyncstorage-for-persisted-state</guid><category><![CDATA[architecture]]></category><category><![CDATA[buildinpublic]]></category><category><![CDATA[performance]]></category><category><![CDATA[Reactnative]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Sat, 31 Jan 2026 16:00:55 GMT</pubDate><content:encoded><![CDATA[<p>Offline-first apps live or die by perceived speed. In my React Native fitness app (5‑second set logging, SQLite-first), the difference between “instant” and “laggy” often comes down to one unglamorous detail: how you persist state. I initially treated persistence as an afterthought (AsyncStorage + JSON), then watched startup time and UI responsiveness degrade as I added an achievement system and more client-side state. This post is about why I switched to MMKV for persisted state with Zustand—and what I traded away to get consistently snappy interactions.</p>
<h2 id="heading-the-problem-space-persistence-is-on-the-hot-path">The problem space: persistence is on the hot path</h2>
<p>I’m building a mobile workout app where the core interaction is logging a set in ~5 seconds. The app is offline-first:</p>
<ul>
<li>SQLite is the primary data store (workouts, sets, exercises)</li>
<li>The UI needs to feel instantaneous (sub-100ms interactions)</li>
<li>State must survive restarts (in-progress workout, last used timers, user preferences, cached AI hints)</li>
<li>A new achievement system introduced more “derived UI state” (unlocked milestones, celebratory banners, streaks)</li>
</ul>
<p>At my current scale (10-person waitlist, ~400+ exercises), this isn’t “big data.” But mobile performance is nonlinear: you can be “small” and still feel slow.</p>
<p>Two constraints made persistence a first-class architecture decision:</p>
<ol>
<li><strong>Cold start is the first impression.</strong> If I can’t restore enough state quickly, users land in a blank screen or loading spinners.</li>
<li><strong>Offline-first means more local state.</strong> Remote is not the source of truth; the device is. That shifts more responsibility to local persistence.</li>
</ol>
<p>I also build in a “vibe coding” style (Cursor + Claude). That speeds up iteration, but it also increases the risk of accidental performance regressions—so I wanted a persistence layer that’s hard to misuse.</p>
<h2 id="heading-options-considered">Options considered</h2>
<p>The decision was specifically about <em>persisted app state</em> (Zustand store snapshots, preferences, small caches), not the main relational data (that lives in SQLite).</p>
<p>Here are the options I seriously considered.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Option</td><td>What it is</td><td>Pros</td><td>Cons</td><td>Best when</td></tr>
</thead>
<tbody>
<tr>
<td>AsyncStorage (community)</td><td>Simple key/value storage, async JS API</td><td>Built-in mental model, widely used, minimal native setup</td><td>JSON serialize/parse overhead, slower reads on startup, easy to store too much, performance varies by platform/implementation</td><td>Very small state, infrequent reads, low perf sensitivity</td></tr>
<tr>
<td>SQLite for everything</td><td>Store preferences/state tables in SQLite</td><td>One database to rule them all, queryable, consistent backup story</td><td>More schema work, migrations, overkill for tiny blobs, still needs careful read patterns on startup</td><td>You need relational queries or complex local joins</td></tr>
<tr>
<td>Secure storage (Keychain/Keystore)</td><td>OS-provided encrypted storage</td><td>Great for secrets, tokens</td><td>Not meant for frequent reads/writes, capacity limits, slower</td><td>Credentials, API keys, refresh tokens</td></tr>
<tr>
<td>MMKV</td><td>Fast key/value storage via JSI (C++), sync reads</td><td>Very fast, synchronous reads (no async waterfall), good for persisted state, supports encryption</td><td>Native dependency, synchronous API can be abused, not queryable like SQLite</td><td>Hot-path state (startup, navigation), medium-sized persisted slices</td></tr>
</tbody>
</table>
</div><h3 id="heading-why-not-just-use-asyncstorage-correctly">Why not just use AsyncStorage “correctly”?</h3>
<p>You can. If you aggressively minimize what you persist, debounce writes, and avoid reading too much on boot, AsyncStorage can be fine.</p>
<p>But in practice, “correctly” is the hard part—especially as a solo creator moving fast.</p>
<p>What pushed me away:</p>
<ul>
<li><strong>Async waterfall on startup</strong>: <code>await getItem()</code> chains across multiple keys can add up.</li>
<li><strong>Serialization overhead</strong>: JSON parse/stringify becomes noticeable when you persist larger objects (like achievement states or cached AI responses).</li>
<li><strong>Non-obvious regressions</strong>: a single new persisted field can silently add milliseconds to every startup.</li>
</ul>
<h3 id="heading-why-not-store-state-in-sqlite">Why not store state in SQLite?</h3>
<p>I’m already using SQLite heavily, so the “one local store” idea was tempting.</p>
<p>But I want to keep a clear boundary:</p>
<ul>
<li><strong>SQLite</strong>: durable domain data (workouts/sets/exercises), needs migrations, integrity constraints.</li>
<li><strong>KV store</strong>: UI/session preferences and caches (fast, schema-less, easy to wipe).</li>
</ul>
<p>Mixing them tends to create a junk-drawer schema where every new UI flag becomes a table row. It’s workable, but it’s not the kind of complexity I want early.</p>
<h2 id="heading-the-decision-mmkv-for-persisted-zustand-slices">The decision: MMKV for persisted Zustand slices</h2>
<p>I chose <strong>MMKV</strong> as the persistence backend for Zustand.</p>
<p>Ranked reasons:</p>
<ol>
<li><strong>Startup speed via synchronous reads</strong>: restoring state doesn’t require an async chain before rendering.</li>
<li><strong>Lower overhead for small-to-medium blobs</strong>: less pain from JSON parse/stringify on hot paths.</li>
<li><strong>Better guardrails</strong>: it nudges me toward persisting only what matters, because it’s easy to keep state slices small and explicit.</li>
</ol>
<p>What I gave up:</p>
<ul>
<li><strong>More native surface area</strong> (dependency management, Expo config/plugins)</li>
<li><strong>Potential UI jank if I abuse sync reads/writes</strong> (sync is a tool, not a free lunch)</li>
<li><strong>Less portability</strong> than AsyncStorage (MMKV is native-first)</li>
</ul>
<h3 id="heading-implementation-overview">Implementation overview</h3>
<p>The key architectural choice wasn’t “use MMKV” in isolation; it was <strong>persist only the slices that must survive a restart</strong>.</p>
<p>My rule: if it can be recomputed from SQLite, don’t persist it in MMKV.</p>
<h4 id="heading-1-create-an-mmkv-backed-storage-adapter-for-zustand">1) Create an MMKV-backed storage adapter for Zustand</h4>
<pre><code class="lang-ts"><span class="hljs-comment">// storage/mmkv.ts</span>
<span class="hljs-keyword">import</span> { MMKV } <span class="hljs-keyword">from</span> <span class="hljs-string">'react-native-mmkv'</span>

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> mmkv = <span class="hljs-keyword">new</span> MMKV({
  id: <span class="hljs-string">'gymtracker-mmkv'</span>,
  <span class="hljs-comment">// Optional: encryptionKey can be added, but be careful with key management.</span>
})

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> zustandStorage = {
  setItem: <span class="hljs-function">(<span class="hljs-params">name: <span class="hljs-built_in">string</span>, value: <span class="hljs-built_in">string</span></span>) =&gt;</span> {
    mmkv.set(name, value)
  },
  getItem: <span class="hljs-function">(<span class="hljs-params">name: <span class="hljs-built_in">string</span></span>) =&gt;</span> {
    <span class="hljs-keyword">const</span> v = mmkv.getString(name)
    <span class="hljs-keyword">return</span> v ?? <span class="hljs-literal">null</span>
  },
  removeItem: <span class="hljs-function">(<span class="hljs-params">name: <span class="hljs-built_in">string</span></span>) =&gt;</span> {
    mmkv.delete(name)
  },
}
</code></pre>
<p>This adapter keeps the persistence boundary clean: Zustand only sees a string-based storage interface.</p>
<h4 id="heading-2-persist-only-session-critical-state">2) Persist only “session-critical” state</h4>
<p>Example: the currently active workout session (IDs, timestamps, UI mode), not the full workout history.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// state/useWorkoutSessionStore.ts</span>
<span class="hljs-keyword">import</span> { create } <span class="hljs-keyword">from</span> <span class="hljs-string">'zustand'</span>
<span class="hljs-keyword">import</span> { persist, createJSONStorage } <span class="hljs-keyword">from</span> <span class="hljs-string">'zustand/middleware'</span>
<span class="hljs-keyword">import</span> { zustandStorage } <span class="hljs-keyword">from</span> <span class="hljs-string">'../storage/mmkv'</span>

<span class="hljs-keyword">type</span> WorkoutSessionState = {
  activeWorkoutId: <span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span>
  startedAt: <span class="hljs-built_in">number</span> | <span class="hljs-literal">null</span>
  restTimerSeconds: <span class="hljs-built_in">number</span>
  setActiveWorkout: <span class="hljs-function">(<span class="hljs-params">id: <span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span></span>) =&gt;</span> <span class="hljs-built_in">void</span>
  setRestTimerSeconds: <span class="hljs-function">(<span class="hljs-params">s: <span class="hljs-built_in">number</span></span>) =&gt;</span> <span class="hljs-built_in">void</span>
  resetSession: <span class="hljs-function">() =&gt;</span> <span class="hljs-built_in">void</span>
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> useWorkoutSessionStore = create&lt;WorkoutSessionState&gt;()(
  persist(
    <span class="hljs-function">(<span class="hljs-params">set</span>) =&gt;</span> ({
      activeWorkoutId: <span class="hljs-literal">null</span>,
      startedAt: <span class="hljs-literal">null</span>,
      restTimerSeconds: <span class="hljs-number">90</span>,
      setActiveWorkout: <span class="hljs-function">(<span class="hljs-params">id</span>) =&gt;</span> set({ activeWorkoutId: id, startedAt: id ? <span class="hljs-built_in">Date</span>.now() : <span class="hljs-literal">null</span> }),
      setRestTimerSeconds: <span class="hljs-function">(<span class="hljs-params">s</span>) =&gt;</span> set({ restTimerSeconds: s }),
      resetSession: <span class="hljs-function">() =&gt;</span> set({ activeWorkoutId: <span class="hljs-literal">null</span>, startedAt: <span class="hljs-literal">null</span> }),
    }),
    {
      name: <span class="hljs-string">'workout-session-v1'</span>,
      storage: createJSONStorage(<span class="hljs-function">() =&gt;</span> zustandStorage),
      partialize: <span class="hljs-function">(<span class="hljs-params">state</span>) =&gt;</span> ({
        activeWorkoutId: state.activeWorkoutId,
        startedAt: state.startedAt,
        restTimerSeconds: state.restTimerSeconds,
      }),
    }
  )
)
</code></pre>
<p>Two important details:</p>
<ul>
<li><code>partialize</code>: prevents “accidental persistence” of large or unstable state.</li>
<li>versioning via key name: makes it easier to invalidate when you change shapes.</li>
</ul>
<h4 id="heading-3-dont-persist-derivedlarge-objects-store-references-instead">3) Don’t persist derived/large objects (store references instead)</h4>
<p>A trap I fell into early: persisting “achievement UI state” as a big object (every milestone, last shown, UI banners). It inflated the persisted blob and increased restore time.</p>
<p>Instead, I persist only:</p>
<ul>
<li>last shown milestone ID</li>
<li>a set of unlocked milestone IDs (small)</li>
<li>timestamps for rate-limiting celebrations</li>
</ul>
<p>Everything else is derived from a static milestone catalog bundled with the app.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// state/useAchievementsStore.ts</span>
<span class="hljs-keyword">import</span> { create } <span class="hljs-keyword">from</span> <span class="hljs-string">'zustand'</span>
<span class="hljs-keyword">import</span> { persist, createJSONStorage } <span class="hljs-keyword">from</span> <span class="hljs-string">'zustand/middleware'</span>
<span class="hljs-keyword">import</span> { zustandStorage } <span class="hljs-keyword">from</span> <span class="hljs-string">'../storage/mmkv'</span>

<span class="hljs-keyword">type</span> AchievementsState = {
  unlockedIds: Record&lt;<span class="hljs-built_in">string</span>, <span class="hljs-literal">true</span>&gt;
  lastCelebratedId: <span class="hljs-built_in">string</span> | <span class="hljs-literal">null</span>
  lastCelebratedAt: <span class="hljs-built_in">number</span> | <span class="hljs-literal">null</span>
  unlock: <span class="hljs-function">(<span class="hljs-params">id: <span class="hljs-built_in">string</span></span>) =&gt;</span> <span class="hljs-built_in">void</span>
  markCelebrated: <span class="hljs-function">(<span class="hljs-params">id: <span class="hljs-built_in">string</span></span>) =&gt;</span> <span class="hljs-built_in">void</span>
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> useAchievementsStore = create&lt;AchievementsState&gt;()(
  persist(
    <span class="hljs-function">(<span class="hljs-params">set, get</span>) =&gt;</span> ({
      unlockedIds: {},
      lastCelebratedId: <span class="hljs-literal">null</span>,
      lastCelebratedAt: <span class="hljs-literal">null</span>,
      unlock: <span class="hljs-function">(<span class="hljs-params">id</span>) =&gt;</span> {
        <span class="hljs-keyword">if</span> (get().unlockedIds[id]) <span class="hljs-keyword">return</span>
        set(<span class="hljs-function">(<span class="hljs-params">s</span>) =&gt;</span> ({ unlockedIds: { ...s.unlockedIds, [id]: <span class="hljs-literal">true</span> } }))
      },
      markCelebrated: <span class="hljs-function">(<span class="hljs-params">id</span>) =&gt;</span> set({ lastCelebratedId: id, lastCelebratedAt: <span class="hljs-built_in">Date</span>.now() }),
    }),
    {
      name: <span class="hljs-string">'achievements-v1'</span>,
      storage: createJSONStorage(<span class="hljs-function">() =&gt;</span> zustandStorage),
      partialize: <span class="hljs-function">(<span class="hljs-params">s</span>) =&gt;</span> ({
        unlockedIds: s.unlockedIds,
        lastCelebratedId: s.lastCelebratedId,
        lastCelebratedAt: s.lastCelebratedAt,
      }),
    }
  )
)
</code></pre>
<p>This keeps MMKV as a fast “memory of the app,” not a second database.</p>
<h2 id="heading-results-amp-learnings-numbers-gotchas">Results &amp; learnings (numbers + gotchas)</h2>
<p>I don’t have millions of users, so my numbers are device-level measurements, not fleet-wide telemetry.</p>
<p>On a mid-range Android device (Pixel 6a-class) and an iPhone 13-class device, measured with simple timestamp logging around hydration and first interactive screen:</p>
<ul>
<li><p><strong>Persisted hydration time</strong> (Zustand restore):</p>
<ul>
<li>AsyncStorage (before): ~35–80ms typical, with occasional 150ms spikes when the persisted blob grew</li>
<li>MMKV (after): ~5–15ms typical, fewer spikes</li>
</ul>
</li>
<li><p><strong>Time to first “workout screen interactive”</strong> (not full app start, but navigation + state ready):</p>
<ul>
<li>Before: ~450–650ms</li>
<li>After: ~320–500ms</li>
</ul>
</li>
</ul>
<p>The bigger win wasn’t the raw numbers—it was predictability. The spikes were what made the UI feel unreliable.</p>
<p>Unexpected challenges:</p>
<ol>
<li><strong>Sync APIs are easy to misuse</strong>: MMKV reads are synchronous. If you start doing lots of reads during render (especially in list items), you can create jank. My mitigation: read once in store hydration, then use in-memory state.</li>
<li><strong>Data shape discipline matters more than the storage engine</strong>: MMKV didn’t save me from persisting too much. <code>partialize</code> did.</li>
<li><strong>Wipe strategy</strong>: for debugging and schema changes, having a clear “reset local state” action is essential. KV stores make this easier than SQLite.</li>
</ol>
<blockquote>
<p>Key insight: MMKV improved the ceiling, but state-slice design removed the foot-guns.</p>
</blockquote>
<h2 id="heading-when-this-doesnt-work">When this doesn’t work</h2>
<p>MMKV isn’t a universal recommendation.</p>
<p>Choose something else if:</p>
<ul>
<li><strong>You need complex queries</strong> over persisted data (use SQLite). KV storage is not fun once you need filtering, joins, or analytics.</li>
<li><strong>Your persisted state is tiny and read rarely</strong>. AsyncStorage is simpler and “good enough” when you’re persisting a couple of strings.</li>
<li><strong>You’re in a strict managed environment</strong> where adding native modules is costly (depending on your Expo setup and policies).</li>
<li><strong>You have multi-process or cross-app access requirements</strong>. MMKV has patterns for this, but the complexity rises quickly.</li>
</ul>
<p>Also: if your app’s main performance problem is expensive renders, heavy images/GIFs, or slow SQLite queries, switching persistence backends won’t move the needle much.</p>
<h2 id="heading-key-takeaways">Key takeaways</h2>
<ol>
<li><strong>Treat persistence as part of your performance budget</strong>, not a utility. It’s on the startup path.</li>
<li><strong>Persist references, not aggregates</strong>: store IDs and timestamps; derive the rest from SQLite or static catalogs.</li>
<li><strong>Use <code>partialize</code> (or equivalents) as a guardrail</strong> to prevent “state creep.”</li>
<li><strong>Prefer predictable performance over theoretical simplicity</strong> when your UX depends on speed.</li>
<li><strong>Measure spikes, not just averages</strong>—users feel the worst 5%.</li>
</ol>
<h2 id="heading-closing">Closing</h2>
<p>If you’re building an offline-first React Native app, what do you persist outside your primary database—and how do you keep that persisted state from quietly growing into a second system of record?</p>
]]></content:encoded></item><item><title><![CDATA[Why I Add an Async Outbox Before Reaching for Kafka]]></title><description><![CDATA[The first time you ship “send email after signup” as an inline API call, it works—until it doesn’t. One slow provider, one transient timeout, or one deploy at the wrong time and you start dropping side effects (emails, webhooks, audit logs) or, worse...]]></description><link>https://blog.dvskr.dev/why-i-add-an-async-outbox-before-reaching-for-kafka</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-add-an-async-outbox-before-reaching-for-kafka</guid><category><![CDATA[architecture]]></category><category><![CDATA[buildinpublic]]></category><category><![CDATA[database]]></category><category><![CDATA[systemdesign]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Thu, 29 Jan 2026 16:00:18 GMT</pubDate><content:encoded><![CDATA[<p>The first time you ship “send email after signup” as an inline API call, it works—until it doesn’t. One slow provider, one transient timeout, or one deploy at the wrong time and you start dropping side effects (emails, webhooks, audit logs) or, worse, sending duplicates. As a solo creator, the painful part isn’t just the bug—it’s the operational overhead of fixing it repeatedly without a team. Here’s the architectural decision I now default to: add an async outbox (in the same database) before reaching for a full message bus.</p>
<h2 id="heading-1-the-decision-do-you-need-a-message-bus-yet">1) The decision: do you need a message bus yet?</h2>
<p>When you’re building solo, reliability problems show up in the least glamorous places:</p>
<ul>
<li>password reset emails that sometimes don’t arrive</li>
<li>webhook deliveries that randomly fail</li>
<li>“welcome” sequences that send twice</li>
<li>background jobs that disappear during deploys</li>
</ul>
<p>The common root cause is coupling: your request path is doing too much work, and your side effects aren’t transactional with your core write.</p>
<p>The tempting solution is “add Kafka/RabbitMQ/SQS.” But running (or even integrating) a message bus is not free: schema evolution, retries, DLQs, observability, idempotency, consumer deployments, and a new failure domain.</p>
<p>My default for early-stage systems is an <strong>async outbox</strong>: write an “event to send” into the same database transaction as your core change, then have a worker deliver it with retries.</p>
<blockquote>
<p>Key idea: if the business write commits, the side effect is guaranteed to be recorded—even if delivery happens later.</p>
</blockquote>
<hr />
<h2 id="heading-2-context-the-problem-space">2) Context (The Problem Space)</h2>
<h3 id="heading-requirements-amp-constraints">Requirements &amp; constraints</h3>
<p>For a solo system design, I optimize for:</p>
<ul>
<li><strong>Correctness over immediacy</strong>: “email eventually sent” beats “sometimes sent instantly.”</li>
<li><strong>Low operational load</strong>: fewer moving pieces, fewer dashboards.</li>
<li><strong>Cost predictability</strong>: one database and one worker is usually enough.</li>
<li><strong>Deploy safety</strong>: deploys shouldn’t drop side effects.</li>
</ul>
<h3 id="heading-typical-scale-expectations">Typical scale expectations</h3>
<p>This pattern holds comfortably for:</p>
<ul>
<li>tens to hundreds of requests/sec</li>
<li>thousands to millions of outbox rows/day</li>
<li>side effects like email/webhook/analytics events</li>
</ul>
<h3 id="heading-non-functional-requirements">Non-functional requirements</h3>
<ul>
<li><strong>At-least-once delivery</strong> (with idempotency on the consumer/provider side)</li>
<li><strong>Retry with backoff</strong></li>
<li><strong>Observability</strong>: know what’s stuck and why</li>
<li><strong>No phantom sends</strong>: don’t send an email if the user row didn’t commit</li>
</ul>
<h3 id="heading-why-just-do-it-inline-doesnt-fit">Why “just do it inline” doesn’t fit</h3>
<p>Inline side effects fail in subtle ways:</p>
<ul>
<li>you commit the DB write, then the email API times out → user exists, but no email</li>
<li>you send the email, then the DB transaction rolls back → email references a user that “doesn’t exist”</li>
<li>you retry the request, and now you send duplicates</li>
</ul>
<p>The outbox is basically admitting: <strong>distributed systems exist even in a monolith</strong> (your DB + third-party APIs is already distributed).</p>
<hr />
<h2 id="heading-3-options-considered">3) Options considered</h2>
<p>Below are the common approaches I’ve used/seen, and where they break.</p>
<h3 id="heading-comparison-table">Comparison table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Option</td><td>What it is</td><td>Pros</td><td>Cons</td><td>Best when</td></tr>
</thead>
<tbody>
<tr>
<td>Inline side effects</td><td>Call email/webhook provider inside request</td><td>Simple, low latency</td><td>Not transactional, timeouts hurt UX, duplicates on retries</td><td>Truly non-critical side effects</td></tr>
<tr>
<td>Background job queue only</td><td>Push job to Redis/queue from request</td><td>Async, faster requests</td><td>Still not transactional unless enqueue is in same transaction boundary</td><td>You can tolerate occasional lost jobs</td></tr>
<tr>
<td>Async outbox (DB)</td><td>Write outbox row in same DB transaction; worker delivers</td><td>Transactional recording, fewer components, great for solo</td><td>Adds polling/worker, needs idempotency + cleanup</td><td>MVPs to mid-scale systems</td></tr>
<tr>
<td>CDC (change data capture)</td><td>Stream DB changes to consumers (Debezium, logical replication)</td><td>Near real-time, scalable</td><td>Operational complexity, schema coupling, infra overhead</td><td>Data platforms, multiple consumers</td></tr>
<tr>
<td>Full message bus</td><td>Kafka/RabbitMQ/SQS + producers/consumers</td><td>High throughput, decoupling, replay</td><td>More infra, more failure modes, more tooling</td><td>Many services/teams, high event volume</td></tr>
</tbody>
</table>
</div><h3 id="heading-option-notes-the-gotchas">Option notes (the “gotchas”)</h3>
<h4 id="heading-inline-side-effects">Inline side effects</h4>
<ul>
<li>Works until your provider latency spikes.</li>
<li>Forces you to choose between slow user experience and unreliable delivery.</li>
</ul>
<h4 id="heading-background-queue-only-redis-etc">Background queue only (Redis etc.)</h4>
<ul>
<li>Better UX, but if enqueue happens after commit and the process crashes in between, you lose the job.</li>
<li>If enqueue happens before commit and the commit fails, you send an email for a transaction that never happened.</li>
</ul>
<h4 id="heading-async-outbox">Async outbox</h4>
<ul>
<li>You trade a bit of implementation complexity for a big jump in correctness.</li>
<li>Your DB becomes both the system of record and the “durable queue.”</li>
</ul>
<h4 id="heading-cdc-or-message-bus">CDC or message bus</h4>
<ul>
<li>Great when multiple consumers need the same events, or you need replay.</li>
<li>Usually too much surface area for a solo codebase early on.</li>
</ul>
<hr />
<h2 id="heading-4-the-decision-what-i-chose">4) The decision (What I chose)</h2>
<p>I choose <strong>Async Outbox in the primary database</strong> as the default for emails/webhooks/audit events.</p>
<h3 id="heading-primary-reasons-ranked">Primary reasons (ranked)</h3>
<ol>
<li><strong>Transactional integrity</strong>: the outbox row is committed with the business write.</li>
<li><strong>Operational simplicity</strong>: no new infra tier (beyond a worker process).</li>
<li><strong>Deploy resilience</strong>: if the worker is down, events accumulate; nothing is lost.</li>
<li><strong>Debuggability</strong>: outbox table is a truth source you can query with SQL.</li>
</ol>
<h3 id="heading-what-i-gave-up">What I gave up</h3>
<ul>
<li><strong>Instant delivery</strong>: outbox is “near real-time,” not truly synchronous.</li>
<li><strong>DB load</strong>: polling adds reads/writes; you must index correctly.</li>
<li><strong>Exactly-once</strong>: you usually get at-least-once; duplicates are handled via idempotency.</li>
</ul>
<h3 id="heading-implementation-overview">Implementation overview</h3>
<h4 id="heading-data-model">Data model</h4>
<p>A minimal outbox table:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> outbox_events (
  <span class="hljs-keyword">id</span>            BIGSERIAL PRIMARY <span class="hljs-keyword">KEY</span>,
  topic         <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  payload       JSONB <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  idempotency_key <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  <span class="hljs-keyword">status</span>        <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-string">'pending'</span>, <span class="hljs-comment">-- pending, processing, sent, failed</span>
  attempts      <span class="hljs-built_in">INT</span>  <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-number">0</span>,
  available_at  TIMESTAMPTZ <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>(),
  locked_at     TIMESTAMPTZ,
  lock_owner    <span class="hljs-built_in">TEXT</span>,
  created_at    TIMESTAMPTZ <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>(),
  updated_at    TIMESTAMPTZ <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>()
);

<span class="hljs-comment">-- Prevent duplicate logical events (e.g., signup welcome email)</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">UNIQUE</span> <span class="hljs-keyword">INDEX</span> outbox_idempotency_uk
  <span class="hljs-keyword">ON</span> outbox_events(topic, idempotency_key);

<span class="hljs-comment">-- Fast fetching of ready work</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> outbox_ready_idx
  <span class="hljs-keyword">ON</span> outbox_events(<span class="hljs-keyword">status</span>, available_at)
  <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">status</span> <span class="hljs-keyword">IN</span> (<span class="hljs-string">'pending'</span>, <span class="hljs-string">'failed'</span>);
</code></pre>
<p>The important design choice here is <strong>idempotency_key</strong>. This is what keeps “at-least-once delivery” from becoming “user got 3 emails.”</p>
<p>Examples of idempotency keys:</p>
<ul>
<li><code>welcome_email:user_id=123</code></li>
<li><code>webhook:invoice_paid:invoice_id=abc</code></li>
</ul>
<h4 id="heading-writing-to-the-outbox-in-the-same-transaction">Writing to the outbox in the same transaction</h4>
<p>Pseudocode (Node/TypeScript-ish, but the idea is language-agnostic):</p>
<pre><code class="lang-ts"><span class="hljs-keyword">await</span> db.tx(<span class="hljs-keyword">async</span> (tx) =&gt; {
  <span class="hljs-keyword">const</span> user = <span class="hljs-keyword">await</span> tx.query(
    <span class="hljs-string">`INSERT INTO users(email) VALUES ($1) RETURNING id, email`</span>,
    [email]
  );

  <span class="hljs-keyword">await</span> tx.query(
    <span class="hljs-string">`INSERT INTO outbox_events(topic, payload, idempotency_key)
     VALUES ($1, $2::jsonb, $3)
     ON CONFLICT (topic, idempotency_key) DO NOTHING`</span>,
    [
      <span class="hljs-string">'email.welcome'</span>,
      <span class="hljs-built_in">JSON</span>.stringify({ userId: user.id, email: user.email }),
      <span class="hljs-string">`welcome_email:user_id=<span class="hljs-subst">${user.id}</span>`</span>
    ]
  );
});
</code></pre>
<p>This is the core win: <strong>either both rows commit, or neither does</strong>.</p>
<h4 id="heading-worker-claim-rows-safely-skip-locked">Worker: claim rows safely (skip locked)</h4>
<p>The worker loop should:</p>
<ol>
<li>fetch a small batch of ready events</li>
<li>atomically mark them as processing (lock)</li>
<li>deliver</li>
<li>mark sent or schedule retry</li>
</ol>
<p>In Postgres, a common pattern is <code>FOR UPDATE SKIP LOCKED</code>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> <span class="hljs-keyword">next</span> <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>
  <span class="hljs-keyword">FROM</span> outbox_events
  <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">status</span> <span class="hljs-keyword">IN</span> (<span class="hljs-string">'pending'</span>, <span class="hljs-string">'failed'</span>)
    <span class="hljs-keyword">AND</span> available_at &lt;= <span class="hljs-keyword">now</span>()
  <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> created_at
  <span class="hljs-keyword">LIMIT</span> <span class="hljs-number">50</span>
  <span class="hljs-keyword">FOR</span> <span class="hljs-keyword">UPDATE</span> <span class="hljs-keyword">SKIP</span> <span class="hljs-keyword">LOCKED</span>
)
<span class="hljs-keyword">UPDATE</span> outbox_events e
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'processing'</span>,
    locked_at = <span class="hljs-keyword">now</span>(),
    lock_owner = $<span class="hljs-number">1</span>,
    updated_at = <span class="hljs-keyword">now</span>()
<span class="hljs-keyword">FROM</span> <span class="hljs-keyword">next</span>
<span class="hljs-keyword">WHERE</span> e.id = next.id
<span class="hljs-keyword">RETURNING</span> e.id, e.topic, e.payload, e.attempts;
</code></pre>
<p>This lets you run multiple worker instances without double-processing the same row.</p>
<h4 id="heading-retry-policy-with-backoff">Retry policy with backoff</h4>
<p>I usually start with exponential backoff capped at a few minutes.</p>
<pre><code class="lang-ts"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">nextBackoffSeconds</span>(<span class="hljs-params">attempt: <span class="hljs-built_in">number</span></span>): <span class="hljs-title">number</span> </span>{
  <span class="hljs-comment">// 1, 2, 4, 8, 16, 32, 60, 60...</span>
  <span class="hljs-keyword">return</span> <span class="hljs-built_in">Math</span>.min(<span class="hljs-number">60</span>, <span class="hljs-number">2</span> ** <span class="hljs-built_in">Math</span>.max(<span class="hljs-number">0</span>, attempt));
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">markFailed</span>(<span class="hljs-params">id: <span class="hljs-built_in">number</span>, attempts: <span class="hljs-built_in">number</span>, err: <span class="hljs-built_in">Error</span></span>) </span>{
  <span class="hljs-keyword">const</span> delay = nextBackoffSeconds(attempts);
  <span class="hljs-keyword">await</span> db.query(
    <span class="hljs-string">`UPDATE outbox_events
     SET status='failed',
         attempts = attempts + 1,
         available_at = now() + ($2 || ' seconds')::interval,
         updated_at = now(),
         payload = jsonb_set(payload, '{last_error}', to_jsonb($3::text), true)
     WHERE id = $1`</span>,
    [id, delay, err.message]
  );
}
</code></pre>
<p>A few deliberate choices:</p>
<ul>
<li>store <code>last_error</code> to make SQL-based debugging possible</li>
<li>cap backoff to avoid “retry storms”</li>
<li>keep it simple; you can add jitter later</li>
</ul>
<hr />
<h2 id="heading-5-results-amp-learnings">5) Results &amp; learnings</h2>
<p>Because this is a general pattern (not tied to one product), I’ll share the kinds of numbers I’ve repeatedly observed in production-ish solo systems.</p>
<h3 id="heading-performance-impact">Performance impact</h3>
<ul>
<li><strong>Request latency</strong>: moving side effects out of the request typically drops p95 latency by <strong>200–1500ms</strong> (depending on provider latency). The DB write for an outbox row is usually <strong>single-digit milliseconds</strong> when indexed.</li>
<li><strong>Throughput</strong>: a single worker polling every 250–1000ms and processing batches of 50 can comfortably handle <strong>hundreds to low thousands of events/min</strong> on modest hardware, assuming the downstream provider isn’t the bottleneck.</li>
<li><strong>DB load</strong>: the outbox table can become write-heavy. With proper partial indexes and batch updates, I typically see outbox overhead remain <strong>&lt;5–10%</strong> of total DB CPU for small-to-mid workloads.</li>
</ul>
<h3 id="heading-what-worked-well">What worked well</h3>
<ul>
<li>Debugging becomes SQL-native: “show me pending emails older than 10 minutes” is a query.</li>
<li>Deploys are less scary: if the worker is down for 10 minutes, you process the backlog.</li>
</ul>
<h3 id="heading-unexpected-challenges">Unexpected challenges</h3>
<ul>
<li><strong>Idempotency is non-negotiable</strong>: you will send duplicates eventually (timeouts, provider ambiguity). Design for it.</li>
<li><strong>Poison messages</strong>: some payloads will fail forever (bad email, invalid webhook URL). You need a terminal state and alerting.</li>
<li><strong>Table growth</strong>: you must archive or delete sent events.</li>
</ul>
<h3 id="heading-what-id-do-differently-next-time">What I’d do differently next time</h3>
<ul>
<li>Add a <code>dead</code> status after N attempts and a lightweight admin view early.</li>
<li>Add basic metrics (counts by status, oldest pending age) before problems happen.</li>
</ul>
<hr />
<h2 id="heading-6-when-this-doesnt-work">6) When this doesn’t work</h2>
<p>The async outbox is not a universal answer.</p>
<p>Choose something else when:</p>
<ul>
<li><strong>You need fan-out to many consumers</strong> with different replay needs. Outbox can do it, but it becomes awkward; a proper bus or CDC can be cleaner.</li>
<li><strong>Event volume is very high</strong> (e.g., tens of thousands/sec). Polling a relational DB becomes expensive; you’ll want streaming infrastructure.</li>
<li><strong>You require strict ordering across partitions</strong> (e.g., per-customer ordering at scale). You can approximate ordering, but it gets complex.</li>
<li><strong>Your primary DB is already the bottleneck</strong>. Turning it into a queue adds load; offloading to a dedicated queue might be healthier.</li>
<li><strong>You can’t tolerate duplicates at all</strong> and downstream isn’t idempotent. You can get closer with provider-side idempotency keys, transactional inbox patterns, or exactly-once semantics in specific systems—but complexity rises quickly.</li>
</ul>
<hr />
<h2 id="heading-7-key-takeaways">7) Key takeaways</h2>
<ul>
<li>Treat third-party APIs as unreliable dependencies; design side effects as <strong>async and retryable</strong>.</li>
<li>If you only adopt one reliability pattern early: <strong>write an outbox row in the same DB transaction</strong> as your business change.</li>
<li>Optimize for <strong>operational simplicity first</strong>; a worker + Postgres is often enough for a long time.</li>
<li>Assume <strong>at-least-once delivery</strong> and make events idempotent with explicit keys.</li>
<li>Plan for lifecycle: retries, dead-lettering (even if it’s just a <code>dead</code> status), and cleanup/archival.</li>
</ul>
<hr />
<h2 id="heading-8-closing">8) Closing</h2>
<p>If you’re building solo, the async outbox is one of those “boring” patterns that quietly saves weeks of debugging later.</p>
<p>What’s your default for side effects in early-stage systems—inline calls, a queue, an outbox, or straight to a message bus? And what failure pushed you there?</p>
]]></content:encoded></item><item><title><![CDATA[Why I Use pg_trgm Fuzzy Search Instead of Full-Text Search]]></title><description><![CDATA[Search is where job boards quietly fail. Users don’t type perfect keywords, job titles aren’t standardized across sources, and “PMHNP” gets spelled five different ways. In my PMHNP Hiring job board (7,556+ jobs, 1,368+ companies, ~200 daily updates),...]]></description><link>https://blog.dvskr.dev/why-i-use-pgtrgm-fuzzy-search-instead-of-full-text-search</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-use-pgtrgm-fuzzy-search-instead-of-full-text-search</guid><category><![CDATA[architecture]]></category><category><![CDATA[buildinpublic]]></category><category><![CDATA[database]]></category><category><![CDATA[systemdesign]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Tue, 27 Jan 2026 16:00:14 GMT</pubDate><content:encoded><![CDATA[<p>Search is where job boards quietly fail. Users don’t type perfect keywords, job titles aren’t standardized across sources, and “PMHNP” gets spelled five different ways. In my PMHNP Hiring job board (7,556+ jobs, 1,368+ companies, ~200 daily updates), I had to pick a search strategy that was fast (&lt;50ms p95 for common queries), typo-tolerant, and cheap to operate as a solo creator. This post is the architecture decision I made: choosing PostgreSQL pg_trgm similarity search over classic full‑text search—and what I gave up to get there.</p>
<h2 id="heading-context-search-in-a-scraped-job-board-is-messy-by-default">Context: search in a scraped job board is messy by default</h2>
<p>PMHNP Hiring is a job board I built for Psychiatric Mental Health Nurse Practitioners. The “product” looks simple—filter by location, remote, company, posted date, and search by title/company. The system behind it is less clean because the data comes from 10+ sources with different formatting and inconsistent fields.</p>
<p>A few constraints shaped the search decision:</p>
<ul>
<li><strong>Scale &amp; churn</strong>: 7,556+ active jobs, 1,368+ companies, and ~200 daily updates (incremental ingestion, not full refresh).</li>
<li><strong>UX reality</strong>: users type partial queries (“psych”, “nurse prac”), acronyms (“PMHNP”), and typos (“psychiatric nurse practioner”).</li>
<li><strong>Performance target</strong>: keep the common feed queries (filters + search) at <strong>~50ms p95</strong> at the database layer.</li>
<li><strong>Operational simplicity</strong>: I’m a solo creator. I didn’t want a separate search cluster to babysit.</li>
<li><strong>Security model</strong>: Supabase + PostgreSQL with RLS. I wanted search to stay inside Postgres so it inherits the same access control semantics.</li>
</ul>
<p>This is where the first non-obvious problem appears: classic full-text search (FTS) is great for “documents”, but job titles and company names behave more like <strong>short strings</strong> where <strong>typo tolerance</strong> and <strong>partial matches</strong> dominate.</p>
<blockquote>
<p>In a job board, search is less about linguistic relevance and more about forgiving messy input.</p>
</blockquote>
<h2 id="heading-the-problem-space-what-i-actually-needed-from-search">The problem space: what I actually needed from search</h2>
<p>I wasn’t building Google. I needed a search that:</p>
<ol>
<li><strong>Works well on short fields</strong>: <code>title</code>, <code>company_name</code>, and sometimes <code>location_text</code>.</li>
<li><strong>Supports “contains-like” behavior</strong>: users often remember only part of a title.</li>
<li><strong>Handles typos</strong>: similarity, not exact token match.</li>
<li><strong>Composes with filters</strong>: search + (state, remote, posted_at, source) should stay fast.</li>
<li><strong>Plays nicely with ingestion</strong>: updates come daily; the index must handle churn.</li>
</ol>
<p>I also didn’t want relevance tuning to turn into a second product. If I had to spend days tweaking <code>ts_rank</code> weights, that’s a smell.</p>
<h2 id="heading-options-considered">Options considered</h2>
<p>Below are the realistic choices I evaluated for Postgres/Supabase.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Option</td><td>What it is</td><td>Pros</td><td>Cons</td><td>Best when</td></tr>
</thead>
<tbody>
<tr>
<td><code>ILIKE '%query%'</code> + B-tree</td><td>Naive substring match</td><td>Simple, no extensions</td><td>Slow on large tables; can’t use B-tree with leading wildcard</td><td>Tiny datasets or admin tools</td></tr>
<tr>
<td>PostgreSQL Full-Text Search (<code>tsvector</code>)</td><td>Token-based search using dictionaries</td><td>Good for long text; ranking; language support</td><td>Weak on typos/partial strings; tuning needed; titles are short</td><td>Articles, descriptions, “document” search</td></tr>
<tr>
<td><code>pg_trgm</code> (trigram similarity)</td><td>String similarity via overlapping 3-char chunks</td><td>Typo-tolerant; fast with GIN/GiST; great for short fields</td><td>Not semantic; can match weirdly; needs threshold tuning</td><td>Names, titles, short text, “forgiving” search</td></tr>
<tr>
<td>External search (Meilisearch/Typesense/Elastic)</td><td>Dedicated search engine</td><td>Great relevance; typo handling; faceting</td><td>Extra infra; sync complexity; more moving parts; cost</td><td>High scale, complex ranking, multi-field relevance</td></tr>
</tbody>
</table>
</div><h3 id="heading-why-i-didnt-stick-with-ilike">Why I didn’t stick with <code>ILIKE</code></h3>
<p><code>ILIKE</code> feels tempting early on, especially when you’re “vibe coding” fast. But it collapses once you hit a few thousand rows and mix it with filters.</p>
<p><code>ILIKE '%pmhnp%'</code> forces a scan unless you add specialized indexing. On 7k jobs it might still feel okay—until you add multi-tenant rules, joins to companies, and a few concurrent users.</p>
<h3 id="heading-why-full-text-search-wasnt-the-right-default">Why full-text search wasn’t the right default</h3>
<p>FTS shines when you search bodies of text. But for job boards, most searches are:</p>
<ul>
<li><code>"pmhnp"</code></li>
<li><code>"psychiatric"</code></li>
<li><code>"remote"</code></li>
<li><code>"headway"</code> (company)</li>
</ul>
<p>FTS tokenization can hurt you here:</p>
<ul>
<li>Typos don’t match.</li>
<li>Partial tokens don’t match unless you add prefix operators and accept recall/precision trade-offs.</li>
<li>Acronyms and short tokens can behave weirdly depending on dictionaries.</li>
</ul>
<h3 id="heading-why-i-didnt-jump-to-an-external-search-engine">Why I didn’t jump to an external search engine</h3>
<p>I love dedicated search engines—but operating them is a commitment:</p>
<ul>
<li>You need a <strong>sync pipeline</strong> (DB → search index) that is correct under retries.</li>
<li>You now have <strong>two sources of truth</strong> for availability.</li>
<li>You have to decide how search respects RLS / access control.</li>
</ul>
<p>For PMHNP Hiring, the cost and complexity weren’t justified. Postgres could do “good enough” search with less risk.</p>
<h2 id="heading-the-decision-pgtrgm-a-search-vector-column-i-can-index">The decision: pg_trgm + a search vector column I can index</h2>
<p>I chose <strong>PostgreSQL’s <code>pg_trgm</code> extension</strong> and built search around a single normalized field (title + company + location) that I could index with <strong>GIN</strong>.</p>
<p>Primary reasons (ranked):</p>
<ol>
<li><strong>Typo tolerance on short fields</strong> without building relevance infrastructure.</li>
<li><strong>Composable performance</strong> with filters (state, remote, posted_at).</li>
<li><strong>Operational simplicity</strong>: no extra services; works inside Supabase.</li>
<li><strong>Predictable indexing story</strong>: GIN trigram indexes are battle-tested.</li>
</ol>
<p>What I gave up:</p>
<ul>
<li>No semantic relevance (synonyms, intent, “psych NP” == “PMHNP”).</li>
<li>Similarity search can return “surprising” matches unless you tune thresholds.</li>
<li>Ranking is simpler; you’re not doing sophisticated scoring.</li>
</ul>
<h3 id="heading-implementation-overview">Implementation overview</h3>
<p>I model jobs and companies relationally, but for search I avoid doing multiple similarity checks across joins at query time. Instead, I denormalize a <code>search_text</code> field on <code>jobs</code>.</p>
<h4 id="heading-1-enable-pgtrgm-and-add-an-indexed-field">1) Enable <code>pg_trgm</code> and add an indexed field</h4>
<pre><code class="lang-sql"><span class="hljs-comment">-- One-time</span>
<span class="hljs-keyword">create</span> extension <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> pg_trgm;

<span class="hljs-comment">-- Add a denormalized search field</span>
<span class="hljs-keyword">alter</span> <span class="hljs-keyword">table</span> jobs <span class="hljs-keyword">add</span> <span class="hljs-keyword">column</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> search_text <span class="hljs-built_in">text</span>;

<span class="hljs-comment">-- Keep it simple: lowercased, whitespace-normalized text</span>
<span class="hljs-comment">-- (I populate this during ingestion / upserts)</span>

<span class="hljs-comment">-- GIN index for fast trigram search</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> jobs_search_text_trgm
<span class="hljs-keyword">on</span> jobs <span class="hljs-keyword">using</span> gin (search_text gin_trgm_ops);
</code></pre>
<p>Why a denormalized <code>search_text</code>?</p>
<ul>
<li>Similarity across multiple columns (<code>title</code>, <code>company_name</code>) can prevent index use or force multiple index scans.</li>
<li>Joining companies for every search adds overhead; my feed queries already join for display.</li>
<li>With 200 daily updates, recomputing <code>search_text</code> is cheap and keeps reads fast.</li>
</ul>
<h4 id="heading-2-populate-searchtext-during-upsert-pipeline-friendly">2) Populate <code>search_text</code> during upsert (pipeline-friendly)</h4>
<p>My ingestion pipeline is: Cron (Vercel) → scraper → dedupe → upsert into Postgres. During the upsert, I compute <code>search_text</code>.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// pseudo-code inside the ingestion worker</span>
<span class="hljs-keyword">const</span> normalize = <span class="hljs-function">(<span class="hljs-params">s: <span class="hljs-built_in">string</span></span>) =&gt;</span>
  s
    .toLowerCase()
    .replace(<span class="hljs-regexp">/[^a-z0-9\s]/g</span>, <span class="hljs-string">" "</span>)
    .replace(<span class="hljs-regexp">/\s+/g</span>, <span class="hljs-string">" "</span>)
    .trim();

<span class="hljs-keyword">const</span> searchText = normalize([
  job.title,
  job.companyName,
  job.locationText,
  job.remote ? <span class="hljs-string">"remote"</span> : <span class="hljs-string">""</span>,
].filter(<span class="hljs-built_in">Boolean</span>).join(<span class="hljs-string">" "</span>));

<span class="hljs-keyword">await</span> db.from(<span class="hljs-string">"jobs"</span>).upsert({
  id: job.id,
  title: job.title,
  company_id: job.companyId,
  location_text: job.locationText,
  remote: job.remote,
  posted_at: job.postedAt,
  search_text: searchText,
  source: job.source,
});
</code></pre>
<p>This is a deliberate trade: write-time work for read-time speed.</p>
<h4 id="heading-3-query-pattern-similarity-filters-stable-pagination">3) Query pattern: similarity + filters + stable pagination</h4>
<p>I treat search as “filtering” rather than a separate endpoint. Most users search while also filtering by state/remote.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Parameterized query idea</span>
<span class="hljs-comment">-- :q is the normalized query string</span>
<span class="hljs-comment">-- :min_sim is a tuned threshold (e.g., 0.2 to 0.35)</span>

<span class="hljs-keyword">select</span>
  j.id, j.title, j.posted_at, j.remote,
  c.name <span class="hljs-keyword">as</span> company_name
<span class="hljs-keyword">from</span> jobs j
<span class="hljs-keyword">join</span> companies c <span class="hljs-keyword">on</span> c.id = j.company_id
<span class="hljs-keyword">where</span>
  (:state <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">or</span> j.state = :state)
  <span class="hljs-keyword">and</span> (:remote <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span> <span class="hljs-keyword">or</span> j.remote = :remote)
  <span class="hljs-keyword">and</span> (:q = <span class="hljs-string">''</span> <span class="hljs-keyword">or</span> j.search_text % :q)
<span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span>
  <span class="hljs-keyword">case</span> <span class="hljs-keyword">when</span> :q = <span class="hljs-string">''</span> <span class="hljs-keyword">then</span> <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> similarity(j.search_text, :q) <span class="hljs-keyword">end</span> <span class="hljs-keyword">desc</span>,
  j.posted_at <span class="hljs-keyword">desc</span>,
  j.id <span class="hljs-keyword">desc</span>
<span class="hljs-keyword">limit</span> :<span class="hljs-keyword">limit</span>;
</code></pre>
<p>Notes:</p>
<ul>
<li>The <code>%</code> operator is trigram “similarity match” (uses the trigram index).</li>
<li><code>similarity()</code> is used only for ordering when a query exists.</li>
<li>The secondary ordering by <code>posted_at, id</code> keeps results stable.</li>
</ul>
<h4 id="heading-4-tuning-similarity-threshold-without-guesswork">4) Tuning similarity threshold without guesswork</h4>
<p>The biggest footgun with <code>pg_trgm</code> is threshold tuning. Too low: irrelevant matches. Too high: you miss typos.</p>
<p>In Postgres you can set it per session:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Example: bump threshold for stricter matches</span>
<span class="hljs-keyword">select</span> set_limit(<span class="hljs-number">0.28</span>);

<span class="hljs-comment">-- Then run the search</span>
<span class="hljs-keyword">select</span> <span class="hljs-keyword">id</span>, title
<span class="hljs-keyword">from</span> jobs
<span class="hljs-keyword">where</span> search_text % <span class="hljs-string">'pmhnp remote'</span>;
</code></pre>
<p>In practice, I ended up using a slightly lower threshold for shorter queries and a higher one for longer queries (because long queries naturally have more trigrams).</p>
<h2 id="heading-results-amp-learnings-with-real-numbers">Results &amp; learnings (with real numbers)</h2>
<p>After shipping pg_trgm search + the right supporting indexes (composite indexes for filters, plus connection pooling with pgBouncer), the database layer for the most common “feed + search” requests stabilized around:</p>
<ul>
<li><strong>~50ms p95 query time</strong> for typical filtered listing queries (state/remote + optional search).</li>
<li>Search remained fast even with daily churn (~200 updates/day) because GIN index maintenance overhead at this scale is manageable.</li>
</ul>
<p>What worked well:</p>
<ul>
<li><strong>Typos stopped mattering</strong> for the most common cases (company names, “psychiatric”, “practitioner”).</li>
<li>I didn’t need to invent a ranking model. Similarity + recency was “good enough”.</li>
<li>Keeping search in Postgres meant fewer moving parts and fewer failure modes.</li>
</ul>
<p>Unexpected challenges:</p>
<ul>
<li>Certain short queries (like <code>"np"</code>) matched too broadly. The fix wasn’t more indexing—it was <strong>product constraints</strong> (minimum query length, or requiring at least one non-trivial token).</li>
<li>Similarity ordering can be noisy when many rows are “close enough”. Recency as a tie-breaker helped.</li>
</ul>
<p>What I’d do differently:</p>
<ul>
<li>Add a lightweight synonym layer (application-side) for domain terms (e.g., map “psych np” → “pmhnp psychiatric”). This is cheaper than building semantic search and improves relevance a lot.</li>
</ul>
<h2 id="heading-when-this-doesnt-work">When this doesn’t work</h2>
<p>pg_trgm isn’t a universal answer. I’d pick something else if:</p>
<ul>
<li><strong>You need semantic relevance</strong>: synonyms, intent understanding, “director of nursing” matching “DON”, etc. That’s where a dedicated search engine or embeddings start to win.</li>
<li><strong>You’re searching long bodies of text</strong> (job descriptions). FTS (or hybrid FTS + trgm) is often better.</li>
<li><strong>Your dataset is massive and high-churn</strong> (hundreds of millions of rows). GIN index size and maintenance can become expensive.</li>
<li><strong>You need advanced faceting + ranking</strong> beyond what SQL can comfortably express.</li>
</ul>
<p>A pragmatic hybrid that I’d consider later:</p>
<ul>
<li>FTS for descriptions (token relevance)</li>
<li>pg_trgm for titles/company (typo tolerance)</li>
<li>Merge/rank results in SQL or application layer</li>
</ul>
<p>But I wouldn’t start there unless search quality is the core differentiator.</p>
<h2 id="heading-key-takeaways">Key takeaways</h2>
<ol>
<li><strong>Match the search tool to the shape of your data</strong>: short strings behave differently than documents.</li>
<li><strong>Optimize for the operational budget you actually have</strong>: one Postgres instance you understand beats two systems you barely monitor.</li>
<li><strong>Denormalize intentionally</strong> when it removes joins from your hot path; pay the cost at ingestion time.</li>
<li><strong>Thresholds are product decisions</strong>: minimum query length and similarity limits are UX levers, not just database knobs.</li>
<li><strong>Use recency as a stabilizer</strong>: in job boards, “newer” is often a better tie-breaker than perfect relevance.</li>
</ol>
<h2 id="heading-closing">Closing</h2>
<p>If you’ve built search for a marketplace or job board: did you stick with Postgres (FTS/trgm) or graduate to a dedicated search engine? I’m especially curious where your tipping point was—data size, relevance requirements, or team/ops maturity.</p>
]]></content:encoded></item><item><title><![CDATA[Why I Built a Durable Offline Queue for AI Calls in React Native]]></title><description><![CDATA[AI features are easy to demo on perfect Wi‑Fi and painfully fragile in the real world. In my fitness app project (React Native + Expo + SQLite), users can log a set in ~5 seconds and optionally get AI help (workout suggestions, explanations, quick ad...]]></description><link>https://blog.dvskr.dev/why-i-built-a-durable-offline-queue-for-ai-calls-in-react-native</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-built-a-durable-offline-queue-for-ai-calls-in-react-native</guid><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Sat, 24 Jan 2026 16:00:39 GMT</pubDate><content:encoded><![CDATA[<p>AI features are easy to demo on perfect Wi‑Fi and painfully fragile in the real world. In my fitness app project (React Native + Expo + SQLite), users can log a set in ~5 seconds and optionally get AI help (workout suggestions, explanations, quick adjustments). The architectural decision that mattered most wasn’t the prompt design—it was whether AI calls should be “best-effort” or “durable”. I chose a durable, persisted offline queue for OpenAI requests so the UX stays responsive, battery-friendly, and predictable even when the device is offline or rate-limited.</p>
<h2 id="heading-context-the-problem-space-and-why-its-subtle">Context: the problem space (and why it’s subtle)</h2>
<p>I’m building a mobile workout tracker where the core loop is fast: open app → log set → move on. The app is offline-first: SQLite is the primary store, and sync is “cloud-backup”, not “cloud-source-of-truth”. Scale is small today (10 waitlist, ~400 exercises), but the constraints are real:</p>
<ul>
<li><strong>Sub-100ms UI interactions</strong> for logging (anything slower feels like friction mid-set)</li>
<li><strong>Offline and spotty network</strong> are normal (basements, gyms with bad reception)</li>
<li><strong>Battery and data usage matter</strong> (background retry loops can be expensive)</li>
<li><strong>AI calls are non-critical</strong> (logging must work without them)</li>
<li><strong>OpenAI limits and latency are unpredictable</strong> (429s, timeouts, slow responses)</li>
</ul>
<p>The naive approach is: “Call the API when the user taps, show a spinner, retry on failure.” That’s fine for a chat app. For a workout logger, it’s a UX regression: it blocks the user on something that isn’t essential.</p>
<p>So the decision: <strong>Should AI requests be synchronous and UI-coupled, or should they be durable tasks that can be executed later?</strong></p>
<blockquote>
<p>Key insight: In offline-first apps, anything that touches the network should be treated like a background job—especially if it’s optional.</p>
</blockquote>
<h2 id="heading-options-considered">Options considered</h2>
<p>I considered four patterns for integrating AI calls without degrading the core logging experience.</p>
<h3 id="heading-comparison-table">Comparison table</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Option</td><td>What it is</td><td>Pros</td><td>Cons</td><td>Best when</td></tr>
</thead>
<tbody>
<tr>
<td>A) Synchronous call in UI flow</td><td>Call OpenAI on button tap, await result</td><td>Simple mental model; fewer moving parts</td><td>UI stalls; brittle offline; retries drain battery; hard to rate-limit</td><td>AI is core feature and latency is acceptable</td></tr>
<tr>
<td>B) Fire-and-forget in memory</td><td>Trigger request, don’t await; store result in state when it returns</td><td>UI stays fast; minimal code</td><td>If app is killed, request is lost; no backoff; duplicates likely</td><td>AI is “nice to have” and losing responses is OK</td></tr>
<tr>
<td>C) Durable local queue (SQLite)</td><td>Persist tasks; worker processes when online; backoff + rate limits</td><td>Survives restarts; controllable retries; good offline UX; measurable</td><td>More code; need idempotency + dedupe; needs observability</td><td>Offline-first apps with optional network features</td></tr>
<tr>
<td>D) Server-side job queue</td><td>Send intent to backend; backend calls OpenAI and pushes result</td><td>Centralized control; better secrets management; easier analytics</td><td>Requires backend; still needs device-side offline handling; more cost/ops</td><td>You already run a backend and need shared results</td></tr>
</tbody>
</table>
</div><h3 id="heading-why-i-didnt-choose-a-or-b">Why I didn’t choose A or B</h3>
<ul>
<li><strong>A (synchronous)</strong> made the UI hostage to network conditions. Even if I didn’t block the whole screen, it introduced “pending” states everywhere and created edge cases (user logs next set while previous AI call is still inflight).</li>
<li><strong>B (in-memory)</strong> sounded attractive until I simulated real behavior: mobile OS kills the app, users background it, network flips, and you end up with lost work or duplicates.</li>
</ul>
<h3 id="heading-why-i-didnt-choose-d-server-side">Why I didn’t choose D (server-side)</h3>
<p>Longer term, a backend queue is compelling. But right now the app is offline-first and early-stage. Adding a backend just to make AI reliable felt like premature complexity. Also, I’d still need a device-side outbox because requests originate offline.</p>
<p>That led to <strong>C: a durable local queue</strong>.</p>
<h2 id="heading-the-decision-a-persisted-offline-queue-sqlite-outbox">The decision: a persisted offline queue (SQLite outbox)</h2>
<p>I implemented an <strong>Outbox pattern</strong> for AI requests:</p>
<ul>
<li>Every AI intent becomes a row in <code>ai_jobs</code> in SQLite.</li>
<li>UI writes a job and immediately returns (optimistic UX).</li>
<li>A background worker processes jobs when:<ul>
<li>device is online</li>
<li>rate limit allows</li>
<li>app is in foreground (initially; background execution is a later enhancement)</li>
</ul>
</li>
<li>Results are stored back into SQLite and projected into UI state.</li>
</ul>
<h3 id="heading-architecture-diagram-mermaid">Architecture diagram (Mermaid)</h3>
<pre><code class="lang-mermaid">flowchart LR
  UI[React Native UI]
  Z[Zustand Store]
  DB[(SQLite)]
  Q[ai_jobs Outbox]
  W[Queue Worker]
  NET[Network Check]
  RL[Rate Limiter]
  OAI[OpenAI API]
  RES[ai_results]

  UI --&gt; Z
  Z --&gt; DB
  UI --&gt;|enqueue intent| Q
  Q --&gt; DB
  W --&gt; NET
  W --&gt; RL
  W --&gt;|claim job| Q
  W --&gt;|call| OAI
  OAI --&gt;|response| W
  W --&gt; RES
  RES --&gt; DB
  DB --&gt; UI
</code></pre>
<h3 id="heading-data-model-jobs-need-to-be-idempotent">Data model: jobs need to be idempotent</h3>
<p>The main thing I learned from data engineering is: distributed systems fail in boring ways. Mobile is a distributed system with a very unreliable worker (the phone).</p>
<p>Each job needs:</p>
<ul>
<li>a stable <strong>idempotency key</strong> (so retries don’t duplicate effects)</li>
<li>status transitions that are safe across crashes</li>
<li>metadata for backoff and debugging</li>
</ul>
<p>A minimal schema:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- SQLite</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> ai_jobs (
  <span class="hljs-keyword">id</span> <span class="hljs-built_in">TEXT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
  <span class="hljs-keyword">type</span> <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  payload_json <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  <span class="hljs-keyword">status</span> <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>, <span class="hljs-comment">-- queued | running | done | failed</span>
  attempts <span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-number">0</span>,
  run_after_ms <span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-number">0</span>,
  locked_until_ms <span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-number">0</span>,
  last_error <span class="hljs-built_in">TEXT</span>,
  created_at_ms <span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  updated_at_ms <span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>
);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> idx_ai_jobs_status_run_after
<span class="hljs-keyword">ON</span> ai_jobs(<span class="hljs-keyword">status</span>, run_after_ms);
</code></pre>
<h3 id="heading-enqueue-from-ui-fast-path">Enqueue from UI (fast path)</h3>
<p>The UI path must be cheap: one insert, no network.</p>
<pre><code class="lang-ts"><span class="hljs-keyword">import</span> { nanoid } <span class="hljs-keyword">from</span> <span class="hljs-string">"nanoid/non-secure"</span>;

<span class="hljs-keyword">type</span> AiJobType = <span class="hljs-string">"exercise_suggestion"</span> | <span class="hljs-string">"form_explanation"</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">enqueueAiJob</span>(<span class="hljs-params">db: <span class="hljs-built_in">any</span>, <span class="hljs-keyword">type</span>: AiJobType, payload: unknown</span>) </span>{
  <span class="hljs-keyword">const</span> now = <span class="hljs-built_in">Date</span>.now();
  <span class="hljs-keyword">const</span> id = nanoid();

  <span class="hljs-keyword">await</span> db.runAsync(
    <span class="hljs-string">`INSERT INTO ai_jobs(id, type, payload_json, status, attempts, run_after_ms, locked_until_ms, created_at_ms, updated_at_ms)
     VALUES(?, ?, ?, 'queued', 0, 0, 0, ?, ?)`</span>
    , [id, <span class="hljs-keyword">type</span>, <span class="hljs-built_in">JSON</span>.stringify(payload), now, now]
  );

  <span class="hljs-keyword">return</span> id;
}
</code></pre>
<p>Design choice: I’m not doing anything clever here (no batching, no compression). The win is that it’s <strong>deterministic and restart-safe</strong>.</p>
<h3 id="heading-claim-process-avoid-duplicate-workers">Claim + process: avoid duplicate workers</h3>
<p>Even on-device, you can end up with multiple workers (hot reload, navigation bugs, accidental multiple intervals). I added a “lease” field <code>locked_until_ms</code> to prevent double-processing.</p>
<pre><code class="lang-ts"><span class="hljs-keyword">const</span> LEASE_MS = <span class="hljs-number">15</span>_000;

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">claimNextJob</span>(<span class="hljs-params">db: <span class="hljs-built_in">any</span></span>) </span>{
  <span class="hljs-keyword">const</span> now = <span class="hljs-built_in">Date</span>.now();

  <span class="hljs-comment">// Find a runnable job</span>
  <span class="hljs-keyword">const</span> job = <span class="hljs-keyword">await</span> db.getFirstAsync(
    <span class="hljs-string">`SELECT * FROM ai_jobs
     WHERE status = 'queued'
       AND run_after_ms &lt;= ?
       AND locked_until_ms &lt;= ?
     ORDER BY created_at_ms ASC
     LIMIT 1`</span>,
    [now, now]
  );

  <span class="hljs-keyword">if</span> (!job) <span class="hljs-keyword">return</span> <span class="hljs-literal">null</span>;

  <span class="hljs-comment">// Lease it (best-effort atomicity; acceptable for single-device SQLite)</span>
  <span class="hljs-keyword">const</span> lockedUntil = now + LEASE_MS;
  <span class="hljs-keyword">await</span> db.runAsync(
    <span class="hljs-string">`UPDATE ai_jobs
     SET status = 'running', locked_until_ms = ?, updated_at_ms = ?
     WHERE id = ?`</span>,
    [lockedUntil, now, job.id]
  );

  <span class="hljs-keyword">return</span> { ...job, locked_until_ms: lockedUntil, status: <span class="hljs-string">"running"</span> };
}
</code></pre>
<p>This is not perfect distributed locking, but for a single SQLite DB on one phone it’s pragmatic.</p>
<h3 id="heading-backoff-rate-limiting-protect-ux-and-battery">Backoff + rate limiting: protect UX and battery</h3>
<p>Two failure modes matter:</p>
<ol>
<li><strong>Offline / flaky network</strong> → repeated failures</li>
<li><strong>429 rate limits</strong> → hammering the API makes it worse</li>
</ol>
<p>I used an exponential backoff with jitter, and a simple per-user token bucket stored in memory + persisted timestamp in SQLite (so app restarts don’t immediately retry everything).</p>
<pre><code class="lang-ts"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">nextRunAfterMs</span>(<span class="hljs-params">attempts: <span class="hljs-built_in">number</span></span>) </span>{
  <span class="hljs-keyword">const</span> base = <span class="hljs-built_in">Math</span>.min(<span class="hljs-number">60</span>_000, <span class="hljs-number">1000</span> * <span class="hljs-built_in">Math</span>.pow(<span class="hljs-number">2</span>, attempts)); <span class="hljs-comment">// cap at 60s</span>
  <span class="hljs-keyword">const</span> jitter = <span class="hljs-built_in">Math</span>.floor(<span class="hljs-built_in">Math</span>.random() * <span class="hljs-number">400</span>); <span class="hljs-comment">// 0-400ms</span>
  <span class="hljs-keyword">return</span> <span class="hljs-built_in">Date</span>.now() + base + jitter;
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">markJobRetry</span>(<span class="hljs-params">db: <span class="hljs-built_in">any</span>, id: <span class="hljs-built_in">string</span>, attempts: <span class="hljs-built_in">number</span>, err: unknown</span>) </span>{
  <span class="hljs-keyword">const</span> now = <span class="hljs-built_in">Date</span>.now();
  <span class="hljs-keyword">const</span> runAfter = nextRunAfterMs(attempts);

  <span class="hljs-keyword">await</span> db.runAsync(
    <span class="hljs-string">`UPDATE ai_jobs
     SET status='queued', attempts=?, run_after_ms=?, locked_until_ms=0, last_error=?, updated_at_ms=?
     WHERE id=?`</span>,
    [attempts, runAfter, <span class="hljs-built_in">String</span>(err), now, id]
  );
}
</code></pre>
<p>Decision detail: I intentionally cap backoff at 60 seconds for now because AI is non-critical, and long retry windows reduce battery churn. If a job can’t succeed within a few minutes, it’s usually because the user is offline for a while—better to wait for connectivity.</p>
<h3 id="heading-result-persistence-decouple-ui-from-worker">Result persistence: decouple UI from worker</h3>
<p>When the worker completes, it stores a result row keyed by job id (or domain entity id), and marks the job done.</p>
<p>That means the UI can render “AI suggestion pending” without caring whether the app was restarted.</p>
<h2 id="heading-results-amp-learnings-so-far">Results &amp; learnings (so far)</h2>
<p>This is early (Builder Day 30), but a few things are measurable even at small scale.</p>
<h3 id="heading-performance-impact">Performance impact</h3>
<ul>
<li><strong>Logging path latency</strong>: enqueue insert is typically <strong>1–4ms</strong> on my test device (Pixel 7), and the UI remains under my <strong>~100ms interaction</strong> budget.</li>
<li><strong>Cold start impact</strong>: negligible because the worker doesn’t run until after initial render (I schedule it after navigation is ready). The main startup cost remains loading the exercise library (handled via lazy loading).</li>
</ul>
<h3 id="heading-reliability-improvements">Reliability improvements</h3>
<ul>
<li>No more “spinners that never resolve” when the user goes underground.</li>
<li>If the app is killed mid-request, the job lease expires and it retries later.</li>
<li>Rate limiting is centralized, so I can enforce “N AI calls per minute” without sprinkling guards across UI components.</li>
</ul>
<h3 id="heading-unexpected-challenges">Unexpected challenges</h3>
<ul>
<li><strong>Duplicate intents</strong>: users tap twice, or navigate back/forward quickly. Without dedupe, you pay twice. I’m adding a domain-level idempotency key (e.g., <code>suggestion:{workoutSessionId}:{exerciseId}:{setIndex}</code>) to collapse duplicates.</li>
<li><strong>Observability</strong>: debugging on-device queues is annoying. I added a hidden “Queue Inspector” screen that lists jobs, attempts, and last_error. Not pretty, but it cuts debugging time.</li>
<li><strong>Context window management</strong>: queued jobs can run minutes later. If the payload references “current set”, it may no longer be current. I learned to enqueue <strong>immutable references</strong> (IDs + snapshot fields), not “current state”.</li>
</ul>
<h2 id="heading-when-this-doesnt-work">When this doesn’t work</h2>
<p>A durable local queue is not a universal solution.</p>
<ul>
<li><strong>If AI output must be immediate</strong>, like real-time coaching, you’ll still need synchronous calls (or at least streaming). Queueing helps reliability but not latency.</li>
<li><strong>If you require cross-device consistency</strong>, device-local jobs can diverge. You’ll want a server-side queue or a shared sync layer where jobs are replicated and deduped across devices.</li>
<li><strong>If you need strong guarantees</strong>, SQLite leasing is “good enough” for single-device but not equivalent to a transactional distributed queue. If you later add background tasks, multiple processes, or extensions, you’ll need more robust locking.</li>
<li><strong>If your payloads are huge</strong>, storing them in SQLite can bloat DB size and slow queries. In that case, store payloads as separate blobs/files and reference them.</li>
</ul>
<h2 id="heading-key-takeaways">Key takeaways</h2>
<ol>
<li><strong>Treat optional network features as background jobs</strong> in offline-first apps; don’t couple them to UI interactions.</li>
<li><strong>Persist the queue</strong> (SQLite outbox) so app restarts and OS kills don’t lose work.</li>
<li><strong>Design for idempotency early</strong>—duplicates are normal, not an edge case.</li>
<li><strong>Centralize backoff and rate limiting</strong> to protect battery, UX, and your API bill.</li>
<li><strong>Enqueue immutable snapshots</strong>, not “current state”, because queued work executes later.</li>
</ol>
<h2 id="heading-closing">Closing</h2>
<p>I’m happy with the durability and UX improvements, but I’m still unsure about the “right” next step: background execution (TaskManager) vs moving AI orchestration server-side once multi-device sync becomes real.</p>
<p>If you’ve built offline queues on mobile: do you prefer a device outbox like this, or do you push intents to a backend queue as early as possible—and why?</p>
]]></content:encoded></item><item><title><![CDATA[Why I Prefer Keyset Pagination for High-Volume Feeds]]></title><description><![CDATA[Pagination looks like a UI problem until it becomes a production bottleneck. Once your dataset grows and users start filtering, sorting, and jumping between pages, the wrong pagination strategy quietly burns CPU, increases query latency, and creates ...]]></description><link>https://blog.dvskr.dev/why-i-prefer-keyset-pagination-for-high-volume-feeds</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-prefer-keyset-pagination-for-high-volume-feeds</guid><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Fri, 23 Jan 2026 16:00:38 GMT</pubDate><content:encoded><![CDATA[<p>Pagination looks like a UI problem until it becomes a production bottleneck. Once your dataset grows and users start filtering, sorting, and jumping between pages, the wrong pagination strategy quietly burns CPU, increases query latency, and creates confusing duplicates or missing rows. As a solo creator, you don’t get to “throw a team at it”—you need a choice that’s fast, predictable, and hard to break at 2am. This is why I default to keyset (cursor) pagination over OFFSET/LIMIT for most high-volume feeds, and when I still won’t use it.</p>
<h2 id="heading-the-problem-space-constraints-that-matter">The problem space (constraints that matter)</h2>
<p>Pagination becomes architectural when it touches three things at once:</p>
<p>1) <strong>Performance at scale</strong>: As tables grow from thousands to millions of rows, naive pagination can turn into a linear scan. The user still sees “page 40”, but your database is doing work proportional to 40 * page_size.</p>
<p>2) <strong>Correctness under writes</strong>: Feeds are rarely static. New rows get inserted, old rows get updated, and background jobs backfill data. Offset-based pagination can return duplicates or skip records as the underlying ordering shifts.</p>
<p>3) <strong>Operational simplicity</strong>: Solo development is a constraint. I prefer designs that are:</p>
<ul>
<li>hard to misuse across endpoints</li>
<li>easy to reason about when debugging</li>
<li>index-friendly</li>
<li>cheap (less CPU, fewer slow queries)</li>
</ul>
<p>Non-functional requirements I usually assume for a “feed-like” endpoint:</p>
<ul>
<li><strong>p95 latency target</strong>: &lt; 150ms for common queries (excluding network)</li>
<li><strong>Predictable performance</strong>: page 1 and page 100 shouldn’t be 10x apart</li>
<li><strong>Stable ordering</strong>: no duplicates, minimal “jumping”</li>
<li><strong>Backwards/forwards navigation</strong>: at least “next page”; ideally “previous” too</li>
</ul>
<p>Why “existing solutions” don’t fit by default:</p>
<ul>
<li>Many ORMs make OFFSET/LIMIT feel like the obvious default.</li>
<li>Many frontend designs assume numeric pages (1…N), which biases you toward OFFSET.</li>
<li>Some developers ship OFFSET early “just for MVP” and then discover it’s embedded in caching, links, emails, and analytics.</li>
</ul>
<blockquote>
<p>Key insight: pagination is part of your data contract. Changing it later is possible, but it’s never free.</p>
</blockquote>
<h2 id="heading-options-considered">Options considered</h2>
<p>Below are the common strategies I’ve used or audited in production-like systems.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Option</td><td>What it is</td><td>Pros</td><td>Cons</td><td>Works best when</td></tr>
</thead>
<tbody>
<tr>
<td>OFFSET/LIMIT</td><td><code>ORDER BY ... OFFSET x LIMIT y</code></td><td>Simple; supports random access (page 37)</td><td>Slower as offset grows; duplicates/skips under writes; deep pages expensive</td><td>Small tables; mostly static datasets; admin views</td></tr>
<tr>
<td>Keyset (Cursor)</td><td><code>WHERE (sort_key, id) &lt; (cursor_sort_key, cursor_id) LIMIT y</code></td><td>O(1) page-to-page; stable under inserts; index-friendly</td><td>Harder random access; needs careful cursor encoding; tricky with complex sorting</td><td>Feeds, timelines, infinite scroll, large datasets</td></tr>
<tr>
<td>Seek by ID only</td><td><code>WHERE id &lt; last_id</code></td><td>Very fast; simplest cursor</td><td>Only works if ID correlates with desired order; breaks if you sort by time/score</td><td>Append-only logs; monotonic IDs; simple “latest first”</td></tr>
<tr>
<td>Snapshot + OFFSET</td><td>Pin a consistent snapshot (repeatable read) and use OFFSET</td><td>Correctness improves; keeps numeric pages</td><td>Still pays offset cost; snapshot mgmt complexity; not great for long browsing sessions</td><td>Reporting; exports; short sessions</td></tr>
<tr>
<td>Precomputed page map</td><td>Materialize page boundaries (e.g., store cursor per page)</td><td>Enables random access + keyset speed</td><td>Extra storage; invalidation complexity; rebuild cost</td><td>Highly trafficked, mostly read-only catalogs</td></tr>
</tbody>
</table>
</div><h3 id="heading-what-actually-bites-you-in-production">What actually bites you in production</h3>
<ul>
<li><strong>OFFSET cost is not theoretical.</strong> In Postgres, OFFSET often means scanning and discarding rows. Even with indexes, the engine still has to walk past N rows.</li>
<li><strong>Correctness is a user-facing feature.</strong> Duplicate items in a feed erode trust. Missing items can be worse.</li>
<li><strong>Random page jumps are overrated.</strong> Most consumer feeds are “next/previous” patterns. Numeric page links are common in catalogs, not timelines.</li>
</ul>
<h2 id="heading-the-decision-what-i-choose-and-why">The decision (what I choose and why)</h2>
<p>I default to <strong>keyset pagination with a composite cursor</strong>:</p>
<ul>
<li>Primary sort key: <code>created_at</code> (or whatever defines the feed)</li>
<li>Tie-breaker: <code>id</code> (unique, stable)</li>
</ul>
<h3 id="heading-why-ranked-reasons">Why (ranked reasons)</h3>
<p>1) <strong>Predictable query cost</strong>: page 1 and page 100 are similar complexity.
2) <strong>Correctness under concurrent writes</strong>: fewer duplicates/skips because you’re anchoring to a position, not a row count.
3) <strong>Index leverage</strong>: a composite index can satisfy the query efficiently.
4) <strong>Simpler operations</strong>: fewer slow queries, fewer surprise p95 spikes.</p>
<h3 id="heading-what-i-give-up">What I give up</h3>
<ul>
<li>True random access like “go to page 42” is non-trivial.</li>
<li>You must design a cursor format (encoding/decoding, validation, expiry decisions).</li>
<li>Some sorting modes don’t map well (e.g., sorting by a computed score that changes frequently).</li>
</ul>
<h2 id="heading-implementation-overview-postgres-examples">Implementation overview (Postgres examples)</h2>
<h3 id="heading-1-schema-index-that-makes-keyset-work">1) Schema + index that makes keyset work</h3>
<p>Assume a table:</p>
<ul>
<li><code>id</code> is unique</li>
<li><code>created_at</code> is the primary ordering</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-comment">-- Postgres</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> items (
  <span class="hljs-keyword">id</span> BIGSERIAL PRIMARY <span class="hljs-keyword">KEY</span>,
  created_at TIMESTAMPTZ <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">now</span>(),
  title <span class="hljs-built_in">TEXT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
  payload JSONB
);

<span class="hljs-comment">-- Composite index to support ORDER BY created_at DESC, id DESC</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> items_created_at_id_desc
<span class="hljs-keyword">ON</span> items (created_at <span class="hljs-keyword">DESC</span>, <span class="hljs-keyword">id</span> <span class="hljs-keyword">DESC</span>);
</code></pre>
<p>The tie-breaker (<code>id</code>) matters because many rows can share the same <code>created_at</code> at millisecond resolution or due to batch inserts.</p>
<h3 id="heading-2-the-keyset-query-next-page">2) The keyset query (next page)</h3>
<p>The cursor is the last row of the previous page: <code>(created_at, id)</code>.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- First page (no cursor)</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>, created_at, title
<span class="hljs-keyword">FROM</span> items
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> created_at <span class="hljs-keyword">DESC</span>, <span class="hljs-keyword">id</span> <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> $<span class="hljs-number">1</span>;

<span class="hljs-comment">-- Next page (cursor provided)</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">id</span>, created_at, title
<span class="hljs-keyword">FROM</span> items
<span class="hljs-keyword">WHERE</span> (created_at, <span class="hljs-keyword">id</span>) &lt; ($<span class="hljs-number">2</span>::timestamptz, $<span class="hljs-number">3</span>::<span class="hljs-built_in">bigint</span>)
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> created_at <span class="hljs-keyword">DESC</span>, <span class="hljs-keyword">id</span> <span class="hljs-keyword">DESC</span>
<span class="hljs-keyword">LIMIT</span> $<span class="hljs-number">1</span>;
</code></pre>
<p>Why tuple comparison? It’s concise and maps cleanly to the “sort key + tie-breaker” concept.</p>
<h3 id="heading-3-cursor-encoding-dont-leak-raw-values-blindly">3) Cursor encoding (don’t leak raw values blindly)</h3>
<p>I like a compact, signed cursor so:</p>
<ul>
<li>clients can’t easily tamper with it</li>
<li>you can evolve cursor formats</li>
</ul>
<p>Below is a minimal Node-style example (works similarly in any backend):</p>
<pre><code class="lang-js"><span class="hljs-keyword">import</span> crypto <span class="hljs-keyword">from</span> <span class="hljs-string">"crypto"</span>;

<span class="hljs-keyword">const</span> SECRET = process.env.CURSOR_SECRET;

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">encodeCursor</span>(<span class="hljs-params">{ createdAt, id }</span>) </span>{
  <span class="hljs-keyword">const</span> payload = <span class="hljs-built_in">JSON</span>.stringify({ <span class="hljs-attr">v</span>: <span class="hljs-number">1</span>, createdAt, id });
  <span class="hljs-keyword">const</span> sig = crypto.createHmac(<span class="hljs-string">"sha256"</span>, SECRET).update(payload).digest(<span class="hljs-string">"base64url"</span>);
  <span class="hljs-keyword">return</span> Buffer.from(payload).toString(<span class="hljs-string">"base64url"</span>) + <span class="hljs-string">"."</span> + sig;
}

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">decodeCursor</span>(<span class="hljs-params">cursor</span>) </span>{
  <span class="hljs-keyword">const</span> [b64, sig] = cursor.split(<span class="hljs-string">"."</span>);
  <span class="hljs-keyword">const</span> payload = Buffer.from(b64, <span class="hljs-string">"base64url"</span>).toString(<span class="hljs-string">"utf8"</span>);
  <span class="hljs-keyword">const</span> expected = crypto.createHmac(<span class="hljs-string">"sha256"</span>, SECRET).update(payload).digest(<span class="hljs-string">"base64url"</span>);
  <span class="hljs-keyword">if</span> (sig !== expected) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"Invalid cursor"</span>);
  <span class="hljs-keyword">const</span> obj = <span class="hljs-built_in">JSON</span>.parse(payload);
  <span class="hljs-keyword">if</span> (obj.v !== <span class="hljs-number">1</span>) <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">"Unsupported cursor version"</span>);
  <span class="hljs-keyword">return</span> obj;
}
</code></pre>
<p>Trade-off: signing adds a tiny CPU cost, but it prevents a whole class of “cursor points to weird place” bugs and makes abuse harder.</p>
<h3 id="heading-4-api-shape-make-misuse-difficult">4) API shape (make misuse difficult)</h3>
<p>I prefer an API contract like:</p>
<ul>
<li><code>limit</code> (bounded)</li>
<li><code>cursor</code> (opaque)</li>
<li>returns: <code>items[]</code>, <code>next_cursor</code></li>
</ul>
<pre><code class="lang-json">{
  <span class="hljs-attr">"items"</span>: [{ <span class="hljs-attr">"id"</span>: <span class="hljs-number">123</span>, <span class="hljs-attr">"created_at"</span>: <span class="hljs-string">"2026-01-29T10:00:00Z"</span>, <span class="hljs-attr">"title"</span>: <span class="hljs-string">"..."</span> }],
  <span class="hljs-attr">"next_cursor"</span>: <span class="hljs-string">"eyJ2IjoxLCJjcmVhdGVkQXQiOiIyMDI2LTAxLTI5VDEwOjAwOjAwWiIsImlkIjoxMjN9.abc..."</span>
}
</code></pre>
<p>This nudges the frontend toward “infinite scroll / load more”, which matches the strengths of keyset.</p>
<h2 id="heading-diagram-how-keyset-pagination-reads-data">Diagram: how keyset pagination reads data</h2>
<pre><code class="lang-mermaid">graph TD
  A[Client] --&gt;|GET /feed?limit=20| B[API]
  B --&gt;|SQL: ORDER BY created_at,id DESC LIMIT 20| C[(Postgres)]
  C --&gt; B
  B --&gt;|items + next_cursor| A
  A --&gt;|GET /feed?limit=20&amp;cursor=...| B
  B --&gt;|SQL: WHERE (created_at,id) &lt; cursor ORDER BY ... LIMIT 20| C
</code></pre>
<p>The key is that the second query doesn’t “count past” earlier rows; it “seeks” to a position.</p>
<h2 id="heading-results-amp-learnings-numbers-what-surprised-me">Results &amp; learnings (numbers + what surprised me)</h2>
<p>I’m intentionally not tying this to any specific product; these are representative numbers from benchmarking patterns I’ve repeated over time.</p>
<h3 id="heading-performance-comparison-representative">Performance comparison (representative)</h3>
<p>On Postgres with a table in the <strong>1–10M row</strong> range, ordering by <code>(created_at DESC, id DESC)</code>:</p>
<ul>
<li><p><strong>OFFSET/LIMIT</strong></p>
<ul>
<li>page 1 (OFFSET 0): often ~10–30ms query time</li>
<li>page 500 (OFFSET 9,980 with limit 20): can drift to ~80–250ms depending on cache and vacuum state</li>
<li>deep pages: p95 spikes are common under concurrent load</li>
</ul>
</li>
<li><p><strong>Keyset</strong></p>
<ul>
<li>page 1: ~10–30ms (similar)</li>
<li>page 500: usually stays in the same band (~10–40ms) because the index seek remains efficient</li>
</ul>
</li>
</ul>
<p>What surprised me early on:</p>
<ul>
<li>OFFSET-based endpoints can look “fine” in staging because you rarely test deep pages.</li>
<li>Keyset pagination makes <strong>caching easier</strong> for “top of feed” traffic, because you avoid expensive deep scans that compete for shared resources.</li>
<li>The biggest win isn’t average latency—it’s <strong>tail latency predictability</strong>.</li>
</ul>
<h3 id="heading-operational-learning">Operational learning</h3>
<p>If you’re solo, the win is fewer incidents caused by growth. Keyset is one of those choices where you pay a bit of complexity upfront (cursor encoding, edge cases) to avoid repeated performance firefighting later.</p>
<h2 id="heading-when-this-doesnt-work">When this doesn’t work</h2>
<p>Keyset pagination is not a universal default.</p>
<p>Use something else when:</p>
<p>1) <strong>You truly need random access</strong> (e.g., “page 37 of 2,000” is a real UX requirement). Catalogs, admin UIs, and compliance exports often need numeric pages.</p>
<p>2) <strong>Your sort key is unstable</strong> (e.g., “score” changes frequently). If the ordering changes between requests, any pagination strategy can feel inconsistent—but keyset can be especially confusing because the cursor anchors to a moving target.</p>
<p>3) <strong>You need total counts and exact page numbers</strong>. Keyset doesn’t naturally provide “total pages”. You can compute counts separately, but it’s another query and can be expensive.</p>
<p>4) <strong>Complex multi-column sorts with NULL semantics</strong>. Still doable, but your cursor logic becomes more fragile. At some point, you’re building a mini query planner.</p>
<p>In those cases, I’ll either:</p>
<ul>
<li>accept OFFSET for smaller datasets + add safeguards (max page, caching, read replica), or</li>
<li>build a hybrid: keyset for “browse”, offset for “jump”, or</li>
<li>materialize results (precomputed boundaries) if the dataset is mostly read-only.</li>
</ul>
<h2 id="heading-key-takeaways-a-framework-you-can-reuse">Key takeaways (a framework you can reuse)</h2>
<p>1) <strong>Decide based on growth shape, not current size</strong>. If you expect the table to grow continuously, avoid strategies with linear deep-page costs.
2) <strong>Correctness under writes is part of UX</strong>. If duplicates/skips are unacceptable, prefer cursor-based approaches.
3) <strong>Pick an ordering you can index</strong>. Keyset only shines when your <code>ORDER BY</code> matches an index.
4) <strong>Make the cursor opaque and versioned</strong>. You’ll thank yourself when you evolve sorting or add filters.
5) <strong>Optimize for operations as a solo creator</strong>: predictable p95 beats cleverness.</p>
<h2 id="heading-closing">Closing</h2>
<p>If you had to choose today: would you trade away random page access to get stable performance and fewer pagination bugs? I’m curious where you draw that line—especially for datasets that are both large and frequently updated.</p>
]]></content:encoded></item><item><title><![CDATA[Why I Use Materialized Views for Job Board Aggregations]]></title><description><![CDATA[When I built a PMHNP job board that aggregates 7,556+ listings across 1,368+ companies (with ~200 updates/day), the “simple” parts got hard fast—especially counts and facets. Users expect filters like state, remote, and company to feel instant. But r...]]></description><link>https://blog.dvskr.dev/why-i-use-materialized-views-for-job-board-aggregations</link><guid isPermaLink="true">https://blog.dvskr.dev/why-i-use-materialized-views-for-job-board-aggregations</guid><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Thu, 22 Jan 2026 17:31:59 GMT</pubDate><content:encoded><![CDATA[<p>When I built a PMHNP job board that aggregates 7,556+ listings across 1,368+ companies (with ~200 updates/day), the “simple” parts got hard fast—especially counts and facets. Users expect filters like state, remote, and company to feel instant. But running aggregation queries on every request competes with writes from the pipeline and spikes latency. This article is one architectural decision: why I chose PostgreSQL materialized views for aggregations (and what I gave up) to keep p95 query times around ~50ms in production.</p>
<h2 id="heading-the-problem-space">The problem space</h2>
<p>I’m Sathish (@Sathish_Daggula), a data engineer turned indie hacker. My production system is a niche job board for Psychiatric Mental Health Nurse Practitioners (PMHNP). It runs on Next.js 14 + Supabase (Postgres) + Vercel.</p>
<p>The workload is deceptively mixed:</p>
<ul>
<li><strong>Read-heavy UX expectations</strong>: job listings, search, filters, “companies hiring” pages.</li>
<li><strong>Write-heavy pipeline bursts</strong>: 10+ sources → scrape → normalize → dedupe → upsert. Roughly <strong>200+ daily updates</strong> (some days spiky).</li>
<li><strong>Aggregation-heavy UI</strong>: “jobs by state”, “top companies”, counts for filters, weekly email alerts that need grouped data.</li>
</ul>
<p>Non-functional constraints mattered more than features:</p>
<ul>
<li><strong>Latency</strong>: I targeted “feels instant” for filters. In practice: <strong>~50ms p95 query time</strong> for the most common endpoints.</li>
<li><strong>Cost &amp; ops</strong>: I’m a solo creator; I wanted fewer moving parts than adding Redis + workers + a separate analytics store.</li>
<li><strong>Correctness</strong>: counts that drift are worse than slow counts. If a filter says “Texas (123)”, it must be defensible.</li>
</ul>
<p>The immediate pain: aggregation queries (COUNT/GROUP BY) were the first thing to degrade as the dataset and filters grew. They’re also the easiest to accidentally make expensive.</p>
<blockquote>
<p>Key insight: for job boards, the “list page” is not the hard part—<strong>facets and rollups</strong> are.</p>
</blockquote>
<h2 id="heading-options-considered">Options considered</h2>
<p>I evaluated four approaches for aggregations (facets + dashboards + email rollups). Here’s how I framed it.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Option</td><td>What it is</td><td>Pros</td><td>Cons</td><td>Best when</td></tr>
</thead>
<tbody>
<tr>
<td>1) On-the-fly aggregations</td><td>Run GROUP BY/COUNT queries per request</td><td>Always fresh; simplest conceptually</td><td>Can get slow fast; competes with writes; needs careful indexing</td><td>Small datasets, low concurrency, few filters</td></tr>
<tr>
<td>2) Application-side caching</td><td>Cache aggregation results (memory/Redis/CDN)</td><td>Very fast reads; flexible TTL</td><td>Cache invalidation is real work; stale data risks; extra infra</td><td>Data changes slowly or staleness is acceptable</td></tr>
<tr>
<td>3) Precomputed tables (manual rollups)</td><td>Maintain rollup tables updated by pipeline/jobs</td><td>Fast and explicit; can be incremental</td><td>More code paths; must handle backfills; consistency bugs possible</td><td>High scale, strict SLAs, you can afford ops</td></tr>
<tr>
<td>4) PostgreSQL materialized views</td><td>Database-managed snapshot of a query, refreshable</td><td>Fast reads; fewer app bugs; strong SQL ergonomics</td><td>Refresh cost; staleness window; concurrency nuances</td><td>Medium scale, lots of repeated rollups, minimal infra</td></tr>
</tbody>
</table>
</div><h3 id="heading-why-not-just-do-on-the-fly">Why not just do on-the-fly?</h3>
<p>I started there. It worked until I added more dimensions (state, remote, employment type, source, posted date windows) and more surfaces (homepage, company page, email alerts). A single “facet query” can be okay, but <strong>multiple facets per page</strong> means you’re effectively running a small analytics workload in your OLTP path.</p>
<h3 id="heading-why-not-rediscdn-caching">Why not Redis/CDN caching?</h3>
<p>I like caching, but I’m careful about using it as a crutch:</p>
<ul>
<li>Cache keys explode with combinations of filters.</li>
<li>TTL-based staleness can create “why is this count wrong?” moments.</li>
<li>Invalidation requires you to know exactly which writes affect which keys.</li>
</ul>
<p>For a solo project, I wanted fewer “distributed correctness” problems.</p>
<h3 id="heading-why-not-manual-rollup-tables">Why not manual rollup tables?</h3>
<p>This is the serious approach at scale. But it turns your pipeline into a mini data warehouse project:</p>
<ul>
<li>You need incremental logic per dimension.</li>
<li>Backfills become tricky.</li>
<li>If you ever change business logic (“what counts as active?”), you’re rewriting historical rollups.</li>
</ul>
<p>I wasn’t ready to take on that complexity.</p>
<h2 id="heading-the-decision">The decision</h2>
<p>I chose <strong>PostgreSQL materialized views</strong> for the repeated aggregations that power:</p>
<ul>
<li>Filter counts / facets (by state, remote, company)</li>
<li>“Top companies hiring” lists</li>
<li>Weekly email alert grouping</li>
</ul>
<h3 id="heading-why-materialized-views-ranked-reasons">Why materialized views (ranked reasons)</h3>
<ol>
<li><strong>Predictable read performance</strong>: the view is effectively a precomputed table.</li>
<li><strong>Lower bug surface</strong>: aggregation logic stays in SQL, not duplicated across API endpoints.</li>
<li><strong>Operational simplicity</strong>: no extra data store; refresh can run from the same scheduler as my scraper.</li>
<li><strong>Easier evolution than manual rollups</strong>: I can change the query and refresh, instead of rewriting incremental update logic.</li>
</ol>
<h3 id="heading-what-i-gave-up">What I gave up</h3>
<ul>
<li><strong>Real-time freshness</strong>: materialized views are snapshots. I accepted a refresh window.</li>
<li><strong>Refresh cost</strong>: refresh is work the database must do, and it can contend with other load.</li>
<li><strong>Some constraints</strong>: for concurrent refresh you need unique indexes and you must design the view accordingly.</li>
</ul>
<h3 id="heading-implementation-overview">Implementation overview</h3>
<p>At a high level, my system looks like this:</p>
<pre><code class="lang-mermaid">flowchart LR
  A[Vercel Cron] --&gt; B[Scrapers 10+ sources]
  B --&gt; C[Normalize + Deduplicate]
  C --&gt; D[(PostgreSQL via Supabase)]
  D --&gt; E[Materialized Views
(facets, rollups)]
  E --&gt; F[Next.js API/Server Actions]
  F --&gt; G[UI: Search + Filters]
  E --&gt; H[Weekly Job Alerts]
</code></pre>
<p>The key is: the app reads facets from materialized views, while the pipeline writes to base tables. Refresh happens on a cadence that matches my “good enough” freshness requirements.</p>
<h3 id="heading-1-a-concrete-materialized-view-for-facets">1) A concrete materialized view for facets</h3>
<p>For example, a simplified “jobs by state” facet.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- 1) Materialized view</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">materialized</span> <span class="hljs-keyword">view</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> mv_jobs_by_state <span class="hljs-keyword">as</span>
<span class="hljs-keyword">select</span>
  j.state <span class="hljs-keyword">as</span> state,
  <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">as</span> job_count,
  <span class="hljs-keyword">max</span>(j.updated_at) <span class="hljs-keyword">as</span> last_updated_at
<span class="hljs-keyword">from</span> jobs j
<span class="hljs-keyword">where</span> j.status = <span class="hljs-string">'active'</span>
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> j.state;

<span class="hljs-comment">-- 2) Index to make reads fast</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">unique</span> <span class="hljs-keyword">index</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">exists</span> mv_jobs_by_state_state_uidx
<span class="hljs-keyword">on</span> mv_jobs_by_state(state);
</code></pre>
<p>Notes:</p>
<ul>
<li>I include <code>status = 'active'</code> because “active jobs” is a business concept, not just a row count.</li>
<li>The <strong>unique index</strong> enables <code>refresh materialized view concurrently</code>.</li>
</ul>
<h3 id="heading-2-refresh-strategy-concurrent-scheduled-and-scoped">2) Refresh strategy: concurrent, scheduled, and scoped</h3>
<p>I refresh materialized views on a schedule (Vercel Cron). The important part is to avoid blocking reads.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Concurrent refresh avoids blocking selects.</span>
refresh materialized view concurrently mv_jobs_by_state;
</code></pre>
<p>In practice I don’t refresh every minute. For a job board, a 15–60 minute staleness window is usually acceptable, and it dramatically reduces refresh churn.</p>
<p>If you have multiple materialized views, refresh order matters (and you may want to stagger them).</p>
<h3 id="heading-3-reading-facets-from-the-app">3) Reading facets from the app</h3>
<p>In Next.js, I query the materialized view rather than running a GROUP BY on the jobs table for every request.</p>
<pre><code class="lang-ts"><span class="hljs-comment">// Server-side query (Next.js 14)</span>
<span class="hljs-keyword">import</span> { createClient } <span class="hljs-keyword">from</span> <span class="hljs-string">"@supabase/supabase-js"</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">getStateFacets</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> supabase = createClient(
    process.env.SUPABASE_URL!,
    process.env.SUPABASE_SERVICE_ROLE_KEY!
  );

  <span class="hljs-keyword">const</span> { data, error } = <span class="hljs-keyword">await</span> supabase
    .from(<span class="hljs-string">"mv_jobs_by_state"</span>)
    .select(<span class="hljs-string">"state, job_count"</span>)
    .order(<span class="hljs-string">"job_count"</span>, { ascending: <span class="hljs-literal">false</span> });

  <span class="hljs-keyword">if</span> (error) <span class="hljs-keyword">throw</span> error;
  <span class="hljs-keyword">return</span> data;
}
</code></pre>
<p>I use the service role for server-side trusted reads; for user-facing reads you can also expose views via RLS carefully (more on that in trade-offs).</p>
<h2 id="heading-results-amp-learnings">Results &amp; learnings</h2>
<h3 id="heading-what-improved">What improved</h3>
<ul>
<li><strong>Latency</strong>: the endpoints that previously ran GROUP BY queries now read from small, indexed materialized views. This is a big contributor to keeping <strong>~50ms p95 query times</strong> for common read paths.</li>
<li><strong>Database stability</strong>: expensive aggregations moved off the “every request” path. Writes from the ingestion pipeline are less likely to coincide with heavy read aggregation.</li>
<li><strong>Developer speed</strong>: when I added weekly job alerts, I reused the same rollups. Business logic stayed centralized.</li>
</ul>
<h3 id="heading-unexpected-challenges">Unexpected challenges</h3>
<ol>
<li><p><strong>Refresh timing is a product decision</strong></p>
<ul>
<li>If you refresh too often, you pay a constant compute tax.</li>
<li>If you refresh too rarely, users see stale counts.</li>
<li>For my case, “jobs updated daily-ish” means a modest refresh cadence is fine.</li>
</ul>
</li>
<li><p><strong>Concurrent refresh has requirements</strong></p>
<ul>
<li>You need a unique index on the materialized view.</li>
<li>Your view query must produce stable unique rows for that index.</li>
</ul>
</li>
<li><p><strong>RLS + views needs deliberate thought</strong></p>
<ul>
<li>Supabase RLS is great for multi-tenant security, but you must decide whether facets are public, tenant-scoped, or admin-only.</li>
<li>Sometimes it’s safer to keep views behind server-only access rather than exposing them directly.</li>
</ul>
</li>
</ol>
<blockquote>
<p>Learning: materialized views are not “set and forget.” The refresh cadence is part of the system design.</p>
</blockquote>
<h2 id="heading-when-this-doesnt-work">When this doesn’t work</h2>
<p>Materialized views are a strong middle ground, but I wouldn’t recommend them universally.</p>
<p>They’re a poor fit when:</p>
<ul>
<li><strong>You need real-time facets</strong> (seconds-level freshness). In that case, you’ll likely need streaming updates + cache invalidation, or incremental rollups.</li>
<li><strong>Your refresh cost is too high</strong> because the underlying query scans huge tables. At large scale, you’ll want partitioning, incremental aggregation, or a separate OLAP store.</li>
<li><strong>You have high write throughput</strong> and refresh contention becomes visible. Even concurrent refresh consumes resources.</li>
<li><strong>Your facets depend on user-specific permissions</strong>. If every user sees different counts (e.g., per-tenant private jobs), you either need tenant-specific materialized views (messy) or compute on the fly with good indexes.</li>
</ul>
<p>At some point, the “right” solution becomes: a small analytical layer (ClickHouse/BigQuery) or a dedicated caching strategy with explicit invalidation.</p>
<h2 id="heading-key-takeaways">Key takeaways</h2>
<ol>
<li><strong>Facets are analytics</strong>: treat them like a separate workload from listing reads.</li>
<li><strong>Materialized views are a pragmatic middle</strong> between slow GROUP BY and complex rollup pipelines.</li>
<li><strong>Design for refresh</strong>: pick a cadence aligned with user expectations, not engineering aesthetics.</li>
<li><strong>Use concurrent refresh + unique indexes</strong> to avoid blocking reads.</li>
<li><strong>Keep business meaning in SQL</strong> (e.g., what counts as “active”) to avoid duplicating logic across endpoints.</li>
</ol>
<h2 id="heading-closing">Closing</h2>
<p>If you’ve built a read-heavy product with frequent updates, how do you handle aggregations—materialized views, Redis caching, rollup tables, or an OLAP store? I’m especially curious what refresh/invalidation strategies have held up for you in production.</p>
]]></content:encoded></item><item><title><![CDATA[Enhancing Your Development Workflow with AI: A Deep Dive into Vibe Coding]]></title><description><![CDATA[Introduction
In the fast-paced world of software development, efficiency is key. As an indie hacker and data engineer, I've constantly sought ways to enhance my productivity and streamline my workflow. Enter Vibe Coding—a concept that leverages AI to...]]></description><link>https://blog.dvskr.dev/enhancing-your-development-workflow-with-ai-a-deep-dive-into-vibe-coding</link><guid isPermaLink="true">https://blog.dvskr.dev/enhancing-your-development-workflow-with-ai-a-deep-dive-into-vibe-coding</guid><category><![CDATA[JavaScript]]></category><category><![CDATA[webdev]]></category><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Fri, 16 Jan 2026 15:00:14 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>In the fast-paced world of software development, efficiency is key. As an indie hacker and data engineer, I've constantly sought ways to enhance my productivity and streamline my workflow. Enter Vibe Coding—a concept that leverages AI tools like Cursor and Claude AI to augment the development process. In this article, I’ll walk you through how these tools have revolutionized my coding experience, allowing me to build better products faster.</p>
<h2 id="heading-the-concept-of-vibe-coding">The Concept of Vibe Coding</h2>
<p>Vibe Coding is about creating a harmonious development environment where AI tools work in tandem to assist in coding tasks. Cursor, an AI-powered code assistant, and Claude AI, a conversational AI tool, form the backbone of my setup, each playing a crucial role in different stages of development.</p>
<h2 id="heading-getting-started-with-cursor">Getting Started with Cursor</h2>
<p>Cursor is a tool designed to increase code quality and development speed. It helps identify errors, optimize code, and even generate boilerplate code. Here's a typical use case:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Before optimization</span>
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(my_list)):
    print(my_list[i])

<span class="hljs-comment"># After optimization with Cursor</span>
<span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> my_list:
    print(item)
</code></pre>
<p>The tool not only improves my coding efficiency but also provides learning opportunities by highlighting better coding practices.</p>
<h2 id="heading-leveraging-claude-ai-for-conceptual-clarity">Leveraging Claude AI for Conceptual Clarity</h2>
<p>Claude AI serves as a digital brainstorming partner, offering insights and suggestions that enhance problem-solving. Whether it’s understanding complex algorithms or exploring new tech stacks, Claude AI is there to provide clarity.</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// Claude AI suggestion for implementing a debounce function</span>
<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">debounce</span>(<span class="hljs-params">func, wait</span>) </span>{
  <span class="hljs-keyword">let</span> timeout;
  <span class="hljs-keyword">return</span> <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">...args</span>) </span>{
    <span class="hljs-built_in">clearTimeout</span>(timeout);
    timeout = <span class="hljs-built_in">setTimeout</span>(<span class="hljs-function">() =&gt;</span> func.apply(<span class="hljs-built_in">this</span>, args), wait);
  };
}
</code></pre>
<h2 id="heading-integrating-ai-into-daily-workflow">Integrating AI into Daily Workflow</h2>
<p>Integrating these tools into my daily workflow has been seamless. I use Cursor for code reviews and Claude AI for design discussions. This integration allows for more focus on creative problem-solving and less on mundane tasks.</p>
<h2 id="heading-overcoming-challenges-with-ai-tools">Overcoming Challenges with AI Tools</h2>
<p>While AI tools offer significant benefits, they are not without challenges. Ensuring data privacy and managing AI suggestions that align with project requirements are ongoing considerations. However, the productivity gains far outweigh these challenges.</p>
<h2 id="heading-future-of-vibe-coding">Future of Vibe Coding</h2>
<p>As AI continues to evolve, the potential for Vibe Coding is limitless. Future advancements may include more intuitive interfaces and deeper integration with existing development tools, making the coding experience even more immersive and efficient.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ul>
<li>Vibe Coding leverages AI tools to create a harmonious and efficient development environment.</li>
<li>Cursor optimizes code quality and speeds up development processes.</li>
<li>Claude AI provides conceptual clarity and enhances problem-solving.</li>
<li>Integrating AI into daily workflows can significantly boost productivity.</li>
<li>Ongoing challenges include data privacy and aligning AI suggestions with project needs.</li>
</ul>
<h2 id="heading-cta">CTA</h2>
<p>Curious about incorporating AI into your development process? Start exploring tools like Cursor and Claude AI. Share your experiences or questions in the comments! Follow my journey: @Sathish_Daggula on X.</p>
]]></content:encoded></item><item><title><![CDATA[Developing an Offline-First Fitness App with React Native: The Journey of Gym Tracker]]></title><description><![CDATA[Introduction
Fitness apps have become an integral part of our lives, guiding us in maintaining a healthy lifestyle. As the creator of Gym Tracker, an app currently on the waitlist with 423 exercises and an AI coach, my vision was to build an offline-...]]></description><link>https://blog.dvskr.dev/developing-an-offline-first-fitness-app-with-react-native-the-journey-of-gym-tracker</link><guid isPermaLink="true">https://blog.dvskr.dev/developing-an-offline-first-fitness-app-with-react-native-the-journey-of-gym-tracker</guid><dc:creator><![CDATA[Sathish]]></dc:creator><pubDate>Wed, 14 Jan 2026 15:00:21 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Fitness apps have become an integral part of our lives, guiding us in maintaining a healthy lifestyle. As the creator of Gym Tracker, an app currently on the waitlist with 423 exercises and an AI coach, my vision was to build an offline-first app. This decision was driven by the need to provide uninterrupted service to users, even in areas with poor connectivity. In this article, I'll share the journey of developing Gym Tracker using React Native and Expo, focusing on the strategies that enabled offline functionality and health data synchronization.</p>
<h2 id="heading-choosing-react-native-and-expo">Choosing React Native and Expo</h2>
<p>React Native was the natural choice for Gym Tracker due to its ability to deliver a native app experience on both iOS and Android from a single codebase. Expo further simplified the process, offering tools and libraries that accelerated development.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> { registerRootComponent } <span class="hljs-keyword">from</span> <span class="hljs-string">'expo'</span>;
<span class="hljs-keyword">import</span> App <span class="hljs-keyword">from</span> <span class="hljs-string">'./App'</span>;

registerRootComponent(App); <span class="hljs-comment">// This ensures your app is loaded correctly</span>
</code></pre>
<h2 id="heading-implementing-offline-first-strategy">Implementing Offline-First Strategy</h2>
<p>One of the primary challenges was ensuring the app functioned offline. To achieve this, I used AsyncStorage for local data storage, allowing critical data such as user progress and workouts to be accessed without an internet connection.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> AsyncStorage <span class="hljs-keyword">from</span> <span class="hljs-string">'@react-native-async-storage/async-storage'</span>;

<span class="hljs-keyword">const</span> storeWorkout = <span class="hljs-keyword">async</span> (workout) =&gt; {
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> jsonValue = <span class="hljs-built_in">JSON</span>.stringify(workout);
    <span class="hljs-keyword">await</span> AsyncStorage.setItem(<span class="hljs-string">'@workout_key'</span>, jsonValue);
  } <span class="hljs-keyword">catch</span> (e) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error storing the workout:'</span>, e);
  }
};
</code></pre>
<h2 id="heading-ai-integration-for-personalized-coaching">AI Integration for Personalized Coaching</h2>
<p>Integrating AI for personalized coaching was another ambitious feature. Using machine learning models, the app provides tailored exercise recommendations based on the user's goals and performance.</p>
<h2 id="heading-health-data-synchronization">Health Data Synchronization</h2>
<p>Synchronizing health data across devices was essential, especially for tracking metrics like steps and calories burned. Using libraries such as Expo's HealthKit for iOS and Google Fit for Android, I ensured seamless integration and synchronization.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> * <span class="hljs-keyword">as</span> GoogleFit <span class="hljs-keyword">from</span> <span class="hljs-string">'react-native-google-fit'</span>;
</code></pre>
<p>GoogleFit.startRecording((callback) =&gt; {
  console.log('Steps data is being recorded:', callback);
});</p>
<h2 id="heading-designing-an-intuitive-user-interface">Designing an Intuitive User Interface</h2>
<p>The user interface of Gym Tracker needed to be both functional and visually appealing. I used a vibrant color palette to create an energetic feel, ensuring users remain motivated and engaged.</p>
<pre><code class="lang-jsx">
  Welcome to Gym Tracker!
</code></pre>
<h2 id="heading-testing-and-deployment">Testing and Deployment</h2>
<p>Extensive testing on different devices was critical to ensure the app's performance and reliability. Using Expo's over-the-air updates, I could roll out fixes and improvements quickly without requiring users to update the app manually.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ul>
<li>React Native and Expo offer powerful tools for building cross-platform apps with native performance.</li>
<li>Offline-first strategies ensure app functionality even in low-connectivity areas.</li>
<li>AI integration enhances user experience by providing personalized fitness recommendations.</li>
<li>Synchronizing health data across platforms is crucial for comprehensive fitness tracking.</li>
<li>An intuitive and engaging UI is vital for user retention and motivation.</li>
</ul>
<h2 id="heading-cta">CTA</h2>
<p>Eager to dive into mobile development? Start experimenting with React Native and Expo. Have insights or questions to share? Drop a comment below! Follow my journey: @Sathish_Daggula on X.</p>
]]></content:encoded></item></channel></rss>