Postgres-Backed Job Queue

Problem

The URL shortener needed its click buffer moved out of process, and the obvious answers were SQS, BullMQ, or RabbitMQ. But the workload — thousands of jobs an hour, not thousands a second — sits squarely in the range where the standard advice is "just use Postgres." I wanted to find out, by building it, what that advice actually costs: what a queue needs beyond INSERT and SELECT, and where Postgres genuinely runs out of road.

So this is a case study. The deliverable is the write-up as much as the library.

Architecture

One table is the queue. Jobs are rows: payload (JSONB), state (pending → claimed → done / failed / dead), run_at, attempts, claimed_until. Three actors operate on it:

Producers INSERT a row — and if they do it inside an existing business transaction, enqueueing is atomic with the work that caused it. That transactional-enqueue property is the single biggest thing brokers can't give you, and it falls out of this design for free.

Workers claim with the canonical query:

UPDATE jobs SET state = 'claimed', claimed_until = now() + interval '60s'
WHERE id IN (
  SELECT id FROM jobs
  WHERE state = 'pending' AND run_at <= now()
  ORDER BY run_at
  FOR UPDATE SKIP LOCKED
  LIMIT 10
)
RETURNING *;

SKIP LOCKED is the whole trick — N workers polling concurrently never block or double-claim. Completion deletes or marks the row; failure increments attempts and reschedules run_at with exponential backoff plus jitter.

The scheduler is a small sweeper loop: anything claimed past its claimed_until (a crashed worker) goes back to pending; anything past max attempts moves to a dead_letter table with its last error, so failures are queryable with SQL instead of grep.

A partial index on (run_at) WHERE state = 'pending' keeps the claim query indexed on exactly the rows that matter. Workers are plain Node processes; Docker Compose runs the demo topology (producer, three workers, scheduler).

Trade-offs

Polling over LISTEN/NOTIFY. I started with NOTIFY for instant pickup and removed it: it adds connection-lifetime concerns and a thundering-herd on every enqueue, to improve a latency (≤ poll interval, 250 ms) that batch workloads don't feel. The sweep stays as the source of truth either way — NOTIFY can only ever be an optimization, never the mechanism.
Visibility timeout over heartbeats. Crashed workers are detected by lease expiry, not liveness pings. Simpler, with one sharp edge: a slow job that outlives its lease runs twice. Handlers must be idempotent, and the README says so in bold.
One table over per-queue tables. A queue column plus the partial index performs fine at this scale and keeps operations (vacuum, metrics, dead-letter triage) uniform.
Honest ceiling. Update-heavy churn means bloat; autovacuum needs tuning past ~100 jobs/sec sustained, and hot-standby failover semantics are on you. The case study ends with a section on when to leave — signals that it's time for SQS/Rabbit, written before I need them.

What I'd do differently

Build the observability first, not last. A queue's hard bugs are invisible in logs and obvious in three numbers — queue depth, claim age, retry rate — which I added only after losing an evening to a stuck worker that one max(now() - run_at) query would have exposed in seconds. I'd also design the payload schema versioning up front: the first time a handler's payload shape changed with jobs still in flight, I had to write a migration shim I could have made unnecessary on day one.