Postgres-Backed Job Queue
A case study in building a reliable job queue on plain Postgres — SKIP LOCKED claiming, visibility timeouts, retries with backoff, and a dead-letter table.
Problem
The URL shortener needed its click buffer moved out of process, and the
obvious answers were SQS, BullMQ, or RabbitMQ. But the workload — thousands
of jobs an hour, not thousands a second — sits squarely in the range where
the standard advice is "just use Postgres." I wanted to find out, by
building it, what that advice actually costs: what a queue needs beyond
INSERT and SELECT, and where Postgres genuinely runs out of road.
So this is a case study. The deliverable is the write-up as much as the library.
Architecture
One table is the queue. Jobs are rows: payload (JSONB), state
(pending → claimed → done / failed / dead), run_at, attempts,
claimed_until. Three actors operate on it:
Producers INSERT a row — and if they do it inside an existing business
transaction, enqueueing is atomic with the work that caused it. That
transactional-enqueue property is the single biggest thing brokers can't
give you, and it falls out of this design for free.
Workers claim with the canonical query:
UPDATE jobs SET state = 'claimed', claimed_until = now() + interval '60s'
WHERE id IN (
SELECT id FROM jobs
WHERE state = 'pending' AND run_at <= now()
ORDER BY run_at
FOR UPDATE SKIP LOCKED
LIMIT 10
)
RETURNING *;SKIP LOCKED is the whole trick — N workers polling concurrently never
block or double-claim. Completion deletes or marks the row; failure
increments attempts and reschedules run_at with exponential backoff plus
jitter.
The scheduler is a small sweeper loop: anything claimed past its
claimed_until (a crashed worker) goes back to pending; anything past max
attempts moves to a dead_letter table with its last error, so failures are
queryable with SQL instead of grep.
A partial index on (run_at) WHERE state = 'pending' keeps the claim query
indexed on exactly the rows that matter. Workers are plain Node processes;
Docker Compose runs the demo topology (producer, three workers, scheduler).
Trade-offs
- Polling over LISTEN/NOTIFY. I started with NOTIFY for instant pickup and removed it: it adds connection-lifetime concerns and a thundering-herd on every enqueue, to improve a latency (≤ poll interval, 250 ms) that batch workloads don't feel. The sweep stays as the source of truth either way — NOTIFY can only ever be an optimization, never the mechanism.
- Visibility timeout over heartbeats. Crashed workers are detected by lease expiry, not liveness pings. Simpler, with one sharp edge: a slow job that outlives its lease runs twice. Handlers must be idempotent, and the README says so in bold.
- One table over per-queue tables. A
queuecolumn plus the partial index performs fine at this scale and keeps operations (vacuum, metrics, dead-letter triage) uniform. - Honest ceiling. Update-heavy churn means bloat; autovacuum needs tuning past ~100 jobs/sec sustained, and hot-standby failover semantics are on you. The case study ends with a section on when to leave — signals that it's time for SQS/Rabbit, written before I need them.
What I'd do differently
Build the observability first, not last. A queue's hard bugs are invisible
in logs and obvious in three numbers — queue depth, claim age, retry rate —
which I added only after losing an evening to a stuck worker that one
max(now() - run_at) query would have exposed in seconds. I'd also design
the payload schema versioning up front: the first time a handler's payload
shape changed with jobs still in flight, I had to write a migration shim I
could have made unnecessary on day one.