ZenRows · internal walkthrough
An async batch-scraping API. You POST a pile of URLs; it queues them, scrapes each through the gateway, stores results in S3, and calls you back when the whole job finishes.
swipe / arrow keys / tap edges →
The one-sentence version
Scraping one URL is a synchronous call to the gateway. Scraping 50,000 URLs reliably — with retries, backpressure, partial failure, and a callback at the end — is a different problem. That's Conveyor.
Validate params, mint a job + run, drop every URL onto a queue. Caller gets an ID instantly.
Workers pull from the queue, call the gateway per URL, write the body to S3.
Per-task status, automatic re-attempts, at-least-once delivery, dense result pages.
On completion, fire a Step Functions token and/or a signed webhook.
Domain model · SPEC §2
Three nested entities. Getting this split right is the whole mental model.
The Job is a reusable submission. A Run is one drain of it. Rerun → a fresh Run on the same Job.
State machines · SPEC §2.4
JOB — about task ingestion
“closed” = no more URLs accepted. Runs can still be in flight.
RUN — about execution
completed = done==total & job closed. Tasks fail; runs never do.
Invariant: at most one non-terminal run per job. Rerun is 409 while a run is live — stop first.
Deployment · SPEC §1.2
cmd/conveyor/main.go branches on CONVEYOR_ROLE. Same image runs the API Lambda and the worker.
AWS Lambda (container image) behind the shared ALB. Serves the /v1/* Gin surface. Scales to zero.
ECS Fargate, long-lived. Drains the SQS queues, calls the gateway, settles tasks. Autoscaled 2→10.
Single process: HTTP + in-memory queues + filesystem store. What I booted to test PR #83.
The priority dispatch loop in internal/worker/runner/ is shared by the prod worker and conveyor-dev — so local dev exercises the real loop.
End to end
Caller never blocks. Everything after “202 Accepted” is the worker draining the queue at its own pace.
Storage · SPEC §5
One table, jobs + runs + tasks + events keyed by Job#<id>. GSI1 = jobs-by-caller, GSI2 = dispatch discovery (active#/idle#).
Run-scoped keys jobs/<job>/runs/<run>/…. 30-day physical sweep; a softer result_ttl_days visibility window on top.
Partition Sem#<job>, one lease row per in-flight message. The live leases are the in-flight count — no counter to drift.
Concurrency idiom: every Job/Run write is version-conditioned (optimistic lock). But stats counters use commutative ADD and skip the version — so a worker bumping completed never conflicts with a handler flipping status. That carve-out is what keeps settlement from being a retry storm.
Worker · SPEC §6.2
The worker loop, condensed. Every step is idempotent and safe to redeliver.
pending → processing (conditional)successful, bump statsfailed + Problem JSONcompleted → fire SFN token / webhookThroughout, the worker heartbeats its lease every 10s so it doesn't age out mid-scrape. Then: release permit, ack message.
Dispatch · SPEC §6.1
Not one big queue — a queue per Run (conveyor-{env}-run-{run_id}). The worker holds an in-memory map of active queues and round-robins them.
active# add, idle# remove), full resync every 10minQueueDoesNotExist (404 = canonical removal) or 10-min emptyBackpressure · SPEC §6.7
Each job caps how many of its messages can be in flight at once — max_inflight, default 100. This protects the gateway and keeps one giant job from starving everyone else.
Strongly-consistent COUNT of live leases. Below the cap → insert a lease & process. At the cap → bounce the message (ChangeMessageVisibility=0) and move on.
Crashed worker stops heartbeating → its lease ages out in 30s (under the 60s visibility timeout). No sweeper, no orphaned counter. Briefly overshooting the cap is accepted by design.
Authoritative across all containers — the DDB lease set is shared. This is the “source of truth” PR #83 builds on.
The bug · COR-315
The old path checked the ceiling after receiving. At saturation, the worker pulled messages it couldn't process, bounced them, pulled them again… a treadmill.
9.7–16.6×
receives per delivered message
~2%
of nominal processing rate
203/3334
stuck — queue effectively frozen
The fix · PR #83 · COR-316
Carlos's call: gate before the receive. The dispatcher tracks a local in-flight count per job and simply skips a queue when it's already at max_inflight. You never receive a message you'd just bounce.
BEFORE — bounce after the fact
AFTER — gate before the poll
~80 lines in dispatcher.go: jobEntry gains maxInflight + an atomic localInflight; pickRandom skips capped queues; processOnce ++/-- around the handler. The DDB lease stays authoritative across containers — the gate just stops the wasteful local polling. COR-314/315 stay as defense-in-depth.
Verified locally · head-to-head test
Same dispatcher, same fake-SQS, only the gate toggled. 500 tasks, cap 5, 50 workers. I ran it three times:
| run | receives off→on | amplification | fewer receives | elapsed |
|---|---|---|---|---|
| 1 | 3396 → 2327 | 6.79× → 4.65× | 31.5% | 11.1s → 7.6s |
| 2 | 2396 → 795 | 4.79× → 1.59× | 66.8% | 5.5s → 2.2s |
| 3 | 2442 → 991 | 4.88× → 1.98× | 59.4% | 7.5s → 2.1s |
Every run: strictly fewer receives, lower amplification, faster — with identical completion (500/500) and zero DLQ on both sides. The gate costs nothing in correctness. Full -race suite green; conveyor-dev ran a real 3-task job end-to-end against a mock gateway.
How the layers stack
Local in-flight count skips capped queues. Stops amplification at the source. Inert until you're at the limit.
Brief skip after a throttle, so a stale local view doesn't hammer one queue.
When a message does get bounced, back it off exponentially (5s→300s) so it doesn't re-deliver instantly.
The gate handles the normal path; 314/315 catch the races — peer-container drift, restarts — where the local count is briefly stale. The DDB lease underneath all of it stays the cross-container truth.
Recap
A Job is a template; a Run drains it; a Task is one URL. The API persists and enqueues; workers round-robin per-run queues, scrape through the gateway, store to S3, and notify on completion. A per-job lease semaphore is the throttle — and PR #83 makes that throttle proactive instead of reactive, killing receive amplification.
SPEC.md is authoritative · PR #83 verified green locally