ZenRows · internal walkthrough

Conveyor

An async batch-scraping API. You POST a pile of URLs; it queues them, scrapes each through the gateway, stores results in S3, and calls you back when the whole job finishes.

Go · one binary API = Lambda Worker = ECS Fargate DynamoDB + SQS + S3

swipe / arrow keys / tap edges →

The one-sentence version

A durable inbox for scrape jobs

Scraping one URL is a synchronous call to the gateway. Scraping 50,000 URLs reliably — with retries, backpressure, partial failure, and a callback at the end — is a different problem. That's Conveyor.

Accept & persist

Validate params, mint a job + run, drop every URL onto a queue. Caller gets an ID instantly.

Drain & scrape

Workers pull from the queue, call the gateway per URL, write the body to S3.

Track & retry

Per-task status, automatic re-attempts, at-least-once delivery, dense result pages.

Notify

On completion, fire a Step Functions token and/or a signed webhook.

Domain model · SPEC §2

Job → Run → Task

Three nested entities. Getting this split right is the whole mental model.

The Job is a reusable submission. A Run is one drain of it. Rerun → a fresh Run on the same Job.

State machines · SPEC §2.4

Lifecycle

JOB — about task ingestion

“closed” = no more URLs accepted. Runs can still be in flight.

RUN — about execution

completed = done==total & job closed. Tasks fail; runs never do.

Invariant: at most one non-terminal run per job. Rerun is 409 while a run is live — stop first.

Deployment · SPEC §1.2

One binary, two roles

cmd/conveyor/main.go branches on CONVEYOR_ROLE. Same image runs the API Lambda and the worker.

role=api

API

AWS Lambda (container image) behind the shared ALB. Serves the /v1/* Gin surface. Scales to zero.

role=worker

Worker

ECS Fargate, long-lived. Drains the SQS queues, calls the gateway, settles tasks. Autoscaled 2→10.

local

conveyor-dev

Single process: HTTP + in-memory queues + filesystem store. What I booted to test PR #83.

The priority dispatch loop in internal/worker/runner/ is shared by the prod worker and conveyor-dev — so local dev exercises the real loop.

End to end

The request flow

Caller never blocks. Everything after “202 Accepted” is the worker draining the queue at its own pace.

Storage · SPEC §5

Where state lives

DynamoDB

Single table

One table, jobs + runs + tasks + events keyed by Job#<id>. GSI1 = jobs-by-caller, GSI2 = dispatch discovery (active#/idle#).

S3

Result bodies

Run-scoped keys jobs/<job>/runs/<run>/…. 30-day physical sweep; a softer result_ttl_days visibility window on top.

DDB lease

Semaphore

Partition Sem#<job>, one lease row per in-flight message. The live leases are the in-flight count — no counter to drift.

Concurrency idiom: every Job/Run write is version-conditioned (optimistic lock). But stats counters use commutative ADD and skip the version — so a worker bumping completed never conflicts with a handler flipping status. That carve-out is what keeps settlement from being a retry storm.

Worker · SPEC §6.2

What one task costs

The worker loop, condensed. Every step is idempotent and safe to redeliver.

Receive from a random per-run queue (short-poll, ≤10 msgs)
Acquire a permit from the job's lease semaphore — or bounce
Load task, skip if already terminal
Flip pending → processing (conditional)

Call the gateway with merged params + forwarded auth
Success → body to S3, task successful, bump stats
Fail + retries left → re-enqueue; else failed + Problem JSON
Last task → run completed → fire SFN token / webhook

Throughout, the worker heartbeats its lease every 10s so it doesn't age out mid-scrape. Then: release permit, ack message.

Dispatch · SPEC §6.1

One queue per run

Not one big queue — a queue per Run (conveyor-{env}-run-{run_id}). The worker holds an in-memory map of active queues and round-robins them.

Discovery via GSI2: cold-start scan, then delta queries every 10s (active# add, idle# remove), full resync every 10min
Pick a random queue from the local map
Poll short, process, sleep-on-empty
Evict on QueueDoesNotExist (404 = canonical removal) or 10-min empty

Backpressure · SPEC §6.7

The per-job ceiling

Each job caps how many of its messages can be in flight at once — max_inflight, default 100. This protects the gateway and keeps one giant job from starving everyone else.

Acquire

Strongly-consistent COUNT of live leases. Below the cap → insert a lease & process. At the cap → bounce the message (ChangeMessageVisibility=0) and move on.

Self-healing

Crashed worker stops heartbeating → its lease ages out in 30s (under the 60s visibility timeout). No sweeper, no orphaned counter. Briefly overshooting the cap is accepted by design.

Authoritative across all containers — the DDB lease set is shared. This is the “source of truth” PR #83 builds on.

The bug · COR-315

Receive amplification

The old path checked the ceiling after receiving. At saturation, the worker pulled messages it couldn't process, bounced them, pulled them again… a treadmill.

9.7–16.6×
receives per delivered message

~2%
of nominal processing rate

203/3334
stuck — queue effectively frozen

The fix · PR #83 · COR-316

Don't poll what you can't process

Carlos's call: gate before the receive. The dispatcher tracks a local in-flight count per job and simply skips a queue when it's already at max_inflight. You never receive a message you'd just bounce.

BEFORE — bounce after the fact

AFTER — gate before the poll

~80 lines in dispatcher.go: jobEntry gains maxInflight + an atomic localInflight; pickRandom skips capped queues; processOnce ++/-- around the handler. The DDB lease stays authoritative across containers — the gate just stops the wasteful local polling. COR-314/315 stay as defense-in-depth.

Verified locally · head-to-head test

Does it actually help? Yes.

Same dispatcher, same fake-SQS, only the gate toggled. 500 tasks, cap 5, 50 workers. I ran it three times:

run	receives off→on	amplification	fewer receives	elapsed
1	3396 → 2327	6.79× → 4.65×	31.5%	11.1s → 7.6s
2	2396 → 795	4.79× → 1.59×	66.8%	5.5s → 2.2s
3	2442 → 991	4.88× → 1.98×	59.4%	7.5s → 2.1s

Every run: strictly fewer receives, lower amplification, faster — with identical completion (500/500) and zero DLQ on both sides. The gate costs nothing in correctness. Full -race suite green; conveyor-dev ran a real 3-task job end-to-end against a mock gateway.

How the layers stack

Three lines of defense

COR-316 · PR #83 · steady state

Pre-poll gate

Local in-flight count skips capped queues. Stops amplification at the source. Inert until you're at the limit.

COR-314 · edge case

Per-queue cooldown

Brief skip after a throttle, so a stale local view doesn't hammer one queue.

COR-315 · edge case

Adaptive bounce visibility

When a message does get bounced, back it off exponentially (5s→300s) so it doesn't re-deliver instantly.

The gate handles the normal path; 314/315 catch the races — peer-container drift, restarts — where the local count is briefly stale. The DDB lease underneath all of it stays the cross-container truth.

Recap

Conveyor in one breath

A Job is a template; a Run drains it; a Task is one URL. The API persists and enqueues; workers round-robin per-run queues, scrape through the gateway, store to S3, and notify on completion. A per-job lease semaphore is the throttle — and PR #83 makes that throttle proactive instead of reactive, killing receive amplification.

Job→Run→Task per-run SQS DDB lease semaphore pre-poll gate ✓

SPEC.md is authoritative · PR #83 verified green locally