ZenRows · internal walkthrough

Conveyor

An async batch-scraping API. You POST a pile of URLs; it queues them, scrapes each through the gateway, stores results in S3, and calls you back when the whole job finishes.

Go · one binary API = Lambda Worker = ECS Fargate DynamoDB + SQS + S3

swipe / arrow keys / tap edges →

The one-sentence version

A durable inbox for scrape jobs

Scraping one URL is a synchronous call to the gateway. Scraping 50,000 URLs reliably — with retries, backpressure, partial failure, and a callback at the end — is a different problem. That's Conveyor.

Accept & persist

Validate params, mint a job + run, drop every URL onto a queue. Caller gets an ID instantly.

Drain & scrape

Workers pull from the queue, call the gateway per URL, write the body to S3.

Track & retry

Per-task status, automatic re-attempts, at-least-once delivery, dense result pages.

Notify

On completion, fire a Step Functions token and/or a signed webhook.

Domain model · SPEC §2

Job → Run → Task

Three nested entities. Getting this split right is the whole mental model.

JOB the template config · params · format status: open→closed max_inflight · webhook no execution state RUN one execution status: running⇄pending → completed/stopped stats{total,done,ok,fail} 1 job → many runs (rerun) TASK one URL in one run pending→processing → successful/failed

The Job is a reusable submission. A Run is one drain of it. Rerun → a fresh Run on the same Job.

State machines · SPEC §2.4

Lifecycle

JOB — about task ingestion

open closed deleted close / last_batch

“closed” = no more URLs accepted. Runs can still be in flight.

RUN — about execution

running pending completed stopped

completed = done==total & job closed. Tasks fail; runs never do.

Invariant: at most one non-terminal run per job. Rerun is 409 while a run is live — stop first.

Deployment · SPEC §1.2

One binary, two roles

cmd/conveyor/main.go branches on CONVEYOR_ROLE. Same image runs the API Lambda and the worker.

role=api

API

AWS Lambda (container image) behind the shared ALB. Serves the /v1/* Gin surface. Scales to zero.

role=worker

Worker

ECS Fargate, long-lived. Drains the SQS queues, calls the gateway, settles tasks. Autoscaled 2→10.

local

conveyor-dev

Single process: HTTP + in-memory queues + filesystem store. What I booted to test PR #83.

The priority dispatch loop in internal/worker/runner/ is shared by the prod worker and conveyor-dev — so local dev exercises the real loop.

End to end

The request flow

Caller APILambda DynamoDBjob·run·task SQSper-run queue WorkerFargate Gateway S3 Step Fns/ webhook POST /v1/jobs enqueue receive scrape URL store body on complete settle + stats

Caller never blocks. Everything after “202 Accepted” is the worker draining the queue at its own pace.

Storage · SPEC §5

Where state lives

DynamoDB

Single table

One table, jobs + runs + tasks + events keyed by Job#<id>. GSI1 = jobs-by-caller, GSI2 = dispatch discovery (active#/idle#).

S3

Result bodies

Run-scoped keys jobs/<job>/runs/<run>/…. 30-day physical sweep; a softer result_ttl_days visibility window on top.

DDB lease

Semaphore

Partition Sem#<job>, one lease row per in-flight message. The live leases are the in-flight count — no counter to drift.

Concurrency idiom: every Job/Run write is version-conditioned (optimistic lock). But stats counters use commutative ADD and skip the version — so a worker bumping completed never conflicts with a handler flipping status. That carve-out is what keeps settlement from being a retry storm.

Worker · SPEC §6.2

What one task costs

The worker loop, condensed. Every step is idempotent and safe to redeliver.

  • Receive from a random per-run queue (short-poll, ≤10 msgs)
  • Acquire a permit from the job's lease semaphore — or bounce
  • Load task, skip if already terminal
  • Flip pending → processing (conditional)
  • Call the gateway with merged params + forwarded auth
  • Success → body to S3, task successful, bump stats
  • Fail + retries left → re-enqueue; else failed + Problem JSON
  • Last task → run completed → fire SFN token / webhook

Throughout, the worker heartbeats its lease every 10s so it doesn't age out mid-scrape. Then: release permit, ack message.

Dispatch · SPEC §6.1

One queue per run

Not one big queue — a queue per Run (conveyor-{env}-run-{run_id}). The worker holds an in-memory map of active queues and round-robins them.

  • Discovery via GSI2: cold-start scan, then delta queries every 10s (active# add, idle# remove), full resync every 10min
  • Pick a random queue from the local map
  • Poll short, process, sleep-on-empty
  • Evict on QueueDoesNotExist (404 = canonical removal) or 10-min empty
dispatcher pickRandom() run A run B run N

Backpressure · SPEC §6.7

The per-job ceiling

Each job caps how many of its messages can be in flight at once — max_inflight, default 100. This protects the gateway and keeps one giant job from starving everyone else.

Acquire

Strongly-consistent COUNT of live leases. Below the cap → insert a lease & process. At the cap → bounce the message (ChangeMessageVisibility=0) and move on.

Self-healing

Crashed worker stops heartbeating → its lease ages out in 30s (under the 60s visibility timeout). No sweeper, no orphaned counter. Briefly overshooting the cap is accepted by design.

Authoritative across all containers — the DDB lease set is shared. This is the “source of truth” PR #83 builds on.

The bug · COR-315

Receive amplification

The old path checked the ceiling after receiving. At saturation, the worker pulled messages it couldn't process, bounced them, pulled them again… a treadmill.

SQS queue 3,334 msgs worker at the cap throttle bounce back receive can't process re-deliver (visibility timeout) — and again, and again

9.7–16.6×
receives per delivered message

~2%
of nominal processing rate

203/3334
stuck — queue effectively frozen

The fix · PR #83 · COR-316

Don't poll what you can't process

Carlos's call: gate before the receive. The dispatcher tracks a local in-flight count per job and simply skips a queue when it's already at max_inflight. You never receive a message you'd just bounce.

BEFORE — bounce after the fact

receive at cap? bounce

AFTER — gate before the poll

localInflight < cap? receive else: skip queue, no SQS call

~80 lines in dispatcher.go: jobEntry gains maxInflight + an atomic localInflight; pickRandom skips capped queues; processOnce ++/-- around the handler. The DDB lease stays authoritative across containers — the gate just stops the wasteful local polling. COR-314/315 stay as defense-in-depth.

Verified locally · head-to-head test

Does it actually help? Yes.

Same dispatcher, same fake-SQS, only the gate toggled. 500 tasks, cap 5, 50 workers. I ran it three times:

runreceives off→onamplificationfewer receiveselapsed
13396 → 23276.79×4.65×31.5%11.1s → 7.6s
22396 → 7954.79×1.59×66.8%5.5s → 2.2s
32442 → 9914.88×1.98×59.4%7.5s → 2.1s

Every run: strictly fewer receives, lower amplification, faster — with identical completion (500/500) and zero DLQ on both sides. The gate costs nothing in correctness. Full -race suite green; conveyor-dev ran a real 3-task job end-to-end against a mock gateway.

How the layers stack

Three lines of defense

COR-316 · PR #83 · steady state

Pre-poll gate

Local in-flight count skips capped queues. Stops amplification at the source. Inert until you're at the limit.

COR-314 · edge case

Per-queue cooldown

Brief skip after a throttle, so a stale local view doesn't hammer one queue.

COR-315 · edge case

Adaptive bounce visibility

When a message does get bounced, back it off exponentially (5s→300s) so it doesn't re-deliver instantly.

The gate handles the normal path; 314/315 catch the races — peer-container drift, restarts — where the local count is briefly stale. The DDB lease underneath all of it stays the cross-container truth.

Recap

Conveyor in one breath

A Job is a template; a Run drains it; a Task is one URL. The API persists and enqueues; workers round-robin per-run queues, scrape through the gateway, store to S3, and notify on completion. A per-job lease semaphore is the throttle — and PR #83 makes that throttle proactive instead of reactive, killing receive amplification.

Job→Run→Task per-run SQS DDB lease semaphore pre-poll gate ✓

SPEC.md is authoritative · PR #83 verified green locally

← → / swipe