Skip to content
All posts
·4 min read·by ScrapeNest Engineering

Scheduled scraping, without the cron-job graveyard

Recurring scrapes are easy to start and hard to run well: missed windows, overlapping runs, silent failures when you run out of budget. Here's how we built scheduling so none of that is your problem.

schedulingautomationproduct

Most scraping that matters is recurring. You don't check a competitor's prices once; you check them every hour. You don't pull a catalog on a whim; you pull it every night. And almost everyone starts the same way: a cron line on a box somewhere that curls an endpoint and hopes.

That works right up until it doesn't. The box reboots during your window and the run just never happens. Two runs overlap because last night's scrape was slow, and now you're paying twice and racing yourself. You run out of budget mid-month and every run fails silently, so you find out three days later when the dashboard is empty. Cron is a timer, not a system, and recurring scraping needs a system.

So we built scheduling into ScrapeNest. You give us a cron expression and a timezone; we run the job on that cadence and hand you a full history of what happened. Here's what we decided, and why.

A schedule is just a job that repeats

The most important design decision is the one you can't see: a scheduled run is a normal job. When your schedule fires, it creates exactly the same job you would have submitted by hand - same engine, same options, same artifacts, same webhooks. It bills the same way too: credits by engine weight (Light 1, Standard 5, Stealth 30), charged only when the job actually delivers content.

That means there is nothing new to learn and nothing new to reconcile. Your existing job.completed webhooks fire for scheduled runs. Your usage analytics already count them. The artifact you get from a 3 a.m. scheduled crawl is byte-for-byte the artifact you'd get from a manual one. Scheduling is a trigger, not a separate product.

Exactly once per window, even when things go wrong

Under the hood, schedules run on Temporal's scheduling engine, not a loop that sleeps and prays. That buys us the guarantees that are annoying to build yourself:

  • Missed windows are handled deliberately. If our scheduler is briefly unavailable when your window arrives, the run isn't silently dropped; it's reconciled according to well-defined catch-up semantics.
  • Overlap is a policy, not an accident. If a run is still going when the next one is due, you choose: skip it, queue exactly one, or allow the overlap. The default is skip, because that's what you almost always want and never remember to enforce.
  • Every fire is recorded. Not just the successes. Each run shows up in the history as it happened.

Running out of budget shouldn't mean failing loudly

Here's the failure mode we cared about most. You've set an hourly schedule, and halfway through the month you exhaust your credits. What should happen?

The wrong answer is what cron gives you: the job runs, fails, and maybe pages someone. The run cost you nothing useful and the noise costs you attention.

Our answer: the run is skipped, not failed. It costs nothing, it's recorded in the history as skipped_quota, and if you want to know about it, there's a dedicated schedule.run_skipped webhook. A skipped run is a normal, expected thing - a schedule that quietly waits for next month's credits instead of throwing errors at you is behaving correctly. The same applies if you downgrade to a plan that no longer includes the engine a schedule uses: it pauses itself and tells you, rather than firing into a wall.

What it looks like

Creating a schedule is one call:

from scrapenest import ScrapeNestClient

client = ScrapeNestClient(api_key="sn_live_...")

client.schedules.create(
    name="hourly-prices",
    cron="0 * * * *",
    timezone="Europe/Paris",
    job_type="light",
    target_url="https://example.com/pricing",
)

From there you can pause and resume without losing the definition, edit the cadence, and page through the run history to see exactly which fires minted a job and which were skipped. The same lives in the console under Schedules, with one-click pause and a run log per schedule.

Scheduling is available on Starter and up. Frequency and how many schedules you can keep scale with the plan, mostly as a guardrail: a per-minute schedule on the most expensive engine adds up fast, and we'd rather make that a deliberate choice than a surprise invoice.

The scheduling docs have the full cron reference, overlap policies, and plan limits. If you've been running a scraping cron job you don't trust, this is the part you get to delete.