Cron reliability for OpenClaw assistants: timeouts, retries, idempotency, and best-effort delivery on Hetzner

Make OpenClaw cron workflows reliable on Hetzner with bounded timeouts, selective retries, idempotent side effects, and explicit strict vs best-effort delivery policies.

Abstract: Cron is where OpenClaw either becomes genuinely useful or quietly unreliable. On a Hetzner VPS, reminders, follow-ups, and recurring checks all depend on the Gateway process, so reliability comes from service supervision and job design, not schedule syntax alone. This deep-dive shows practical patterns that SetupClaw applies in Basic Setup: bounded timeouts, controlled retries, idempotent side effects, and explicit best-effort delivery rules.

Most people think cron reliability is about writing a good expression and moving on. I used to think that too. But when scheduled automations fail in production, the root cause is usually not the schedule string. It is runtime behaviour: stuck tasks, duplicate side effects, missing observability, and assumptions about uptime that were never true.

That is exactly why this topic matters for SetupClaw. If the assistant is supposed to run while you sleep, cron is not a side feature. It is the backbone. And backbones need boring, explicit engineering.

Cron jobs run inside your Gateway, so uptime is step zero

In current OpenClaw architecture, cron jobs persist on disk and execute inside the Gateway process.

If the Gateway is down, cron does not run. If the host is rebooting, cron does not run. So before tuning retries or schedules, verify supervision and uptime posture first. On Hetzner, that usually means systemd health, restart policy, and a known recovery path after reboot.

A lot of “cron is broken” incidents are really “the service was not alive when the schedule hit.”

Choose schedule type by failure impact, not by habit

One-shot at jobs are useful for low-stakes reminders. But if a workflow is recurring or business-critical, prefer recurring schedules (every or cron) and design the task to tolerate missed windows.

The hidden problem with one-shot thinking is brittleness. If something goes wrong at exactly the scheduled time, the task may be gone with no safe catch-up path. Recurring schedules plus idempotent logic give you a chance to recover without duplicating impact.

So the real decision is not “which syntax looks cleaner.” It is “how do we recover safely when this run does not happen exactly on time?”

Timeouts should protect scheduler capacity

Long-running tasks that never fail are dangerous because they block resources quietly. Practical reliability means setting per-run timeouts that fail fast when a task is stuck.

This is especially important when teams mix lightweight reminder jobs with heavy analysis jobs. If a complex job needs more time, isolate it in a separate session/sub-agent flow and keep reminder paths short and predictable.

Timeouts are not only about failure. They are about protecting scheduler capacity so one stuck run does not starve other work.

Retries are useful only when they are selective

A retry policy without classification creates noise. You should retry transient failures, network hiccups, temporary provider errors, and avoid retrying deterministic configuration mistakes.

Use bounded retries with backoff and a maximum attempt count. Then make sure each retry remains idempotent, otherwise you are not adding resilience, you are adding duplicate side effects.

People worry that retries will spam users. That happens when retry and idempotency are designed separately. They should be designed together.

Idempotency is the control that prevents “reliable damage”

If you remember one idea from this article, use this one: every scheduled side effect should be safe to run more than once.

In practice, that means a run marker, state key, checksum guard, or destination-specific dedupe token. A common pattern is a deterministic key per job window (for example job-id + UTC date/hour) to suppress duplicate sends across retries and restarts.

Without idempotency, reliability tuning can backfire. The system becomes very good at repeating mistakes.

Best-effort delivery is a policy, not a fallback

Some notifications are important but non-critical. In those cases, best-effort delivery can be the correct policy, as long as you still record outcome and expose missed deliveries in logs.

That is the key nuance. Best-effort should not mean invisible failure. It means the overall operational task can complete even if a notification path transiently fails, while preserving auditability.

For Telegram workflows, reliability also includes destination correctness. A message delivered to the wrong chat is not reliable delivery.

Cron and heartbeat are complementary, not interchangeable

Cron gives precise timing contracts. Heartbeat gives context-aware periodic awareness. Trying to use one for everything usually creates either brittle timing or noisy checks.

A practical split is simple: use cron for exact scheduled obligations, and heartbeat for broader situational checks that can tolerate batching.

This keeps costs and alert noise under control while preserving predictable obligations.

Why this still needs PR-only and memory discipline

Reliable scheduling does not automatically make changes safe. If scheduled tasks touch code or configuration, they should still flow through PR-only controls and branch protection.

And reliability improves when operational constraints are captured in memory. Retry limits, escalation paths, strict versus best-effort job classes, these should be retrievable context, not operator folklore.

Otherwise each incident starts from scratch.

Practical implementation steps

Step one: validate gateway supervision before touching cron config

Confirm the Gateway process is supervised, persistent across reboot, and observable through status and logs.

Step two: classify jobs by criticality

Mark each job as strict or best-effort, then define destination, timeout, and retry policy accordingly.

Step three: set bounded timeouts per job type

Keep reminder paths short. Move heavy analysis into isolated flows so core reminders are not blocked.

Step four: implement selective retries with backoff

Retry only transient failures. Set a clear max attempts value. Log every retry reason.

Step five: enforce idempotency at side-effect boundaries

Use run markers or dedupe keys so retries and restarts cannot duplicate user-visible outputs.

Step six: add a weekly reliability review

Review failed runs, timeout frequency, delivery outcomes, and drift in destination policies before they become incidents.

This is the kind of operations-first baseline SetupClaw Basic Setup (£249) is built for: a scheduler that stays predictable under real conditions, with explicit policies for retries, idempotency, and delivery behaviour, and a handoff SOP your team can operate without guesswork.