Browser automation reliability for OpenClaw: handling CAPTCHA, MFA prompts, and safe fallbacks on Hetzner

A practical SetupClaw browser reliability baseline: handle CAPTCHA/MFA with safe checkpoints, use execute/assist modes, and apply bounded retries plus secure escalation.

Abstract: Browser automation often fails in production for predictable reasons, CAPTCHA challenges, MFA prompts, session expiry, and UI changes, not because the assistant is inherently unreliable. The practical fix is not “retry harder,” but a fallback model that switches safely from execute mode to assist mode when hard gates appear. This guide explains a SetupClaw-ready approach for reliable browser workflows on Hetzner with clear escalation, bounded retries, and human checkpoints.

If your browser automation works perfectly in tests and fails in production, you are not alone. And you are probably not looking at a random bug.

In many real systems, common failures come from anti-bot controls, authentication checkpoints, or UI drift. Teams often respond by increasing retries or removing safeguards. That usually makes outcomes worse, not better.

A stronger approach is to treat these failure classes as expected and design safe fallback behaviour from day one.

Start by classifying failure types before writing fallback logic

Reliability improves quickly when you stop treating all failures the same.

For OpenClaw browser workflows, practical failure classes are: anti-bot challenge, MFA interruption, session expiry, selector drift, and upstream UI change. Each class needs a specific response path.

If your response to every failure is “retry,” you are mixing transient problems with deterministic gates.

CAPTCHA should be a manual checkpoint, not a retry target

CAPTCHA is usually an intentional gate, not a timing glitch.

Repeated automated retries can increase blocking and reduce reliability for future attempts. A safer policy is to treat CAPTCHA as a controlled human checkpoint, notify operator, pause execution, resume only after confirmation.

That keeps the workflow predictable and avoids escalating anti-bot controls.

MFA should stay enabled, with human-in-the-loop design

When MFA interrupts flow, some teams are tempted to weaken auth controls “for reliability.” That trade-off is expensive.

A better model is checkpointed automation: pause at MFA step, request operator confirmation through trusted channel, verify authenticated session state, then continue.

You keep security and reliability together instead of choosing one against the other.

Use execute mode and assist mode intentionally

A useful design pattern is two operating modes.

In execute mode, automation performs approved actions end-to-end. In assist mode, triggered by hard gates like CAPTCHA or MFA, automation gathers context, drafts next steps, and asks for human input.

This is graceful degradation. The workflow still helps even when full automation is not possible.

Route browser actions by risk level

Not all request sources should have identical automation power.

Group or public route requests should use constrained automation mode. Privileged browser actions should be reserved for trusted private routes with explicit approvals.

High-risk examples that should require private route plus approval: payment submits, account/permission changes, credential resets, and destructive operations.

This reduces the chance that noisy channel context triggers high-impact browser behaviour.

Verify state before and after important actions

Many expensive failures are partial failures that looked successful at first.

Add pre-action and post-action checks for session state, expected page context, and completion markers. If state checks fail, stop and escalate rather than continuing blindly.

State assertions are one of the cheapest reliability controls you can add.

Use bounded retries for transient issues only

Retries are useful when the failure class is transient, temporary latency, occasional loading instability.

For deterministic gates, CAPTCHA, MFA, policy blocks, retries should be minimal or none, followed by immediate escalation. This avoids wasted runs, extra cost, and escalating anti-bot defences.

Define concrete stop conditions up front: maximum retry count, maximum elapsed time, and immediate escalation when deterministic gates are detected.

Reliability is not about trying forever. It is about stopping at the right point.

Cron workflows need skip-with-alert behaviour

Scheduled browser jobs can become noisy quickly when hard gates appear.

Instead of infinite retries, use preflight checks and skip-with-alert behaviour for deterministic blocks. Notify operators with enough context to decide next action.

This protects schedule reliability and keeps alert noise manageable.

Use Telegram for escalation, but keep policy strict

Telegram is useful for operator checkpoints and confirmations during browser incidents.

Keep allowlist and mention policies strict while using it for escalation. Do not widen channel permissions during incident pressure.

The escalation path should help recovery without expanding attack surface.

Keep fallback rules versioned and reviewed

Fallback logic changes are production behaviour changes.

Handle them through PR-reviewed updates so policy drift is tracked and reversible. Ad hoc edits after incidents often fix one case while breaking another silently.

Reliability improves when fallback decisions are auditable.

Store challenge patterns in durable memory

Recurring anti-bot and MFA patterns are operational knowledge.

Capture site-specific challenge behaviour, chosen fallback action, and resolution outcomes in durable runbooks/memory. This reduces mean time to recovery for repeated issues.

Without this, every incident starts from scratch.

Practical implementation steps

Step one: create a reliability matrix per workflow

For each site and flow, define expected challenge type, fallback action, approval owner, and retry policy.

Step two: implement execute/assist mode switch

Trigger assist mode automatically on deterministic gates and request human checkpoint.

Step three: add state verification guards

Check preconditions and postconditions around high-impact browser actions.

Step four: tune retry policy by failure class

Allow bounded retries for transient faults. Escalate early for CAPTCHA/MFA/policy blocks.

Step five: wire safe escalation notifications

Use Telegram notifications for operator decisions while preserving strict channel access policy.

Step six: review incidents and update by PR

Record outcomes, update fallback matrix, and merge reliability changes through reviewed workflow.

After CAPTCHA/MFA checkpoint escalation, pass criteria should be explicit:

expected authenticated state present
target page/context validated
intended action result confirmed
no policy boundary violations in logs

You cannot guarantee fully unattended browser automation on every hostile or fast-changing site. What you can do is make failures predictable, contained, and recoverable, which is exactly what a practical SetupClaw baseline should deliver.