Abstract: On a Hetzner VPS, OpenClaw reliability is mostly a service-supervision problem before it is a model problem. A hardened systemd setup gives you automatic recovery, predictable startup behaviour, cleaner incident triage, and tighter local-state boundaries. This deep-dive explains how SetupClaw treats systemd as the operational backbone for Telegram continuity, cron reliability, and safe day-two maintenance.
A lot of OpenClaw outages look like channel issues at first. Telegram appears flaky, cron jobs seem to skip, and operators assume the model or API provider is at fault. In practice, many of these incidents come from a simpler source: the gateway process is not supervised well enough.
That is why SetupClaw Basic Setup treats systemd hardening as foundational. If OpenClaw is meant to be always-on, “it runs in my shell” is not an operating model. You need restart policy, boot persistence, clear logs, and a runtime user model that limits blast radius when something goes wrong.
Why systemd matters more than tmux for production-like operation
Running OpenClaw inside tmux is fine for testing. It is not enough for unattended reliability.
A tmux session can disappear after reboot, logout, or accidental process termination. A properly configured systemd service gives explicit restart behaviour, service state visibility, and predictable lifecycle commands. That difference is what keeps Telegram control and scheduled automations stable overnight.
The practical rule is simple: if the service needs to survive you, it needs supervision that survives your shell.
Restart policy is a reliability control, not a convenience toggle
The most important setting to verify is restart behaviour. A documented restart policy (commonly Restart=always or on-failure, depending on service behavior) improves automatic recovery from transient failures.
Does this mean broken config will keep restarting? Yes, and that is useful when paired with logs and status checks. A crash loop is visible and diagnosable. A silently dead process in an abandoned shell is worse because it can look healthy until someone notices missed actions hours later.
For SetupClaw, this directly supports cron and channel continuity. Jobs cannot run and messages cannot flow if the gateway process is not up.
Least-privilege runtime is non-negotiable on an internet-connected host
Another recurring mistake is running automation services as root to avoid permission friction. It feels faster during setup, but it expands blast radius across the host.
A safer model is a dedicated non-root account (or user-scoped service), with OpenClaw state under that user’s ~/.openclaw and only the permissions required for intended tasks. This keeps memory/session artefacts scoped and lowers cross-user exposure risk.
Least privilege does not remove all risk. It does make failures smaller and recovery safer.
Linger is the hidden setting that explains “it stopped overnight”
On headless VPS setups, user services can stop when the user session ends unless linger is enabled. This is a common source of intermittent availability problems.
SetupClaw workflows typically include doctor/status checks that can detect this condition and guide remediation; verify exact behavior for your installed OpenClaw version.
If your incident pattern is “worked yesterday, dead this morning”, check linger before changing anything else.
Environment consistency beats interactive shell habits
Another subtle failure class is credential drift. Operators export variables in an interactive shell and assume the service sees them. systemd often does not inherit that environment the way people expect.
A safer baseline is to centralise runtime secrets in a service-visible location, typically ~/.openclaw/.env and managed auth profiles. This keeps tokens and API keys consistent across restart, reboot, and user login state.
Use strict file permissions for .env and auth material, and avoid storing secrets in shell history or ad-hoc scripts.
The result is fewer false alarms where Telegram or provider calls fail simply because the service process restarted without expected environment variables.
Logging should be two-layered so triage is fast
When something breaks, you need to answer one question quickly: is this an application issue or a supervisor/runtime issue?
That is why SetupClaw favours two log layers:
- OpenClaw application logs (
openclaw logs --follow, control UI) - systemd/journal logs for service wrapper and startup failures
This split shortens diagnosis time. You can distinguish “gateway process never started” from “gateway started but hit channel/auth/config errors.” Without that distinction, teams waste time changing the wrong thing.
Config changes and restart expectations should be documented
Not all changes behave the same way. Some configuration updates can reload cleanly, others require full restart (especially around gateway/discovery/network paths).
If operators do not know which class a change belongs to, normal reconnect behaviour gets misread as failure. SetupClaw handoff should therefore include explicit restart expectations and a short checklist for post-change validation.
This is a small documentation step that prevents a lot of reactive troubleshooting.
Practical implementation steps on Hetzner
Step one: run OpenClaw as a supervised service
Install or verify the documented user service setup and use service commands for lifecycle operations, not ad hoc shell runs.
Step two: verify restart and boot persistence
Confirm restart policy and confirm service availability after reboot/logout scenarios. Check linger status for user services.
Step three: enforce least-privilege runtime
Run under a dedicated non-root context. Verify state and workspace path ownership are scoped to the service user.
Step four: standardise environment handling
Move credentials to service-visible .env/auth profile locations. Remove hidden dependencies on interactive shell exports.
Step five: validate log visibility and triage ladder
Use a consistent incident path: openclaw gateway status, openclaw status --deep, openclaw doctor, then application and journal logs.
Step six: document restart-sensitive changes
Include which config areas require restart and what health checks to run immediately after applying changes.
How this fits the wider SetupClaw safety model
systemd hardening and PR-only are complementary, not interchangeable. One protects runtime continuity and diagnosability. The other protects repository change governance.
Likewise, service hardening supports Telegram reliability but does not replace channel access policy. It supports cron uptime but does not replace schedule correctness. In other words, this is the backbone layer, not the whole stack.
That is why it belongs in Basic Setup. It gives customers a deployment they can operate, not just launch.