
Self-healing — failures shouldn't wake you up
Hermes is a self-healing infrastructure CLI written in Go (v0.8.0 in my stack). The philosophy: a whitelist of permitted actions + verification-after-fix + learning from recurring failures. A five-stage architecture: detect → diagnose → fix → verify → learn. It runs as a cron job or a webhook responder and persists history to SQLite/JSON. In my setup it performs autoheal for Kami and for OpenClaw (the engine behind Kaylee) — but for you, it's a pattern you can adopt with any CLI (or even bash scripts): the five stages fit any production system, not just AI agents.
90% of failures are the same 10 problems on repeat. Hermes solves them on its own, and wakes you only for something genuinely new.
PagerDuty at 3 AM because a Docker container crashed
Hermes tried a restart, it worked, sent a morning email 'handled and resolved'
Running the same fix script for the fifth time this week
Hermes remembers 'what worked for what' and applies it automatically
PagerDuty, BetterStack, Grafana OnCall — $21-$100+/month per user
Hermes is open, transparent, repair rules stored as JSON
Monitoring without action = noise
Monitoring + action pipeline = a real solution
Here's how:
Senior engineer drowning in on-call rotations? A self-healing pattern meaningfully cuts the load within a week.
One or two servers, lots of services. Hermes looks after them even while you're on vacation.
Customers shouldn't have to know about your failures. Hermes makes sure they don't.
A foundational pattern for any agent that acts in the real world — it needs fallback and verification.
Click any section to open it
Hermes is implemented inside Kaylee + the delegator
The classic book — where these ideas come from
How to build good healthchecks inside containers
The agent that runs Hermes on my VPS
The store behind healing_history — Hermes's memory
Want Hermes inside your infrastructure?
It's a mindset shift — from reactive to autonomous. Ready to see how it's built?
Full-Stack Developer & AI Specialist
Hermes handled 40+ incidents for me in six months — without me even knowing something was wrong. This approach turned the VPS into 'fire and forget.' This guide is based on real failures — I started with a whitelist that was too aggressive and had to rein it back.