Approval recorded in Slack before execution
OnCallZero investigates production incidents, proposes a bounded fix, executes it automatically after Slack approval when policy requires it, and verifies recovery on real infrastructure.
OnCallZero is an autonomous incident resolution system for Kubernetes. It investigates production incidents, proposes safe recovery actions, executes them after Slack approval, and verifies recovery automatically.
Live incident proof · approval, execution, verification · 53–69s recovery
Evidence from production-like runs on a live Hetzner k3s environment. These artifacts show the incident, the approval recorded in Slack, the executed remediation, and the verification checks that confirmed recovery.
Approval recorded in Slack before execution
Executed remediation and recovery verification
OnCallZero is not the alerting or paging layer. It starts from monitoring signals, investigates the incident, selects policy-bounded recovery, asks for approval in Slack when required, executes automatically after approval, and verifies the system recovered.
Starts with signals from your monitoring tools, then moves straight into diagnosis. Correlates pod status, rollout history, image versions, and recent events to identify a likely cause and the safest recovery path.
Builds a recovery plan with blast radius assessment, risk scoring, dependency checks, and rollback safety. When policy requires a human gate, approval happens in Slack before any execution begins.
After approval, runs the selected action in your infrastructure, validates each step, and confirms health checks recovered. Stops immediately if policy or verification checks fail.
OnCallZero doesn't bypass your controls or stop at notification. It investigates, executes, and verifies within them. The same policies your team enforces manually become bounded automation with a full audit trail.
Every action type has explicit allow/deny/approval-required rules per namespace, severity, and time window before execution can proceed.
If the agent can't verify it's safe to proceed, it stops. Unknown states result in no action, not best-effort guesses.
Actions are capped by scope. One deployment, not the namespace. Three pods, not the cluster. Boundaries are hard limits, not suggestions.
Every tool call, every decision, every approval, every verification result — timestamped and logged. Reconstructable for any incident, any time.
| Action | Non-critical | Production | Critical path |
|---|---|---|---|
| Rollback deployment | Auto | Approval | Deny |
| Scale replicas | Auto | Auto | Approval |
| Restart pod | Auto | Approval | Approval |
| Modify resources | Approval | Deny | Deny |
| Delete resource | Deny | Deny | Deny |
2:14 AM — PagerDuty fires. On-call engineer wakes up, opens laptop, tries to remember which service this is.
2:28 AM — Still investigating. Checking logs, guessing at root cause, Slacking teammates who are asleep.
2:47 AM — Manual rollback. Fingers crossed. No blast radius check. Error rate still climbing.
3:12 AM — Resolved. 58 minutes of downtime. Engineer burned out. Postmortem tomorrow.
2:14 AM — Incident signal ingested. OnCallZero begins investigation immediately. Correlates pods, deploy history, and recent events.
2:14 AM — Recovery action selected. Likely cause identified. Blast radius, rollback path, and policy constraints checked.
2:14 AM — Approved in Slack. Approval opens the gate. The bounded rollback executes automatically in production.
~2:15 AM — Recovery verified. 53–69s in proven runs. Health checks pass, artifacts are captured, and the engineer wakes up to proof instead of manual triage.
No. Every mutating action requires explicit approval in Slack before execution. You see the diagnosis, proposed action, risk level, and blast radius — then you decide. If you do not approve, nothing happens. There is no autonomous execution mode in the current version.
OnCallZero runs real Kubernetes health checks after every action. If verification fails — replicas are not healthy, pods are still crashing, or the rollout did not complete — the incident is marked FAILED, not RESOLVED. The system never claims success without proof. If execution fails, it stops immediately and notifies you.
Two proven scenarios: stuck rollouts (ProgressDeadlineExceeded leading to rollback to the previous healthy revision) and CrashLoopBackOff (pod crash-looping leading to pod replacement and recovery verification). Both are proven on a live Hetzner k3s cluster with multiple successful runs. More scenarios are being added.
OnCallZero connects to the Kubernetes API using a ServiceAccount with scoped RBAC permissions. It receives alerts via webhook from Prometheus or Alertmanager and sends approval requests through Slack. No agents on nodes. No sidecars. No kernel modules.
No. OnCallZero is not an alerting or incident management platform. It is the execution layer that sits between your monitoring stack and your infrastructure. PagerDuty tells you something broke. OnCallZero helps fix it — with your approval. They work together.
You see the diagnosis and proposed action in Slack before anything happens. If it is wrong, click Reject. The system also applies structural controls: blast radius limits, cooldown enforcement, kill switch support, and approval gates. The AI proposes. You decide.
OnCallZero is designed fail-closed. Every action goes through policy checks, scope limits, approval control, and post-action verification. If anything is uncertain, it stops. It is built for production safety first, not best-effort automation.
OnCallZero is free for early design partners. We are looking for a small number of engineering teams to use it and give honest feedback. After the design partner phase, pricing will be per-cluster per-month with no per-seat charges.
OnCallZero investigates Kubernetes incidents, asks for approval in Slack when policy requires it, executes bounded recovery automatically after approval, and verifies the outcome with proof artifacts.