Real incident artifacts · Slack approvals · executed recoveries · verified outcomes

OnCallZero. Kubernetes incidents.
Resolved in production.
53–69 seconds.

Not alerts. Not dashboards. Actual incident resolution.

OnCallZero investigates production incidents, proposes a bounded fix, executes it automatically after Slack approval when policy requires it, and verifies recovery on real infrastructure.

Real incidents, not demos
Slack approval captured
Actions executed in production
Recovery verified with artifacts
See real incident proof →
incident 4dd38e92 · proving-nginx
verified
Incident signal
Rollout stalled during deployment in proving-nginx.
09:04:12 UTC
Investigation
New revision failing readiness checks. Previous revision still healthy.
09:04:15 UTC · 4 tools · 3s
Proposed action
Rollback to the previous healthy revision within policy scope.
09:04:18 UTC · risk: medium
Approval
Approved by @andrii in Slack. Execution gate opened for the scoped rollback.
slack-approval-required
Recovery verified
Rollback executed automatically. 2/2 replicas healthy. Recovery verification passed.
09:05:10 UTC · 58s total
Slack approval recorded
Execution logs captured
Recovery verification passed
Full audit trail available

What is OnCallZero

OnCallZero is an autonomous incident resolution system for Kubernetes. It investigates production incidents, proposes safe recovery actions, executes them after Slack approval, and verifies recovery automatically.

Live incident proof · approval, execution, verification · 53–69s recovery

This is not a demo.
These are real incidents.

Evidence from production-like runs on a live Hetzner k3s environment. These artifacts show the incident, the approval recorded in Slack, the executed remediation, and the verification checks that confirmed recovery.

Slack approval record
Slack approval record showing proposed rollback, human approval, and verified recovery for the stuck rollout scenario

Approval recorded in Slack before execution

Execution artifact
Execution artifact showing bounded rollback execution and recovery verification for the stuck rollout scenario

Executed remediation and recovery verification

From incident signal to verified recovery.
Less manual firefighting.

OnCallZero is not the alerting or paging layer. It starts from monitoring signals, investigates the incident, selects policy-bounded recovery, asks for approval in Slack when required, executes automatically after approval, and verifies the system recovered.

01

Detect and investigate

Starts with signals from your monitoring tools, then moves straight into diagnosis. Correlates pod status, rollout history, image versions, and recent events to identify a likely cause and the safest recovery path.

09:04:12SIGNAL ImagePullBackOff · proving-nginx
09:04:13QUERY pod status, image pull events
09:04:14QUERY rollout history, previous revision image
09:04:15LIKELY CAUSE bad image tag in latest revision
02

Select and approve bounded action

Builds a recovery plan with blast radius assessment, risk scoring, dependency checks, and rollback safety. When policy requires a human gate, approval happens in Slack before any execution begins.

Rollback to previous healthy revision
medium risk
Blast radius: 1 deployment, 2 pods
scoped
Policy: approval in Slack before execution
Slack gate
03

Execute and verify recovery

After approval, runs the selected action in your infrastructure, validates each step, and confirms health checks recovered. Stops immediately if policy or verification checks fail.

Policy and Slack approval check passed0.4s
Rollback executed automatically1.7s
Post-action health checks passed52s
Recovery verified · 1m 09s

Every action governed.
Every decision auditable.

OnCallZero doesn't bypass your controls or stop at notification. It investigates, executes, and verifies within them. The same policies your team enforces manually become bounded automation with a full audit trail.

Permission matrix

Every action type has explicit allow/deny/approval-required rules per namespace, severity, and time window before execution can proceed.

Fail-closed by default

If the agent can't verify it's safe to proceed, it stops. Unknown states result in no action, not best-effort guesses.

Blast radius enforcement

Actions are capped by scope. One deployment, not the namespace. Three pods, not the cluster. Boundaries are hard limits, not suggestions.

Full audit trail

Every tool call, every decision, every approval, every verification result — timestamped and logged. Reconstructable for any incident, any time.

permission-matrix.yaml
Active
Action Non-critical Production Critical path
Rollback deployment Auto Approval Deny
Scale replicas Auto Auto Approval
Restart pod Auto Approval Approval
Modify resources Approval Deny Deny
Delete resource Deny Deny Deny

What changes when incidents get resolved,
not just escalated.

Without OnCallZero

2:14 AM — PagerDuty fires. On-call engineer wakes up, opens laptop, tries to remember which service this is.

2:28 AM — Still investigating. Checking logs, guessing at root cause, Slacking teammates who are asleep.

2:47 AM — Manual rollback. Fingers crossed. No blast radius check. Error rate still climbing.

3:12 AM — Resolved. 58 minutes of downtime. Engineer burned out. Postmortem tomorrow.

With OnCallZero

2:14 AM — Incident signal ingested. OnCallZero begins investigation immediately. Correlates pods, deploy history, and recent events.

2:14 AM — Recovery action selected. Likely cause identified. Blast radius, rollback path, and policy constraints checked.

2:14 AM — Approved in Slack. Approval opens the gate. The bounded rollback executes automatically in production.

~2:15 AM — Recovery verified. 53–69s in proven runs. Health checks pass, artifacts are captured, and the engineer wakes up to proof instead of manual triage.

Common questions

Does OnCallZero execute actions without my permission?

No. Every mutating action requires explicit approval in Slack before execution. You see the diagnosis, proposed action, risk level, and blast radius — then you decide. If you do not approve, nothing happens. There is no autonomous execution mode in the current version.

What happens if the rollback fails or makes things worse?

OnCallZero runs real Kubernetes health checks after every action. If verification fails — replicas are not healthy, pods are still crashing, or the rollout did not complete — the incident is marked FAILED, not RESOLVED. The system never claims success without proof. If execution fails, it stops immediately and notifies you.

What Kubernetes incidents does it handle today?

Two proven scenarios: stuck rollouts (ProgressDeadlineExceeded leading to rollback to the previous healthy revision) and CrashLoopBackOff (pod crash-looping leading to pod replacement and recovery verification). Both are proven on a live Hetzner k3s cluster with multiple successful runs. More scenarios are being added.

How does it connect to my cluster?

OnCallZero connects to the Kubernetes API using a ServiceAccount with scoped RBAC permissions. It receives alerts via webhook from Prometheus or Alertmanager and sends approval requests through Slack. No agents on nodes. No sidecars. No kernel modules.

Do I need to replace PagerDuty or incident.io?

No. OnCallZero is not an alerting or incident management platform. It is the execution layer that sits between your monitoring stack and your infrastructure. PagerDuty tells you something broke. OnCallZero helps fix it — with your approval. They work together.

What if the AI proposes the wrong action?

You see the diagnosis and proposed action in Slack before anything happens. If it is wrong, click Reject. The system also applies structural controls: blast radius limits, cooldown enforcement, kill switch support, and approval gates. The AI proposes. You decide.

Is it safe for production?

OnCallZero is designed fail-closed. Every action goes through policy checks, scope limits, approval control, and post-action verification. If anything is uncertain, it stops. It is built for production safety first, not best-effort automation.

How much does it cost?

OnCallZero is free for early design partners. We are looking for a small number of engineering teams to use it and give honest feedback. After the design partner phase, pricing will be per-cluster per-month with no per-seat charges.

You don’t need more alerts.
You need resolution.

OnCallZero investigates Kubernetes incidents, asks for approval in Slack when policy requires it, executes bounded recovery automatically after approval, and verifies the outcome with proof artifacts.