Autonomous Kubernetes incident remediation. Human-approved.

Rollout failure detected, approved in Slack, rolled back with audit evidence.

First-party proof on our own k3s / Hetzner cluster: one complete rollback proof packet, one partial CrashLoopBackOff runtime export, one pending evidence packet. This is owned-infrastructure evidence, not customer production proof.

Hero packet
OCZ-PROOF-STUCK-ROLLOUT-ROLLBACK-002
Scenario
ProgressDeadlineExceeded -> rollback_k8s_deployment
Environment
Owned Hetzner k3s proving cluster, namespace oncallzero-workloads
Status
validation_status: complete
namespace: oncallzero-workloads
scenario: stuck-rollout-rollback
Proof packets

Three packets, three honest statuses.

The homepage uses the strongest real packet as the hero and keeps partial or placeholder evidence visibly downgraded.

Complete

Stuck rollout / ProgressDeadlineExceeded -> rollback_k8s_deployment

Manifest confirms signed webhook acceptance, Slack approval, Slack signature validation, action guard, rollback execution, strict verification, RBAC evidence, and audit IDs for one owned-cluster run.

Partial runtime export

CrashLoopBackOff -> delete_pod

Real reconstructed runtime export with Slack screenshot, incident export, proof-flow log, post-restore Kubernetes state, runtime image, git metadata, and audit IDs. Missing raw request, raw kubectl-before, full audit records, standalone policy export, and structured Slack approval JSON.

Pending evidence

Bad deployment rollback

Placeholder packet only. The manifest says no public runtime artifact is present for this scenario yet, so it is not used as proof for a completed remediation.

Inline evidence

Real text artifacts, not fake terminal replay.

These are concise excerpts from checked-in proof packet files. No asciinema is embedded because no real .cast file exists. The cropped terminal screenshots are not used as proof visuals.

verification-verdict.txt View raw artifact
== STRICT VERIFICATION VERDICT ==
Incident: b9abfcc7-1078-44f9-bffd-c2c54628d8a6

2026-05-23 13:31:45 [info] verification_node.verdict
  attempt=1
  confidence=0.95
  incident_id=b9abfcc7-1078-44f9-bffd-c2c54628d8a6
  node=verify
  passed=True
  reason_preview='Deployment rollout is healthy with all pods ready and available. No new readiness probe failures in current pods. Old failing pod is terminated. Metrics and events confirm stability and no regressions'

== VERDICT ==
Strict verification: PASSED (confidence=0.95)
All pods ready: YES (2/2)
Rollout complete: YES (NewReplicaSetAvailable)
Old stuck RS terminated: YES (proving-nginx-7f486885f6: 0 desired)
Deployment healthy: YES
Incident status: resolved
action-guard-evidence.txt View raw artifact
== ACTION GUARD EVIDENCE ==
Tool: rollback_k8s_deployment
Incident: b9abfcc7-1078-44f9-bffd-c2c54628d8a6

== GUARD CONTEXT (from incident.json + logs) ==
- approval_status: approved (U0AKT8RE8N9, khmuraandriy)
- tool_name approved: rollback_k8s_deployment
- namespace approved: oncallzero-workloads
- ONCALLZERO_ALLOWED_NAMESPACES: ["oncallzero-workloads"] (k8s ConfigMap)
- target namespace in tool_args: oncallzero-workloads -> matches allowlist

2026-05-23 13:31:07 [critical] AUDIT_ACTION_EXECUTE
  incident_id=b9abfcc7-1078-44f9-bffd-c2c54628d8a6
  tool_name=rollback_k8s_deployment

2026-05-23 13:31:07 [critical] AUDIT_ACTION_RESULT
  success=True
  duration_seconds=0.12
  tool_name=rollback_k8s_deployment

== VERDICT ==
action_guard: ALLOWED (no PERMISSION DENIED in log sequence, execution proceeded)
namespace_check: PASSED
approval_check: PASSED
tool_match: PASSED
kubectl-before-and-after.txt View raw artifact
=== Injected state ===
namespace: oncallzero-workloads
deployment: proving-nginx

NAME            READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES       SELECTOR
proving-nginx   2/2     1            2           60d   nginx        nginx:1.27   app=proving-nginx

Available=True MinimumReplicasAvailable Deployment has minimum availability.
Progressing=False ProgressDeadlineExceeded ReplicaSet "proving-nginx-7f486885f6" has timed out progressing.

NAME                             READY   STATUS    RESTARTS   AGE
proving-nginx-76d7745467-cxnqr   1/1     Running   0          63s
proving-nginx-76d7745467-pdtck   1/1     Running   0          57s
proving-nginx-7f486885f6-n928x   0/1     Running   0          47s

=== Final state ===
namespace: oncallzero-workloads
deployment: proving-nginx

NAME            READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES       SELECTOR
proving-nginx   2/2     2            2           60d   nginx        nginx:1.27   app=proving-nginx

Progressing=True NewReplicaSetAvailable ReplicaSet "proving-nginx-584f49ffbd" has successfully progressed.
rbac-can-i.txt View raw artifact
=== can-i patch deployments in oncallzero-workloads ===
yes
=== can-i get replicasets in oncallzero-workloads ===
yes
=== can-i list pods in oncallzero-workloads ===
yes
=== can-i delete pods in oncallzero-workloads ===
yes

=== RoleBinding: oncallzero-workload-mutation ===
Name:         oncallzero-workload-mutation
Role:
  Kind:  Role
  Name:  oncallzero-workload-mutation
Subjects:
  Kind            Name        Namespace
  ----            ----        ---------
  ServiceAccount  oncallzero  oncallzero

=== Role: oncallzero-workload-mutation rules ===
{'apiGroups': ['apps'], 'resources': ['deployments'], 'verbs': ['get', 'list', 'watch', 'patch', 'update']}
{'apiGroups': [''], 'resources': ['pods'], 'verbs': ['delete']}
Anatomy of one run

One incident path, tied to artifacts.

Stages use timestamps from proof-flow.log, incident.json, and metadata.json. When an export timestamp is not present, the page says so.

2026-05-23 13:30:51

Detected

Signed Prometheus webhook accepted for proving-nginx rollout stalled. proof-flow.log

2026-05-23 13:30:52

Investigated

Analysis started with Kubernetes wedge tools: pod status, logs, deployment status, events, HPA, and pod issue description. proof-flow.log

2026-05-23 13:31:06

Slack approved

Slack callback recorded approve_action from khmuraandriy. signature validation

2026-05-23 13:31:07

Action guarded

Approved tool and namespace matched the allowlist before execution proceeded. action guard

2026-05-23 13:31:07

Rollback executed

AUDIT_ACTION_RESULT success=True for rollback_k8s_deployment. proof-flow.log

2026-05-23 13:31:45

Verification passed

Verification verdict passed with confidence 0.95; deployment healthy and pods ready. verdict

timestamp unavailable

Audit/proof exported

Audit IDs and proof packet files are checked in; the packet does not expose a separate proof export timestamp. audit IDs

Safety model

SRE-facing controls shown in the packet.

The safety language below separates what is shown in this run from what remains pending proof.

Shown in proof packet

Slack approval gate: approval_status: approved, approver U0AKT8RE8N9, and callback evidence from Slackbot via app.oncallzero.com.

Shown in proof packet

Signed webhook and Slack signature validation: Prometheus accepted only with configured headers; Slack signature is marked valid with clock_skew_seconds=0.

Implemented in this run

Action guard: executed tool matched approved tool rollback_k8s_deployment, target namespace matched oncallzero-workloads, and execution proceeded without a permission-denied log.

Implemented in this run

Namespace/action scope: RBAC evidence permits deployment patch/update plus pod list/get/delete in the owned workload namespace. That caps this packet to the shown cluster and namespace.

Shown in proof packet

Verification verdict: deployment returned to 2/2 ready, rollout complete, active ReplicaSet proving-nginx-584f49ffbd, verification passed.

Shown in proof packet

Audit IDs: five entries cover incident opened, action proposal, approval request, approval decision, and tool call. Full raw audit JSON is not exposed as a public artifact.

Implemented in this run

Deny-by-default posture is evidenced for inbound webhooks: no API key returns 401, correct API key without webhook secret returns 403, signed webhook returns 202.

Pending proof

No kill-switch artifact is present in this public packet. No broad multi-tenant, multi-cluster, or customer-production safety claim is made here.

Early access

Join early access

For SRE / Platform teams who want to review the proof packet or discuss whether approval-gated Kubernetes remediation fits their environment.

No customer production proof yet. First-party owned-infra proof only.

Prefer email? Contact oncallzero@gmail.com
Claim firewall

What this page does not claim.

  • No customer production proof.
  • No customer validation claim.
  • No production readiness or enterprise readiness claim.
  • No recovery-time guarantee or SLA claim.
  • No claim that all Kubernetes remediation paths are proven.
  • No live operational dashboard claim.
  • No asciinema hero until a real checked-in .cast file exists.