Browse documents

Runbook — Incident Response (skeleton)

Status: Skeleton (Phase 0) · Owner: Eng Lead / On-call · Expanded in Phase 9.

Severity levels

SevDefinitionResponse
Sev-1Prod down / data loss / security breachPage immediately; all-hands; status page red
Sev-2Major feature broken, no workaroundPage on-call; fix within hours
Sev-3Degraded / workaround existsNext business day
Sev-4Minor / cosmeticBacklog

First responder checklist

  1. Acknowledge the alert; declare severity.
  2. Open an incident channel; assign an Incident Commander.
  3. Check /api/health and observability dashboards (traces/metrics/logs).
  4. Mitigate first (rollback / failover), root-cause second.
  5. Communicate: status page + stakeholders at a fixed cadence.
  6. After resolution: write a blameless post-mortem within 48h.

Useful entry points

  • Health: GET /api/health (deps.mongo etc.)
  • Rollback: blue-green toggle (Phase 9 §6.2) — to be documented.
  • DB: MongoDB Atlas console; backups/PITR (RPO ≤ 15m target).
  • Secrets rotation: per ADR-001.

Phase 9 fills in: on-call rota, paging tooling, per-dependency playbooks, DR drill linkage.