Runbook — Incident Response (skeleton)
Status: Skeleton (Phase 0) · Owner: Eng Lead / On-call · Expanded in Phase 9.
Severity levels
| Sev | Definition | Response |
|---|---|---|
| Sev-1 | Prod down / data loss / security breach | Page immediately; all-hands; status page red |
| Sev-2 | Major feature broken, no workaround | Page on-call; fix within hours |
| Sev-3 | Degraded / workaround exists | Next business day |
| Sev-4 | Minor / cosmetic | Backlog |
First responder checklist
- Acknowledge the alert; declare severity.
- Open an incident channel; assign an Incident Commander.
- Check
/api/healthand observability dashboards (traces/metrics/logs). - Mitigate first (rollback / failover), root-cause second.
- Communicate: status page + stakeholders at a fixed cadence.
- After resolution: write a blameless post-mortem within 48h.
Useful entry points
- Health:
GET /api/health(deps.mongoetc.) - Rollback: blue-green toggle (Phase 9 §6.2) — to be documented.
- DB: MongoDB Atlas console; backups/PITR (RPO ≤ 15m target).
- Secrets rotation: per ADR-001.
Phase 9 fills in: on-call rota, paging tooling, per-dependency playbooks, DR drill linkage.