Browse documents

Production Readiness & Go-Live Runbook

Phase: 9 (QA, Security, Deployment & Handover) · Last updated: 2026-06-28 · Gate: G9

This runbook is the operational checklist for taking the platform from the verified dev build to production. It complements 01_Status.md (what's built) and incident_response.md.

1. Readiness checklist

AreaStatusEvidence / Notes
Production build✅ Greenpnpm -C apps/web build exits 0 (lint + type-check + bundle)
Type safetypnpm typecheck clean across the workspace
API: auth on every routegetApiContext + requirePermission; verified by audit
API: tenant isolationAll queries pass tenantScopedFilter; no cross-tenant leak found
API: error handlinghandle() maps Zod / ApiError / Mongoose Cast/Validation/duplicate → 4xx; never leaks 500s for client errors
RBAC enforcementScoped API keys denied ungranted reads/writes (403)
Input validation✅ (data) / ⚠️ (feature)CRUD entities Zod-validated; some AI/workflow routes hand-validate fields
Audit logImmutable; write actions recorded
Accessibility✅ baselineFocus-visible on shared controls, aria-labels on icon buttons, labelled inputs
i18nen / ja / zh; lang attribute set per tenant
API referenceOpenAPI at /api/openapi.json, Swagger at /api/docs, Scalar at /api/docs/scalar
Health checkGET /api/health{status:"ok",deps:{mongo:"up"}}
External pentest⏳ OutstandingRequired before sign-off (Phase 9 O9.2)
Perf / soak test⏳ OutstandingLoad/spike/72h soak vs NFR targets
Backup + DR drill⏳ OutstandingSee §6; Atlas backups + restore rehearsal
Prod deploy pipeline⏳ OutstandingBlue-green / canary; see §3
Observability + alerting⏳ PartialHealth + logs present; OTEL hook unconfigured (see §5)

Honest summary: the application code is production-grade and fully verified in the dev environment against MongoDB Atlas. The outstanding items (⏳) are infrastructure & process deliverables that live outside the codebase (deployment platform, monitoring stack, pentest engagement, DR rehearsal). They must be completed before a real go-live.

2. Environment variables

Set these in the production secret store (never in the repo). Full list in .env.example.

Required: MONGODB_URI, MONGODB_DB_NAME, AUTH_SECRET (strong random), NEXT_PUBLIC_APP_URL, NODE_ENV=production. Disable in prod: AUTH_DEV_BYPASS (must be unset/false), SEED_OWNER_PASSWORD. AI (optional): AI_PROVIDER, AI_MODEL, OPENAI_API_KEY / ANTHROPIC_API_KEY. Push (optional): VAPID public/private keys for web push. Future infra: S3_* (document storage), OTEL_EXPORTER_OTLP_ENDPOINT (traces/metrics), REDIS_URL (if moving workflow/maintenance engines out of process).

3. Build & deploy

pnpm install --frozen-lockfile
pnpm -C apps/web build      # standalone output
# Node host:
node apps/web/.next/standalone/server.js   # behind a TLS-terminating reverse proxy
  • Target a Node 22 LTS runtime (Vercel, a container, or a managed Node host). Per project constraint there is no Docker setup in-repo; add one for the chosen platform at deploy time.
  • Roll out blue-green (or canary): deploy the new version alongside the old, run §7 smoke checks against it, then switch the load-balancer/alias. Keep the previous release warm for instant rollback.

4. Database (MongoDB Atlas)

  • Use a dedicated production cluster + database; the app user needs readWrite only on its DB.
  • Indexes are declared in the Mongoose models (tenantId-first compound indexes, unique (tenantId, code) on assets/work orders) and created on connect — confirm they exist after first boot (db.<col>.getIndexes()).
  • Enable Atlas continuous backups (PITR) with retention meeting the RPO target.
  • Restrict network access to the app's egress IPs; enable Atlas auditing.

5. Observability

  • Liveness/readiness: GET /api/health (checks Mongo). Wire it to the platform health probe.
  • Logs: the app logs unhandled API errors via console.error('[api] unhandled error', …) — ship stdout/stderr to the log aggregator.
  • Metrics/traces: an OTEL endpoint env var (OTEL_EXPORTER_OTLP_ENDPOINT) is reserved; wire an OpenTelemetry exporter and dashboards (latency, error rate, Mongo op time) before go-live.
  • Alerts: page on health-check failures, 5xx rate, and Atlas connection saturation. Link alerts to incident_response.md.

6. Backup & disaster recovery

  • Backup: Atlas PITR (above) + a scheduled mongodump to object storage as a second copy.
  • Restore drill (do before go-live): restore the latest snapshot into a parallel cluster, point a staging deploy at it, run §7 smoke checks, and record actual RPO/RTO. Document the result as DR_drill_v1.
  • Document the failover steps (promote restored cluster, rotate MONGODB_URI, redeploy).

7. Go-live smoke checklist

Run against the new release before cutover:

  1. GET /api/healthmongo: up.
  2. Log in as a real user → lands on /dashboard.
  3. List a few entities (/assets, /work-orders, /systems) — data renders, pagination works.
  4. Create + edit + delete a throwaway record (e.g. a tag) — round-trips and audits.
  5. Run a report → CSV and PDF download.
  6. AI chat answers a data question (if AI configured).
  7. AUTH_DEV_BYPASS is off — an unauthenticated API call returns 401.

8. Rollback

  • Switch the load-balancer/alias back to the previous release (kept warm from §3).
  • If a bad schema/index migration shipped, roll the code back first; Mongoose additive index changes are backward-compatible, but verify no field was renamed/removed in the bad release.
  • Capture a post-incident note via incident_response.md.

9. Outstanding before sign-off (G9)

External penetration test (no open Critical/High) · performance + soak pass · backup/restore + DR drill with documented RPO/RTO · production deploy pipeline + observability/alerting wired · operations & customer-success training · 5-day hypercare plan.