Production Readiness & Go-Live Runbook

Phase: 9 (QA, Security, Deployment & Handover) · Last updated: 2026-06-28 · Gate: G9

This runbook is the operational checklist for taking the platform from the verified dev build to production. It complements 01_Status.md (what's built) and incident_response.md.

1. Readiness checklist

Area	Status	Evidence / Notes
Production build	✅ Green	`pnpm -C apps/web build` exits 0 (lint + type-check + bundle)
Type safety	✅	`pnpm typecheck` clean across the workspace
API: auth on every route	✅	`getApiContext` + `requirePermission`; verified by audit
API: tenant isolation	✅	All queries pass `tenantScopedFilter`; no cross-tenant leak found
API: error handling	✅	`handle()` maps Zod / `ApiError` / Mongoose Cast/Validation/duplicate → 4xx; never leaks 500s for client errors
RBAC enforcement	✅	Scoped API keys denied ungranted reads/writes (403)
Input validation	✅ (data) / ⚠️ (feature)	CRUD entities Zod-validated; some AI/workflow routes hand-validate fields
Audit log	✅	Immutable; write actions recorded
Accessibility	✅ baseline	Focus-visible on shared controls, aria-labels on icon buttons, labelled inputs
i18n	✅	en / ja / zh; `lang` attribute set per tenant
API reference	✅	OpenAPI at `/api/openapi.json`, Swagger at `/api/docs`, Scalar at `/api/docs/scalar`
Health check	✅	`GET /api/health` → `{status:"ok",deps:{mongo:"up"}}`
External pentest	⏳ Outstanding	Required before sign-off (Phase 9 O9.2)
Perf / soak test	⏳ Outstanding	Load/spike/72h soak vs NFR targets
Backup + DR drill	⏳ Outstanding	See §6; Atlas backups + restore rehearsal
Prod deploy pipeline	⏳ Outstanding	Blue-green / canary; see §3
Observability + alerting	⏳ Partial	Health + logs present; OTEL hook unconfigured (see §5)

Honest summary: the application code is production-grade and fully verified in the dev environment against MongoDB Atlas. The outstanding items (⏳) are infrastructure & process deliverables that live outside the codebase (deployment platform, monitoring stack, pentest engagement, DR rehearsal). They must be completed before a real go-live.

2. Environment variables

Set these in the production secret store (never in the repo). Full list in .env.example.

Required: MONGODB_URI, MONGODB_DB_NAME, AUTH_SECRET (strong random), NEXT_PUBLIC_APP_URL, NODE_ENV=production. Disable in prod: AUTH_DEV_BYPASS (must be unset/false), SEED_OWNER_PASSWORD. AI (optional): AI_PROVIDER, AI_MODEL, OPENAI_API_KEY / ANTHROPIC_API_KEY. Push (optional): VAPID public/private keys for web push. Future infra: S3_* (document storage), OTEL_EXPORTER_OTLP_ENDPOINT (traces/metrics), REDIS_URL (if moving workflow/maintenance engines out of process).

3. Build & deploy

pnpm install --frozen-lockfile
pnpm -C apps/web build      # standalone output
# Node host:
node apps/web/.next/standalone/server.js   # behind a TLS-terminating reverse proxy

Target a Node 22 LTS runtime (Vercel, a container, or a managed Node host). Per project constraint there is no Docker setup in-repo; add one for the chosen platform at deploy time.
Roll out blue-green (or canary): deploy the new version alongside the old, run §7 smoke checks against it, then switch the load-balancer/alias. Keep the previous release warm for instant rollback.

4. Database (MongoDB Atlas)

Use a dedicated production cluster + database; the app user needs readWrite only on its DB.
Indexes are declared in the Mongoose models (tenantId-first compound indexes, unique (tenantId, code) on assets/work orders) and created on connect — confirm they exist after first boot (db.<col>.getIndexes()).
Enable Atlas continuous backups (PITR) with retention meeting the RPO target.
Restrict network access to the app's egress IPs; enable Atlas auditing.

5. Observability

Liveness/readiness: GET /api/health (checks Mongo). Wire it to the platform health probe.
Logs: the app logs unhandled API errors via console.error('[api] unhandled error', …) — ship stdout/stderr to the log aggregator.
Metrics/traces: an OTEL endpoint env var (OTEL_EXPORTER_OTLP_ENDPOINT) is reserved; wire an OpenTelemetry exporter and dashboards (latency, error rate, Mongo op time) before go-live.
Alerts: page on health-check failures, 5xx rate, and Atlas connection saturation. Link alerts to incident_response.md.

6. Backup & disaster recovery

Backup: Atlas PITR (above) + a scheduled mongodump to object storage as a second copy.
Restore drill (do before go-live): restore the latest snapshot into a parallel cluster, point a staging deploy at it, run §7 smoke checks, and record actual RPO/RTO. Document the result as DR_drill_v1.
Document the failover steps (promote restored cluster, rotate MONGODB_URI, redeploy).

7. Go-live smoke checklist

Run against the new release before cutover:

GET /api/health → mongo: up.
Log in as a real user → lands on /dashboard.
List a few entities (/assets, /work-orders, /systems) — data renders, pagination works.
Create + edit + delete a throwaway record (e.g. a tag) — round-trips and audits.
Run a report → CSV and PDF download.
AI chat answers a data question (if AI configured).
AUTH_DEV_BYPASS is off — an unauthenticated API call returns 401.

8. Rollback

Switch the load-balancer/alias back to the previous release (kept warm from §3).
If a bad schema/index migration shipped, roll the code back first; Mongoose additive index changes are backward-compatible, but verify no field was renamed/removed in the bad release.
Capture a post-incident note via incident_response.md.

9. Outstanding before sign-off (G9)

External penetration test (no open Critical/High) · performance + soak pass · backup/restore + DR drill with documented RPO/RTO · production deploy pipeline + observability/alerting wired · operations & customer-success training · 5-day hypercare plan.