Phase 9 — QA, Security, Deployment & Handover
Status: In progress 🚧 · Owner: QA Lead + DevOps Lead · Duration: 2 weeks · Gate: G9
1. Overview
Phase 9 takes the assembled product through the final-quality bar and into production: end-to-end QA across all modules, third-party penetration test, performance + soak testing, security hardening, DR drill, production deployment, hypercare, training, and handover to operations and customer success. This is the gate that says "the platform is ready to take real customer load."
2. Objectives
- O9.1 — Full-system QA pass: functional, integration, e2e, performance, accessibility, i18n.
- O9.2 — Independent penetration test with no Critical or High findings open at sign-off.
- O9.3 — Performance + soak test pass against published NFRS targets.
- O9.4 — Backup, restore, and disaster-recovery drills passed.
- O9.5 — Production environment hardened: secrets, IAM, network, observability, alerting.
- O9.6 — Documentation set complete: runbooks, API, admin, user guides, in-app help.
- O9.7 — Successful go-live with 5-day hypercare and zero Sev-1.
- O9.8 — Operations and Customer Success teams trained and signed-off.
3. Scope
3.1 In-scope
- Test execution across all phases (regression of P0–P8).
- DAST + SAST + dependency scan + SBOM in CI; pentest by an external firm.
- Performance: load (steady-state targets), spike (3× burst), soak (72-hour run).
- DR: full backup + restore in a parallel environment; documented RPO/RTO.
- Production deployment pipeline (blue-green or canary).
- Observability: alerts, dashboards, runbook links.
- Compliance: ISO 27001 / 27017 / 27018 alignment review; APPI / GDPR DPIA finalised.
- Documentation pass.
- Training (engineering ops, customer success, sales-engineer enablement).
- Hypercare for 5 working days.
3.2 Out-of-scope
- Customer-specific feature work.
- Acceptance testing by a real customer (separate engagement; the KTC implementation in
KTC_SDLC_Technical_Document.docxis the first such engagement and runs against this platform).
4. Dependencies
- All prior phases (P0–P8) completed and gates signed.
5. Test programme
5.1 Functional regression
- Re-run every Acceptance Criterion from P0–P8 in the staging environment.
- Capture results in
docs/quality/regression_run_v1.md.
5.2 End-to-end scenarios
- Scenario A — Tenant onboarding → user invite → import asset ledger → connect gateway → see telemetry on dashboard.
- Scenario B — Operator creates a planned maintenance workflow → cron triggers → work orders generated → assigned → completed → KPIs update on dashboard.
- Scenario C — Anomaly fires → AI explains → workflow creates P2 work order → mobile approval → executed offline → synced.
- Scenario D — Auditor pulls a monthly KPI report; AI chat answers compliance questions from RAG.
5.3 Performance
- Steady-state load: NFRS targets sustained for 30 minutes (API, ingest, dashboard render).
- Spike: 3× burst for 5 minutes; latency degrades gracefully and recovers.
- Soak: 72 hours at 60% of steady-state; no memory leak, no error rate growth.
- Tool: k6 for HTTP + WebSocket; playwright soak for browser.
5.4 Security
- SAST + dependency scan + SBOM in CI (already from P0).
- DAST: OWASP ZAP baseline + active scans in staging.
- Pentest: external firm, 2-week engagement, scope = production-like staging. Findings triaged: Critical/High → must close; Medium → close or accept with rationale; Low → backlog.
- Threat model review: revisit STRIDE from Phase 1; document changes.
- Compliance: ISO 27001 / 27017 / 27018 control mapping; APPI / GDPR DPIA final.
5.5 DR drill
- Snapshot prod-like staging.
- Restore to a parallel env from latest backup.
- Validate RPO (≤ 15 min) and RTO (≤ 4 h).
- Document deltas; update runbook.
5.6 Accessibility & i18n
- Axe-core on every route; no Critical/Serious violations.
- Manual screen-reader walkthrough (NVDA + VoiceOver).
- Translation coverage check (English + Japanese) — no untranslated strings on user-facing routes.
6. Deployment
6.1 Topology
- Cloud platform per ADR-002 (decided in Phase 0).
- Environments: dev, staging, production.
- Production sizing: 3× app instances behind LB; MongoDB Atlas M30+ (or dedicated cluster); Redis cluster; S3 bucket with versioning + lifecycle; CDN in front.
- Region: primary in agreed region (residency-aligned); DR target in a different AZ minimum.
- Secrets: managed secret store (no env vars in plain disk).
6.2 Release strategy
- Blue-green preferred (toggle via LB / DNS).
- Canary option for risky releases (5% → 25% → 100% over 24h).
- Automatic rollback on SLO breach.
- Database migrations: forward-only, backward-compatible per release.
6.3 Observability in production
- Logs → Loki / vendor; structured, sampled.
- Metrics → Prometheus; alerts on SLO burn.
- Traces → Tempo / vendor; 10% sample.
- Status page (public) — auto-updated from health probes.
- On-call rota with PagerDuty / Opsgenie.
6.4 Cutover plan (production)
- Final regression on staging passes.
- Pentest findings closed.
- Backup taken.
- Maintenance window scheduled.
- Deploy via blue-green.
- Smoke tests post-cutover.
- 5-day hypercare with daily standup.
6.5 Hypercare
- Heightened on-call.
- Daily standup with engineering + product + ops.
- Per-day report (incidents, SLA, customer signals).
- Exit criteria: zero Sev-1 for 5 days, SLOs met, customer signals stable.
7. Training & handover
7.1 Training audiences
- Customer Success (admin features, troubleshooting, ticket triage).
- Sales Engineering (demo paths, KTC reference walkthrough).
- Engineering Ops (runbooks, on-call, incident response).
- Customer trainers ("train-the-trainer" material).
7.2 Materials
- Recorded modules (≤15 min each) per topic.
- Quick-reference cheat sheets.
- Sandbox tenant with KTC reference data for hands-on.
7.3 Handover
- All runbooks reviewed and current.
- All ADRs in
AcceptedorSupersededstate — noProposedopen. - All Phase docs (
Phase_0.md→Phase_9.md) tagged final. - Backlog handover: open items, deferred decisions, customer-specific TODOs.
8. Documentation completeness checklist
- FRS, NFRS, System Architecture, Data Model, Security Architecture, Threat Model
- Every phase doc (P0–P9) reviewed and final
- All ADRs accepted
- All runbooks current: auth, API keys, gateway, ingestion, dashboard, workflow, AI, mobile, incident response, DR, on-call
- User guides: admin, operator, developer, mobile
- OpenAPI spec at
v1.0; SDK at matching version - In-app help content reviewed and translated (EN + JA)
- Public docs site (Docusaurus / similar) live with all of the above
- Changelog at v1.0
9. Acceptance Criteria (gate G9)
- AC9.1 — All functional acceptance criteria from P0–P8 re-pass in regression.
- AC9.2 — Pentest report shows no Critical / High findings open; Medium / Low triaged with rationale.
- AC9.3 — Performance NFRS targets met in load, spike, and soak tests.
- AC9.4 — DR drill completed with RPO ≤ 15 min and RTO ≤ 4 h recorded.
- AC9.5 — Production deployed via blue-green; smoke tests pass; rollback procedure exercised on a non-prod slot to verify.
- AC9.6 — Five days of hypercare completed with zero Sev-1 and SLA met.
- AC9.7 — Documentation checklist (§8) 100% complete.
- AC9.8 — Training delivered and acknowledged by Customer Success, Sales Engineering, Engineering Ops.
10. Test Requirements
- Coverage gates retained: ≥80% unit on business logic across the repo.
- Contract tests pass against published OpenAPI v1.0.
- All e2e scenarios in §5.2 pass.
- Lighthouse mobile gates retained.
11. Documentation Requirements
docs/quality/regression_run_v1.mddocs/quality/perf_report_v1.mddocs/security/pentest_report_v1.md(with closure log)docs/security/threat_model_final.mddocs/operations/dr_drill_v1.mddocs/operations/cutover_v1.mddocs/operations/hypercare_v1.mddocs/training/customer_success.md,sales_engineering.md,engineering_ops.mdCHANGELOG.mdat v1.0docs/release_notes/v1.0.md
12. Sign-off Criteria (Gate G9)
- All Acceptance Criteria met.
- Production go/no-go meeting held with Engineering Lead, Product Owner, Security Lead, DevOps Lead, Customer Success Lead, Sales Engineering Lead.
- Signed
_gates/Gate_G9_signoff.md. - Tagged
v1.0.0and announced.
13. Risks & Mitigations
| Risk | L | I | Mitigation |
|---|---|---|---|
| Pentest finds Critical close to gate | 3 | 5 | Pentest scheduled to start at end of P7 so closure window exists in P9. |
| Performance under load fails NFRS | 2 | 4 | Spikes already done in earlier phases; tuning playbook ready. |
| DR drill exposes backup gap | 2 | 4 | Tabletop DR in P2 hopefully surfaced gaps; drill is the formal proof. |
| Production secret leak during cutover | 1 | 5 | Pre-cutover checklist; rotation script ready; audit during cutover. |
| Hypercare extends due to instability | 3 | 3 | Phase 10 (post-launch hardening) reserved in calendar even though not in scope here. |
14. After v1.0
This phase ends the build plan; post-launch is its own roadmap document. Likely first-30-day priorities:
- Customer-driven hardening (any production incident).
- Predictive maintenance use cases (AI Phase 2).
- Native mobile shell via Capacitor (if PWA gaps surface).
- Marketplace / public widget API (post-launch).
- Customer-specific connectors per inbound demand.