Browse documents

Phase 9 — QA, Security, Deployment & Handover

Status: In progress 🚧 · Owner: QA Lead + DevOps Lead · Duration: 2 weeks · Gate: G9

1. Overview

Phase 9 takes the assembled product through the final-quality bar and into production: end-to-end QA across all modules, third-party penetration test, performance + soak testing, security hardening, DR drill, production deployment, hypercare, training, and handover to operations and customer success. This is the gate that says "the platform is ready to take real customer load."

2. Objectives

  • O9.1 — Full-system QA pass: functional, integration, e2e, performance, accessibility, i18n.
  • O9.2 — Independent penetration test with no Critical or High findings open at sign-off.
  • O9.3 — Performance + soak test pass against published NFRS targets.
  • O9.4 — Backup, restore, and disaster-recovery drills passed.
  • O9.5 — Production environment hardened: secrets, IAM, network, observability, alerting.
  • O9.6 — Documentation set complete: runbooks, API, admin, user guides, in-app help.
  • O9.7 — Successful go-live with 5-day hypercare and zero Sev-1.
  • O9.8 — Operations and Customer Success teams trained and signed-off.

3. Scope

3.1 In-scope

  • Test execution across all phases (regression of P0–P8).
  • DAST + SAST + dependency scan + SBOM in CI; pentest by an external firm.
  • Performance: load (steady-state targets), spike (3× burst), soak (72-hour run).
  • DR: full backup + restore in a parallel environment; documented RPO/RTO.
  • Production deployment pipeline (blue-green or canary).
  • Observability: alerts, dashboards, runbook links.
  • Compliance: ISO 27001 / 27017 / 27018 alignment review; APPI / GDPR DPIA finalised.
  • Documentation pass.
  • Training (engineering ops, customer success, sales-engineer enablement).
  • Hypercare for 5 working days.

3.2 Out-of-scope

  • Customer-specific feature work.
  • Acceptance testing by a real customer (separate engagement; the KTC implementation in KTC_SDLC_Technical_Document.docx is the first such engagement and runs against this platform).

4. Dependencies

  • All prior phases (P0–P8) completed and gates signed.

5. Test programme

5.1 Functional regression

  • Re-run every Acceptance Criterion from P0–P8 in the staging environment.
  • Capture results in docs/quality/regression_run_v1.md.

5.2 End-to-end scenarios

  • Scenario A — Tenant onboarding → user invite → import asset ledger → connect gateway → see telemetry on dashboard.
  • Scenario B — Operator creates a planned maintenance workflow → cron triggers → work orders generated → assigned → completed → KPIs update on dashboard.
  • Scenario C — Anomaly fires → AI explains → workflow creates P2 work order → mobile approval → executed offline → synced.
  • Scenario D — Auditor pulls a monthly KPI report; AI chat answers compliance questions from RAG.

5.3 Performance

  • Steady-state load: NFRS targets sustained for 30 minutes (API, ingest, dashboard render).
  • Spike: 3× burst for 5 minutes; latency degrades gracefully and recovers.
  • Soak: 72 hours at 60% of steady-state; no memory leak, no error rate growth.
  • Tool: k6 for HTTP + WebSocket; playwright soak for browser.

5.4 Security

  • SAST + dependency scan + SBOM in CI (already from P0).
  • DAST: OWASP ZAP baseline + active scans in staging.
  • Pentest: external firm, 2-week engagement, scope = production-like staging. Findings triaged: Critical/High → must close; Medium → close or accept with rationale; Low → backlog.
  • Threat model review: revisit STRIDE from Phase 1; document changes.
  • Compliance: ISO 27001 / 27017 / 27018 control mapping; APPI / GDPR DPIA final.

5.5 DR drill

  • Snapshot prod-like staging.
  • Restore to a parallel env from latest backup.
  • Validate RPO (≤ 15 min) and RTO (≤ 4 h).
  • Document deltas; update runbook.

5.6 Accessibility & i18n

  • Axe-core on every route; no Critical/Serious violations.
  • Manual screen-reader walkthrough (NVDA + VoiceOver).
  • Translation coverage check (English + Japanese) — no untranslated strings on user-facing routes.

6. Deployment

6.1 Topology

  • Cloud platform per ADR-002 (decided in Phase 0).
  • Environments: dev, staging, production.
  • Production sizing: 3× app instances behind LB; MongoDB Atlas M30+ (or dedicated cluster); Redis cluster; S3 bucket with versioning + lifecycle; CDN in front.
  • Region: primary in agreed region (residency-aligned); DR target in a different AZ minimum.
  • Secrets: managed secret store (no env vars in plain disk).

6.2 Release strategy

  • Blue-green preferred (toggle via LB / DNS).
  • Canary option for risky releases (5% → 25% → 100% over 24h).
  • Automatic rollback on SLO breach.
  • Database migrations: forward-only, backward-compatible per release.

6.3 Observability in production

  • Logs → Loki / vendor; structured, sampled.
  • Metrics → Prometheus; alerts on SLO burn.
  • Traces → Tempo / vendor; 10% sample.
  • Status page (public) — auto-updated from health probes.
  • On-call rota with PagerDuty / Opsgenie.

6.4 Cutover plan (production)

  1. Final regression on staging passes.
  2. Pentest findings closed.
  3. Backup taken.
  4. Maintenance window scheduled.
  5. Deploy via blue-green.
  6. Smoke tests post-cutover.
  7. 5-day hypercare with daily standup.

6.5 Hypercare

  • Heightened on-call.
  • Daily standup with engineering + product + ops.
  • Per-day report (incidents, SLA, customer signals).
  • Exit criteria: zero Sev-1 for 5 days, SLOs met, customer signals stable.

7. Training & handover

7.1 Training audiences

  • Customer Success (admin features, troubleshooting, ticket triage).
  • Sales Engineering (demo paths, KTC reference walkthrough).
  • Engineering Ops (runbooks, on-call, incident response).
  • Customer trainers ("train-the-trainer" material).

7.2 Materials

  • Recorded modules (≤15 min each) per topic.
  • Quick-reference cheat sheets.
  • Sandbox tenant with KTC reference data for hands-on.

7.3 Handover

  • All runbooks reviewed and current.
  • All ADRs in Accepted or Superseded state — no Proposed open.
  • All Phase docs (Phase_0.mdPhase_9.md) tagged final.
  • Backlog handover: open items, deferred decisions, customer-specific TODOs.

8. Documentation completeness checklist

  • FRS, NFRS, System Architecture, Data Model, Security Architecture, Threat Model
  • Every phase doc (P0–P9) reviewed and final
  • All ADRs accepted
  • All runbooks current: auth, API keys, gateway, ingestion, dashboard, workflow, AI, mobile, incident response, DR, on-call
  • User guides: admin, operator, developer, mobile
  • OpenAPI spec at v1.0; SDK at matching version
  • In-app help content reviewed and translated (EN + JA)
  • Public docs site (Docusaurus / similar) live with all of the above
  • Changelog at v1.0

9. Acceptance Criteria (gate G9)

  • AC9.1 — All functional acceptance criteria from P0–P8 re-pass in regression.
  • AC9.2 — Pentest report shows no Critical / High findings open; Medium / Low triaged with rationale.
  • AC9.3 — Performance NFRS targets met in load, spike, and soak tests.
  • AC9.4 — DR drill completed with RPO ≤ 15 min and RTO ≤ 4 h recorded.
  • AC9.5 — Production deployed via blue-green; smoke tests pass; rollback procedure exercised on a non-prod slot to verify.
  • AC9.6 — Five days of hypercare completed with zero Sev-1 and SLA met.
  • AC9.7 — Documentation checklist (§8) 100% complete.
  • AC9.8 — Training delivered and acknowledged by Customer Success, Sales Engineering, Engineering Ops.

10. Test Requirements

  • Coverage gates retained: ≥80% unit on business logic across the repo.
  • Contract tests pass against published OpenAPI v1.0.
  • All e2e scenarios in §5.2 pass.
  • Lighthouse mobile gates retained.

11. Documentation Requirements

  • docs/quality/regression_run_v1.md
  • docs/quality/perf_report_v1.md
  • docs/security/pentest_report_v1.md (with closure log)
  • docs/security/threat_model_final.md
  • docs/operations/dr_drill_v1.md
  • docs/operations/cutover_v1.md
  • docs/operations/hypercare_v1.md
  • docs/training/customer_success.md, sales_engineering.md, engineering_ops.md
  • CHANGELOG.md at v1.0
  • docs/release_notes/v1.0.md

12. Sign-off Criteria (Gate G9)

  • All Acceptance Criteria met.
  • Production go/no-go meeting held with Engineering Lead, Product Owner, Security Lead, DevOps Lead, Customer Success Lead, Sales Engineering Lead.
  • Signed _gates/Gate_G9_signoff.md.
  • Tagged v1.0.0 and announced.

13. Risks & Mitigations

RiskLIMitigation
Pentest finds Critical close to gate35Pentest scheduled to start at end of P7 so closure window exists in P9.
Performance under load fails NFRS24Spikes already done in earlier phases; tuning playbook ready.
DR drill exposes backup gap24Tabletop DR in P2 hopefully surfaced gaps; drill is the formal proof.
Production secret leak during cutover15Pre-cutover checklist; rotation script ready; audit during cutover.
Hypercare extends due to instability33Phase 10 (post-launch hardening) reserved in calendar even though not in scope here.

14. After v1.0

This phase ends the build plan; post-launch is its own roadmap document. Likely first-30-day priorities:

  • Customer-driven hardening (any production incident).
  • Predictive maintenance use cases (AI Phase 2).
  • Native mobile shell via Capacitor (if PWA gaps surface).
  • Marketplace / public widget API (post-launch).
  • Customer-specific connectors per inbound demand.