Incident Response
caution
Detailed runbooks will be added here during the implementation phase. This page documents the incident classification framework.
Severity Classification
| Severity | Definition | Response time |
|---|---|---|
| P1 — Critical | All assessments failing, data loss risk, security breach | Immediate |
| P2 — High | Assessments failing for subset of clients, EDS unavailable | Within 1 hour |
| P3 — Medium | Degraded quality scores across multiple jobs, notification system down | Within 4 hours |
| P4 — Low | Single job failure, minor performance degradation | Next business day |
Common Incidents
- EDS write failure: Check KMS connectivity, blob store availability, DEK generation service
- Stuck collection job: Check Azure API availability, service principal credential expiry
- AI inference timeout: Check Claude API status, retry queue depth
- High 429 error rate: Reduce Phase 3 concurrency, check for Azure-side throttling policy changes