Skip to main content

Incident Response

caution

Detailed runbooks will be added here during the implementation phase. This page documents the incident classification framework.

Severity Classification

SeverityDefinitionResponse time
P1 — CriticalAll assessments failing, data loss risk, security breachImmediate
P2 — HighAssessments failing for subset of clients, EDS unavailableWithin 1 hour
P3 — MediumDegraded quality scores across multiple jobs, notification system downWithin 4 hours
P4 — LowSingle job failure, minor performance degradationNext business day

Common Incidents

  • EDS write failure: Check KMS connectivity, blob store availability, DEK generation service
  • Stuck collection job: Check Azure API availability, service principal credential expiry
  • AI inference timeout: Check Claude API status, retry queue depth
  • High 429 error rate: Reduce Phase 3 concurrency, check for Azure-side throttling policy changes