Monitoring
Key Metrics to Watch
| Metric | Healthy threshold | Alert threshold |
|---|---|---|
| Assessment job completion rate | > 98% | < 95% |
| Average collection duration | < 60s | > 90s |
| Average quality score | > 80% | < 60% (per job) |
| Phase 1 failure rate | 0% | Any failure |
| API error rate (403) | < 5% of calls | > 15% |
| API throttling rate (429) | < 2% of calls | > 10% |
| EDS write success rate | 100% | Any failure |
| Admin notification delivery | < 2 min | > 5 min |
Dashboard
The CYC operations dashboard shows:
- Jobs completed in last 24 hours with quality score distribution
- Error rate by domain (Defender, Cost, Sentinel, etc.)
- Collection duration percentiles (p50, p90, p99)
- Degraded quality job list with recommended actions
Audit Log Review
Review the EDS audit log weekly for:
- Jobs in
collectingstate for more than 10 minutes (stuck jobs) - Jobs where
deleted_atis null and TTL should have expired (deletion failures) - Unusual access patterns in the access log