Component 1 — Data Collector
What It Is
The Data Collector is the first active component in the CYC Assess pipeline. It is a stateless Python service that authenticates to a client tenant using a read-only service principal, executes all data collection calls across four structured phases, and writes a single raw collection JSON file to the Ephemeral Data Store.
It has no knowledge of checklist items, scoring logic, or AI prompts. Its only responsibility is to collect and store.
IP protection by design
Even if the Data Collector were fully reverse-engineered, an observer would find only standard Azure API calls — no CYC methodology is exposed.
Position in the Pipeline
Design Principles
| Principle | Definition |
|---|---|
| Stateless | No local state between runs. All inputs received via job payload. Horizontally scalable. |
| Non-destructive | Read-only permissions only. Architecturally incapable of modifying client resources. |
| Fault-tolerant | Individual call failures logged and collection continues. Always produces a complete output file. |
| Opaque | No proprietary logic. Query strings retrieved from encrypted vault at runtime, not embedded in source. |
| Time-bounded | 30-second timeout per API call. Target under 60 seconds for a 20-subscription tenant. |
| Privacy-enforcing | Raw data written to Ephemeral Data Store with 48-hour TTL by default. No other writes. |
Collection Phases Summary
Subsections
| Page | Content |
|---|---|
| Authentication | Credential sets, required Azure role assignments |
| Collection Phases | Phase 1–4 detail, API calls, Sentinel conditional block |
| Error Handling | Error classification, retry policy, collection_errors schema |
| Quality Scoring | Quality score calculation, thresholds, collection_quality schema |
| Output Schema | Full collection JSON schema |
| Storage Model | Two-tier storage, ephemeral vs retained |