Skip to main content

Component 1 — Data Collector

What It Is

The Data Collector is the first active component in the CYC Assess pipeline. It is a stateless Python service that authenticates to a client tenant using a read-only service principal, executes all data collection calls across four structured phases, and writes a single raw collection JSON file to the Ephemeral Data Store.

It has no knowledge of checklist items, scoring logic, or AI prompts. Its only responsibility is to collect and store.

IP protection by design

Even if the Data Collector were fully reverse-engineered, an observer would find only standard Azure API calls — no CYC methodology is exposed.

Position in the Pipeline

Design Principles

PrincipleDefinition
StatelessNo local state between runs. All inputs received via job payload. Horizontally scalable.
Non-destructiveRead-only permissions only. Architecturally incapable of modifying client resources.
Fault-tolerantIndividual call failures logged and collection continues. Always produces a complete output file.
OpaqueNo proprietary logic. Query strings retrieved from encrypted vault at runtime, not embedded in source.
Time-bounded30-second timeout per API call. Target under 60 seconds for a 20-subscription tenant.
Privacy-enforcingRaw data written to Ephemeral Data Store with 48-hour TTL by default. No other writes.

Collection Phases Summary

Subsections

PageContent
AuthenticationCredential sets, required Azure role assignments
Collection PhasesPhase 1–4 detail, API calls, Sentinel conditional block
Error HandlingError classification, retry policy, collection_errors schema
Quality ScoringQuality score calculation, thresholds, collection_quality schema
Output SchemaFull collection JSON schema
Storage ModelTwo-tier storage, ephemeral vs retained