Every CORTEX engagement starts with the same question: 'What do you measure?' If the answer is 'we look at the outputs and judge', we know exactly how the engagement will go. The most expensive lesson in production AI is not building the eval set early.
Why the eval set compounds
Models change every quarter. The major providers ship a meaningfully different frontier model roughly twice a year; mid-tier models update faster. Open-weight models drop monthly. Without an eval set, every model decision is a religious argument: someone trusts Claude, someone trusts GPT, someone read a benchmark. With an eval set tied to your real workload, model decisions become engineering decisions — you run the candidate, you compare scores, you decide.
What a real eval set looks like
- 100–500 prompts representative of real workload distribution (not the happy path)
- Ground-truth answers curated by your domain experts, not by the engineering team
- Refusal cases — questions the system should decline rather than answer
- Edge cases the system has historically gotten wrong (these become regression tests)
- Subgroup distribution if equity matters (demographic stratification)
- Versioning — the eval set itself is a product that improves over time
Why most teams skip it
Building the eval set isn't fun. The fun work is the model architecture, the prompt engineering, the tool integration. The eval set is annotation work — domain experts spending half a day per week with the engineering team for two months. Most teams skip it because it feels like a tax on the interesting work, then re-discover its absence three weeks before launch when the question 'is this good enough to ship' has no defensible answer.