What does a good offline eval set look like for an agent?

Question

Accepted Answer

A good offline evaluation set for an agent is diverse and representative, capturing a wide spectrum of user intents, contexts, and potential edge cases encountered in real-world scenarios. It meticulously includes ground truth labels or expected outputs for each input, allowing for objective measurement of the agent's performance against predefined success criteria. This set often incorporates a mix of common interactions and challenging failure modes, specifically designed to stress-test the agent's robustness and identify areas for improvement. Key characteristics involve clear performance metrics-such as accuracy, relevance, safety, or adherence to specific guidelines-that can be programmatically evaluated. Furthermore, a robust eval set is periodically updated and curated, reflecting evolving user behaviors, new product features, or identified gaps in previous agent versions. Its composition ensures sufficient coverage without excessive redundancy, making evaluations efficient yet comprehensive for iterative agent development.