As AI agents and multimodal models become more prevalent, understanding how to evaluate GenAI is no longer optional – it's essential.
Generative AI introduces new complexities in assessment compared to traditional software, and this week on Chain of Thought we’re joined by Chip Huyen (Storyteller, Tép Studio), Vivienne Zhang (Senior Product Manager, Generative AI Software, Nvidia) for a discussion on AI evaluation best practices.
Before we hear from our guests, Vikram Chatterji (CEO, Galileo) and Conor Bronsdon (Developer Awareness, Galileo) give their takes on the complexities of AI evals and how to overcome them through the use of objective criteria in evaluating open-ended tasks, the role of hallucinations in AI models, and the importance of human-in-the-loop systems.
Afterwards, Chip and Vivienne sit down with Atin Sanyal (Co-Founder & CTO, Galileo) to explore common evaluation approaches, best practices for building frameworks, and implementation lessons. They also discuss the nuances of evaluating AI coding assistants and agentic systems.
Show Notes:
Chapters:
00:00 Challenges in Evaluating Generative AI
05:45 Evaluating AI Agents
13:08 Are Hallucinations Bad?
17:12 Human in the Loop Systems
20:49 Panel discussion begins
22:57 Challenges in Evaluating Intelligent Systems
24:37 User Feedback and Iterative Improvement
26:47 Post-Deployment Evaluations and Common Mistakes
28:52 Hallucinations in AI: Definitions and Challenges
34:17 Evaluating AI Coding Assistants
38:15 Agentic Systems: Use Cases and Evaluations
43:00 Trends in AI Models and Hardware
45:42 Future of AI in Enterprises
47:16 Conclusion and Final Thoughts