discourse

History

Roman Rizzi 3a647c8e50 FEATURE: Use evals to compare LLMs and Personas' prompts (#36027 ) Implemented an eval “comparison matrix” that lets you run the same evals across multiple personas or multiple LLMs and have a judge model declare a winner with per-candidate scores. The CLI adds --compare personas\|llms, keeps persona selection (auto-prepending default for persona mode), and always ensures a judge is configured. A dedicated ComparisonRunner reuses Workbench results to build candidate outputs and sends them to Judge#compare, which crafts a rubric-aware comparison prompt and parses structured winner/ratings JSON. Outputs are streamed to the console and individual run logs still get written. README documents how to use the new flag and what each mode does.		2025-11-18 10:39:52 -03:00
..
lib	FEATURE: Use evals to compare LLMs and Personas' prompts (#36027 )	2025-11-18 10:39:52 -03:00
personas	FEATURE: Use evals to compare LLMs and Personas' prompts (#36027 )	2025-11-18 10:39:52 -03:00
run	FEATURE: Use evals to compare LLMs and Personas' prompts (#36027 )	2025-11-18 10:39:52 -03:00