discourse/plugins/discourse-ai/evals/lib
Roman Rizzi 3a647c8e50
FEATURE: Use evals to compare LLMs and Personas' prompts (#36027)
Implemented an eval “comparison matrix” that lets you run the same evals
across multiple personas or multiple LLMs and have a judge model declare
a winner with per-candidate scores. The CLI adds --compare
personas|llms, keeps persona selection (auto-prepending default for
persona mode), and always ensures a judge is configured. A dedicated
ComparisonRunner reuses Workbench results to build candidate outputs and
sends them to Judge#compare, which crafts a rubric-aware comparison
prompt and parses structured winner/ratings JSON. Outputs are streamed
to the console and individual run logs still get written. README
documents how to use the new flag and what each mode does.
2025-11-18 10:39:52 -03:00
..
prompts FEATURE: Cover all LLM features with evals (#35693) 2025-11-13 12:24:56 -03:00
runners FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
boot.rb
cli.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
comparison_runner.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
eval.rb REFACTOR: centralize eval orchestration around feature-driven playground (#35718) 2025-10-30 13:08:38 -03:00
features.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
judge.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
llm_repository.rb FEATURE: Cover all LLM features with evals (#35693) 2025-11-13 12:24:56 -03:00
persona_prompt_loader.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
recorder.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00
structured_logger.rb REFACTOR: centralize eval orchestration around feature-driven playground (#35718) 2025-10-30 13:08:38 -03:00
workbench.rb FEATURE: Use evals to compare LLMs and Personas' prompts (#36027) 2025-11-18 10:39:52 -03:00