mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-05-28 07:19:11 +08:00
Implemented an eval “comparison matrix” that lets you run the same evals across multiple personas or multiple LLMs and have a judge model declare a winner with per-candidate scores. The CLI adds --compare personas|llms, keeps persona selection (auto-prepending default for persona mode), and always ensures a judge is configured. A dedicated ComparisonRunner reuses Workbench results to build candidate outputs and sends them to Judge#compare, which crafts a rubric-aware comparison prompt and parses structured winner/ratings JSON. Outputs are streamed to the console and individual run logs still get written. README documents how to use the new flag and what each mode does. |
||
|---|---|---|
| .. | ||
| runners | ||
| support | ||
| comparison_runner_spec.rb | ||
| eval_spec.rb | ||
| features_spec.rb | ||
| judge_spec.rb | ||
| llm_repository_spec.rb | ||
| persona_prompt_loader_spec.rb | ||
| recorder_spec.rb | ||
| workbench_spec.rb | ||