mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-05-23 20:04:04 +08:00
Implemented an eval “comparison matrix” that lets you run the same evals across multiple personas or multiple LLMs and have a judge model declare a winner with per-candidate scores. The CLI adds --compare personas|llms, keeps persona selection (auto-prepending default for persona mode), and always ensures a judge is configured. A dedicated ComparisonRunner reuses Workbench results to build candidate outputs and sends them to Judge#compare, which crafts a rubric-aware comparison prompt and parses structured winner/ratings JSON. Outputs are streamed to the console and individual run logs still get written. README documents how to use the new flag and what each mode does. |
||
|---|---|---|
| .. | ||
| prompts | ||
| runners | ||
| boot.rb | ||
| cli.rb | ||
| comparison_runner.rb | ||
| eval.rb | ||
| features.rb | ||
| judge.rb | ||
| llm_repository.rb | ||
| persona_prompt_loader.rb | ||
| recorder.rb | ||
| structured_logger.rb | ||
| workbench.rb | ||