mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 02:05:37 +08:00
Reported through a few user tests, the agent was unreliable in three ways: - `tags.tag_name` instead of `tags.name` - did not use `current_user_id` for "my posts" prompts - plural nouns as singular - and used unparse-able date defaults like "today". Few issues: - `DbSchema` tool was returning a dense one-line-per-table comma that qwen was unable to deal with. Now line-per-column so schema accuracy originally flaky is now 5/5 PASSING on qwen and Gemini. - The prompt was teaching the wrong thing where the `-- null boolean :opt_flag = #null` example made models use `#null` as a default value. We now have a "Parameter rules" section, ISO date examples that match the "no natural-language defaults" rule below them, explicit `current_user_id` guidance for first-person prompts, and a plural-noun rule that applies to each plural noun independently in the same prompt (e.g. "categories and tags" → BOTH list params, not one of each). - Eval runner now captures `name` and `description` separately, not just `sql`. The description text is graded directly rather than grading the SQL string. Tested against qwen 3.5 122B (our hosted model) + Gemini 3.1 Flash Lite (judge GPT-5.2): 20/20 each. New eval cases ship in this PR https://github.com/discourse/discourse-ai-evals/pull/18 |
||
|---|---|---|
| .. | ||
| runners | ||
| support | ||
| agent_prompt_loader_spec.rb | ||
| console_formatter_spec.rb | ||
| eval_spec.rb | ||
| features_spec.rb | ||
| judge_spec.rb | ||
| llm_repository_spec.rb | ||
| recorder_spec.rb | ||
| workbench_compare_spec.rb | ||
| workbench_spec.rb | ||