mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 07:43:46 +08:00
Reported through a few user tests, the agent was unreliable in three ways: - `tags.tag_name` instead of `tags.name` - did not use `current_user_id` for "my posts" prompts - plural nouns as singular - and used unparse-able date defaults like "today". Few issues: - `DbSchema` tool was returning a dense one-line-per-table comma that qwen was unable to deal with. Now line-per-column so schema accuracy originally flaky is now 5/5 PASSING on qwen and Gemini. - The prompt was teaching the wrong thing where the `-- null boolean :opt_flag = #null` example made models use `#null` as a default value. We now have a "Parameter rules" section, ISO date examples that match the "no natural-language defaults" rule below them, explicit `current_user_id` guidance for first-person prompts, and a plural-noun rule that applies to each plural noun independently in the same prompt (e.g. "categories and tags" → BOTH list params, not one of each). - Eval runner now captures `name` and `description` separately, not just `sql`. The description text is graded directly rather than grading the SQL string. Tested against qwen 3.5 122B (our hosted model) + Gemini 3.1 Flash Lite (judge GPT-5.2): 20/20 each. New eval cases ship in this PR https://github.com/discourse/discourse-ai-evals/pull/18
38 lines
1.5 KiB
Ruby
Vendored
38 lines
1.5 KiB
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
require_relative "../../../evals/lib/runners/data_explorer"
|
|
require_relative "../support/runner_helper"
|
|
|
|
RSpec.describe DiscourseAi::Evals::Runners::DataExplorer do
|
|
fab!(:llm, :fake_model)
|
|
let(:execution_context) { DiscourseAi::Completions::ExecutionContext.new }
|
|
let(:runner) { described_class.new("query_generation") }
|
|
let(:eval_case) { OpenStruct.new(args: { input: "Show me signups by month" }) }
|
|
|
|
def stub_structured_response(name:, description:, sql:)
|
|
payload = { name: name, description: description, sql: sql }
|
|
structured =
|
|
instance_double(DiscourseAi::Completions::StructuredOutput).tap do |double|
|
|
allow(double).to receive(:read_buffered_property) { |key| payload[key] }
|
|
end
|
|
|
|
stub_runner_bot { |blk| blk.call(structured, nil, :structured_output) }
|
|
end
|
|
|
|
describe "#run" do
|
|
it "captures name, description, and sql separately" do
|
|
stub_structured_response(
|
|
name: "Signups by month",
|
|
description: "Counts user signups grouped by month.",
|
|
sql: "SELECT date_trunc('month', created_at) AS month FROM users GROUP BY month",
|
|
)
|
|
|
|
result = runner.run(eval_case, llm, execution_context: execution_context)
|
|
|
|
expect(result[:raw]).to include("SELECT date_trunc")
|
|
expect(result[:metadata][:name]).to eq("Signups by month")
|
|
expect(result[:metadata][:description]).to eq("Counts user signups grouped by month.")
|
|
expect(result[:metadata][:feature]).to eq("query_generation")
|
|
end
|
|
end
|
|
end
|