mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 03:05:45 +08:00
Reported through a few user tests, the agent was unreliable in three ways: - `tags.tag_name` instead of `tags.name` - did not use `current_user_id` for "my posts" prompts - plural nouns as singular - and used unparse-able date defaults like "today". Few issues: - `DbSchema` tool was returning a dense one-line-per-table comma that qwen was unable to deal with. Now line-per-column so schema accuracy originally flaky is now 5/5 PASSING on qwen and Gemini. - The prompt was teaching the wrong thing where the `-- null boolean :opt_flag = #null` example made models use `#null` as a default value. We now have a "Parameter rules" section, ISO date examples that match the "no natural-language defaults" rule below them, explicit `current_user_id` guidance for first-person prompts, and a plural-noun rule that applies to each plural noun independently in the same prompt (e.g. "categories and tags" → BOTH list params, not one of each). - Eval runner now captures `name` and `description` separately, not just `sql`. The description text is graded directly rather than grading the SQL string. Tested against qwen 3.5 122B (our hosted model) + Gemini 3.1 Flash Lite (judge GPT-5.2): 20/20 each. New eval cases ship in this PR https://github.com/discourse/discourse-ai-evals/pull/18
62 lines
1.8 KiB
Ruby
Vendored
62 lines
1.8 KiB
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
require_relative "base"
|
|
|
|
module DiscourseAi
|
|
module Evals
|
|
module Runners
|
|
class DataExplorer < Base
|
|
STRUCTURED_KEYS = %i[name description sql].freeze
|
|
|
|
def self.can_handle?(full_feature_name)
|
|
full_feature_name&.start_with?("data_explorer:")
|
|
end
|
|
|
|
def run(eval_case, llm, execution_context:)
|
|
args = eval_case.args
|
|
agent = resolve_agent(agent_class: DiscourseDataExplorer::AiQueryGenerator)
|
|
user = Discourse.system_user
|
|
|
|
context =
|
|
DiscourseAi::Agents::BotContext.new(
|
|
user: user,
|
|
skip_show_thinking: true,
|
|
feature_name: "evals/data_explorer_query_generation",
|
|
messages: [{ type: :user, content: args[:input] }],
|
|
)
|
|
|
|
bot = DiscourseAi::Agents::Bot.as(user, agent: agent, model: llm)
|
|
captured = capture_structured_fields(bot, context, execution_context:)
|
|
|
|
sql = captured[:sql].to_s.strip
|
|
metadata = {
|
|
feature: feature_name,
|
|
name: captured[:name].to_s.strip,
|
|
description: captured[:description].to_s.strip,
|
|
}
|
|
|
|
wrap_result(sql, metadata)
|
|
end
|
|
|
|
private
|
|
|
|
def capture_structured_fields(bot, context, execution_context:)
|
|
buffers = STRUCTURED_KEYS.index_with { +"" }
|
|
|
|
bot.reply(context, execution_context:) do |partial, _, type|
|
|
if type == :structured_output
|
|
STRUCTURED_KEYS.each do |key|
|
|
chunk = partial.read_buffered_property(key)
|
|
buffers[key] << chunk.to_s if chunk
|
|
end
|
|
elsif type.blank?
|
|
buffers[:sql] << partial.to_s
|
|
end
|
|
end
|
|
|
|
buffers
|
|
end
|
|
end
|
|
end
|
|
end
|
|
end
|