discourse/plugins/discourse-ai/lib/agents/tools/github_search_code.rb
Régis Hanol 36a8a51ef0
FEATURE: Route all GitHub API requests through one rate-limited client (#40637)
GitHub oneboxes and the discourse-github plugin talked to GitHub's REST
and
GraphQL API with no rate-limit awareness. On busy instances this
exhausted
GitHub's limits (60 requests/hour unauthenticated, 5000 authenticated),
and
because there was no backoff every render kept hitting GitHub and
re-failing
-- which GitHub's docs warn can get an integration banned. The recently
added PR-status onebox multiplied the number of calls and made it far
worse.

GitHub access was also fragmented: the core onebox engines used OpenURI,
the
discourse-github plugin used Octokit, and the discourse-ai bot tools
used
FinalDestination::HTTP -- three HTTP stacks, three tokens, and
inconsistent
(or entirely missing) error and rate-limit handling.

This introduces a single client, Discourse::GithubApi, that every GitHub
data-API request now flows through. It is built on Faraday with the
SSRF-safe
FinalDestination adapter and:

- authenticates per token (Bearer) and returns plain string-keyed Hashes
(get/post) or raw bodies (raw_get) -- one response shape, no
Octokit/Sawyer
- only ever sends the access token to api.github.com and
  raw.githubusercontent.com, rejecting any other absolute URL, so a
  user-derived path can never leak a token to an arbitrary host
- backs off on rate limits both reactively (403/429) and proactively
(when
X-RateLimit-Remaining hits 0), honouring Retry-After /
X-RateLimit-Reset,
via a shared Redis flag (GithubRateLimit) keyed per token so each
token's
  budget and the shared unauthenticated/IP budget back off independently
- short-circuits while backing off without ever sleeping, so onebox
rendering
  and post baking degrade to a plain link instead of blocking a request
- caches ETags and sends If-None-Match, so unchanged resources return
304s
  that do not count against the rate limit

Every caller was moved onto it:

- the 6 core GitHub onebox engines, via a slimmed
Onebox::Mixins::GithubApi
adapter that keeps their public methods and translates client errors
back
to the OpenURI::HTTPError vocabulary they already rescue (engines
unchanged)
- the github_blob raw.githubusercontent.com fetch
- the discourse-github plugin (badges, linkback, permalinks, token
validator),
which no longer uses the octokit and sawyer gems (they stay in the
Gemfile for
the discourse-code-review official plugin, which still depends on them)
- the discourse-ai bot's GitHub tools (search code, diff, file content,
  search files)

Also adds a GithubOneboxBackoff admin problem check that surfaces while
one of
the onebox token identities is backing off -- scoped to the tokens
resolved by
Onebox::GithubAccess (each configured github_onebox_access_tokens entry
plus the
unauthenticated client) so a backoff on the AI bot or linkback token is
not
misattributed to onebox. Its message points admins at the relevant
setting with
the {{setting:...}} link marker, which problem-check messages now expand
too.
Onebox token resolution is centralised in Onebox::GithubAccess, and the
onebox
cache TTL for transient GitHub failures is shortened so they recover
quickly.

GitHub OAuth login, theme git-clone, the inbound webhook, and the
Oneboxer
FinalDestination URL-resolution special-cases for github.com are
intentionally
out of scope -- they are different concerns, not the rate-limited data
API.
2026-06-15 10:59:10 +02:00

335 lines
10 KiB
Ruby
Vendored

# frozen_string_literal: true
require "cgi"
require "uri"
module DiscourseAi
module Agents
module Tools
class GithubSearchCode < Tool
MAX_GH_RESULTS = 1_000
PER_PAGE = 30
MAX_ALLOWED_PAGE = (MAX_GH_RESULTS / PER_PAGE.to_f).ceil
def self.signature
{
name: name,
description: "Searches for code in a GitHub repository",
parameters: [
{
name: "repo",
description: "The repository name in the format 'owner/repo'",
type: "string",
required: true,
},
{
name: "query",
description: "The search query (e.g., a function name, variable, or code snippet)",
type: "string",
required: true,
},
{
name: "page",
description: "Results page to retrieve (GitHub returns up to 30 results per page)",
type: "integer",
required: false,
},
{
name: "ignore_paths",
description:
"File path prefixes to exclude from results (e.g., ['config/locales/', 'spec/'])",
type: "array",
item_type: "string",
required: false,
},
],
}
end
def self.name
"github_search_code"
end
def repo
parameters[:repo]
end
def query
parameters[:query]
end
def page
requested = parameters[:page].to_i
requested = 1 if requested <= 0
[requested, MAX_ALLOWED_PAGE].min
end
def ignore_paths
Array(parameters[:ignore_paths]).map(&:to_s)
end
def description_args
{ repo: repo, query: query, page: page }
end
def invoke
api_url = build_url
begin
search_data =
github_client.get(api_url, accept: "application/vnd.github.v3.text-match+json")
rescue Discourse::GithubApi::Error => e
return { error: "Failed to perform code search. #{e.message}" }
end
formatted_results = format_results(search_data["items"])
grouped = group_by_file(formatted_results)
trimmed = trim_results(grouped)
total_count = search_data["total_count"].to_i
total_pages =
if total_count <= 0
1
else
[(total_count.to_f / PER_PAGE).ceil, MAX_ALLOWED_PAGE].min
end
result = { search_results: trimmed }
result[:page] = page if page > 1
result[:total_results] = total_count
result[:next_page] = page + 1 if page < total_pages
if search_data["incomplete_results"]
result[:notes] = "GitHub marked the search results as incomplete."
end
result
end
private
def build_url
base_query = "#{query} repo:#{repo}"
encoded_params = URI.encode_www_form({ q: base_query, page: page, per_page: PER_PAGE })
"https://api.github.com/search/code?#{encoded_params}"
end
# Cap blob fetches to avoid excessive API calls for line-number enrichment.
# Files beyond this limit still appear in results but without line numbers.
MAX_BLOB_FETCHES = 10
def format_results(items)
return [] if items.blank?
file_cache = {}
blob_fetches = 0
ignored = ignore_paths
results =
items.flat_map do |item|
text_matches = item["text_matches"]
next [] if text_matches.blank?
path = item["path"]
next [] if ignored.any? { |prefix| path.start_with?(prefix) }
repo_full_name = item.dig("repository", "full_name") || repo
ref =
extract_ref(item) || item.dig("repository", "default_branch") ||
fetch_default_branch(repo_full_name)
sha = item["sha"]
cache_key = blob_cache_key(repo_full_name, path, ref, sha)
if file_cache.key?(cache_key)
file_details = file_cache[cache_key]
elsif blob_fetches < MAX_BLOB_FETCHES
file_details = fetch_file_content(repo_full_name, path, ref, file_cache, sha)
blob_fetches += 1
else
file_details = nil
end
text_matches.map do |match|
fragment = match["fragment"]
next if fragment.blank?
content = file_details&.dig(:content)
total_lines = file_details&.dig(:total_lines)
line_range = derive_line_range(content, fragment)
{
file: path,
lines: format_line_label(line_range),
total_file_lines: total_lines,
content: fragment,
}
end
end
results.compact
end
def group_by_file(results)
return [] if results.blank?
grouped = {}
results.each do |entry|
file = entry[:file]
grouped[file] ||= { file: file, total_lines: entry[:total_file_lines], matches: [] }
grouped[file][:matches] << { lines: entry[:lines], content: entry[:content] }
end
grouped.values
end
def trim_results(grouped_results)
return [] if grouped_results.blank?
max_chars = 20_000
used_chars = 0
grouped_results.each_with_object([]) do |file_entry, acc|
file = file_entry[:file].to_s
file_overhead = file.length + file_entry[:total_lines].to_s.length
remaining = max_chars - used_chars
break acc if remaining <= file_overhead
trimmed_matches = []
match_chars = 0
file_entry[:matches].each do |match|
lines = match[:lines].to_s
content = match[:content].to_s
entry_length = lines.length + content.length
match_remaining = remaining - file_overhead - match_chars
break if match_remaining <= 0
if entry_length > match_remaining
allowed = match_remaining - lines.length
break if allowed <= 0
content = content[0...allowed]
entry_length = lines.length + content.length
end
trimmed_matches << { lines: match[:lines], content: content }
match_chars += entry_length
end
next if trimmed_matches.empty?
acc << {
file: file_entry[:file],
total_lines: file_entry[:total_lines],
matches: trimmed_matches,
}
used_chars += file_overhead + match_chars
end
end
def format_line_label(line_range)
return nil if line_range.nil?
if line_range[:start_line] == line_range[:end_line]
line_range[:start_line].to_s
else
"#{line_range[:start_line]}-#{line_range[:end_line]}"
end
end
def blob_cache_key(repo_full_name, path, ref, blob_sha)
suffix = ref.presence || blob_sha.presence || "main"
"#{repo_full_name || repo}@#{suffix}:#{path}"
end
def fetch_file_content(repo_full_name, path, ref, cache, blob_sha)
repo_full_name ||= repo
cache_key = blob_cache_key(repo_full_name, path, ref, blob_sha)
return cache[cache_key] if cache.key?(cache_key)
owner, repo_name = repo_full_name.to_s.split("/", 2)
if owner.blank? || repo_name.blank?
cache[cache_key] = nil
return nil
end
url =
if blob_sha.present?
"https://api.github.com/repos/#{owner}/#{repo_name}/git/blobs/#{blob_sha}"
else
actual_ref = ref.presence || fetch_default_branch(repo_full_name)
"https://api.github.com/repos/#{owner}/#{repo_name}/contents/#{path}?ref=#{actual_ref}"
end
begin
data = github_client.get(url)
decoded = ensure_utf8(Base64.decode64(data["content"].to_s))
cache[cache_key] = { content: decoded, total_lines: count_lines(decoded) }
rescue Discourse::GithubApi::Error
cache[cache_key] = nil
end
cache[cache_key]
end
def extract_ref(item)
url = item["url"]
return if url.blank?
uri =
begin
URI.parse(url)
rescue StandardError
nil
end
return unless uri&.query
CGI.parse(uri.query || "")["ref"]&.first
end
def derive_line_range(file_content, fragment)
return if file_content.blank? || fragment.blank?
file = normalize_line_endings(file_content)
snippet = normalize_line_endings(fragment)
# GitHub fragments mirror contiguous file sections, so match directly.
index = file.index(snippet)
return if index.nil?
prefix = file[0...index]
start_line = prefix.count("\n") + 1
line_count = snippet.each_line.count
{ start_line: start_line, end_line: start_line + line_count - 1 }
end
def normalize_line_endings(text)
text.gsub("\r\n", "\n")
end
def count_lines(content)
return 0 if content.blank?
normalized = normalize_line_endings(content)
normalized.count("\n") + (normalized.end_with?("\n") ? 0 : 1)
end
def ensure_utf8(text)
return "" if text.nil?
result = text.dup
result.force_encoding(Encoding::UTF_8)
return result if result.valid_encoding?
result.encode(Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "")
end
end
end
end
end