mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 07:43:46 +08:00
GitHub oneboxes and the discourse-github plugin talked to GitHub's REST
and
GraphQL API with no rate-limit awareness. On busy instances this
exhausted
GitHub's limits (60 requests/hour unauthenticated, 5000 authenticated),
and
because there was no backoff every render kept hitting GitHub and
re-failing
-- which GitHub's docs warn can get an integration banned. The recently
added PR-status onebox multiplied the number of calls and made it far
worse.
GitHub access was also fragmented: the core onebox engines used OpenURI,
the
discourse-github plugin used Octokit, and the discourse-ai bot tools
used
FinalDestination::HTTP -- three HTTP stacks, three tokens, and
inconsistent
(or entirely missing) error and rate-limit handling.
This introduces a single client, Discourse::GithubApi, that every GitHub
data-API request now flows through. It is built on Faraday with the
SSRF-safe
FinalDestination adapter and:
- authenticates per token (Bearer) and returns plain string-keyed Hashes
(get/post) or raw bodies (raw_get) -- one response shape, no
Octokit/Sawyer
- only ever sends the access token to api.github.com and
raw.githubusercontent.com, rejecting any other absolute URL, so a
user-derived path can never leak a token to an arbitrary host
- backs off on rate limits both reactively (403/429) and proactively
(when
X-RateLimit-Remaining hits 0), honouring Retry-After /
X-RateLimit-Reset,
via a shared Redis flag (GithubRateLimit) keyed per token so each
token's
budget and the shared unauthenticated/IP budget back off independently
- short-circuits while backing off without ever sleeping, so onebox
rendering
and post baking degrade to a plain link instead of blocking a request
- caches ETags and sends If-None-Match, so unchanged resources return
304s
that do not count against the rate limit
Every caller was moved onto it:
- the 6 core GitHub onebox engines, via a slimmed
Onebox::Mixins::GithubApi
adapter that keeps their public methods and translates client errors
back
to the OpenURI::HTTPError vocabulary they already rescue (engines
unchanged)
- the github_blob raw.githubusercontent.com fetch
- the discourse-github plugin (badges, linkback, permalinks, token
validator),
which no longer uses the octokit and sawyer gems (they stay in the
Gemfile for
the discourse-code-review official plugin, which still depends on them)
- the discourse-ai bot's GitHub tools (search code, diff, file content,
search files)
Also adds a GithubOneboxBackoff admin problem check that surfaces while
one of
the onebox token identities is backing off -- scoped to the tokens
resolved by
Onebox::GithubAccess (each configured github_onebox_access_tokens entry
plus the
unauthenticated client) so a backoff on the AI bot or linkback token is
not
misattributed to onebox. Its message points admins at the relevant
setting with
the {{setting:...}} link marker, which problem-check messages now expand
too.
Onebox token resolution is centralised in Onebox::GithubAccess, and the
onebox
cache TTL for transient GitHub failures is shortened so they recover
quickly.
GitHub OAuth login, theme git-clone, the inbound webhook, and the
Oneboxer
FinalDestination URL-resolution special-cases for github.com are
intentionally
out of scope -- they are different concerns, not the rate-limited data
API.
182 lines
5.2 KiB
Ruby
Vendored
182 lines
5.2 KiB
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
module DiscourseAi
|
|
module Agents
|
|
module Tools
|
|
class GithubDiff < Tool
|
|
LARGE_OBJECT_THRESHOLD = 30_000
|
|
|
|
def self.signature
|
|
{
|
|
name: name,
|
|
description: "Retrieves the diff for a GitHub pull request or commit",
|
|
parameters: [
|
|
{
|
|
name: "repo",
|
|
description: "The repository name in the format 'owner/repo'",
|
|
type: "string",
|
|
required: true,
|
|
},
|
|
{
|
|
name: "pull_id",
|
|
description: "The pull request number (use this OR sha, not both)",
|
|
type: "integer",
|
|
required: false,
|
|
},
|
|
{
|
|
name: "sha",
|
|
description: "The commit SHA (use this OR pull_id, not both)",
|
|
type: "string",
|
|
required: false,
|
|
},
|
|
],
|
|
}
|
|
end
|
|
|
|
def self.name
|
|
"github_diff"
|
|
end
|
|
|
|
def repo
|
|
parameters[:repo]
|
|
end
|
|
|
|
def pull_id
|
|
parameters[:pull_id]
|
|
end
|
|
|
|
def sha
|
|
parameters[:sha]
|
|
end
|
|
|
|
def url
|
|
@url
|
|
end
|
|
|
|
def invoke
|
|
return { error: "Must provide either pull_id or sha" } if pull_id.blank? && sha.blank?
|
|
|
|
# Prioritize sha if present (LLMs sometimes pass both)
|
|
sha.present? ? fetch_commit : fetch_pull_request
|
|
end
|
|
|
|
def description_args
|
|
if sha.present?
|
|
{ repo: repo, ref: sha, url: url }
|
|
else
|
|
{ repo: repo, ref: pull_id, url: url }
|
|
end
|
|
end
|
|
|
|
def self.sort_and_shorten_diff(diff, threshold: LARGE_OBJECT_THRESHOLD)
|
|
file_start_regex = /^diff --git.*/
|
|
|
|
prev_start = -1
|
|
prev_match = nil
|
|
|
|
split = []
|
|
|
|
diff.scan(file_start_regex) do |match|
|
|
match_start = $~.offset(0)[0]
|
|
|
|
if prev_start != -1
|
|
full_diff = diff[prev_start...match_start]
|
|
split << [prev_match, full_diff]
|
|
end
|
|
|
|
prev_match = match
|
|
prev_start = match_start
|
|
end
|
|
|
|
split << [prev_match, diff[prev_start..-1]] if prev_match
|
|
|
|
split.sort! { |x, y| x[1].length <=> y[1].length }
|
|
|
|
split
|
|
.map do |x, y|
|
|
if y.length < threshold
|
|
y
|
|
else
|
|
"#{x}\nRedacted, Larger than #{threshold} chars"
|
|
end
|
|
end
|
|
.join("\n")
|
|
end
|
|
|
|
private
|
|
|
|
def fetch_pull_request
|
|
api_url = "https://api.github.com/repos/#{repo}/pulls/#{pull_id}"
|
|
@url = "https://github.com/#{repo}/pull/#{pull_id}"
|
|
|
|
fetch_diff(api_url, "PR") do |info, diff|
|
|
source_repo = info.dig("head", "repo", "full_name")
|
|
source_branch = info.dig("head", "ref")
|
|
source_sha = info.dig("head", "sha")
|
|
target_repo = info.dig("base", "repo", "full_name")
|
|
target_branch = info.dig("base", "ref")
|
|
|
|
{
|
|
type: "pull_request",
|
|
diff: diff,
|
|
pr_info: {
|
|
title: info["title"],
|
|
state: info["state"],
|
|
source: {
|
|
repo: source_repo,
|
|
branch: source_branch,
|
|
sha: source_sha,
|
|
url: "https://github.com/#{source_repo}/tree/#{source_branch}",
|
|
},
|
|
target: {
|
|
repo: target_repo,
|
|
branch: target_branch,
|
|
},
|
|
author: info.dig("user", "login"),
|
|
created_at: info["created_at"],
|
|
updated_at: info["updated_at"],
|
|
},
|
|
}
|
|
end
|
|
end
|
|
|
|
def fetch_commit
|
|
api_url = "https://api.github.com/repos/#{repo}/commits/#{sha}"
|
|
@url = "https://github.com/#{repo}/commit/#{sha}"
|
|
|
|
fetch_diff(api_url, "commit") do |info, diff|
|
|
{
|
|
type: "commit",
|
|
diff: diff,
|
|
commit_info: {
|
|
sha: info["sha"],
|
|
message: info.dig("commit", "message"),
|
|
author: info.dig("commit", "author", "name"),
|
|
author_login: info.dig("author", "login"),
|
|
date: info.dig("commit", "author", "date"),
|
|
url: @url,
|
|
stats: {
|
|
additions: info.dig("stats", "additions"),
|
|
deletions: info.dig("stats", "deletions"),
|
|
total: info.dig("stats", "total"),
|
|
},
|
|
files_changed: info["files"]&.length || 0,
|
|
},
|
|
}
|
|
end
|
|
end
|
|
|
|
def fetch_diff(api_url, type)
|
|
info = github_client.get(api_url)
|
|
diff_body = github_client.raw_get(api_url, accept: "application/vnd.github.v3.diff")
|
|
|
|
diff = self.class.sort_and_shorten_diff(diff_body)
|
|
diff = truncate(diff, max_length: 20_000, percent_length: 0.3, llm: llm)
|
|
yield(info, diff)
|
|
rescue Discourse::GithubApi::Error => e
|
|
{ error: "Failed to retrieve the #{type} information. #{e.message}" }
|
|
end
|
|
end
|
|
end
|
|
end
|
|
end
|