mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 03:23:50 +08:00
Introduce an "agentic" execution mode as an alternative to the default fixed-turn/tool-limit approach. In agentic mode, personas use a configurable token budget (`max_turn_tokens`) to govern how long a tool-use session can run, with automatic context compression when the conversation exceeds a configurable threshold percentage (`compression_threshold`) of the model's context window. Key changes: - Add `execution_mode`, `max_turn_tokens`, and `compression_threshold` columns to `ai_personas` via migration - Refactor `Bot#reply` to support token-budget loop control with a thread-local token accumulator, budget exhaustion hints, and a safety valve at 100 completions - Add `maybe_compress_context` which summarizes middle conversation messages when token usage crosses the compression threshold, preserving system prompt and recent tail messages - Update `StreamReplyCustomToolsSession` to track accumulated tokens across rounds and handle budget exhaustion in the custom tools path - Discount cached tokens (Anthropic) in the token accumulator to avoid over-counting reused KV cache prefixes - Update persona editor UI with execution mode selector and conditional fields (agentic shows token budget/compression; default shows max context posts)
58 lines
1.6 KiB
Ruby
Vendored
58 lines
1.6 KiB
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
module DiscourseAi
|
|
module Completions
|
|
class TokenUsageTracker
|
|
def initialize(base_total: nil, base_request: nil, base_response: nil)
|
|
@mutex = Mutex.new
|
|
if base_request.nil? && base_response.nil?
|
|
total = base_total.to_i
|
|
initial_request = total / 2
|
|
@request = initial_request
|
|
@response = total - initial_request
|
|
else
|
|
if !base_total.nil?
|
|
raise ArgumentError, "base_total cannot be combined with base_request/base_response"
|
|
end
|
|
if base_request.nil? || base_response.nil?
|
|
raise ArgumentError, "base_request and base_response must both be provided"
|
|
end
|
|
|
|
@request = base_request.to_i
|
|
@response = base_response.to_i
|
|
end
|
|
end
|
|
|
|
def add_from_audit_log(log)
|
|
# request_tokens = non-cached input (already excludes cached)
|
|
# cache_write_tokens = newly cached (full cost)
|
|
# cache_read_tokens = served from cache (1/10 cost)
|
|
add_effective(
|
|
request:
|
|
log.request_tokens.to_i + log.cache_write_tokens.to_i +
|
|
(log.cache_read_tokens.to_i * 0.1).to_i,
|
|
response: log.response_tokens.to_i,
|
|
)
|
|
end
|
|
|
|
def add_effective(request:, response:)
|
|
@mutex.synchronize do
|
|
@request += request.to_i
|
|
@response += response.to_i
|
|
end
|
|
end
|
|
|
|
def request
|
|
@mutex.synchronize { @request }
|
|
end
|
|
|
|
def response
|
|
@mutex.synchronize { @response }
|
|
end
|
|
|
|
def total
|
|
@mutex.synchronize { @request + @response }
|
|
end
|
|
end
|
|
end
|
|
end
|