discourse/plugins/discourse-ai/lib/completions/token_usage_tracker.rb
Sam b8abe100c5
FEATURE: add agentic execution mode for AI personas (#38230)
Introduce an "agentic" execution mode as an alternative to the
default fixed-turn/tool-limit approach. In agentic mode, personas
use a configurable token budget (`max_turn_tokens`) to govern how
long a tool-use session can run, with automatic context compression
when the conversation exceeds a configurable threshold percentage
(`compression_threshold`) of the model's context window.

Key changes:

- Add `execution_mode`, `max_turn_tokens`, and `compression_threshold`
  columns to `ai_personas` via migration
- Refactor `Bot#reply` to support token-budget loop control with a
  thread-local token accumulator, budget exhaustion hints, and a
  safety valve at 100 completions
- Add `maybe_compress_context` which summarizes middle conversation
  messages when token usage crosses the compression threshold,
  preserving system prompt and recent tail messages
- Update `StreamReplyCustomToolsSession` to track accumulated tokens
  across rounds and handle budget exhaustion in the custom tools path
- Discount cached tokens (Anthropic) in the token accumulator to
  avoid over-counting reused KV cache prefixes
- Update persona editor UI with execution mode selector and
  conditional fields (agentic shows token budget/compression;
  default shows max context posts)
2026-03-05 15:06:54 +11:00

58 lines
1.6 KiB
Ruby
Vendored

# frozen_string_literal: true
module DiscourseAi
module Completions
class TokenUsageTracker
def initialize(base_total: nil, base_request: nil, base_response: nil)
@mutex = Mutex.new
if base_request.nil? && base_response.nil?
total = base_total.to_i
initial_request = total / 2
@request = initial_request
@response = total - initial_request
else
if !base_total.nil?
raise ArgumentError, "base_total cannot be combined with base_request/base_response"
end
if base_request.nil? || base_response.nil?
raise ArgumentError, "base_request and base_response must both be provided"
end
@request = base_request.to_i
@response = base_response.to_i
end
end
def add_from_audit_log(log)
# request_tokens = non-cached input (already excludes cached)
# cache_write_tokens = newly cached (full cost)
# cache_read_tokens = served from cache (1/10 cost)
add_effective(
request:
log.request_tokens.to_i + log.cache_write_tokens.to_i +
(log.cache_read_tokens.to_i * 0.1).to_i,
response: log.response_tokens.to_i,
)
end
def add_effective(request:, response:)
@mutex.synchronize do
@request += request.to_i
@response += response.to_i
end
end
def request
@mutex.synchronize { @request }
end
def response
@mutex.synchronize { @response }
end
def total
@mutex.synchronize { @request + @response }
end
end
end
end