discourse/plugins/discourse-ai/lib/completions/llm.rb
Rafael dos Santos Silva 18a0a8daeb
FEATURE: Add AWS Bedrock Converse API provider (#38903)
## Summary

Adds a new `aws_bedrock_converse` inference provider that uses the
official AWS SDK (`aws-sdk-bedrockruntime`) and the Converse API. This
runs alongside the existing `aws_bedrock` provider — fully additive,
zero risk to existing configurations.

### Why a new provider?

The existing `aws_bedrock` provider manually handles SigV4 signing, URL
construction, binary event stream decoding, and maintains a hardcoded
model ID mapping table. It only supports Claude and Nova models.

The new provider delegates all of this to the official AWS SDK, which
means:

- **Model-agnostic** — works with any model available on Bedrock
(Claude, Nova, Kimi, MiniMax, Mistral, Llama, DeepSeek, NVIDIA, Qwen,
GLM, etc.) without any model-specific code
- **Application Inference Profiles** — users can set cross-region
profiles (`us.anthropic.claude-sonnet-4-20250514-v1:0`) or application
inference profile ARNs directly as the model name
- **Bedrock API Key auth** — supports the new AWS Bedrock API keys
(Bearer token auth) in addition to IAM access keys, STS role assumption,
and automatic credential resolution from environment/instance profiles
- **No maintenance burden** — no model ID mapping table to update when
AWS adds new models, no manual SigV4 signing, no binary event stream
decoding
- **Native tools only** — no XML tool fallback; uses the Converse API's
built-in tool support

### Authentication options (priority order)

| Config | Auth method |
|---|---|
| `role_arn` set | STS AssumeRole (SigV4) |
| `access_key_id` set | Static IAM credentials (SigV4) |
| API key set (no access_key_id/role_arn) | Bearer token (Bedrock API
key) |
| Nothing set | SDK auto-resolves (env vars, instance profile, ECS task
role) |

### Features supported

- Streaming and non-streaming completions
- Native tool use with tool_choice (auto/any/specific tool)
- Structured output via Converse API's `output_config` (models that
support it)
- Extended thinking / adaptive thinking with signature preservation for
multi-turn
- Interleaved thinking with tool calls (thinking blocks preserved per
tool_call message)
- Prompt caching via `cache_point` blocks
- Effort parameter (low/medium/high/max)
- `extra_model_fields` provider param for arbitrary
`additionalModelRequestFields` (beta features like `anthropic_beta`, 1M
context, interleaved thinking)

### New files

- `lib/completions/endpoints/aws_bedrock_converse.rb` — endpoint using
`Aws::BedrockRuntime::Client`
- `lib/completions/dialects/converse.rb` — unified Converse API dialect
- `lib/completions/dialects/converse_tools.rb` — tool formatting
- `lib/completions/converse_message_processor.rb` — response processing
for SDK typed objects

## Tested against real Bedrock API

All tests performed using Bedrock API Key auth (Bearer token) against
live endpoints with 9 different models from 8 providers:

| Test | Claude Sonnet 4 | Claude Haiku 4.5 | Kimi K2.5 | MiniMax M2 |
DeepSeek 3.2 | NVIDIA Nemotron 3 120B | Qwen3 Next 80B | GLM 5 | Mistral
Small |
|---|---|---|---|---|---|---|---|---|---|
| Non-streaming text |  |  |
 |  |  |
 |  |  |
 |
| Streaming text |  |  |
 |  |  |
 |  |  |
 |
| Multi-turn conversation |  |  |
 |  |  |
 |  |  |
 |
| Tool use (non-streaming) |  |  |
 |  |  |
 |  |  |
 |
| Tool use (streaming) |  |  |
 |  |  |
 |  |  |  model
unsupported |
| Structured output (non-streaming) | — |  |  model
unsupported |  |  |
 |  |  |  model
unsupported |
| Structured output (streaming) | — |  |  model
unsupported |  |  |
 |  |  |  model
unsupported |
| Bearer token auth |  |  |
 |  |  |
 |  |  |
 |
| Cross-region inference profile |  |
 | — | — | — | — | — | — | — |
| Audit logging + token tracking |  |
 |  |  |
 |  |  |
 |  |

> **Notes:**
> - Claude Sonnet 4 structured output not tested — requires 4.5+ for
this feature and those cross-region profiles were not available in the
test region.
> - Kimi K2.5 and Mistral Small do not support Bedrock's native
structured output.
> - Mistral Small does not support streaming tool use.
> - All  results are model-level limitations, not code issues — the
Converse API correctly surfaces the error.

## Test plan

- [ ] Existing `aws_bedrock` provider tests pass (`bin/rspec
spec/lib/completions/endpoints/aws_bedrock_spec.rb`)
- [ ] New provider tests pass (`bin/rspec
spec/lib/completions/endpoints/aws_bedrock_converse_spec.rb`)
- [ ] Create an LLM model with provider "AWS Bedrock (Converse API)" in
admin UI
- [ ] Verify basic completion works with a Bedrock API key (just region
+ API key, no IAM keys needed)
- [ ] Verify tool use works in AI bot conversations
- [ ] Verify structured output works with a supported model (Claude
Haiku 4.5+)
2026-03-30 12:37:30 -03:00

243 lines
8.2 KiB
Ruby
Vendored

# frozen_string_literal: true
# A facade that abstracts multiple LLMs behind a single interface.
#
# Internally, it consists of the combination of a dialect and an endpoint.
# After receiving a prompt using our generic format, it translates it to
# the target model and routes the completion request through the correct gateway.
#
# Use the .proxy method to instantiate an object.
# It chooses the correct dialect and endpoint for the model you want to interact with.
#
# Tests of modules that perform LLM calls can use .with_prepared_responses to return canned responses
# instead of relying on WebMock stubs like we did in the past.
#
module DiscourseAi
module Completions
class Llm
UNKNOWN_MODEL = Class.new(StandardError)
class << self
def presets
LlmPresets.all
end
def provider_names
providers = %w[
aws_bedrock
aws_bedrock_converse
anthropic
vllm
hugging_face
cohere
open_ai
google
azure
samba_nova
mistral
open_router
groq
]
if !Rails.env.production?
providers << "fake"
providers << "ollama"
end
providers
end
def tokenizer_names
DiscourseAi::Tokenizer::BasicTokenizer.available_llm_tokenizers.map(&:name)
end
def valid_provider_models
return @valid_provider_models if defined?(@valid_provider_models)
valid_provider_models = []
models_by_provider.each do |provider, models|
valid_provider_models.concat(models.map { |model| "#{provider}:#{model}" })
end
@valid_provider_models = Set.new(valid_provider_models)
end
def with_prepared_responses(responses, llm: nil)
@canned_response = DiscourseAi::Completions::Endpoints::CannedResponse.new(responses)
@canned_llm = llm
@prompts = []
@prompt_options = []
yield(@canned_response, llm, @prompts, @prompt_options)
ensure
# Don't leak prepared response if there's an exception.
@canned_response = nil
@canned_llm = nil
@prompts = nil
end
def record_prompt(prompt, options)
@prompts << prompt.dup if @prompts
@prompt_options << options if @prompt_options
end
def prompt_options
@prompt_options
end
def prompts
@prompts
end
def proxy(model)
llm_model =
if model.is_a?(LlmModel)
model
elsif model.is_a?(Numeric)
LlmModel.find_by(id: model)
else
model_name_without_prov = model.split(":").last.to_i
LlmModel.find_by(id: model_name_without_prov)
end
raise UNKNOWN_MODEL if llm_model.nil?
dialect_klass = DiscourseAi::Completions::Dialects::Dialect.dialect_for(llm_model)
if @canned_response
if @canned_llm && @canned_llm != model
raise "Invalid call LLM call, expected #{@canned_llm} but got #{model}"
end
return new(dialect_klass, nil, llm_model, gateway: @canned_response)
end
gateway_klass = DiscourseAi::Completions::Endpoints::Base.endpoint_for(llm_model)
new(dialect_klass, gateway_klass, llm_model)
end
end
def initialize(dialect_klass, gateway_klass, llm_model, gateway: nil)
@dialect_klass = dialect_klass
@gateway_klass = gateway_klass
@gateway = gateway
@llm_model = llm_model
end
# @param generic_prompt { DiscourseAi::Completions::Prompt } - Our generic prompt object
# @param user { User } - User requesting the summary.
# @param temperature { Float - Optional } - The temperature to use for the completion.
# @param top_p { Float - Optional } - The top_p to use for the completion.
# @param max_tokens { Integer - Optional } - The maximum number of tokens to generate.
# @param stop_sequences { Array<String> - Optional } - The stop sequences to use for the completion.
# @param feature_name { String - Optional } - The feature name to use for the completion.
# @param feature_context { Hash - Optional } - The feature context to use for the completion.
# @param partial_tool_calls { Boolean - Optional } - If true, the completion will return partial tool calls.
# @param output_thinking { Boolean - Optional } - If true, the completion will return the thinking output for thinking models.
# @param response_format { Hash - Optional } - JSON schema passed to the API as the desired structured output.
# @param [Experimental] extra_model_params { Hash - Optional } - Other params that are not available accross models. e.g. response_format JSON schema.
# @param execution_context { DiscourseAi::Completions::ExecutionContext - Optional } - Explicit per-call context for token tracking and audit logging.
#
# @param &on_partial_blk { Block - Optional } - The passed block will get called with the LLM partial response.
#
# @returns String | ToolCall - Completion result.
# if multiple tools or a tool and a message come back, the result will be an array of ToolCall / String objects.
#
def generate(
prompt,
temperature: nil,
top_p: nil,
max_tokens: nil,
stop_sequences: nil,
user:,
feature_name: nil,
feature_context: nil,
partial_tool_calls: false,
output_thinking: false,
response_format: nil,
extra_model_params: nil,
cancel_manager: nil,
execution_context: nil,
&partial_read_blk
)
self.class.record_prompt(
prompt,
{
temperature: temperature,
top_p: top_p,
max_tokens: max_tokens,
stop_sequences: stop_sequences,
user: user,
feature_name: feature_name,
feature_context: feature_context,
partial_tool_calls: partial_tool_calls,
output_thinking: output_thinking,
response_format: response_format,
extra_model_params: extra_model_params,
},
)
model_params = { max_tokens: max_tokens, stop_sequences: stop_sequences }
if SiteSetting.ai_llm_temperature_top_p_enabled
model_params[:temperature] = temperature if temperature
model_params[:top_p] = top_p if top_p
end
# internals expect symbolized keys, so we normalize here
response_format =
JSON.parse(response_format.to_json, symbolize_names: true) if response_format &&
response_format.is_a?(Hash)
model_params[:response_format] = response_format if response_format
model_params.merge!(extra_model_params) if extra_model_params
if prompt.is_a?(String)
prompt =
DiscourseAi::Completions::Prompt.new(
"You are a helpful bot",
messages: [{ type: :user, content: prompt }],
)
elsif prompt.is_a?(Array)
prompt = DiscourseAi::Completions::Prompt.new(messages: prompt)
end
if !prompt.is_a?(DiscourseAi::Completions::Prompt)
raise ArgumentError, "Prompt must be either a string, array, of Prompt object"
end
model_params.keys.each { |key| model_params.delete(key) if model_params[key].nil? }
dialect = dialect_klass.new(prompt, llm_model, opts: model_params)
gateway = @gateway || gateway_klass.new(llm_model)
gateway.perform_completion!(
dialect,
user,
model_params,
feature_name: feature_name,
feature_context: feature_context,
partial_tool_calls: partial_tool_calls,
output_thinking: output_thinking,
cancel_manager: cancel_manager,
execution_context:,
&partial_read_blk
)
end
def max_prompt_tokens
llm_model.max_prompt_tokens
end
def tokenizer
llm_model.tokenizer_class
end
attr_reader :llm_model
private
attr_reader :dialect_klass, :gateway_klass
end
end
end