discourse/app/services/search_indexer.rb
Joffrey JAFFEUX 7ecb945ec4
FEATURE: Add full-text search for chat messages (#34704)
## Overview

This PR introduces comprehensive search functionality for chat messages,
enabling users to search through their chat history both globally across
all accessible channels and within specific channels.

### Search Capabilities

**All-Channel Search**: When no channel is specified, users can search
across all channels they have access to. The search respects channel
permissions through `ChannelFetcher.all_secured_channel_ids`, ensuring
users only see results from channels they can view.

**Per-Channel Search**: Users can scope their search to a specific
channel by providing a `channel_id` parameter, useful for finding
messages within a particular conversation context.

**Search Features**:
- Full-text search using PostgreSQL's tsvector/tsquery
- Advanced filters: `@username` to filter by author, `#channel` to
filter by channel slug
- Sort options: relevance (default) or latest
- Pagination support
- Search data weighted by relevance

## Site Setting: `chat_search_enabled`

This feature is gated behind the `chat_search_enabled` site setting,
which is currently:
- **Default**: `false`
- **Hidden**: `true`
- **Client-accessible**: `true`

### Deployment Strategy

Due to the need for chat messages to be indexed before search becomes
useful, we're implementing a two-phase deployment:

**Phase 1 (Initial Merge)**:
- `chat_search_enabled` remains `false` and hidden
- The `register_search_index` uses default (true) instead of `chat_search_enabled` value
- This allows the reindexing infrastructure to begin indexing existing
chat messages even if we don't show the UI yet

**Wait Period**:
- Wait at least one week after Phase 1 deployment
- `Jobs::ReindexSearch` runs every 2 hours and will progressively index
all chat messages
- This ensures most sites have a significant part of their chat history indexed

**Phase 2 (Follow-up Merge)**:
- Set `chat_search_enabled` default to `true` and unhide it
- Update the `register_search_index` enabled proc uses the default
(true) instead of using the `chat_search_enabled` setting
- Users can now access search with pre-indexed data

**Rationale**: Without this phased approach, users would see the search
UI immediately but receive no results until the reindexing job runs,
creating a confusing experience. By pre-indexing while the UI is hidden,
we ensure search works immediately when enabled.

## New Plugin API: `register_search_index`

This PR introduces a new plugin API that allows plugins to register
custom search indexes that integrate seamlessly with Discourse's search
infrastructure.

### API Signature

```ruby
register_search_index(
  model_class:,              # The ActiveRecord model to index
  search_data_class:,        # The model for storing search data
  index_version:,            # Version number for re-indexing
  search_data:,              # Proc that returns weighted search data
  load_unindexed_record_ids:,# Proc that finds records needing indexing
  enabled:                   # Optional proc to enable/disable (default: -> { true })
)
```

### How It Works

**Integration with SearchIndexer**: When `SearchIndexer.index(obj)` is
called, it checks registered search handlers for the object's type. If a
handler matches, it:
1. Calls the `search_data` proc with the object and an `IndexerHelper`
instance
2. Receives weighted search data (`:a_weight`, `:b_weight`, `:c_weight`,
`:d_weight`)
3. Updates the corresponding search data table with PostgreSQL's
tsvector

**Integration with Jobs::ReindexSearch**: The scheduled job (runs every
2 hours) calls `rebuild_registered_search_handlers`, which:
1. Iterates through all registered search handlers
2. Skips handlers where `enabled` proc returns `false`
3. Calls `load_unindexed_record_ids` to find records needing indexing
4. Indexes up to `limit` records per handler (default: 10,000)

### Chat Implementation Example

```ruby
register_search_index(
  model_class: Chat::Message,
  search_data_class: Chat::MessageSearchData,
  index_version: 1,
  search_data: proc { |message, indexer_helper|
    {
      a_weight: message.message,
      d_weight: indexer_helper.scrub_html(message.cooked)[0..600_000]
    }
  },
  load_unindexed_record_ids: proc { |limit:, index_version:|
    Chat::Message
      .joins("LEFT JOIN chat_message_search_data ON chat_message_id = chat_messages.id")
      .where(
        "chat_message_search_data.locale IS NULL OR 
         chat_message_search_data.locale != ? OR 
         chat_message_search_data.version != ?",
        SiteSetting.default_locale,
        index_version
      )
      .order("chat_messages.id ASC")
      .limit(limit)
      .pluck(:id)
  }
)
```

Co-authored-by: Martin Brennan <mjrbrennan@gmail.com>
Co-authored-by: Loïc Guitaut <5648+Flink@users.noreply.github.com>
2025-10-22 11:30:35 +02:00

450 lines
13 KiB
Ruby
Vendored

# frozen_string_literal: true
class SearchIndexer
MIN_POST_BLURB_INDEX_VERSION = 4
POST_INDEX_VERSION = 5
TOPIC_INDEX_VERSION = 4
CATEGORY_INDEX_VERSION = 3
USER_INDEX_VERSION = 3
TAG_INDEX_VERSION = 3
# version to apply when issuing a background reindex
REINDEX_VERSION = 0
TS_VECTOR_PARSE_REGEX = /('([^']*|'')*'\:)(([0-9]+[A-D]?,?)+)/
def self.disable
@disabled = true
end
def self.enable
@disabled = false
end
def self.with_indexing
prior = @disabled
enable
yield
ensure
@disabled = prior
end
def self.update_index(table:, id:, a_weight: nil, b_weight: nil, c_weight: nil, d_weight: nil)
raw_data = { a: a_weight, b: b_weight, c: c_weight, d: d_weight }
# The version used in excerpts
search_data = raw_data.transform_values { |data| Search.prepare_data(data || "", :index) }
# The version used to build the index
indexed_data =
search_data.transform_values do |data|
data.gsub(/\S+/) { |word| word[0...SiteSetting.search_max_indexed_word_length] }
end
table_name = "#{table}_search_data"
foreign_key = "#{table}_id"
# for user login and name use "simple" lowercase stemmer
stemmer = table == "user" ? "simple" : Search.ts_config
ranked_index = <<~SQL
setweight(to_tsvector('#{stemmer}', #{Search.wrap_unaccent("coalesce(:a,''))")}, 'A') ||
setweight(to_tsvector('#{stemmer}', #{Search.wrap_unaccent("coalesce(:b,''))")}, 'B') ||
setweight(to_tsvector('#{stemmer}', #{Search.wrap_unaccent("coalesce(:c,''))")}, 'C') ||
setweight(to_tsvector('#{stemmer}', #{Search.wrap_unaccent("coalesce(:d,''))")}, 'D')
SQL
tsvector = DB.query_single("SELECT #{ranked_index}", indexed_data)[0]
additional_lexemes = []
# we also want to index parts of a domain name
# that way stemmed single word searches will match
additional_words = []
tsvector
.scan(/'(([a-zA-Z0-9]+\.)+[a-zA-Z0-9]+)'\:([\w+,]+)/)
.reduce(additional_lexemes) do |array, (lexeme, _, positions)|
count = 0
if lexeme !~ /\A(\d+\.)?(\d+\.)*(\*|\d+)\z/
loop do
count += 1
break if count >= 10 # Safeguard here to prevent infinite loop when a term has many dots
term, _, remaining = lexeme.partition(".")
break if remaining.blank?
additional_words << [term, positions]
array << "'#{remaining}':#{positions}"
lexeme = remaining
end
end
array
end
extra_domain_word_terms =
if additional_words.length > 0
DB
.query_single(
"SELECT to_tsvector(?, ?)",
stemmer,
additional_words.map { |term, _| term }.join(" "),
)
.first
.scan(TS_VECTOR_PARSE_REGEX)
.map do |term, _, indexes|
new_indexes =
indexes
.split(",")
.map do |index|
existing_positions = additional_words[index.to_i - 1]
if existing_positions
existing_positions[1]
else
index
end
end
.join(",")
"#{term}#{new_indexes}"
end
.join(" ")
end
tsvector = "#{tsvector} #{additional_lexemes.join(" ")} #{extra_domain_word_terms}"
if (max_dupes = SiteSetting.max_duplicate_search_index_terms) > 0
reduced = []
tsvector
.scan(TS_VECTOR_PARSE_REGEX)
.each do |term, _, indexes|
family_counts = Hash.new(0)
new_index_array = []
indexes
.split(",")
.each do |index|
family = nil
family = index[-1] if index[-1].match?(/[A-D]/)
# title dupes can completely dominate the index
# so we limit them to 1
if (family_counts[family] += 1) <= (family == "A" ? 1 : max_dupes)
new_index_array << index
end
end
reduced << "#{term.strip}#{new_index_array.join(",")}"
end
tsvector = reduced.join(" ")
end
indexed_data =
if table.to_s == "post"
clean_post_raw_data!(search_data[:d])
else
search_data.values.select { |d| d.length > 0 }.join(" ")
end
handler = DiscoursePluginRegistry.search_handlers.find { |h| h[:table_name] == table }
index_version = handler ? handler[:index_version] : const_get("#{table.upcase}_INDEX_VERSION")
params = {
"raw_data" => indexed_data,
"#{foreign_key}" => id,
"locale" => SiteSetting.default_locale,
"version" => index_version,
"search_data" => tsvector,
}
yield params if block_given?
if handler
handler[:search_data_class].upsert(params)
else
table_name.camelize.constantize.upsert(params)
end
rescue => e
if Rails.env.test?
raise
else
# TODO is there any way we can safely avoid this?
# best way is probably pushing search indexer into a dedicated process so it no longer happens on save
# instead in the post processor
Discourse.warn_exception(
e,
message: "Unexpected error while indexing #{table} for search",
env: {
id: id,
},
)
end
end
def self.update_topics_index(topic_id, title, cooked)
# a bit inconsistent that we use title as A and body as B when in
# the post index body is D
update_index(
table: "topic",
id: topic_id,
a_weight: title,
b_weight: HtmlScrubber.scrub(cooked)[0...Topic::MAX_SIMILAR_BODY_LENGTH],
)
end
def self.update_posts_index(
post_id:,
topic_title:,
category_name:,
topic_tags:,
cooked:,
private_message:
)
update_index(
table: "post",
id: post_id,
a_weight: topic_title,
b_weight: category_name,
c_weight: topic_tags,
# The tsvector resulted from parsing a string can be double the size of
# the original string. Since there is no way to estimate the length of
# the expected tsvector, we limit the input to ~50% of the maximum
# length of a tsvector (1_048_576 bytes).
d_weight: HtmlScrubber.scrub(cooked)[0..600_000],
) { |params| params["private_message"] = private_message }
end
def self.update_users_index(user_id, username, name, custom_fields)
update_index(
table: "user",
id: user_id,
a_weight: username,
b_weight: name,
c_weight: custom_fields,
)
end
def self.update_categories_index(category_id, name)
update_index(table: "category", id: category_id, a_weight: name)
end
def self.update_tags_index(tag_id, name)
update_index(table: "tag", id: tag_id, a_weight: name.downcase)
end
def self.queue_category_posts_reindex(category_id)
return if @disabled
DB.exec(<<~SQL, category_id: category_id, version: REINDEX_VERSION)
UPDATE post_search_data
SET version = :version
FROM posts
INNER JOIN topics ON posts.topic_id = topics.id
INNER JOIN categories ON topics.category_id = categories.id
WHERE post_search_data.post_id = posts.id
AND categories.id = :category_id
SQL
end
def self.queue_users_reindex(user_ids)
return if @disabled
DB.exec(<<~SQL, user_ids: user_ids, version: REINDEX_VERSION)
UPDATE user_search_data
SET version = :version
WHERE user_search_data.user_id IN (:user_ids)
SQL
end
def self.queue_post_reindex(topic_id)
return if @disabled
DB.exec(<<~SQL, topic_id: topic_id, version: REINDEX_VERSION)
UPDATE post_search_data
SET version = :version
FROM posts
WHERE post_search_data.post_id = posts.id
AND posts.topic_id = :topic_id
SQL
end
def self.index(obj, force: false)
return if @disabled
# Check registered search handlers for this object type
handler = DiscoursePluginRegistry.search_handlers.find { |h| obj.is_a?(h[:model_class]) }
if handler
indexer_helper = IndexerHelper.new
search_weights = handler[:search_data].call(obj, indexer_helper)
update_index(table: handler[:table_name], id: obj.id, **search_weights)
return
end
category_name = nil
tag_names = nil
topic = nil
if Topic === obj
topic = obj
elsif Post === obj
topic = obj.topic
end
category_name = topic.category&.name if topic
if topic
tags = topic.tags.select(:id, :name).to_a
if tags.present?
tag_names =
(tags.map(&:name) + Tag.where(target_tag_id: tags.map(&:id)).pluck(:name)).join(" ")
end
end
if Post === obj && obj.raw.present? &&
(force || obj.saved_change_to_cooked? || obj.saved_change_to_topic_id?)
if topic
SearchIndexer.update_posts_index(
post_id: obj.id,
topic_title: topic.title,
category_name: category_name,
topic_tags: tag_names,
cooked: obj.cooked,
private_message: topic.private_message?,
)
SearchIndexer.update_topics_index(topic.id, topic.title, obj.cooked) if obj.is_first_post?
end
end
if User === obj && (obj.saved_change_to_username? || obj.saved_change_to_name? || force)
SearchIndexer.update_users_index(
obj.id,
obj.username_lower || "",
obj.name ? obj.name.downcase : "",
obj.user_custom_fields.searchable.map(&:value).join(" "),
)
end
if Topic === obj && (obj.saved_change_to_title? || force)
if obj.posts
if post = obj.posts.find_by(post_number: 1)
SearchIndexer.update_posts_index(
post_id: post.id,
topic_title: obj.title,
category_name: category_name,
topic_tags: tag_names,
cooked: post.cooked,
private_message: obj.private_message?,
)
SearchIndexer.update_topics_index(obj.id, obj.title, post.cooked)
end
end
end
if Category === obj && (obj.saved_change_to_name? || force)
SearchIndexer.queue_category_posts_reindex(obj.id)
SearchIndexer.update_categories_index(obj.id, obj.name)
end
if Tag === obj && (obj.saved_change_to_name? || force)
SearchIndexer.update_tags_index(obj.id, obj.name)
end
end
def self.clean_post_raw_data!(raw_data)
urls = Set.new
raw_data.scan(Discourse::Utils::URI_REGEXP) { urls << $& }
urls.each do |url|
begin
case File.extname(URI(url).path || "")
when Oneboxer::VIDEO_REGEX
raw_data.gsub!(url, I18n.t("search.video"))
when Oneboxer::AUDIO_REGEX
raw_data.gsub!(url, I18n.t("search.audio"))
end
rescue URI::InvalidURIError
end
end
raw_data
end
private_class_method :clean_post_raw_data!
class HtmlScrubber < Nokogiri::XML::SAX::Document
attr_reader :scrubbed
def initialize
@scrubbed = +""
end
def self.scrub(html)
return +"" if html.blank?
begin
document = Nokogiri.HTML5("<div>#{html}</div>", encoding: Encoding::UTF_8)
rescue ArgumentError
return +""
end
nodes = document.css("div.#{CookedPostProcessor::LIGHTBOX_WRAPPER_CSS_CLASS}")
if nodes.present?
nodes.each do |node|
node.traverse do |child_node|
next if child_node == node
if %w[a img].exclude?(child_node.name)
child_node.remove
elsif child_node.name == "a"
ATTRIBUTES.each { |attribute| child_node.remove_attribute(attribute) }
end
end
end
end
document.css("img.emoji").each { |node| node.remove_attribute("alt") }
document
.css("a[href]")
.each do |node|
if node["href"] == node.text || MENTION_CLASSES.include?(node["class"])
node.remove_attribute("href")
end
if node["class"] == "anchor" && node["href"].starts_with?("#")
node.remove_attribute("href")
end
end
html_scrubber = new
Nokogiri::HTML4::SAX::Parser.new(html_scrubber, Encoding::UTF_8).parse(document.to_html)
html_scrubber.scrubbed.squish
end
MENTION_CLASSES = %w[mention mention-group]
ATTRIBUTES = %w[alt title href data-video-title]
def start_element(_name, attributes = [])
attributes = Hash[*attributes.flatten]
ATTRIBUTES.each do |attribute_name|
if attributes[attribute_name].present? &&
!(attribute_name == "href" && UrlHelper.is_local(attributes[attribute_name]))
characters(attributes[attribute_name])
end
end
end
def characters(str)
scrubbed << " #{str} "
end
end
class IndexerHelper
def scrub_html(html)
HtmlScrubber.scrub(html)
end
end
end