discourse/lib/compression/zip.rb
Sam fa54f62348
FEATURE: extract text from document uploads for LLM prompts (#39634)
Document attachments (doc, docx, xls, xlsx, rtf, csv, md, txt) are now
converted to text before being included in LLM prompts, instead of
being forwarded as raw base64 payloads. PDFs remain the only format
sent as a raw upload, capped at 10MB.

New converters under lib/completions:

- DocToText shells out to antiword
- DocxToText parses OOXML directly with size and depth limits
- XlsToText shells out to xls2csv
- XlsxToText parses OOXML and shared strings into CSV-style text
- RtfToText is a custom RTF tokenizer with destination/group handling

Plain text formats (csv, md, txt) are read with a 1MB byte cap and
UTF-8 normalization. Extracted text is truncated to 100k characters,
with a preamble noting the original filename and size.

Dialect trimming now uses token-aware truncation against a per-message
budget so large extracted documents collapse cleanly under the prompt
limit, rather than the previous step-based slicing of raw content.

Other changes:

- LlmModel.normalize_attachment_types is shared with UploadEncoder and
  collapses "markdown" to "md" so the canonical extension is consistent
  across model config, UI defaults, and encoder output
- ai-llm-attachment-types adds csv, xls, xlsx to the default choices
- Locale strings clarify that vision controls images and
  allowed_attachment_types controls documents

---------

Co-authored-by: Rafael Silva <xfalcox@gmail.com>
2026-05-05 08:16:23 +10:00

97 lines
2.6 KiB
Ruby

# frozen_string_literal: true
require "zip"
require "compression/safe_zip_reader"
module Compression
class Zip < Strategy
MAX_ZIP_ENTRIES = 10_000
def extension
".zip"
end
def compress(path, target_name)
absolute_path = sanitize_path("#{path}/#{target_name}")
zip_filename = "#{absolute_path}.zip"
::Zip::File.open(zip_filename, create: true) do |zipfile|
if File.directory?(absolute_path)
entries = Dir.entries(absolute_path) - %w[. ..]
write_entries(entries, absolute_path, "", zipfile)
else
put_into_archive(absolute_path, zipfile, target_name)
end
end
zip_filename
end
private
def extract_folder(entry, entry_path)
FileUtils.mkdir_p(entry_path)
end
def get_compressed_file_stream(compressed_file_path)
zip_file = ::Zip::File.open(compressed_file_path)
Compression::SafeZipReader.new(zip_file, max_entries: MAX_ZIP_ENTRIES).validate!
yield(zip_file)
end
def build_entry_path(dest_path, entry, _)
File.join(dest_path, entry.name)
end
def decompression_results_path(dest_path, _)
dest_path
end
def extract_file(entry, entry_path, available_size)
remaining_size = available_size
if ::File.exist?(entry_path)
raise DestinationFileExistsError, "Destination '#{entry_path}' already exists"
end
::File.open(entry_path, "wb") do |os|
entry.get_input_stream do |is|
entry.set_extra_attributes_on_path(entry_path)
buf = "".dup
while (buf = is.sysread(chunk_size, buf))
remaining_size -= chunk_size
raise ExtractFailed if remaining_size.negative?
os << buf
end
end
end
remaining_size
end
# A helper method to make the recursion work.
def write_entries(entries, base_path, path, zipfile)
entries.each do |e|
zipfile_path = path == "" ? e : File.join(path, e)
disk_file_path = File.join(base_path, zipfile_path)
if File.directory? disk_file_path
recursively_deflate_directory(disk_file_path, zipfile, base_path, zipfile_path)
else
put_into_archive(disk_file_path, zipfile, zipfile_path)
end
end
end
def recursively_deflate_directory(disk_file_path, zipfile, base_path, zipfile_path)
zipfile.mkdir zipfile_path
subdir = Dir.entries(disk_file_path) - %w[. ..]
write_entries subdir, base_path, zipfile_path, zipfile
end
def put_into_archive(disk_file_path, zipfile, zipfile_path)
zipfile.add(zipfile_path, disk_file_path)
end
end
end