discourse/db/post_migrate/20260422135650_strip_upload_label_escapes.rb
Régis Hanol 49474b0188
FIX: Prevent backslash accumulation in upload markdown labels (#39461)
#39133 backslash-escaped markdown characters in upload filenames at
generation time. That worked for freshly uploaded files but exposed a
latent bug in the rich-editor round-trip: markdown-it kept `\_` in the
parsed image token's content, ProseMirror stored it verbatim in the
`alt` attribute, and on save the serializer re-escaped each `\` to `\\`.
Every edit doubled the backslashes (1 → 2 → 4 → … → 2^N) until the post
exceeded `max_post_length` and became uneditable.

Escaping the raw is fundamentally fragile — nothing treats the stored
raw as canonical, so any parse/serialize cycle either drops the escape
or re-applies it. Fix it at parse time instead:

- Revert the generation-side escaping from #39133 in `UploadMarkdown`,
  `uploads.js`, `inline_uploads.rb`, `to-markdown.js` and `sanitizeAlt`.
- Add a `literalize_upload_labels` core ruler in the markdown-it engine.
  After inline parsing runs, it walks `image` / `link_open` tokens whose
  URL starts with `upload://` and collapses their children into a single
  literal text token, rebuilt from the children `content` plus the
  `markup` of emphasis/strong/strikethrough delimiters. So `_foo_`,
  `**foo**`, `~~foo~~`, `` `foo` ``, `\_foo`, linkified URLs, hashtags
and mentions inside upload labels all render literally. Reference-style
  links (`[label][ref]` with `[ref]: upload://…`) get the same treatment
  for free since they go through the same tokens.

The raw now stays canonical: the filename goes in verbatim, cooks the
same way on every pass, and the textarea and rich editor round-trip
identically.

Because escaping is gone, the structural characters `[`, `]` and `|`
(which would break the link/image syntax and can't be escaped without
reintroducing the doubling) are stripped from labels at every generation
point — `UploadMarkdown`, the HTML-anchor and hotlinked-image
conversions in `inline_uploads.rb`, `uploads.js` and `to-markdown.js`.

The multi-token scan-forward in `renderAttachment` (engine.js) and in
ProseMirror's `link.js` parser is kept: it still matters for non-upload
attachment links like
`[**bold**|attachment](https://example.com/x.pdf)`,
where the label legitimately contains inline formatting the new ruler
doesn't touch.

`StripUploadLabelEscapes` (post-deploy migration) heals posts already
damaged by the regression, batching a scoped `regexp_replace` across the
`posts` table. The lookahead `(?=[^\]\[]*\]\(upload://)` bounds each
match to an upload label — forbidding `[`/`]` between the escape and the
closing `](upload://` keeps user-written `\_` escapes elsewhere in the
raw intact. Scoping on the lookahead alone, rather than anchoring on the
opening `[`, lets a single pass strip every escape in a label (e.g.
`foo\_bar\_baz`), not just the first.

https://meta.discourse.org/t/401231
2026-06-03 18:10:38 +02:00

49 lines
1.7 KiB
Ruby
Vendored

# frozen_string_literal: true
# One-off cleanup for posts whose upload markdown labels accumulated backslash
# escapes through repeated rich-editor edits (regression from
# https://meta.discourse.org/t/401231). The backslashes doubled on each save
# (1 → 2 → 4 → … → 2^N). The engine-side fix prevents further damage; this
# heals existing posts.
#
# Batched + non-transactional so we never hold update locks across the whole
# posts table at once on large sites.
#
# Pattern: strip runs of `\` that escape `_ * ~ | `` inside a label closing
# with `](upload://…`. `\\+` swallows the whole run; the `(?=…)` lookahead
# scopes the match to upload labels — forbidding `[`/`]` in the gap means an
# escape only matches while it sits inside the label that ends in
# `](upload://`, so user-written escapes elsewhere stay intact. Scoping on the
# lookahead alone (rather than consuming the opening `[`) lets a single pass
# strip every escape in a label, e.g. `foo\_bar\_baz`, not just the first.
class StripUploadLabelEscapes < ActiveRecord::Migration[8.0]
disable_ddl_transaction!
BATCH_SIZE = 10_000
def up
min_id, max_id = DB.query_single("SELECT MIN(id), MAX(id) FROM posts")
return if max_id.nil?
current_id = min_id
while current_id <= max_id
DB.exec(<<~SQL, start_id: current_id, end_id: current_id + BATCH_SIZE)
UPDATE posts
SET raw = regexp_replace(
raw,
'\\\\+([_*~|`])(?=[^\\]\\[]*\\]\\(upload://)',
'\\1',
'g'
)
WHERE id >= :start_id
AND id < :end_id
AND raw ~ '\\\\+[_*~|`][^\\]\\[]*\\]\\(upload://'
SQL
current_id += BATCH_SIZE
end
end
def down
raise ActiveRecord::IrreversibleMigration
end
end