mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 01:44:11 +08:00
#39133 backslash-escaped markdown characters in upload filenames at generation time. That worked for freshly uploaded files but exposed a latent bug in the rich-editor round-trip: markdown-it kept `\_` in the parsed image token's content, ProseMirror stored it verbatim in the `alt` attribute, and on save the serializer re-escaped each `\` to `\\`. Every edit doubled the backslashes (1 → 2 → 4 → … → 2^N) until the post exceeded `max_post_length` and became uneditable. Escaping the raw is fundamentally fragile — nothing treats the stored raw as canonical, so any parse/serialize cycle either drops the escape or re-applies it. Fix it at parse time instead: - Revert the generation-side escaping from #39133 in `UploadMarkdown`, `uploads.js`, `inline_uploads.rb`, `to-markdown.js` and `sanitizeAlt`. - Add a `literalize_upload_labels` core ruler in the markdown-it engine. After inline parsing runs, it walks `image` / `link_open` tokens whose URL starts with `upload://` and collapses their children into a single literal text token, rebuilt from the children `content` plus the `markup` of emphasis/strong/strikethrough delimiters. So `_foo_`, `**foo**`, `~~foo~~`, `` `foo` ``, `\_foo`, linkified URLs, hashtags and mentions inside upload labels all render literally. Reference-style links (`[label][ref]` with `[ref]: upload://…`) get the same treatment for free since they go through the same tokens. The raw now stays canonical: the filename goes in verbatim, cooks the same way on every pass, and the textarea and rich editor round-trip identically. Because escaping is gone, the structural characters `[`, `]` and `|` (which would break the link/image syntax and can't be escaped without reintroducing the doubling) are stripped from labels at every generation point — `UploadMarkdown`, the HTML-anchor and hotlinked-image conversions in `inline_uploads.rb`, `uploads.js` and `to-markdown.js`. The multi-token scan-forward in `renderAttachment` (engine.js) and in ProseMirror's `link.js` parser is kept: it still matters for non-upload attachment links like `[**bold**|attachment](https://example.com/x.pdf)`, where the label legitimately contains inline formatting the new ruler doesn't touch. `StripUploadLabelEscapes` (post-deploy migration) heals posts already damaged by the regression, batching a scoped `regexp_replace` across the `posts` table. The lookahead `(?=[^\]\[]*\]\(upload://)` bounds each match to an upload label — forbidding `[`/`]` between the escape and the closing `](upload://` keeps user-written `\_` escapes elsewhere in the raw intact. Scoping on the lookahead alone, rather than anchoring on the opening `[`, lets a single pass strip every escape in a label (e.g. `foo\_bar\_baz`), not just the first. https://meta.discourse.org/t/401231
49 lines
1.7 KiB
Ruby
Vendored
49 lines
1.7 KiB
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
# One-off cleanup for posts whose upload markdown labels accumulated backslash
|
|
# escapes through repeated rich-editor edits (regression from
|
|
# https://meta.discourse.org/t/401231). The backslashes doubled on each save
|
|
# (1 → 2 → 4 → … → 2^N). The engine-side fix prevents further damage; this
|
|
# heals existing posts.
|
|
#
|
|
# Batched + non-transactional so we never hold update locks across the whole
|
|
# posts table at once on large sites.
|
|
#
|
|
# Pattern: strip runs of `\` that escape `_ * ~ | `` inside a label closing
|
|
# with `](upload://…`. `\\+` swallows the whole run; the `(?=…)` lookahead
|
|
# scopes the match to upload labels — forbidding `[`/`]` in the gap means an
|
|
# escape only matches while it sits inside the label that ends in
|
|
# `](upload://`, so user-written escapes elsewhere stay intact. Scoping on the
|
|
# lookahead alone (rather than consuming the opening `[`) lets a single pass
|
|
# strip every escape in a label, e.g. `foo\_bar\_baz`, not just the first.
|
|
class StripUploadLabelEscapes < ActiveRecord::Migration[8.0]
|
|
disable_ddl_transaction!
|
|
|
|
BATCH_SIZE = 10_000
|
|
|
|
def up
|
|
min_id, max_id = DB.query_single("SELECT MIN(id), MAX(id) FROM posts")
|
|
return if max_id.nil?
|
|
|
|
current_id = min_id
|
|
while current_id <= max_id
|
|
DB.exec(<<~SQL, start_id: current_id, end_id: current_id + BATCH_SIZE)
|
|
UPDATE posts
|
|
SET raw = regexp_replace(
|
|
raw,
|
|
'\\\\+([_*~|`])(?=[^\\]\\[]*\\]\\(upload://)',
|
|
'\\1',
|
|
'g'
|
|
)
|
|
WHERE id >= :start_id
|
|
AND id < :end_id
|
|
AND raw ~ '\\\\+[_*~|`][^\\]\\[]*\\]\\(upload://'
|
|
SQL
|
|
current_id += BATCH_SIZE
|
|
end
|
|
end
|
|
|
|
def down
|
|
raise ActiveRecord::IrreversibleMigration
|
|
end
|
|
end
|