2
0
Fork 0
mirror of https://github.com/discourse/discourse.git synced 2025-09-07 12:02:53 +08:00
discourse/spec/components
Régis Hanol 501b19b6e0
FIX: server-side HtmlToMarkdown improvements (#9586)
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown.
It also adds support for converting HTML <tables> to markdown tables.

The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'.
We would also need to hoist <pre> elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes
that have no "visitor" (eg. a 'visit_<tag>' method) defined.

2. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.

3. hoist_line_breaks!

The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying.
The <br> tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>".
The latter being much more easy to process than the former, so that's what this method is doing.
The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element.

4. remove_whitespaces!

The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well

- remove_whitespaces!
- is_inline?
- collapse_spaces!
- remove_trailing_space!

The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods.
For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively.

The "is_inline?" method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline.

The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags.
For example, if we have "  Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42".

Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown
FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown
FIX: added support for <img> dimensions when converting from HTML to Markdown
FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for <acronym> when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem

Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2020-04-30 12:21:25 +02:00
..
active_record/connection_adapters DEV: s/\$redis/Discourse\.redis (#8431) 2019-12-03 10:05:53 +01:00
auth FIX: update GitHub screen_name on login via GitHub 2020-04-23 20:54:26 +05:30
common_passwords DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
concern FIX: add category hashtags support for sub-sub categories. 2020-04-06 20:43:38 +05:30
email FIX: Add topic deleted check to email/sender (#9166) 2020-03-13 10:04:15 +10:00
file_store FIX: parallel spec system needs a dedicated upload folder for each worker. (#8547) 2019-12-18 11:21:57 +05:30
freedom_patches DEV: correct spec failures in PG 12 2019-11-26 16:39:14 +11:00
guardian FIX: Only show the review page to users that can see it. Do not publish the reviewable count update message to everyone. (#9556) 2020-04-27 14:51:25 -03:00
highlight_js DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
import DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
middleware DEV: Add test case for /srv/status probers (#9259) 2020-03-24 16:28:07 +11:00
migration DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
onebox/engine FIX: prevents whitelisted_generic_onebox_spec to fail with zeitwerk (#8288) 2019-11-04 09:15:09 +11:00
plugin DEV: update specs followup to 67e96f6 2020-04-23 16:11:17 -04:00
pretty_text SPEC: 'lookup_upload_urls' method should use cdn url if available. 2019-10-14 12:57:33 +05:30
rate_limiter DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
scheduler DEV: reduce logging when no external id is specified 2020-04-08 12:42:28 +10:00
site_settings UX: adds support for a color setting type (#9016) 2020-03-09 10:07:03 +01:00
stylesheet FEATURE: Ability to add components to all themes (#8404) 2019-11-28 16:19:01 +11:00
svg_sprite DEV: Correct references to theme flags 2020-03-13 16:45:55 +00:00
theme_store FEATURE: Allow themes to specify modifiers in their about.json file (#9097) 2020-03-11 13:30:45 +00:00
validators FEATURE: add setting auto_approve_email_domains to auto approve users (#9323) 2020-03-31 23:59:15 +05:30
wizard FIX: Default to light theme in wizard so that previews are displayed 2020-04-02 18:37:45 +01:00
admin_confirmation_spec.rb FEATURE: Add welcome message for admins. (#8293) 2019-11-05 18:15:55 +05:30
admin_user_index_query_spec.rb FEATURE: Approve suspect users is now true by default. The suspect users list was removed (#9151) 2020-03-10 08:56:42 -03:00
archetype_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
avatar_lookup_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
cache_spec.rb DEV: s/\$redis/Discourse\.redis (#8431) 2019-12-03 10:05:53 +01:00
category_badge_spec.rb FIX: Correctly escape category description text (#8107) 2019-10-01 12:04:39 -04:00
composer_messages_finder_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
content_buffer_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
cooked_post_processor_spec.rb DEV: Skip erratic spec for now 2020-04-25 13:20:04 +10:00
crawler_detection_spec.rb FIX: Serve crawler view to Google PageSpeed 2019-11-27 22:15:34 +01:00
current_user_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
directory_helper_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
discourse_diff_spec.rb Remove focus from specs 2019-10-16 14:28:04 -04:00
discourse_event_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
discourse_hub_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
discourse_plugin_registry_spec.rb DEV: correct regression in registry test suite 2019-08-22 16:22:52 +10:00
discourse_plugin_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
discourse_redis_spec.rb Revert "FIX: Redis fallback handler refactoring (#8771)" (#8776) 2020-01-24 09:20:17 +11:00
discourse_spec.rb DEV: Introduce plugin api for conditionally rendering assets (#9200) 2020-03-13 15:30:31 +00:00
discourse_tagging_spec.rb DEV: stop freezing frozen strings 2020-04-30 16:48:53 +10:00
discourse_updates_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
distributed_memoizer_spec.rb DEV: s/\$redis/Discourse\.redis (#8431) 2019-12-03 10:05:53 +01:00
distributed_mutex_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
email_cook_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
email_updater_spec.rb FIX: When admin changes staff email still enforce old email confirm (#9007) 2020-02-20 13:42:57 +10:00
enum_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
excerpt_parser_spec.rb DEV: Add option to keep quoted content in post excerpt. 2020-01-04 18:56:52 +05:30
feed_element_installer_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
feed_item_accessor_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
file_helper_spec.rb DEV: properly clean up temp files in FileHelper spec 2019-05-28 11:33:08 +10:00
filter_best_posts_spec.rb DEV: Prefabrication (test optimization) (#7414) 2019-05-07 13:12:20 +10:00
final_destination_spec.rb FIX: do not follow redirect on same host with path /login or /session 2019-08-07 16:26:55 +05:30
flag_settings_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
gaps_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
global_path_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
guardian_spec.rb FEATURE: Support for publishing topics as pages (#9364) 2020-04-08 12:52:36 -04:00
has_errors_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
hijack_spec.rb FEATURE: Stricter rules for user presence 2020-03-26 17:36:52 +11:00
html_prettify_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
html_to_markdown_spec.rb FIX: server-side HtmlToMarkdown improvements (#9586) 2020-04-30 12:21:25 +02:00
image_sizer_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
inline_oneboxer_spec.rb FIX: Make inline oneboxes work with secured topics in secured contexts (#8895) 2020-02-12 12:11:28 +02:00
js_locale_helper_spec.rb DEV: Add spec to find MF locale for en_US 2020-01-16 14:40:53 +01:00
json_error_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
letter_avatar_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
method_profiler_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
new_post_manager_spec.rb enqueue spam/dmarc failing emails instead of hiding (#8674) 2020-01-21 11:12:00 -05:00
new_post_result_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
oneboxer_spec.rb DEV: do not persist force_custom_user_agent_hosts setting 2020-02-06 11:56:54 -05:00
onpdiff_spec.rb Remove focus from specs 2019-10-16 14:28:04 -04:00
pbkdf2_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
pinned_check_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
plain_text_to_markdown_spec.rb FIX: use URI.regexp to find URLs in plain text 2019-06-07 01:26:06 +02:00
post_action_creator_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
post_creator_spec.rb FIX: use the new duration attribute in set_or_create_timer method. 2020-03-19 21:45:05 +05:30
post_destroyer_spec.rb FIX: Recovered posts with no user will be taken over by system user (#8834) 2020-02-06 10:19:04 +02:00
post_locker_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
post_merger_spec.rb DEV: Prefabrication (test optimization) (#7414) 2019-05-07 13:12:20 +10:00
post_revisor_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
pretty_text_spec.rb DEV: Improve video onebox stripping spec 2020-02-20 11:45:12 -05:00
promotion_spec.rb DEV: Prefabrication (test optimization) (#7414) 2019-05-07 13:12:20 +10:00
quote_comparer_spec.rb DEV: Prefabrication (test optimization) (#7414) 2019-05-07 13:12:20 +10:00
rate_limiter_spec.rb DEV: s/\$redis/Discourse\.redis (#8431) 2019-12-03 10:05:53 +01:00
redis_store_spec.rb DEV: Implement a faster Discourse.cache 2019-11-27 16:11:49 +11:00
retrieve_title_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
rtl_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
s3_helper_spec.rb FIX: Update S3 stubs for more aws-sdk API changes (#8534) 2019-12-11 11:26:52 -08:00
s3_inventory_spec.rb FIX: Use updated_at in the S3 inventory job (#8823) 2020-01-31 11:02:44 +01:00
score_calculator_spec.rb DEV: Prefabrication (test optimization) (#7414) 2019-05-07 13:12:20 +10:00
search_spec.rb FIX: add support for sub-sub category slugs in search 2020-03-20 15:36:50 +11:00
secure_session_spec.rb DEV: correct implementation of expiry api 2019-11-11 11:18:12 +11:00
site_icon_manager_spec.rb DEV: enable frozen string literal on all files 2019-05-13 09:31:32 +08:00
site_setting_extension_spec.rb DEV: use Discourse.cache over Rails.cache 2019-11-27 12:36:19 +11:00
slug_spec.rb FIX: If a prettified slug is a number, return defaultt (#8554) 2019-12-17 10:34:20 +10:00
spam_handler_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
suggested_topics_builder_spec.rb DEV: Default to skipping creating a topic when fabricating categories (#7976) 2019-08-06 11:26:54 +01:00
system_message_spec.rb DEV: Refactor SystemMessage#create specs. 2019-05-30 07:56:36 +08:00
text_cleaner_spec.rb FEATURE: English locale with international date formats 2019-05-20 13:47:20 +02:00
text_sentinel_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
theme_settings_manager_spec.rb DEV: Prefabrication (test optimization) (#7414) 2019-05-07 13:12:20 +10:00
theme_settings_parser_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
timeline_lookup_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
topic_creator_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
topic_publisher_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
topic_query_spec.rb DEV: Improve flaky time-sensitive specs (#9141) 2020-03-10 22:13:17 +01:00
topic_retriever_spec.rb FIX: An opts hash was not, in fact, optional :) 2020-04-20 14:17:13 -04:00
topic_view_spec.rb UX: Do not use avatars as fallback opengraph images for replies (#8605) 2019-12-20 13:17:14 +00:00
topics_bulk_action_spec.rb FIX: Unread topics not clearing when whisper is last post (#8271) 2019-11-01 09:19:43 +10:00
trashable_spec.rb DEV: Upgrading Discourse to Zeitwerk (#8098) 2019-10-02 14:01:53 +10:00
trust_level_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
unread_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00
url_helper_spec.rb FIX: Stop encoding presigned URLs with UrlHelper (#8818) 2020-01-31 09:09:34 +10:00
user_name_suggester_spec.rb FIX: Respect unicode whitelist when suggesting username 2019-10-01 20:33:09 +02:00
version_spec.rb DEV: use #frozen_string_literal: true on all spec 2019-04-30 10:27:42 +10:00