mirror of
https://github.com/discourse/discourse.git
synced 2025-09-07 12:02:53 +08:00
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML <tables> to markdown tables. The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'. We would also need to hoist <pre> elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear. I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts 1. remove_not_allowed! The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes that have no "visitor" (eg. a 'visit_<tag>' method) defined. 2. remove_hidden! Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. 3. hoist_line_breaks! The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying. The <br> tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>". The latter being much more easy to process than the former, so that's what this method is doing. The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element. 4. remove_whitespaces! The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well - remove_whitespaces! - is_inline? - collapse_spaces! - remove_trailing_space! The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods. For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively. The "is_inline?" method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline. The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags. For example, if we have " Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42". Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown FIX: added support for <img> dimensions when converting from HTML to Markdown FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for <acronym> when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪. |
||
---|---|---|
.. | ||
active_record/connection_adapters | ||
auth | ||
common_passwords | ||
concern | ||
file_store | ||
freedom_patches | ||
guardian | ||
highlight_js | ||
import | ||
middleware | ||
migration | ||
onebox/engine | ||
plugin | ||
pretty_text | ||
rate_limiter | ||
scheduler | ||
site_settings | ||
stylesheet | ||
svg_sprite | ||
theme_store | ||
validators | ||
wizard | ||
admin_confirmation_spec.rb | ||
admin_user_index_query_spec.rb | ||
archetype_spec.rb | ||
avatar_lookup_spec.rb | ||
cache_spec.rb | ||
category_badge_spec.rb | ||
composer_messages_finder_spec.rb | ||
content_buffer_spec.rb | ||
cooked_post_processor_spec.rb | ||
crawler_detection_spec.rb | ||
current_user_spec.rb | ||
directory_helper_spec.rb | ||
discourse_diff_spec.rb | ||
discourse_event_spec.rb | ||
discourse_hub_spec.rb | ||
discourse_plugin_registry_spec.rb | ||
discourse_plugin_spec.rb | ||
discourse_redis_spec.rb | ||
discourse_spec.rb | ||
discourse_tagging_spec.rb | ||
discourse_updates_spec.rb | ||
distributed_memoizer_spec.rb | ||
distributed_mutex_spec.rb | ||
email_cook_spec.rb | ||
email_updater_spec.rb | ||
enum_spec.rb | ||
excerpt_parser_spec.rb | ||
feed_element_installer_spec.rb | ||
feed_item_accessor_spec.rb | ||
file_helper_spec.rb | ||
filter_best_posts_spec.rb | ||
final_destination_spec.rb | ||
flag_settings_spec.rb | ||
gaps_spec.rb | ||
global_path_spec.rb | ||
guardian_spec.rb | ||
has_errors_spec.rb | ||
hijack_spec.rb | ||
html_prettify_spec.rb | ||
html_to_markdown_spec.rb | ||
image_sizer_spec.rb | ||
inline_oneboxer_spec.rb | ||
js_locale_helper_spec.rb | ||
json_error_spec.rb | ||
letter_avatar_spec.rb | ||
method_profiler_spec.rb | ||
new_post_manager_spec.rb | ||
new_post_result_spec.rb | ||
oneboxer_spec.rb | ||
onpdiff_spec.rb | ||
pbkdf2_spec.rb | ||
pinned_check_spec.rb | ||
plain_text_to_markdown_spec.rb | ||
post_action_creator_spec.rb | ||
post_creator_spec.rb | ||
post_destroyer_spec.rb | ||
post_locker_spec.rb | ||
post_merger_spec.rb | ||
post_revisor_spec.rb | ||
pretty_text_spec.rb | ||
promotion_spec.rb | ||
quote_comparer_spec.rb | ||
rate_limiter_spec.rb | ||
redis_store_spec.rb | ||
retrieve_title_spec.rb | ||
rtl_spec.rb | ||
s3_helper_spec.rb | ||
s3_inventory_spec.rb | ||
score_calculator_spec.rb | ||
search_spec.rb | ||
secure_session_spec.rb | ||
site_icon_manager_spec.rb | ||
site_setting_extension_spec.rb | ||
slug_spec.rb | ||
spam_handler_spec.rb | ||
suggested_topics_builder_spec.rb | ||
system_message_spec.rb | ||
text_cleaner_spec.rb | ||
text_sentinel_spec.rb | ||
theme_settings_manager_spec.rb | ||
theme_settings_parser_spec.rb | ||
timeline_lookup_spec.rb | ||
topic_creator_spec.rb | ||
topic_publisher_spec.rb | ||
topic_query_spec.rb | ||
topic_retriever_spec.rb | ||
topic_view_spec.rb | ||
topics_bulk_action_spec.rb | ||
trashable_spec.rb | ||
trust_level_spec.rb | ||
unread_spec.rb | ||
url_helper_spec.rb | ||
user_name_suggester_spec.rb | ||
version_spec.rb |