discourse

mirror of https://github.com/discourse/discourse.git synced 2025-09-07 12:02:53 +08:00

History

Régis Hanol 501b19b6e0 FIX: server-side HtmlToMarkdown improvements (#9586 ) TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML <tables> to markdown tables. The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'. We would also need to hoist <pre> elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear. I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts 1. remove_not_allowed! The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes that have no "visitor" (eg. a 'visit_<tag>' method) defined. 2. remove_hidden! Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. 3. hoist_line_breaks! The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying. The <br> tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>". The latter being much more easy to process than the former, so that's what this method is doing. The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element. 4. remove_whitespaces! The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well - remove_whitespaces! - is_inline? - collapse_spaces! - remove_trailing_space! The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods. For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively. The "is_inline?" method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline. The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags. For example, if we have " Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42". Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown FIX: added support for <img> dimensions when converting from HTML to Markdown FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for <acronym> when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪.		2020-04-30 12:21:25 +02:00
..
active_record/connection_adapters	DEV: s/\$redis/Discourse\.redis (#8431 )	2019-12-03 10:05:53 +01:00
auth	FIX: update GitHub screen_name on login via GitHub	2020-04-23 20:54:26 +05:30
common_passwords	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
concern	FIX: add category hashtags support for sub-sub categories.	2020-04-06 20:43:38 +05:30
email	FIX: Add topic deleted check to email/sender (#9166 )	2020-03-13 10:04:15 +10:00
file_store	FIX: parallel spec system needs a dedicated upload folder for each worker. (#8547 )	2019-12-18 11:21:57 +05:30
freedom_patches	DEV: correct spec failures in PG 12	2019-11-26 16:39:14 +11:00
guardian	FIX: Only show the review page to users that can see it. Do not publish the reviewable count update message to everyone. (#9556 )	2020-04-27 14:51:25 -03:00
highlight_js	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
import	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
middleware	DEV: Add test case for /srv/status probers (#9259 )	2020-03-24 16:28:07 +11:00
migration	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
onebox/engine	FIX: prevents whitelisted_generic_onebox_spec to fail with zeitwerk (#8288 )	2019-11-04 09:15:09 +11:00
plugin	DEV: update specs followup to `67e96f6`	2020-04-23 16:11:17 -04:00
pretty_text	SPEC: 'lookup_upload_urls' method should use cdn url if available.	2019-10-14 12:57:33 +05:30
rate_limiter	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
scheduler	DEV: reduce logging when no external id is specified	2020-04-08 12:42:28 +10:00
site_settings	UX: adds support for a color setting type (#9016 )	2020-03-09 10:07:03 +01:00
stylesheet	FEATURE: Ability to add components to all themes (#8404 )	2019-11-28 16:19:01 +11:00
svg_sprite	DEV: Correct references to theme flags	2020-03-13 16:45:55 +00:00
theme_store	FEATURE: Allow themes to specify modifiers in their about.json file (#9097 )	2020-03-11 13:30:45 +00:00
validators	FEATURE: add setting `auto_approve_email_domains` to auto approve users (#9323 )	2020-03-31 23:59:15 +05:30
wizard	FIX: Default to light theme in wizard so that previews are displayed	2020-04-02 18:37:45 +01:00
admin_confirmation_spec.rb	FEATURE: Add welcome message for admins. (#8293 )	2019-11-05 18:15:55 +05:30
admin_user_index_query_spec.rb	FEATURE: Approve suspect users is now true by default. The suspect users list was removed (#9151 )	2020-03-10 08:56:42 -03:00
archetype_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
avatar_lookup_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
cache_spec.rb	DEV: s/\$redis/Discourse\.redis (#8431 )	2019-12-03 10:05:53 +01:00
category_badge_spec.rb	FIX: Correctly escape category description text (#8107 )	2019-10-01 12:04:39 -04:00
composer_messages_finder_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
content_buffer_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
cooked_post_processor_spec.rb	DEV: Skip erratic spec for now	2020-04-25 13:20:04 +10:00
crawler_detection_spec.rb	FIX: Serve crawler view to Google PageSpeed	2019-11-27 22:15:34 +01:00
current_user_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
directory_helper_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
discourse_diff_spec.rb	Remove focus from specs	2019-10-16 14:28:04 -04:00
discourse_event_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
discourse_hub_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
discourse_plugin_registry_spec.rb	DEV: correct regression in registry test suite	2019-08-22 16:22:52 +10:00
discourse_plugin_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
discourse_redis_spec.rb	Revert "FIX: Redis fallback handler refactoring (#8771 )" (#8776 )	2020-01-24 09:20:17 +11:00
discourse_spec.rb	DEV: Introduce plugin api for conditionally rendering assets (#9200 )	2020-03-13 15:30:31 +00:00
discourse_tagging_spec.rb	DEV: stop freezing frozen strings	2020-04-30 16:48:53 +10:00
discourse_updates_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
distributed_memoizer_spec.rb	DEV: s/\$redis/Discourse\.redis (#8431 )	2019-12-03 10:05:53 +01:00
distributed_mutex_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
email_cook_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
email_updater_spec.rb	FIX: When admin changes staff email still enforce old email confirm (#9007 )	2020-02-20 13:42:57 +10:00
enum_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
excerpt_parser_spec.rb	DEV: Add option to keep quoted content in post excerpt.	2020-01-04 18:56:52 +05:30
feed_element_installer_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
feed_item_accessor_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
file_helper_spec.rb	DEV: properly clean up temp files in FileHelper spec	2019-05-28 11:33:08 +10:00
filter_best_posts_spec.rb	DEV: Prefabrication (test optimization) (#7414 )	2019-05-07 13:12:20 +10:00
final_destination_spec.rb	FIX: do not follow redirect on same host with path /login or /session	2019-08-07 16:26:55 +05:30
flag_settings_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
gaps_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
global_path_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
guardian_spec.rb	FEATURE: Support for publishing topics as pages (#9364 )	2020-04-08 12:52:36 -04:00
has_errors_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
hijack_spec.rb	FEATURE: Stricter rules for user presence	2020-03-26 17:36:52 +11:00
html_prettify_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
html_to_markdown_spec.rb	FIX: server-side HtmlToMarkdown improvements (#9586 )	2020-04-30 12:21:25 +02:00
image_sizer_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
inline_oneboxer_spec.rb	FIX: Make inline oneboxes work with secured topics in secured contexts (#8895 )	2020-02-12 12:11:28 +02:00
js_locale_helper_spec.rb	DEV: Add spec to find MF locale for en_US	2020-01-16 14:40:53 +01:00
json_error_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
letter_avatar_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
method_profiler_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
new_post_manager_spec.rb	enqueue spam/dmarc failing emails instead of hiding (#8674 )	2020-01-21 11:12:00 -05:00
new_post_result_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
oneboxer_spec.rb	DEV: do not persist force_custom_user_agent_hosts setting	2020-02-06 11:56:54 -05:00
onpdiff_spec.rb	Remove focus from specs	2019-10-16 14:28:04 -04:00
pbkdf2_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
pinned_check_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
plain_text_to_markdown_spec.rb	FIX: use URI.regexp to find URLs in plain text	2019-06-07 01:26:06 +02:00
post_action_creator_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
post_creator_spec.rb	FIX: use the new duration attribute in `set_or_create_timer` method.	2020-03-19 21:45:05 +05:30
post_destroyer_spec.rb	FIX: Recovered posts with no user will be taken over by system user (#8834 )	2020-02-06 10:19:04 +02:00
post_locker_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
post_merger_spec.rb	DEV: Prefabrication (test optimization) (#7414 )	2019-05-07 13:12:20 +10:00
post_revisor_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
pretty_text_spec.rb	DEV: Improve video onebox stripping spec	2020-02-20 11:45:12 -05:00
promotion_spec.rb	DEV: Prefabrication (test optimization) (#7414 )	2019-05-07 13:12:20 +10:00
quote_comparer_spec.rb	DEV: Prefabrication (test optimization) (#7414 )	2019-05-07 13:12:20 +10:00
rate_limiter_spec.rb	DEV: s/\$redis/Discourse\.redis (#8431 )	2019-12-03 10:05:53 +01:00
redis_store_spec.rb	DEV: Implement a faster Discourse.cache	2019-11-27 16:11:49 +11:00
retrieve_title_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
rtl_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
s3_helper_spec.rb	FIX: Update S3 stubs for more aws-sdk API changes (#8534 )	2019-12-11 11:26:52 -08:00
s3_inventory_spec.rb	FIX: Use updated_at in the S3 inventory job (#8823 )	2020-01-31 11:02:44 +01:00
score_calculator_spec.rb	DEV: Prefabrication (test optimization) (#7414 )	2019-05-07 13:12:20 +10:00
search_spec.rb	FIX: add support for sub-sub category slugs in search	2020-03-20 15:36:50 +11:00
secure_session_spec.rb	DEV: correct implementation of expiry api	2019-11-11 11:18:12 +11:00
site_icon_manager_spec.rb	DEV: enable frozen string literal on all files	2019-05-13 09:31:32 +08:00
site_setting_extension_spec.rb	DEV: use Discourse.cache over Rails.cache	2019-11-27 12:36:19 +11:00
slug_spec.rb	FIX: If a prettified slug is a number, return defaultt (#8554 )	2019-12-17 10:34:20 +10:00
spam_handler_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
suggested_topics_builder_spec.rb	DEV: Default to skipping creating a topic when fabricating categories (#7976 )	2019-08-06 11:26:54 +01:00
system_message_spec.rb	DEV: Refactor `SystemMessage#create` specs.	2019-05-30 07:56:36 +08:00
text_cleaner_spec.rb	FEATURE: English locale with international date formats	2019-05-20 13:47:20 +02:00
text_sentinel_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
theme_settings_manager_spec.rb	DEV: Prefabrication (test optimization) (#7414 )	2019-05-07 13:12:20 +10:00
theme_settings_parser_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
timeline_lookup_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
topic_creator_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
topic_publisher_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
topic_query_spec.rb	DEV: Improve flaky time-sensitive specs (#9141 )	2020-03-10 22:13:17 +01:00
topic_retriever_spec.rb	FIX: An `opts` hash was not, in fact, optional :)	2020-04-20 14:17:13 -04:00
topic_view_spec.rb	UX: Do not use avatars as fallback opengraph images for replies (#8605 )	2019-12-20 13:17:14 +00:00
topics_bulk_action_spec.rb	FIX: Unread topics not clearing when whisper is last post (#8271 )	2019-11-01 09:19:43 +10:00
trashable_spec.rb	DEV: Upgrading Discourse to Zeitwerk (#8098 )	2019-10-02 14:01:53 +10:00
trust_level_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
unread_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00
url_helper_spec.rb	FIX: Stop encoding presigned URLs with UrlHelper (#8818 )	2020-01-31 09:09:34 +10:00
user_name_suggester_spec.rb	FIX: Respect unicode whitelist when suggesting username	2019-10-01 20:33:09 +02:00
version_spec.rb	DEV: use #frozen_string_literal: true on all spec	2019-04-30 10:27:42 +10:00

菲码源库

快速导航

产品服务

生态伙伴