mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-18 18:00:27 +08:00
Previously, 437ab337d2 added the normalized_referrer column to the
browser_pageview_events table. That column stores the cleaned referrer
value used by the admin Top Referrers report, but existing
browser_pageview_events rows were not backfilled. Older rows still have
NULL normalized_referrer values, so historical dates can show little or
no referrer data even though the original referrer was recorded.
This PR backfills normalized_referrer for existing
browser_pageview_events rows and adds a mechanism to repeat that work if
the normalization rules change later.
Key technical changes:
1. Add the normalized_referrer_version column to the
browser_pageview_events table, which stores raw browser pageview events.
The new column records which version of the referrer normalization rules
processed each row. This gives the backfill a clear stopping condition
and lets a future rules change reprocess older rows by bumping the
version.
2. Replace the existing AggregateBrowserPageviewDailyRollups scheduled
job, which rebuilt browser pageview country and referrer daily rollups,
with the new MaintainBrowserPageviewRollups scheduled job. The new job
still keeps those rollups current, and it also backfills older
normalized referrers from the same execution path so separate jobs do
not rebuild the same rollups at the same time.
3. Backfill browser_pageview_events rows in batches so large sites do
not need to update all historical pageview events in one run. A day’s
referrer rollup is only rebuilt once every stale referrer row for that
day has been processed, so the Top Referrers report does not show
partial counts between batches.
4. Skip dates that no longer have source rows in the
browser_pageview_events table. Browser pageview events can be pruned by
cleanup, but daily rollups are the permanent report data. Before
rebuilding a date, the backfill checks that browser_pageview_events
still has events for that date. If no events remain, the existing rollup
is left untouched because the job can no longer safely reconstruct it.
77 lines
2.3 KiB
Ruby
Vendored
77 lines
2.3 KiB
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
# Normalizes referrer URLs captured by the browser pageview middleware so the
|
|
# same logical referrer groups consistently in the top-referrers report. It
|
|
# strips scheme, `www.`, port, fragment, trailing slashes, and common tracking
|
|
# query params, converts the host to lowercase punycode, and truncates the
|
|
# result to 200 bytes.
|
|
class BrowserPageviewReferrerInspector
|
|
# Bump when the normalization logic changes significantly to trigger a
|
|
# re-backfill of rows stamped with an older version.
|
|
VERSION = 1
|
|
|
|
# TODO: consider vendoring DuckDuckGo's Tracker Radar tracking-parameter list
|
|
# (https://github.com/duckduckgo/tracker-radar) for broader, maintained
|
|
# coverage instead of this hand-curated subset.
|
|
TRACKING_PARAMS = %w[
|
|
utm_source
|
|
utm_medium
|
|
utm_campaign
|
|
utm_term
|
|
utm_content
|
|
fbclid
|
|
gclid
|
|
mc_cid
|
|
mc_eid
|
|
ref_src
|
|
_hsenc
|
|
_hsmi
|
|
].to_set.freeze
|
|
|
|
MAX_LENGTH = 2000
|
|
|
|
def self.normalize(raw)
|
|
return nil if raw.blank?
|
|
|
|
# Scheme is intentionally dropped: `http://example.com/x` and
|
|
# `https://example.com/x` collapse to the same key so the report groups
|
|
# cross-protocol traffic together.
|
|
uri = Addressable::URI.parse(raw.to_s.strip)
|
|
return nil if uri.nil?
|
|
|
|
host = normalize_host(uri.host)
|
|
return nil if host.blank?
|
|
|
|
path = uri.path.to_s.sub(%r{/+\z}, "")
|
|
filtered_query = filter_query(uri.query)
|
|
query_str = filtered_query.empty? ? "" : "?#{filtered_query}"
|
|
|
|
"#{host}#{path}#{query_str}".byteslice(0, MAX_LENGTH).scrub("")
|
|
rescue Addressable::URI::InvalidURIError, ArgumentError, TypeError
|
|
nil
|
|
end
|
|
|
|
def self.normalize_host(host)
|
|
return nil if host.blank?
|
|
normalized = Addressable::URI.parse("http://#{host}").normalized_host
|
|
return nil if normalized.blank?
|
|
normalized.delete_prefix("www.").delete_suffix(".")
|
|
rescue Addressable::URI::InvalidURIError
|
|
nil
|
|
end
|
|
|
|
# Filters the raw query string so original percent-encoding is preserved
|
|
# (avoids %20/+ duplicate groupings for rows pointing at the same URL).
|
|
def self.filter_query(query)
|
|
return "" if query.blank?
|
|
|
|
query
|
|
.split("&")
|
|
.reject do |pair|
|
|
key = pair.split("=", 2).first.to_s
|
|
TRACKING_PARAMS.include?(key)
|
|
end
|
|
.join("&")
|
|
end
|
|
private_class_method :filter_query
|
|
end
|