mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 00:48:08 +08:00
Adds `Jobs::DetectCrawlerPageviews`, a scheduled job that runs every 10 minutes and assigns a crawler score to each anonymous browser pageview event in the last hour. Six heuristics: - Automation user agent (+50) - Known crawler ASN (+35) - Pageview velocity per ip+ua (+0 to +35) - Session churn per ip+ua (+0 to +20) - Rapid navigation — median gap under 2s (+15) - Referrer discontinuity per ip+ua (+0 to +10) Scoring is take-max idempotent: re-runs only update rows where the new score is higher than the existing one, which keeps write pressure on the events table minimal. Logged-in events are deliberately not scored — crawler behaviour is fundamentally an anonymous-traffic problem, and a separate detector for authenticated traffic can land later if needed. New site settings (both hidden, default off): - `detect_crawler_pageviews` — gates the job - `crawler_automation_user_agents` — regex of automation UA tokens The `crawler_asns` default also expands from Baidu-only to a conservative list of dedicated crawler-operator ASNs (Baidu, Ahrefs, Babbar, Internet Archive, Yandex). Major search-engine ASNs (Google, Microsoft, Apple, Meta) are intentionally omitted because those networks also carry consumer traffic. Benchmarked at 20k-1M events; initial scoring 250ms-17s, steady-state re-score 85ms-5.4s — comfortably under the 2-minute statement_timeout.
19 lines
545 B
Ruby
Vendored
19 lines
545 B
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
RSpec.describe Jobs::DetectCrawlerPageviews do
|
|
it "does nothing when detection is disabled" do
|
|
SiteSetting.experimental_detect_crawler_pageviews = false
|
|
CrawlerScorer.expects(:score!).never
|
|
|
|
described_class.new.execute({})
|
|
end
|
|
|
|
it "scores the last hour of pageviews when enabled" do
|
|
SiteSetting.experimental_detect_crawler_pageviews = true
|
|
freeze_time
|
|
|
|
CrawlerScorer.expects(:score!).with(window_start: 1.hour.ago, window_end: Time.now)
|
|
|
|
described_class.new.execute({})
|
|
end
|
|
end
|