mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-19 02:05:37 +08:00
Adds `Jobs::DetectCrawlerPageviews`, a scheduled job that runs every 10 minutes and assigns a crawler score to each anonymous browser pageview event in the last hour. Six heuristics: - Automation user agent (+50) - Known crawler ASN (+35) - Pageview velocity per ip+ua (+0 to +35) - Session churn per ip+ua (+0 to +20) - Rapid navigation — median gap under 2s (+15) - Referrer discontinuity per ip+ua (+0 to +10) Scoring is take-max idempotent: re-runs only update rows where the new score is higher than the existing one, which keeps write pressure on the events table minimal. Logged-in events are deliberately not scored — crawler behaviour is fundamentally an anonymous-traffic problem, and a separate detector for authenticated traffic can land later if needed. New site settings (both hidden, default off): - `detect_crawler_pageviews` — gates the job - `crawler_automation_user_agents` — regex of automation UA tokens The `crawler_asns` default also expands from Baidu-only to a conservative list of dedicated crawler-operator ASNs (Baidu, Ahrefs, Babbar, Internet Archive, Yandex). Major search-engine ASNs (Google, Microsoft, Apple, Meta) are intentionally omitted because those networks also carry consumer traffic. Benchmarked at 20k-1M events; initial scoring 250ms-17s, steady-state re-score 85ms-5.4s — comfortably under the 2-minute statement_timeout.
16 lines
342 B
Ruby
Vendored
16 lines
342 B
Ruby
Vendored
# frozen_string_literal: true
|
|
|
|
module Jobs
|
|
class DetectCrawlerPageviews < ::Jobs::Scheduled
|
|
every 10.minutes
|
|
|
|
LOOKBACK = 1.hour
|
|
|
|
def execute(args)
|
|
return if !SiteSetting.experimental_detect_crawler_pageviews
|
|
|
|
now = Time.now
|
|
CrawlerScorer.score!(window_start: now - LOOKBACK, window_end: now)
|
|
end
|
|
end
|
|
end
|