discourse/app/jobs/scheduled/detect_crawler_pageviews.rb
Krzysztof Kotlarek 9b2c7beadf
DEV: Add anonymous pageview crawler scoring job (#39954)
Adds `Jobs::DetectCrawlerPageviews`, a scheduled job that runs every 10
minutes and assigns a crawler score to each anonymous browser pageview
event in the last hour. Six heuristics:

- Automation user agent (+50)
- Known crawler ASN (+35)
- Pageview velocity per ip+ua (+0 to +35)
- Session churn per ip+ua (+0 to +20)
- Rapid navigation — median gap under 2s (+15)
- Referrer discontinuity per ip+ua (+0 to +10)

Scoring is take-max idempotent: re-runs only update rows where the new
score is higher than the existing one, which keeps write pressure on the
events table minimal. Logged-in events are deliberately not scored —
crawler behaviour is fundamentally an anonymous-traffic problem, and a
separate detector for authenticated traffic can land later if needed.

New site settings (both hidden, default off):

- `detect_crawler_pageviews` — gates the job
- `crawler_automation_user_agents` — regex of automation UA tokens

The `crawler_asns` default also expands from Baidu-only to a
conservative list of dedicated crawler-operator ASNs (Baidu, Ahrefs,
Babbar, Internet Archive, Yandex). Major search-engine ASNs (Google,
Microsoft, Apple, Meta) are intentionally omitted because those networks
also carry consumer traffic.

Benchmarked at 20k-1M events; initial scoring 250ms-17s, steady-state
re-score 85ms-5.4s — comfortably under the 2-minute statement_timeout.
2026-05-14 12:46:08 +08:00

16 lines
342 B
Ruby
Vendored

# frozen_string_literal: true
module Jobs
class DetectCrawlerPageviews < ::Jobs::Scheduled
every 10.minutes
LOOKBACK = 1.hour
def execute(args)
return if !SiteSetting.experimental_detect_crawler_pageviews
now = Time.now
CrawlerScorer.score!(window_start: now - LOOKBACK, window_end: now)
end
end
end