discourse/spec/jobs/detect_crawler_pageviews_spec.rb
Krzysztof Kotlarek 9b2c7beadf
DEV: Add anonymous pageview crawler scoring job (#39954)
Adds `Jobs::DetectCrawlerPageviews`, a scheduled job that runs every 10
minutes and assigns a crawler score to each anonymous browser pageview
event in the last hour. Six heuristics:

- Automation user agent (+50)
- Known crawler ASN (+35)
- Pageview velocity per ip+ua (+0 to +35)
- Session churn per ip+ua (+0 to +20)
- Rapid navigation — median gap under 2s (+15)
- Referrer discontinuity per ip+ua (+0 to +10)

Scoring is take-max idempotent: re-runs only update rows where the new
score is higher than the existing one, which keeps write pressure on the
events table minimal. Logged-in events are deliberately not scored —
crawler behaviour is fundamentally an anonymous-traffic problem, and a
separate detector for authenticated traffic can land later if needed.

New site settings (both hidden, default off):

- `detect_crawler_pageviews` — gates the job
- `crawler_automation_user_agents` — regex of automation UA tokens

The `crawler_asns` default also expands from Baidu-only to a
conservative list of dedicated crawler-operator ASNs (Baidu, Ahrefs,
Babbar, Internet Archive, Yandex). Major search-engine ASNs (Google,
Microsoft, Apple, Meta) are intentionally omitted because those networks
also carry consumer traffic.

Benchmarked at 20k-1M events; initial scoring 250ms-17s, steady-state
re-score 85ms-5.4s — comfortably under the 2-minute statement_timeout.
2026-05-14 12:46:08 +08:00

19 lines
545 B
Ruby
Vendored

# frozen_string_literal: true
RSpec.describe Jobs::DetectCrawlerPageviews do
it "does nothing when detection is disabled" do
SiteSetting.experimental_detect_crawler_pageviews = false
CrawlerScorer.expects(:score!).never
described_class.new.execute({})
end
it "scores the last hour of pageviews when enabled" do
SiteSetting.experimental_detect_crawler_pageviews = true
freeze_time
CrawlerScorer.expects(:score!).with(window_start: 1.hour.ago, window_end: Time.now)
described_class.new.execute({})
end
end