discourse/app/jobs/scheduled/detect_crawler_pageviews.rb at main - Discourse/discourse - 菲码源库 feiCode.com

Discourse/discourse

mirror of https://gh.wpcy.net/https://github.com/discourse/discourse.git synced 2026-06-19 02:05:37 +08:00

Krzysztof Kotlarek 9b2c7beadf

DEV: Add anonymous pageview crawler scoring job (#39954 )

Adds `Jobs::DetectCrawlerPageviews`, a scheduled job that runs every 10
minutes and assigns a crawler score to each anonymous browser pageview
event in the last hour. Six heuristics:

- Automation user agent (+50)
- Known crawler ASN (+35)
- Pageview velocity per ip+ua (+0 to +35)
- Session churn per ip+ua (+0 to +20)
- Rapid navigation — median gap under 2s (+15)
- Referrer discontinuity per ip+ua (+0 to +10)

Scoring is take-max idempotent: re-runs only update rows where the new
score is higher than the existing one, which keeps write pressure on the
events table minimal. Logged-in events are deliberately not scored —
crawler behaviour is fundamentally an anonymous-traffic problem, and a
separate detector for authenticated traffic can land later if needed.

New site settings (both hidden, default off):

- `detect_crawler_pageviews` — gates the job
- `crawler_automation_user_agents` — regex of automation UA tokens

The `crawler_asns` default also expands from Baidu-only to a
conservative list of dedicated crawler-operator ASNs (Baidu, Ahrefs,
Babbar, Internet Archive, Yandex). Major search-engine ASNs (Google,
Microsoft, Apple, Meta) are intentionally omitted because those networks
also carry consumer traffic.

Benchmarked at 20k-1M events; initial scoring 250ms-17s, steady-state
re-score 85ms-5.4s — comfortably under the 2-minute statement_timeout.

2026-05-14 12:46:08 +08:00

16 lines

342 B

Ruby

Vendored

Raw Permalink Blame History

 # frozen_string_literal: true
 module Jobs
   class DetectCrawlerPageviews < ::Jobs::Scheduled
     every 10.minutes
     LOOKBACK = 1.hour
     def execute(args)
       return if !SiteSetting.experimental_detect_crawler_pageviews
       now = Time.now
       CrawlerScorer.score!(window_start: now - LOOKBACK, window_end: now)
     end
   end
 end

专为开源 Web 生态打造的企业级代码托管平台，深度支持 WordPress、Laravel、Vue.js、React 等主流技术栈，致力于推动中国开放网络 OpenWeb 发展，助力本土开源项目建设。

基于构建 | 专业 • 开放 • 安全

文派开源（WenPai.org）项目官方代码托管平台，由以下企业技术团队联合运营：

汉中菲比斯网络技术有限公司 | 文派（广州）科技有限公司

莫蒂奇数字技术（苏州）有限公司

探索项目组织机构问题反馈开发者社区

代码托管本地化翻译企业服务私有部署

文派叶子薇晓朵 WP TEA 慕得教育麟悦平台 ArkPress 跨飞独立站橙黑设计

Copyright © 2025 菲码源库 feiCode.com. All rights reserved. 陕ICP备15002899号-20