Simple, easy-to-use scraper to scrape data from WordPress JSON API
Find a file
2024-05-03 01:45:11 -07:00
.github Upgrade to GitHub-native Dependabot 2021-04-29 16:09:56 +00:00
legacy update cyberthreat.py's endpoint 2021-04-21 14:15:12 +08:00
wpscraper refactor; make requirements.txt to list top-level dependencies only 2021-04-12 17:02:28 +08:00
.gitignore refactor legacy scripts; update .gitignore 2021-03-30 22:33:20 +08:00
crawl.py naively implement MongoDBConnector 2020-11-09 00:28:00 +08:00
file2mongo.py add file2mongo.py; add RawDocument class 2020-11-10 17:08:48 +08:00
legacy_crawl_all.py add configurable threads count; update legacy script settings; add new sites in CSV 2021-04-15 14:53:01 +08:00
legacy_main.py add legacy_crawl_all.py; make legacy_main.py's main function callable 2021-04-13 15:41:15 +08:00
LICENSE Create LICENSE 2020-11-08 02:40:30 +08:00
README.md Update README.md 2021-05-07 23:25:56 +08:00
requirements.txt Bump pymongo from 4.6.2 to 4.7.1 2024-04-30 21:25:21 +00:00

wordpress-scraper

Description

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Features

  • Support storing crawled documents as MongoDB documents / JSON files
  • Auto retry upon errors

Requirements

  • Python 3.7+

Installation

pip install -r requirements.txt

How to use

Basic

Just run crawl.py with the sites URL supplied:

python3 crawl.py https://your.website.here

This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.

The crawled JSON files will be stored in the directory ./data/<domain-name>.

Most of the time, This will suffice when scraping sites that are:

  1. not required to sign in
  2. JSON API paths not blocked

Advanced

For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.

Upcoming Features

  • Rewrite/Refactor
  • MongoDB Connector
  • Async session
  • Authentication Module
  • Cloudflare circumvention
  • Configurable retry policies
  • Full WPv2 API resources support