Orix Au Yeung 61c1c832fe Merge pull request #42 from SoloSynth1/dependabot/pip/pymongo-4.7.1 Bump pymongo from 4.6.2 to 4.7.1	2024-05-03 01:45:11 -07:00
.github	Upgrade to GitHub-native Dependabot	2021-04-29 16:09:56 +00:00
legacy	update cyberthreat.py's endpoint	2021-04-21 14:15:12 +08:00
wpscraper	refactor; make requirements.txt to list top-level dependencies only	2021-04-12 17:02:28 +08:00
.gitignore	refactor legacy scripts; update .gitignore	2021-03-30 22:33:20 +08:00
crawl.py	naively implement MongoDBConnector	2020-11-09 00:28:00 +08:00
file2mongo.py	add file2mongo.py; add RawDocument class	2020-11-10 17:08:48 +08:00
legacy_crawl_all.py	add configurable threads count; update legacy script settings; add new sites in CSV	2021-04-15 14:53:01 +08:00
legacy_main.py	add legacy_crawl_all.py; make legacy_main.py's main function callable	2021-04-13 15:41:15 +08:00
LICENSE	Create LICENSE	2020-11-08 02:40:30 +08:00
README.md	Update README.md	2021-05-07 23:25:56 +08:00
requirements.txt	Bump pymongo from 4.6.2 to 4.7.1	2024-04-30 21:25:21 +00:00

README.md

wordpress-scraper

Description

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Features

Support storing crawled documents as MongoDB documents / JSON files
Auto retry upon errors

Requirements

Python 3.7+

Installation

pip install -r requirements.txt

How to use

Basic

Just run crawl.py with the sites URL supplied:

python3 crawl.py https://your.website.here

This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.

The crawled JSON files will be stored in the directory ./data/<domain-name>.

Most of the time, This will suffice when scraping sites that are:

not required to sign in
JSON API paths not blocked

Advanced

For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.

Upcoming Features

Rewrite/Refactor
MongoDB Connector
Async session
Authentication Module
Cloudflare circumvention
Configurable retry policies
Full WPv2 API resources support