|
|
||
|---|---|---|
| .github | ||
| legacy | ||
| wpscraper | ||
| .gitignore | ||
| crawl.py | ||
| file2mongo.py | ||
| legacy_crawl_all.py | ||
| legacy_main.py | ||
| LICENSE | ||
| README.md | ||
| requirements.txt | ||
wordpress-scraper
Description
Simple, easy-to-use scraper to scrape data from WordPress JSON API
Features
- Support storing crawled documents as MongoDB documents / JSON files
- Auto retry upon errors
Requirements
- Python 3.7+
Installation
pip install -r requirements.txt
How to use
Basic
Just run crawl.py with the sites URL supplied:
python3 crawl.py https://your.website.here
This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.
The crawled JSON files will be stored in the directory ./data/<domain-name>.
Most of the time, This will suffice when scraping sites that are:
- not required to sign in
- JSON API paths not blocked
Advanced
For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.
Upcoming Features
- Rewrite/Refactor
- MongoDB Connector
- Async session
- Authentication Module
- Cloudflare circumvention
- Configurable retry policies
- Full WPv2 API resources support