Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jtiala/wpdl
⬇️ Scrape pages, posts, images and other data from a WordPress instance.
https://github.com/jtiala/wpdl
crawler downloader scraper scraping wordpress
Last synced: 2 months ago
JSON representation
⬇️ Scrape pages, posts, images and other data from a WordPress instance.
- Host: GitHub
- URL: https://github.com/jtiala/wpdl
- Owner: jtiala
- License: mit
- Created: 2022-12-01T21:48:44.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-20T20:43:55.000Z (10 months ago)
- Last Synced: 2024-10-04T18:22:53.280Z (3 months ago)
- Topics: crawler, downloader, scraper, scraping, wordpress
- Language: TypeScript
- Homepage:
- Size: 1.26 MB
- Stars: 9
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# wpdl
[![License](https://img.shields.io/npm/l/wpdl)](https://github.com/jtiala/wpdl/blob/main/LICENSE)
[![Release](https://img.shields.io/github/v/release/jtiala/wpdl?sort=semver)](https://github.com/jtiala/wpdl/releases)
[![npm](https://img.shields.io/npm/v/wpdl)](https://www.npmjs.com/package/wpdl)
[![Conventional Commits](https://img.shields.io/badge/Conventional%20Commits-1.0.0-yellow.svg)](https://conventionalcommits.org)
[![CI](https://github.com/jtiala/wpdl/actions/workflows/ci.yml/badge.svg)](https://github.com/jtiala/wpdl/actions/workflows/ci.yml)Scrape pages, posts, images and other data from a WordPress instance using the WordPress [REST API](https://developer.wordpress.org/rest-api/). Use simple command line arguments to clean up the scraped data.
![Screenshot of example usage of the tool in a terminal emulator.](https://raw.githubusercontent.com/jtiala/wpdl/main/usage.png)
## Pre-requisites
Node.js v19 or newer (for native fetch support).
## Usage examples
The following commands use the latest version of `wpdl` that is published in [npm](https://www.npmjs.com/package/wpdl). To run the script locally, clone this repo and replace `npx wpdl` with `npx .`.
Scrape pages and posts
```bash
npx wpdl --url https://your-wp-instance.com --pages --posts
```Scrape pages and clean up the html by filtering out all `img` elements and elements with the class `foo`. Also remove all elements without text content. From the json files, remove all the Jetpack and Yoast SEO data.
```bash
npx wpdl --url https://your-wp-instance.com --pages --elementFilter img --classFilter foo --jsonFilter "jetpack_*" --jsonFilter "yoast_*" --removeEmptyElements
```To see full usage, run
```bash
npx wpdl -h
```