https://github.com/casoon/site-scraper
https://github.com/casoon/site-scraper
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/casoon/site-scraper
- Owner: casoon
- License: mit
- Created: 2025-10-27T08:02:53.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-12-02T08:54:12.000Z (6 months ago)
- Last Synced: 2025-12-05T02:31:09.040Z (6 months ago)
- Language: TypeScript
- Size: 30.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Site Scraper
[](https://github.com/casoon/site-scraper/actions/workflows/ci.yml)
A small Node.js CLI tool that creates static copies of websites. It crawls from a starting URL, saves HTML files along with stylesheets and scripts locally, and replaces images with placeholders based on configuration.
## Prerequisites
- Node.js >= 24 (for native `fetch` support)
- pnpm as package manager (npm or yarn work as well, but the commands below are for pnpm)
## Installation
```sh
pnpm install
```
## Usage
```sh
pnpm run dev [--maxDepth 2] [--concurrency 8] [--placeholder external|local] [--sitemap] [--allowExternalAssets]
```
Example:
```sh
pnpm run dev https://www.example.com --maxDepth 2 --placeholder local
```
### Output
- All results are automatically saved to `./output/`.
- If the folder already exists, it will be deleted and recreated before the run.
- HTML files are stored in a folder structure matching the URL paths.
- Assets (CSS/JS/Fonts) are downloaded and internal references are rewritten.
- Images can be replaced with external placeholders (`external`) or locally generated PNGs (`local`, optionally requires `sharp`).
### Options
- `--maxDepth`: Maximum crawl depth relative to the start page (default: `2`).
- `--concurrency`: Number of parallel downloads (default: `8`).
- `--sitemap`: When set (default: `true`), entries from `/sitemap.xml` or `/sitemap_index.xml` are also used as starting points.
- `--allowExternalAssets`: When `false`, external CSS/JS/assets are not downloaded (default: `true`).
## Build
To create a compiled output in `dist/`:
```sh
pnpm run build
```
## Linting & Formatting
This project uses [Biome](https://biomejs.dev/) for linting and formatting:
```sh
pnpm run check # Check lint + format
pnpm run check:fix # Auto-fix issues
pnpm run lint # Lint only
pnpm run format # Format only
```
## License
This project is licensed under the [MIT License](LICENSE).