Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ReedD/crawler
Chromium / Puppeteer site crawler
https://github.com/ReedD/crawler
bot chromium crawler puppeteer redis scraper
Last synced: 3 months ago
JSON representation
Chromium / Puppeteer site crawler
- Host: GitHub
- URL: https://github.com/ReedD/crawler
- Owner: ReedD
- Created: 2017-08-20T01:48:54.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-03-30T02:58:20.000Z (almost 5 years ago)
- Last Synced: 2024-07-31T17:16:45.026Z (6 months ago)
- Topics: bot, chromium, crawler, puppeteer, redis, scraper
- Language: JavaScript
- Homepage:
- Size: 30.3 KB
- Stars: 48
- Watchers: 6
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-puppeteer - ReedD/crawler
- awesome-puppeteer - ReedD/crawler - BFS site crawler. (Rendering and web scraping)
- awesome-puppeteer-zh - ReedD/crawler - BFS站点爬虫. (渲染和网页抓取 / 贡献)
README
# Chromium / [Puppeteer](https://github.com/GoogleChrome/puppeteer) site crawler
[![styled with prettier](https://img.shields.io/badge/styled_with-prettier-ff69b4.svg)](https://github.com/prettier/prettier)
This crawler does a BFS starting from a given site entry point. It will not leave the entry point domain and it will not crawl a page more than once. Given a shared redis host/cluster this crawler can be distributed across multiple machines or processes. Discovered pages will be stored in mongo collection, each with a url, outbound urls, and a radius from the origin.
## Installation
```
yarn
```## Usage
### Basic
```bash
./crawl -u https://www.dadoune.com
```
### Distributed
```bash
# Terminal 1
./crawl -u https://www.dadoune.com
``````bash
# Terminal 2
./crawl -r
```
### Debug
```bash
DEBUG=crawler:* ./crawl -u https://www.dadoune.com
```### Options
- `--maxRadius` or `-m` the maximum link depth the crawler will explore from the entry url.
- `--resume` or `-r` to resume crawling after prematurely exiting a process or to add additional crawlers to an existing crawl.
- `--url` or `-u` the entry point URL to kick the crawler off.