Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shuddha2021/nodejs-crawler
A lightweight and efficient web crawler built with Node.js
https://github.com/shuddha2021/nodejs-crawler
asynchronous axios cheerio dataextraction javascript nodejs opensource webcrawler webscarping
Last synced: 6 days ago
JSON representation
A lightweight and efficient web crawler built with Node.js
- Host: GitHub
- URL: https://github.com/shuddha2021/nodejs-crawler
- Owner: shuddha2021
- Created: 2024-04-18T00:07:36.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-04-18T00:14:49.000Z (9 months ago)
- Last Synced: 2024-11-06T19:50:22.972Z (about 2 months ago)
- Topics: asynchronous, axios, cheerio, dataextraction, javascript, nodejs, opensource, webcrawler, webscarping
- Language: JavaScript
- Homepage:
- Size: 851 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Crawler
A lightweight and configurable web crawler built with Node.js.
## Description
This crawler recursively extracts links from websites using Node.js, Axios, and Cheerio. It respects depth limits and avoids duplicate visits for efficient crawling.
## Features
- Asynchronous operation for optimal performance
- Recursive link extraction with configurable depth
- Deduplication of visited URLs
- Targeted crawling capability (e.g., specific domain)
- Extensible codebase for easy customization
- Error handling and reporting## Technologies
- Node.js
- Axios (HTTP requests)
- Cheerio (HTML parsing)## Implementation
The crawler fetches and parses HTML using Axios and Cheerio, respectively. It maintains a set of visited URLs and recursively follows links within the configured depth limit and target domain. The process continues until all links are crawled or the maximum depth is reached.
## Usage
1. Clone the repository
2. Install dependencies: `npm install`
3. Configure `MAX_DEPTH` and `targetDomain` in `crawler.js`
4. Run: `node crawler.js`## Contributing
Contributions are welcome! Open issues or submit pull requests.
## License
[MIT License](LICENSE)