Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shuddha2021/nodejs-crawler

A lightweight and efficient web crawler built with Node.js
https://github.com/shuddha2021/nodejs-crawler

asynchronous axios cheerio dataextraction javascript nodejs opensource webcrawler webscarping

Last synced: 6 days ago
JSON representation

A lightweight and efficient web crawler built with Node.js

Awesome Lists containing this project

README

        

# Web Crawler

A lightweight and configurable web crawler built with Node.js.

Screenshot 2024-04-17 at 7 13 38 PM

## Description

This crawler recursively extracts links from websites using Node.js, Axios, and Cheerio. It respects depth limits and avoids duplicate visits for efficient crawling.

## Features

- Asynchronous operation for optimal performance
- Recursive link extraction with configurable depth
- Deduplication of visited URLs
- Targeted crawling capability (e.g., specific domain)
- Extensible codebase for easy customization
- Error handling and reporting

## Technologies

- Node.js
- Axios (HTTP requests)
- Cheerio (HTML parsing)

## Implementation

The crawler fetches and parses HTML using Axios and Cheerio, respectively. It maintains a set of visited URLs and recursively follows links within the configured depth limit and target domain. The process continues until all links are crawled or the maximum depth is reached.

## Usage

1. Clone the repository
2. Install dependencies: `npm install`
3. Configure `MAX_DEPTH` and `targetDomain` in `crawler.js`
4. Run: `node crawler.js`

## Contributing

Contributions are welcome! Open issues or submit pull requests.

## License

[MIT License](LICENSE)