https://github.com/shuddha2021/nodejs-crawler

A lightweight and efficient web crawler built with Node.js
https://github.com/shuddha2021/nodejs-crawler

asynchronous axios cheerio dataextraction javascript nodejs opensource webcrawler webscarping

Last synced: 4 days ago
JSON representation

A lightweight and efficient web crawler built with Node.js

Host: GitHub
URL: https://github.com/shuddha2021/nodejs-crawler
Owner: shuddha2021
Created: 2024-04-18T00:07:36.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-18T00:14:49.000Z (about 1 year ago)
Last Synced: 2025-02-17T07:42:16.935Z (3 months ago)
Topics: asynchronous, axios, cheerio, dataextraction, javascript, nodejs, opensource, webcrawler, webscarping
Language: JavaScript
Homepage:
Size: 851 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Crawler

A lightweight and configurable web crawler built with Node.js.

Screenshot 2024-04-17 at 7 13 38 PM

## Description

This crawler recursively extracts links from websites using Node.js, Axios, and Cheerio. It respects depth limits and avoids duplicate visits for efficient crawling.

## Features

- Asynchronous operation for optimal performance
- Recursive link extraction with configurable depth
- Deduplication of visited URLs
- Targeted crawling capability (e.g., specific domain)
- Extensible codebase for easy customization
- Error handling and reporting

## Technologies

- Node.js
- Axios (HTTP requests)
- Cheerio (HTML parsing)

## Implementation

The crawler fetches and parses HTML using Axios and Cheerio, respectively. It maintains a set of visited URLs and recursively follows links within the configured depth limit and target domain. The process continues until all links are crawled or the maximum depth is reached.

## Usage

1. Clone the repository
2. Install dependencies: `npm install`
3. Configure `MAX_DEPTH` and `targetDomain` in `crawler.js`
4. Run: `node crawler.js`

## Contributing

Contributions are welcome! Open issues or submit pull requests.

## License

[MIT License](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shuddha2021/nodejs-crawler

Awesome Lists containing this project

README