Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/akashrajpurohit/node-crawler

Nodejs Crawler which scrapes a website on live domain and crawls to find all URL of the domain
https://github.com/akashrajpurohit/node-crawler

crawler node-crawler nodejs url

Last synced: about 1 month ago
JSON representation

Nodejs Crawler which scrapes a website on live domain and crawls to find all URL of the domain

Host: GitHub
URL: https://github.com/akashrajpurohit/node-crawler
Owner: AkashRajpurohit
Created: 2019-05-24T04:50:14.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-12-05T02:15:24.000Z (about 1 year ago)
Last Synced: 2024-12-18T19:40:26.004Z (about 2 months ago)
Topics: crawler, node-crawler, nodejs, url
Language: JavaScript
Homepage:
Size: 47.9 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Nodejs Crawler

### It is a basic nodejs crawler to crawl any domain and get all the urls from that domain

Sample Input HTML page server at ```localhost:4000```

```html

Hello World

Home

About

Contact

Blogs

Hello

World

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Nulla, laudantium, omnis. Ea quaerat minima, nostrum doloremque repellendus! Ratione quasi, non eligendi quidem at culpa animi vitae id eius corrupti deleniti.
Some image

This is some more dummy text

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quaerat vitae dolor, atque, excepturi numquam cumque ut iusto, odio perferendis cum rem saepe eveniet voluptatum fuga debitis et illo distinctio eligendi!

Hi there, this is empty div with no children :(

Different section

```

Output:
```
💻💻💻 Scraping...

{ links:
[ { linkText: 'Home', linkUrl: '/index.html' },
{ linkText: 'About', linkUrl: '/about.html' },
{ linkText: 'Contact', linkUrl: '/contact.html' },
{ linkText: 'Blogs', linkUrl: '/blog.html' } ],
requestTime: 64,
title: 'Hello World',
url: 'http://localhost:4000' }

🥳🥳🥳 Done...
```