Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sujeetkrjaiswal/link-scraper

A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.
https://github.com/sujeetkrjaiswal/link-scraper

nodejs scraper

Last synced: about 1 month ago
JSON representation

A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.

Awesome Lists containing this project

README

        

# Link Scraper

![CI_NPM_PUBLISH](https://github.com/sujeetkrjaiswal/link-scraper/workflows/CI_NPM_PUBLISH/badge.svg?branch=master)

A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.

This utility provides an interactive command-line user interface as well as command line options.

## Usage

### Command line options

```text
-u, --url seed url
-w, --whitelisted Whitelisted url
-o, --outFile Output file name
-e, --extension depth limit to recursively scrape
-d, --depth depth limit to recursively scrape
-q, --query Consider Query Params for URL Uniqueness
-h, --hash Ignore Hash Params for URL Uniqueness
-s, --secure Scrape only secured URLs (https: only)
--no-hash Ignore Hash Params for URL Uniqueness
--no-query Ignore Hash Params for URL Uniqueness
--no-secure Allow scraping Non secure URLs (http: & https:)
--help display help for command
```

### Using CLI Interactive Questions

![Medium.com sample Log](assets/medium.com-sample.png)

### Example 1
Scrape `Medium.com` for `depth 2` for only `secure` urls i.e. `https:` and for whitelisted domains `medium.com`, `help.medium.com`
and save the output inside `data` folder with file name `medium-links` in `md` and `tsv` format.

To consider the uniqueness consider the `query` params and ignore the `hash` params.
```bash
link-scraper -u https://medium.com/ -w https://medium.com,https://help.medium.com -o data/medium-links -e tsv,md -d 2 -qs --no-hash
```

### Example 2

Partially init the application from command line and put other fields from the interactive interface.

For command line options: set the depth as `2` and set extension as `tsv`. consider only `https:` URLs and for uniqueness test
consider the `query` params and ignore the `hash` params.

Seed URL, whitelisted domains and file path will be entered from command line.

```bash
link-scraper -qs --no-hash -d 2 -e tsv
```