Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sujeetkrjaiswal/link-scraper
A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.
https://github.com/sujeetkrjaiswal/link-scraper
nodejs scraper
Last synced: about 1 month ago
JSON representation
A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.
- Host: GitHub
- URL: https://github.com/sujeetkrjaiswal/link-scraper
- Owner: sujeetkrjaiswal
- License: mit
- Created: 2018-06-02T11:07:17.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-06-25T12:35:56.000Z (over 2 years ago)
- Last Synced: 2024-10-01T19:15:06.485Z (about 1 month ago)
- Topics: nodejs, scraper
- Language: TypeScript
- Homepage:
- Size: 1000 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Link Scraper
![CI_NPM_PUBLISH](https://github.com/sujeetkrjaiswal/link-scraper/workflows/CI_NPM_PUBLISH/badge.svg?branch=master)
A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.
This utility provides an interactive command-line user interface as well as command line options.
## Usage
### Command line options
```text
-u, --url seed url
-w, --whitelisted Whitelisted url
-o, --outFile Output file name
-e, --extension depth limit to recursively scrape
-d, --depth depth limit to recursively scrape
-q, --query Consider Query Params for URL Uniqueness
-h, --hash Ignore Hash Params for URL Uniqueness
-s, --secure Scrape only secured URLs (https: only)
--no-hash Ignore Hash Params for URL Uniqueness
--no-query Ignore Hash Params for URL Uniqueness
--no-secure Allow scraping Non secure URLs (http: & https:)
--help display help for command
```### Using CLI Interactive Questions
![Medium.com sample Log](assets/medium.com-sample.png)
### Example 1
Scrape `Medium.com` for `depth 2` for only `secure` urls i.e. `https:` and for whitelisted domains `medium.com`, `help.medium.com`
and save the output inside `data` folder with file name `medium-links` in `md` and `tsv` format.To consider the uniqueness consider the `query` params and ignore the `hash` params.
```bash
link-scraper -u https://medium.com/ -w https://medium.com,https://help.medium.com -o data/medium-links -e tsv,md -d 2 -qs --no-hash
```### Example 2
Partially init the application from command line and put other fields from the interactive interface.
For command line options: set the depth as `2` and set extension as `tsv`. consider only `https:` URLs and for uniqueness test
consider the `query` params and ignore the `hash` params.Seed URL, whitelisted domains and file path will be entered from command line.
```bash
link-scraper -qs --no-hash -d 2 -e tsv
```