Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shivamsaraswat/webxcrawler
WebXCrawler is a fast static crawler to crawl a website and get all the links.
https://github.com/shivamsaraswat/webxcrawler
crawler crawling python scraping webcrawler webxcrawler
Last synced: 2 months ago
JSON representation
WebXCrawler is a fast static crawler to crawl a website and get all the links.
- Host: GitHub
- URL: https://github.com/shivamsaraswat/webxcrawler
- Owner: shivamsaraswat
- License: mit
- Created: 2023-01-16T13:51:34.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-18T07:16:58.000Z (5 months ago)
- Last Synced: 2024-08-18T08:29:25.289Z (5 months ago)
- Topics: crawler, crawling, python, scraping, webcrawler, webxcrawler
- Language: Python
- Homepage:
- Size: 139 KB
- Stars: 0
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# WebXCrawler
WebXCrawler is a fast static crawler to crawl a website and get all the links.
## Installation Through PIP
To install dependencies, use the following command:
```bash
pip3 install -r requirements.txt
```## Installation Through Poetry
This package is built with [Poetry](https://python-poetry.org/). To set up the virtual environment and install dependencies, follow these steps (trying using with sudo, if you get any error):
```bash
poetry install
```### Installation with Docker
This tool can also be used with [Docker](https://www.docker.com/). To set up the Docker environment, follow these steps (trying using with sudo, if you get any error):
```bash
docker build -t webxcrawler:latest .
```## Using the WebXCrawler
To run the WebXCrawler on a website, use the '-u' flag and provide the URL as an argument:
```bash
python3 webxcrawler -u URL
```For an overview of all commands use the following command:
```bash
python3 webxcrawler -h
```The output shown below are the latest supported commands.
```bash
██╗ ██╗███████╗██████╗ ██╗ ██╗ ██████╗██████╗ █████╗ ██╗ ██╗██╗ ███████╗██████╗
██║ ██║██╔════╝██╔══██╗╚██╗██╔╝██╔════╝██╔══██╗██╔══██╗██║ ██║██║ ██╔════╝██╔══██╗
██║ █╗ ██║█████╗ ██████╔╝ ╚███╔╝ ██║ ██████╔╝███████║██║ █╗ ██║██║ █████╗ ██████╔╝
██║███╗██║██╔══╝ ██╔══██╗ ██╔██╗ ██║ ██╔══██╗██╔══██║██║███╗██║██║ ██╔══╝ ██╔══██╗
╚███╔███╔╝███████╗██████╔╝██╔╝ ██╗╚██████╗██║ ██║██║ ██║╚███╔███╔╝███████╗███████╗██║ ██║
╚══╝╚══╝ ╚══════╝╚═════╝ ╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚══╝╚══╝ ╚══════╝╚══════╝╚═╝ ╚═╝
Coded with Love by Shivam Saraswat (@cybersapien)usage: python3 webxcrawler [-h] -u URL [-d int] [-t int] [-o file_path] [-V]
WebXCrawler is a fast static crawler to crawl a website and get all the links.
options:
-h, --help show this help message and exit
-u URL, --url URL URL to crawl
-d int, --depth int maximum depth to crawl (default 2)
-t int, --threads int
number of threads to use (default 2)
-o file_path, --output file_path
file to write output to
-V, --version show webxcrawler version number and exitExample: python3 webxcrawler -u https://example.com -d 2 -t 10 -o /tmp/tofile
```## Using the Docker Container
A typical run through Docker would look as follows:
```bash
docker run -it --rm webxcrawler -u URL
```**NOTE:** Do check out **[Golang Version of WebXCrawler](https://github.com/shivamsaraswat/webxcrawler_go)**.