https://github.com/mehdieidi/offliner
Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which saves all the pages and static files in a directory.
https://github.com/mehdieidi/offliner
concurrency concurrent concurrent-programming crawler go golang goroutine multiprocessing multithreading process scraper thread
Last synced: 5 months ago
JSON representation
Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which saves all the pages and static files in a directory.
- Host: GitHub
- URL: https://github.com/mehdieidi/offliner
- Owner: mehdieidi
- License: gpl-3.0
- Created: 2021-12-04T19:43:02.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-01-07T12:42:28.000Z (over 4 years ago)
- Last Synced: 2024-06-20T12:40:45.002Z (about 2 years ago)
- Topics: concurrency, concurrent, concurrent-programming, crawler, go, golang, goroutine, multiprocessing, multithreading, process, scraper, thread
- Language: Go
- Homepage:
- Size: 22.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# offliner
Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which crawls a website and saves all the pages and static files in a directory.
It can use both, multi-processing & multi-threading as its concurrency model.

## Features
* Serial scraping.
* Multi-threaded scraping.
* Multi-process scraping.
* Save static files (css, js, img).
* Edit the links on the pages to reference the local files.
## Usage
You need to provide a full URL to start the scraping. You can use the defined flags to control the features. If you intend to use the multi-process form, the "process" program must exist in the same directory as the "offliner" program.
```
-h show help.
-url full URL of the start page.
-f save static files too. It also edits the pages so the links reference the local files.
-a use multi-processing instead of multi-threading as the concurrency model.
-n maximum number of the pages to be saved. (default is 100)
-p maximum number of the execution units (goroutines or processes) to run at the same time. (default is 50)
-s run the scraper in a non-concurrent (serial) fashion.
```
## Examples
Multi-threaded scraping. Save max 100 pages using max 90 goroutines. Save static files too.
```
./offliner -url=https://urmia.ac.ir -n=100 -p=90 -f
```
Multi-process scraping. Save max 100 pages using max 50 processes.
```
./offliner -url=https://urmia.ac.ir -n=100 -p=50 -a
```
Serial scraping. Save max 100 pages. Save static files too.
```
./offliner -url=https://urmia.ac.ir -n=100 -s -f
```
## Todo
* Improve multi-processing design.
* Add a logger.
* Make the scraper a separate package (library).
## License
GNU General Public License v3.0