https://github.com/mehdieidi/offliner

Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which saves all the pages and static files in a directory.
https://github.com/mehdieidi/offliner

concurrency concurrent concurrent-programming crawler go golang goroutine multiprocessing multithreading process scraper thread

Last synced: 5 months ago
JSON representation

Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which saves all the pages and static files in a directory.

Host: GitHub
URL: https://github.com/mehdieidi/offliner
Owner: mehdieidi
License: gpl-3.0
Created: 2021-12-04T19:43:02.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-01-07T12:42:28.000Z (over 4 years ago)
Last Synced: 2024-06-20T12:40:45.002Z (about 2 years ago)
Topics: concurrency, concurrent, concurrent-programming, crawler, go, golang, goroutine, multiprocessing, multithreading, process, scraper, thread
Language: Go
Homepage:
Size: 22.1 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # offliner

Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which crawls a website and saves all the pages and static files in a directory.


It can use both, multi-processing & multi-threading as its concurrency model.

![offliner](screenshot.png)

## Features

* Serial scraping.

* Multi-threaded scraping.

* Multi-process scraping.

* Save static files (css, js, img).

* Edit the links on the pages to reference the local files.

## Usage

You need to provide a full URL to start the scraping. You can use the defined flags to control the features. If you intend to use the multi-process form, the "process" program must exist in the same directory as the "offliner" program.

```

-h     show help.

-url   full URL of the start page.

-f     save static files too. It also edits the pages so the links reference the local files.

-a     use multi-processing instead of multi-threading as the concurrency model.

-n     maximum number of the pages to be saved. (default is 100)

-p     maximum number of the execution units (goroutines or processes) to run at the same time. (default is 50)

-s     run the scraper in a non-concurrent (serial) fashion.

```

## Examples

Multi-threaded scraping. Save max 100 pages using max 90 goroutines. Save static files too.

```

./offliner -url=https://urmia.ac.ir -n=100 -p=90 -f

```

Multi-process scraping. Save max 100 pages using max 50 processes.

```

./offliner -url=https://urmia.ac.ir -n=100 -p=50 -a

```

Serial scraping. Save max 100 pages. Save static files too.

```

./offliner -url=https://urmia.ac.ir -n=100 -s -f

```

## Todo

* Improve multi-processing design.

* Add a logger.

* Make the scraper a separate package (library).

## License

GNU General Public License v3.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mehdieidi/offliner

Awesome Lists containing this project

README