Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aurelg/linkbak

linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.
https://github.com/aurelg/linkbak

archive backup crawler html pdf python3

Last synced: 3 months ago
JSON representation

linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.

Host: GitHub
URL: https://github.com/aurelg/linkbak
Owner: aurelg
License: gpl-3.0
Created: 2018-08-30T16:11:12.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T01:13:56.000Z (about 2 years ago)
Last Synced: 2024-08-02T14:08:25.782Z (6 months ago)
Topics: archive, backup, crawler, html, pdf, python3
Language: JavaScript
Homepage:
Size: 93.8 KB
Stars: 14
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

# What is `linkbak`

`linkbak` is a web page archiver : it reads a list of links and dumps the
corresponding pages in HTML and PDF. It is somewhat similar to
[bookmark-archiver](https://github.com/pirate/bookmark-archiver), but lighter
(no UI) and faster.

The HTML content is extracted with python's `requests`/`readability`, PDFs are
generated with `chromium` in `headless` mode. For an even better readability,
the DOM (extracted by `chromium`, again in `headless` mode) is parsed by
[Mozilla's readability](https://github.com/mozilla/readability) and processed by
[Pandoc](https://pandoc.org) to produce MOBI, EPUB, Markdown and a cleaner PDF
output.

Moreover, links can be processed in parallel. Previous failed attempts can be
either ignored or retried, and a custom timeout is supported.

## Input

- Atom (URL or local)
- RSS (URL or local)
- HTML (local)
- text file containing a list of URLs (one per line)

## Output

Pages (HTML/PDF) are stored in output directories identified by the sha256 of
the links to avoid collisions. An additional JSON index is also written to keep
track of which links are stored in which directory.

Downloaded files can be browsed with your browser:

- start python's integrated web server: `cd output && python -m http.server`
- open your browser at `http://localhost:8000`

# Installation

The easy way, with Docker:

- Retrieve from docker hub: `docker pull aurelg/linkbak`
- Or create your image locally: `git clone https://github.com/aurelg/linkbak.git && docker build -t linkbak linkbak/`

If you want to install it manually, just clone this repository and make sure you
have the following dependencies installed:

- `chromium` (or `google-chrome`)
- `texlive`
- `pandoc`
- `nodejs` (and a few packages than can be installed with `npm install ...`: `fs`, `jsdom` and `https://github.com/mozilla/readability`)

# Example

Example: `lnk2bak.py -v -j10 https://github.com/shaarli/Shaarli/releases.atom`

Or with docker:

```
docker run \
-v $(pwd):/workdir \
-u $(id -u):$(id -g) \
--rm -ti linkbak \
/linkbak/src/linkbak/lnk2bak.py -j1 -vvv links.txt
```

You may want to define an alias like:

`alias linkbak='docker run -v \$(pwd):/workdir -u $(id -u):$(id -g) --rm -ti aurelg/linkbak /linkbak/src/linkbak/lnk2bak.py'`

This command downloads HTML and generates PDFs for each of the links found in
the Shaarli atom feed on Github, allowing up to 10 downloads in parallel.

Output:

```
.
├── 394a30c14c9f36....
│   ├── index.html
│   ├── metadata.json
│   └── output.pdf
├── 4357bbfb8b7788....
│   ├── index.html
│   ├── metadata.json
│   └── output.pdf
├── 51ec955a6fe728....
│   ├── index.html
│   ├── metadata.json
│   └── output.pdf
...

10 directories, 31 files
```

If the HTML, metadata or PDF cannot be retrieved, an error message is written in
a logfile named `{index.html,metadata.json,output.pdf}.log`, respectively.

In each link directory, a `metadata.json` file containing the `sha156` and the
URL is written:

```
{
"id": "394a30c14c9f36830d77dca945ed6d558ea3ede08b9009bbffa3b6e92dc68f30",
"link": "https://github.com/shaarli/Shaarli/releases/tag/v0.9.6"
}
```

All these `metadata.json` files are eventually merged in `results.json` once all
links are processed:

```
[
{
"id": "51ec955a6fe728451be9c8ae654f1012e376e77ae45ad8235ef9dd67b3f016d8",
"link": "https://github.com/shaarli/Shaarli/releases/tag/v0.8.7"
},
{
"id": "ea2cf19731ad7a1378e6d7d1b4dc84c65ee8808328db98dd80cc17cce6728bb3",
"link": "https://github.com/shaarli/Shaarli/releases/tag/v0.9.3"
},
{
"id": "394a30c14c9f36830d77dca945ed6d558ea3ede08b9009bbffa3b6e92dc68f30",
"link": "https://github.com/shaarli/Shaarli/releases/tag/v0.9.6"
},
...
]
```