{"id":13570796,"url":"https://github.com/aurelg/linkbak","last_synced_at":"2025-04-04T07:32:06.531Z","repository":{"id":43367809,"uuid":"146775021","full_name":"aurelg/linkbak","owner":"aurelg","description":"linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.","archived":false,"fork":false,"pushed_at":"2022-12-08T01:13:56.000Z","size":96,"stargazers_count":14,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-05T03:36:49.785Z","etag":null,"topics":["archive","backup","crawler","html","pdf","python3"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aurelg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-08-30T16:11:12.000Z","updated_at":"2024-02-29T15:09:58.000Z","dependencies_parsed_at":"2023-01-24T02:20:26.161Z","dependency_job_id":null,"html_url":"https://github.com/aurelg/linkbak","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aurelg%2Flinkbak","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aurelg%2Flinkbak/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aurelg%2Flinkbak/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aurelg%2Flinkbak/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aurelg","download_url":"https://codeload.github.com/aurelg/linkbak/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247138959,"owners_count":20890139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archive","backup","crawler","html","pdf","python3"],"created_at":"2024-08-01T14:00:55.041Z","updated_at":"2025-04-04T07:32:01.523Z","avatar_url":"https://github.com/aurelg.png","language":"JavaScript","funding_links":[],"categories":["JavaScript"],"sub_categories":[],"readme":"# What is `linkbak`\n\n`linkbak` is a web page archiver : it reads a list of links and dumps the\ncorresponding pages in HTML and PDF. It is somewhat similar to\n[bookmark-archiver](https://github.com/pirate/bookmark-archiver), but lighter\n(no UI) and faster.\n\nThe HTML content is extracted with python's `requests`/`readability`, PDFs are\ngenerated with `chromium` in `headless` mode. For an even better readability,\nthe DOM (extracted by `chromium`, again in `headless` mode) is parsed by\n[Mozilla's readability](https://github.com/mozilla/readability) and processed by\n[Pandoc](https://pandoc.org) to produce MOBI, EPUB, Markdown and a cleaner PDF\noutput.\n\nMoreover, links can be processed in parallel. Previous failed attempts can be\neither ignored or retried, and a custom timeout is supported.\n\n## Input\n\n- Atom (URL or local)\n- RSS (URL or local)\n- HTML (local)\n- text file containing a list of URLs (one per line)\n\n## Output\n\nPages (HTML/PDF) are stored in output directories identified by the sha256 of\nthe links to avoid collisions. An additional JSON index is also written to keep\ntrack of which links are stored in which directory.\n\nDownloaded files can be browsed with your browser:\n\n- start python's integrated web server: `cd output \u0026\u0026 python -m http.server`\n- open your browser at `http://localhost:8000`\n\n# Installation\n\nThe easy way, with Docker:\n\n- Retrieve from docker hub: `docker pull aurelg/linkbak`\n- Or create your image locally: `git clone https://github.com/aurelg/linkbak.git \u0026\u0026 docker build -t linkbak linkbak/`\n\nIf you want to install it manually, just clone this repository and make sure you\nhave the following dependencies installed:\n\n- `chromium` (or `google-chrome`)\n- `texlive`\n- `pandoc`\n- `nodejs` (and a few packages than can be installed with `npm install ...`: `fs`, `jsdom` and `https://github.com/mozilla/readability`)\n\n# Example\n\nExample: `lnk2bak.py -v -j10 https://github.com/shaarli/Shaarli/releases.atom`\n\nOr with docker:\n\n```\ndocker run \\\n  -v $(pwd):/workdir \\\n  -u $(id -u):$(id -g) \\\n  --rm -ti linkbak \\\n  /linkbak/src/linkbak/lnk2bak.py -j1 -vvv links.txt\n```\n\nYou may want to define an alias like:\n\n`alias linkbak='docker run -v \\$(pwd):/workdir -u $(id -u):$(id -g) --rm -ti aurelg/linkbak /linkbak/src/linkbak/lnk2bak.py'`\n\nThis command downloads HTML and generates PDFs for each of the links found in\nthe Shaarli atom feed on Github, allowing up to 10 downloads in parallel.\n\nOutput:\n\n```\n.\n├── 394a30c14c9f36....\n│   ├── index.html\n│   ├── metadata.json\n│   └── output.pdf\n├── 4357bbfb8b7788....\n│   ├── index.html\n│   ├── metadata.json\n│   └── output.pdf\n├── 51ec955a6fe728....\n│   ├── index.html\n│   ├── metadata.json\n│   └── output.pdf\n...\n\n10 directories, 31 files\n```\n\nIf the HTML, metadata or PDF cannot be retrieved, an error message is written in\na logfile named `{index.html,metadata.json,output.pdf}.log`, respectively.\n\nIn each link directory, a `metadata.json` file containing the `sha156` and the\nURL is written:\n\n```\n{\n \"id\": \"394a30c14c9f36830d77dca945ed6d558ea3ede08b9009bbffa3b6e92dc68f30\",\n \"link\": \"https://github.com/shaarli/Shaarli/releases/tag/v0.9.6\"\n}\n```\n\nAll these `metadata.json` files are eventually merged in `results.json` once all\nlinks are processed:\n\n```\n[\n {\n  \"id\": \"51ec955a6fe728451be9c8ae654f1012e376e77ae45ad8235ef9dd67b3f016d8\",\n  \"link\": \"https://github.com/shaarli/Shaarli/releases/tag/v0.8.7\"\n },\n {\n  \"id\": \"ea2cf19731ad7a1378e6d7d1b4dc84c65ee8808328db98dd80cc17cce6728bb3\",\n  \"link\": \"https://github.com/shaarli/Shaarli/releases/tag/v0.9.3\"\n },\n {\n  \"id\": \"394a30c14c9f36830d77dca945ed6d558ea3ede08b9009bbffa3b6e92dc68f30\",\n  \"link\": \"https://github.com/shaarli/Shaarli/releases/tag/v0.9.6\"\n },\n ...\n]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faurelg%2Flinkbak","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faurelg%2Flinkbak","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faurelg%2Flinkbak/lists"}