Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/charlesroelli/org-board

Org mode's web archiver.
https://github.com/charlesroelli/org-board

Last synced: 3 months ago
JSON representation

Org mode's web archiver.

Awesome Lists containing this project

README

        

org-board
=========

Last updated: Wed 30 May 2018 20:06:55 CEST

* Motivation

org-board is a bookmarking and web archival system for Emacs Org
mode, building on ideas from Pinboard . It
archives your bookmarks so that you can access them even when
you're not online, or when the site hosting them goes down.
`wget' is used as a backend for archival, so any of its options
can be used directly from org-board. This means you can download
whole sites for archival with a couple of keystrokes, while
keeping track of your archives from a simple Org file.

* Summary

In org-board, a bookmark is represented by an Org heading of any
level, with a `URL' property containing one or more URLs. Once
such a heading is created, a call to `org-board-archive' creates a
unique ID and directory for the entry via `org-attach', archives
the contents and requisites of the page(s) listed in the `URL'
property using `wget', and saves them inside the entry's
directory. A link to the (timestamped) root archive folder is
created in the property `ARCHIVED_AT'. Multiple archives can be
made for each entry. Additional options to pass to `wget' can be
specified via the property `WGET_OPTIONS'. The variable
`org-board-after-archive-functions' (defaulting to nil) holds a
list of functions to run after each archival operation.

* User commands

`org-board-archive' archives the current entry, creating a unique
ID and directory via `org-attach' if necessary.

`org-board-archive-dry-run' shows the `wget' invocation that will
run for this entry in the echo area.

`org-board-new' prompts for a URL to add to the current entry's
properties, then archives the entry immediately.

`org-board-delete-all' deletes all the archives for this entry by
deleting the `org-attach' directory.

`org-board-open' opens the bookmark at point in a browser.
Default to the built-in browser, `eww', and with prefix, the
native operating system browser.

`org-board-diff' uses `zdiff' (if available) or `ediff' to
recursively diff two archives of the same entry.

`org-board-diff3' uses `ediff' to recursively diff three archives
of the same entry.

`org-board-cancel' cancels the current org-board archival process.

`org-board-run-after-archive-function' prompts for a function and
an archive in the current entry, and applies the function to the
archive.

These are all bound in the `org-board-keymap' variable (not bound
to any key by default).

* Customizable options

`org-board-wget-program' is the path to the wget program.

`org-board-wget-switches' are the command line options to use with
`wget'. By default these are included as:

"-e robots=off" ignores robots.txt files.
"--page-requisites" downloads all page requisites (CSS, images).
"--adjust-extension" add a ".html" extension where needed.
"--convert-links" convert external links to internal.

`org-board-agent-header-alist' is an alist mapping agent names to
their respective header/user-agent arguments. Set a
`WGET_OPTIONS' property to a key of this alist (say,
`Mac-OS-10.8') and org-board will replace the key with its
corresponding value before calling wget. This is useful for some
sites that refuse to serve pages to `wget'.

`org-board-wget-show-buffer' controls whether the archival process
buffer is shown in a window (defaults to true).

`org-board-log-wget-invocation' controls whether to log the
archival process command in the root of the archival directory
(defaults to true).

`org-board-domain-regexp-alist' applies certain options when a
domain matches a regular expression. See the docstring for
details. As an example, this is used to make sure that `wget'
does not send a User Agent string when archiving from Google
Cache, which will not normally serve pages to it.

`org-board-after-archive-functions' (default nil) holds a list of
functions to run after an archival takes place. This is intended
for user extensions to `org-board'. The functions receive three
arguments: a list of URLs downloaded, the folder name where they
were downloaded and the process filter event string (see the Elisp
manual for details on the possible values of this string). For an
example use of `org-board-after-archive-functions', see the
"Example usage" section below.

* Known limitations

Options like "--header: 'Agent X" cannot be specified as
properties, because the property API splits on spaces, and such an
option has to be passed to `wget' as one argument. To work around
this, add these types of options to `org-board-agent-header-alist'
instead, where the property API is not involved.

At the moment, only one archive can be done at a time.

* Example usage

** Archiving

I recently found a list of articles on linkers that I wanted to
bookmark and keep locally for offline reading. In a dedicated org
file for bookmarks I created this entry:

** TODO Linkers (20-part series)
:PROPERTIES:
:URL: http://a3f.at/lists/linkers
:WGET_OPTIONS: --recursive -l 1 --span-hosts
:END:

Where the `URL' property is a page that already lists the URLs
that I wanted to download. I specified the recursive property for
`wget' along with a depth of 1 ("-l 1") so that each linked page
would be downloaded. With point inside the entry, I run "M-x
org-board-archive". An `org-attach' directory is created and
`wget' starts downloading the pages to it. Afterwards, the end
the entry looks like this:

** TODO Linkers (20-part series)
:PROPERTIES:
:URL: http://a3f.at/lists/linkers
:WGET_OPTIONS: --recursive -l 1 --span-hosts
:ID: D3BCE79F-C465-45D5-847E-7733684B9812
:ARCHIVED_AT: [2016-08-30-Tue-15-03-56]
:END:

The value in the `ARCHIVED_AT' property is a link that points to
the root of the timestamped archival directory. The ID property
was automatically generated by `org-attach'.

** Diffing

You can diff between two archives done for the same entry using
`org-board-diff', so you can see how a page has changed over time.
The diff recurses through the directory structure of an archive
and will highlight any changes that have been made. `ediff' is
used if `zdiff' is not available (both are capable of recursing
through a directory structure, but `zdiff' is possibly more
intuitive to use). `org-board-diff3' also offers diffing between
three different archive directories.

** `org-board-after-archive-functions'

`org-board-after-archive-functions' is a list of functions run
after an archive is finished. You can use it to do anything you
like with newly archived pages. For example, you could add a
function that copies the new archive to an external hard disk, or
opens the archived page in your browser as soon as it is done
downloading. You could also, for instance, copy all of the media
files that were downloaded to your own media folder, and pop up a
Dired buffer inside that folder to give you the chance to
organize them.

Here is an example function that copies the archived page to an
external service called `IPFS' , a decentralized
versioning and storage system geared towards web content (thanks
to Alan Schmitt):

(defun org-board-add-to-ipfs (urls output-folder event &rest _rest)
"Add the downloaded site to IPFS."
(unless (string-match "exited abnormally" event)
(let* ((parsed-url (url-generic-parse-url (car urls)))
(domain (url-host parsed-url))
(path (url-filename parsed-url))
(output (shell-command-to-string
(concat "ipfs add -r "
(concat output-folder domain))))
(ipref
(nth 1 (split-string
(car (last (split-string output "\n" t))) " "))))
(with-current-buffer (get-buffer-create "*org-board-post-archive*")
(princ (format "your file is at %s\n"
(concat "http://localhost:8080/ipfs/" ipref path))
(current-buffer))))))

(eval-after-load "org-board"
'(add-hook 'org-board-after-archive-functions 'org-board-add-to-ipfs))

Note that for forward compatibility, it's best to add to a final
`&rest' argument to every function listed in
`org-board-after-archive-functions', since a future update may
provide each function with additional arguments (like a marker
pointing to a buffer position where the archive was initiated, for
example).

For more information on `org-board-after-archive-functions', see
its docstring and the docstring of
`org-board-test-after-archive-function'.

You can also interactively run an after-archive function with the
command `org-board-run-after-archive-function'. See its docstring
for details.

* Getting started

** Installation

There are two ways to install the package. One way is to clone
this repository and add the directory to your load-path manually.

(add-to-list 'load-path "/path/to/org-board")
(require 'org-board)

Alternatively, you can download the package directly from Emacs
using MELPA . M-x
package-install RET org-board RET will take care of it.

** Keybindings

The following keymap is defined in `org-board-keymap':

| Key | Command |
| a | org-board-archive |
| r | org-board-archive-dry-run |
| n | org-board-new |
| k | org-board-delete-all |
| o | org-board-open |
| d | org-board-diff |
| 3 | org-board-diff3 |
| c | org-board-cancel |
| x | org-board-run-after-archive-function |
| O | org-attach-reveal-in-emacs |
| ? | Show help for this keymap. |

To install the keymap give it a prefix key, e.g.:

(global-set-key (kbd "C-c o") org-board-keymap)

Then typing `C-c o a' would run `org-board-archive', for example.

* Miscellaneous

The location of `wget' should be picked up automatically from the
`PATH' environment variable. If it is not, then the variable
`org-board-wget-program' can be customized.

Other options are already set so that archiving bookmarks is done
pretty much automatically. With no `WGET_OPTIONS' specified, by
default `org-board-archive' will just download the page and its
requisites (images and CSS), and nothing else.

** Support for org-capture from Firefox (thanks to Alan Schmitt):

On the Firefox side, install org-capture from here:

http://chadok.info/firefox-org-capture/

Alternatively, you can do it manually by following the
instructions here:

http://weblog.zamazal.org/org-mode-firefox/
(in the “The advanced way” section)

When org-capture is installed, add `(require 'org-protocol)' to
your init file (`~/.emacs').

Then create a capture template like this:

(setq org-board-capture-file "my-org-board.org")

(setq org-capture-templates
`(...
("c" "capture through org protocol" entry
(file+headline ,org-board-capture-file "Unsorted")
"* %?%:description\n:PROPERTIES:\n:URL: %:link\n:END:\n\n Added %U")
...))

And add a hook to `org-capture-before-finalize-hook':

(defun do-org-board-dl-hook ()
(when (equal (buffer-name)
(concat "CAPTURE-" org-board-capture-file))
(org-board-archive)))

(add-hook 'org-capture-before-finalize-hook 'do-org-board-dl-hook)

* Acknowledgements

Thanks to Alan Schmitt for the code to combine `org-board' and
`org-capture', and for the example function used in the
documentation of `org-board-after-archive-functions' above.