Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/akirak/orgabilize.el

(WIP) Convert web pages to readable Org
https://github.com/akirak/orgabilize.el

Last synced: 7 days ago
JSON representation

(WIP) Convert web pages to readable Org

Awesome Lists containing this project

README

        

* orgabilize.el
Orgabilize (org + readable + -ize) is an Org-centric web page archiver and helper for Emacs.

It provides the following functionalities:

- Insert a link to a URL with its title.
- Insert the table of contents of a URL.
- Convert a web page to an Org file in an archive directory so you can read/quote/annotate it.
- View the source of a URL.

Upstream sources are stored (cached) locally, so operations are fast.

It uses [[https://gitlab.com/gardenappl/readability-cli][readability-cli]] to preprocess HTML.
It is an alternative to [[https://github.com/alphapapa/org-web-tools][org-web-tools]] which uses pandoc, but =orgabilize= focuses on producing highly readable Org output from modern web pages.

On the other hand, *orgabilize is less reliable than org-web-tools*, because it basically assumes HTML markup is valid — It fails to process some web pages marked up invalidly.
It also cannot read the contents of client-rendered web pages, but it can still read titles and metadata from them.
See [[#limitations][Limitations]] for details.
** Installation
This package is not available from any package registry yet.
Install the package from source using =package-vc= (built-in since Emacs 29) or its alternative.

You also have to install [[https://gitlab.com/gardenappl/readability-cli][readability-cli]].
I don't provide an instruction for installing the program.
** Configuration
Set =orgabilize-executable= to the name of the executable of readability-cli.
** Usage
:PROPERTIES:
:CREATED_TIME: [2021-04-11 Sun 13:14]
:END:
*** Inserting a link
To insert a link to a web page with its title into an Org buffer, use =orgabilize-insert-org-link=.
*** Inserting a table of contents into an Org buffer
To insert a list of headings in a web page, use =orgabilize-insert-org-toc=.

To insert a list of links in a =nav= element from a web site, use =orgabilize-insert-toc-from-nav=.
First type a URL, and then select a nav element with its =aria-label= attribute.
This is typically useful for inserting the table of contents of an entire website.
*** Archiving a web page to an Org file in a directory
To convert a web page to Org, use =orgabilize-org-archive= command.
The buffer is saved to a file in =orgabilize-org-archive-directory=.
** Limitations
:PROPERTIES:
:CUSTOM_ID: limitations
:END:
At present, it fails to process many web pages.
This may improve in the future as tests are added and bugs are fixed, but it is not going to be perfect, because how this package works.

Another issue is the it cannot handle web pages that are supposed to be rendered in a browser (e.g. single-page applications).
For these pages, you can use =orgabilize-save-file-as-url= command to save an offline version as a URL.
A more thorough solution would be to integrate with a headless browser, but I don't have an implementation plan for it yet.
** Todo
- [ ] Improve the Org conversion
- [ ] Add tests
- [ ] Generate image links
- [ ] Handle more invalid web pages
- [ ] Integrate with a headless browser to handle client-rendered web pages