Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sophicshift/org-mode-hs

Libraries and tool for parsing Org Mode documents with customizable exporters. 🦄
https://github.com/sophicshift/org-mode-hs

Last synced: about 1 month ago
JSON representation

Libraries and tool for parsing Org Mode documents with customizable exporters. 🦄

Awesome Lists containing this project

README

        

#+title: org-mode-hs

This repository provides a parser and exporters for Org Mode documents. The Org document is parsed into an AST similar to =org-element='s, and the exporters are highly configurable using HTML, Markdown or LaTeX templates.

* Table of contents :TOC:
- [[#org-cli-horg][org-cli (horg)]]
- [[#usage][Usage]]
- [[#installation-from-source][Installation from source]]
- [[#customizing-templates][Customizing templates]]
- [[#org-parser-library][org-parser library]]
- [[#how-to-test-and-play-with-it][How to test and play with it]]
- [[#progress][Progress]]
- [[#comparasion-to-pandoc][Comparasion to Pandoc]]
- [[#org-exporters-library][org-exporters library]]
- [[#defining-a-new-export-backend][Defining a new export backend]]

* org-cli (horg)

You can find compiled binaries for Linux and MacOS on the [[https://github.com/lucasvreis/org-mode-hs/releases][releases page]].

** Usage
The =horg= CLI tool has some basic help pages, to see them type:
#+begin_src bash
# Global help
horg -h
# Exporter help
horg export -h
#+end_src

To convert to HTML, for instance, you can type:
#+begin_src bash
# Save to out.html
horg export html -i myfile.org -o out.html

# Print to stdout
horg export html -i myfile.org --stdout
#+end_src

*** Using Pandoc
The Pandoc exporter does not output Pandoc formats directly, but rather, it generates a JSON AST that can be fed into Pandoc. You can pipe this JSON into Pandoc to convert to Markdown or any other supported format:

#+begin_src bash
horg export pandoc -i myfile.org --stdout | pandoc --from json --to markdown
#+end_src

It can happen that the JSON API version is incompatible with your installed version of Pandoc. The CI compiles =horg= with the latest Pandoc API available at the build time, so using the latest released version of the Pandoc binary has a good chance of fixing the problem.

** Installation from source
*** Using =cabal=
#+begin_src bash
git clone https://github.com/lucasvreis/org-mode-hs.git
cd org-mode-hs
cabal install horg
#+end_src

Please note that this library and some of its dependencies are not on Hackage yet, so you need to clone this repository first.

** Customizing templates
You can use the =horg init-templates= command to populate a =.horg= directory in the current directory with the default templates, which you can then modify.

Detailed documentation on how the templates work is TODO.

* org-parser library
** How to test and play with it
*** Testing the parser in =ghci=

This assumes you have =cabal= installed.

Clone the parser repository, =cd= into =org-parser= and run =cabal repl test= inside it. Cabal will be busy downloading dependencies and building for a while. You can then call the convenience function ~prettyParse~ like so:

: prettyParse orgDocument [text|
: This is a test.
: |]

You can write the contents to be parsed between =[text|= and =|]=. More generally, you can call

: prettyParse [parser to parse] [string to parse]

Where =[parser to parse]= can be basically any of the functions from =Org.Parse.Document=, =Org.Parser.Elements= or =Org.Parser.Objects= whose types are wrapped by the =OrgParser= or =Marked OrgParser= monads. You don't need to import those modules yourself as they are already imported in the ~test~ namespace.

*** Unit tests
You can view the unit tests under [[org-parser/test][org-parser/test]]. They aim to touch as much corner cases as possible against org-element, so you can take a look there to see what already works, and how well it works.

** Progress
In the spec terms (see below the table for other features), the following components are implemented:
| Component | Type | Parse |
|---------------------+-------------------+-------|
| Heading | X | X |
| Section | X | X |
|---------------------+-------------------+-------|
| Affiliated Keywords | X | X |
|---------------------+-------------------+-------|
| GreaterBlock | X | X |
| Drawer | X | X |
| FootnoteDefinition | X | X |
| Item | X | X |
| List | X | X |
| PropertyDrawer | X | X |
| Table | X | X |
|---------------------+-------------------+-------|
| BabelCall | parsed as keyword | |
| Comment Block | X | X |
| Clock | X | X |
| Example Block | X | X |
| Export Block | X | X |
| Src Block | X | X |
| Verse Block | X | |
| Planning | X | X |
| Comment | X | X |
| FixedWidth | X (ExampleBlock) | X |
| HorizontalRule | X | X |
| Keyword | X | X |
| LaTeXEnvironment | X | X |
| NodeProperty | X | X |
| Paragraph | X | X |
| TableRow | X | X |
| TableHRule | X | X |
|---------------------+-------------------+-------|
| OrgEntity | X | X |
| LaTeXFragment | X | X |
| ExportSnippet | X | X |
| FootnoteReference | X | X |
| InlineBabelCall | X | X |
| InlineSrcBlock | X | X |
| RadioLink | wontfix | |
| PlainLink | wontfix | |
| AngleLink | X (Link) | X |
| RegularLink | X (Link) | X |
| Image | X | X |
| LineBreak | X | X |
| Macro | X | X |
| Citation | X | X |
| RadioTarget | wontfix | |
| Target | X | X |
| StatisticsCookie | X | X |
| Subscript | X | X |
| Superscript | X | X |
| TableCell | X | X |
| Timestamp | X | X |
| Plain | X | X |
| Markup | X | X |
(Thanks @tecosaur for the table)

*** Going beyond what is listed in the spec

~org-element-parse-buffer~ does not parse /everything/ that will eventually be parsed or processed when exporting a document written in Org-mode. Examples of Org features that are not handled by the parser alone (so aren't described in the spec) include content from keywords like =#+title:=, that are parsed "later" by the exporter itself, references in lines of =src= or =example= blocks and link resolving, that are done in a post-processing step, and the use of =#+include:= keywords, =TODO= keywords and radio links, that are done in a pre-processing step.

Since the aspects listed above are genuine /org-mode features/, and not optional extensions, its preferable that should be resolved in the AST outputted by this parser. Below is a table with more Org features that are not listed in the spec but are planned to be supported:

| Feature | Implemented? |
|--------------------------------------------+------------------------------------------------------------------------------------|
| ​=#+include:= keywords | not yet |
| Src/example blocks switches and references | yes |
| Resolving all inner links | some |
| Parsing image links into =Image=​s | yes |
| Processing radio links | no; conformant implementation /requires/ parsing twice. May be added under a flag. |
| Per-file TODO keywords | not yet (on the way, some work is done) |
| Macro definitions and substitution | not yet (on the way, some work is done) |

** Comparasion to Pandoc
The main difference between =org-parser= and the Pandoc Org Reader is that this one parses into an AST is more similar to the org-element's AST, while Pandoc's parses into the =Pandoc= AST, which cannot express all Org elements directly. This has the effect that some Org features are either unsupported by the reader or "projected" onto =Pandoc= in ways that bundle less information about the Org source. In contrast, this parser aims to represent Org documents more faithfully before "projecting" them into formats like HTML or the Pandoc AST itself. So you can expect more org-specific features to be parsed, and a hopefully more accurate parsing in general.

Also, if you are developer mainly interested in rendering Org documents to HTML, Pandoc is a very big library to depend upon, with very long build times (at least in my computer, sadly).

Indeed, my initial plan was to fork the Org Reader and make it a standalone package, but this quickly proved unfeasible as the reader is very tangled with the rest of Pandoc. Also, some accuracy improvements to the reader were hard to make without deeper changes to the parser. For example, consider the following Org snippet:
#+begin_src org
This is a single paragraph. Because this single paragraph
,#+should not be ended by this funny line, because this funny
line is not a keyword. Not even this incomplete
\begin{LaTeX}
environment should break this paragraph apart.
#+end_src
This single paragraph is broken into three by Pandoc, because it looks for a new "block start" (the start of a new org element) in each line. If there is a block start, then it aborts the current element (block) and starts the new one. Only later the parser decides if the started block actually parses correctly until its end, which is not the case for the =\begin{LaTeX}= in this example.

Another noteworthy difference is that =haskell-org-parser= uses a different parsing library, ~megaparsec~. Pandoc uses the older ~parsec~, but also bundles many features on its own library.

* org-exporters library
This library provides functions for post-processing of the Org AST and exporting to various formats with =ondim=.

** Defining a new export backend
Basically:
- Use the [[https://github.com/lucasvreis/ondim][~ondim~ library]] to create a Ondim template system for the desired format, if it does not already exist.
- Import ~Org.Exporters.Common~ and create an ~ExportBackend~ for your format.
- Create auxiliary functions for loading templates and rendering the document.