Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/conversation/upmark
A HTML to Markdown converter.
https://github.com/conversation/upmark
converter html markdown peg ruby
Last synced: 3 months ago
JSON representation
A HTML to Markdown converter.
- Host: GitHub
- URL: https://github.com/conversation/upmark
- Owner: conversation
- License: mit
- Created: 2011-09-21T01:09:29.000Z (about 13 years ago)
- Default Branch: main
- Last Pushed: 2024-04-21T22:53:16.000Z (7 months ago)
- Last Synced: 2024-07-20T10:50:32.188Z (4 months ago)
- Topics: converter, html, markdown, peg, ruby
- Language: Ruby
- Homepage: Upmark has the skills to convert your HTML to Markdown.
- Size: 114 KB
- Stars: 28
- Watchers: 14
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Upmark
A HTML to Markdown converter.
## Installation
> gem install upmark
## Usage
In ruby:
```ruby
require "upmark"
html = "messenger bag skateboard
"
markdown = Upmark.convert(html)
puts markdown
```From the command-line:
> upmark foo.html
You can also pipe poorly formatted HTML documents through `tidy` before piping them into `upmark`:
> cat bar.html | tidy -asxhtml -indent -quiet --show-errors 0 --show-warnings 0 --show-body-only 1 --wrap 0 | upmark
## Features
Upmark will convert the following (arbitrarily nested) HTML elements to Markdown:
* `strong`
* `em`
* `p`
* `a`
* `h1`, `h2`, `h3`, `h4`, `h5`, `h6`
* `ul`
* `ol`
* `br`It will also pass through block and span-level HTML elements (e.g. `table`, `div`, `span`, etc) which aren't used by Markdown.
## How it works
Upmark defines a parsing expression grammar (PEG) using the very awesome [Parslet](https://github.com/kschiess/parslet/) gem. This PEG is then used to convert HTML into Markdown in 4 steps:
1. Parse the XHTML into an abstract syntax tree (AST).
2. Normalize the AST into a nested hash of HTML elements.
3. Mark the block and span-level subtrees which should be ignored (`table`, `div`, `span`, etc).
4. Convert the AST leaves into Markdown.