Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/JohannesKaufmann/html-to-markdown
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
https://github.com/JohannesKaufmann/html-to-markdown
converter go golang goquery html html-to-markdown markdown
Last synced: about 2 months ago
JSON representation
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
- Host: GitHub
- URL: https://github.com/JohannesKaufmann/html-to-markdown
- Owner: JohannesKaufmann
- License: mit
- Created: 2018-05-15T13:26:26.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-09-04T17:15:47.000Z (3 months ago)
- Last Synced: 2024-09-06T00:28:13.684Z (3 months ago)
- Topics: converter, go, golang, goquery, html, html-to-markdown, markdown
- Language: Go
- Homepage:
- Size: 603 KB
- Stars: 833
- Watchers: 10
- Forks: 82
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-go - html-to-markdown - Convert HTML to Markdown. Even works with entire websites and can be extended through rules. (Text Processing / Markup Languages)
- project-awesome - JohannesKaufmann/html-to-markdown - ⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules. (Go)
- go-awesome - html-to-markdown - Convert HTML to Markdown (Open source library / Word Processing)
- jimsghstars - JohannesKaufmann/html-to-markdown - ⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules. (Go)
- awesome-go-extra - html-to-markdown - 05-15T13:26:26Z|2022-08-15T12:25:18Z| (Bot Building / Markup Languages)
- my-awesome - JohannesKaufmann/html-to-markdown - to-markdown,markdown pushed_at:2024-12 star:2.4k fork:0.1k ⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules. (Go)
- starred - JohannesKaufmann/html-to-markdown - ⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules. (Go)
- awesome-go - html-to-markdown - Convert HTML to Markdown. Even works with entire websites and can be extended through rules. Stars:`2.4K`. (Text Processing / Markup Languages)
- awesome-go - html-to-markdown - 将 HTML 转换为 Markdown 的工具。 (文本处理 / 标记语言)
- awesome-go - html-to-markdown - 将 HTML 转换为 Markdown 的工具。 (文本处理 / 标记语言)
README
# html-to-markdown
[![Go Report Card](https://goreportcard.com/badge/github.com/JohannesKaufmann/html-to-markdown)](https://goreportcard.com/report/github.com/JohannesKaufmann/html-to-markdown)
[![codecov](https://codecov.io/gh/JohannesKaufmann/html-to-markdown/branch/master/graph/badge.svg)](https://codecov.io/gh/JohannesKaufmann/html-to-markdown)
![GitHub MIT License](https://img.shields.io/github/license/JohannesKaufmann/html-to-markdown)
[![GoDoc](https://godoc.org/github.com/JohannesKaufmann/html-to-markdown?status.png)](http://godoc.org/github.com/JohannesKaufmann/html-to-markdown)![Gopher, the mascot of Golang, is wearing a party hat and holding a balloon. Next to the Gopher is a machine that converts characters associated with HTML to characters associated with Markdown.](/logo_five_years.png)
Convert HTML into Markdown with Go. It is using an [HTML Parser](https://github.com/PuerkitoBio/goquery) to avoid the use of `regexp` as much as possible. That should prevent some [weird cases](https://stackoverflow.com/a/1732454) and allows it to be used for cases where the input is totally unknown.
## Installation
```
go get github.com/JohannesKaufmann/html-to-markdown
```## Usage
```go
import (
"fmt"
"log"md "github.com/JohannesKaufmann/html-to-markdown"
)converter := md.NewConverter("", true, nil)
html := `Important`
markdown, err := converter.ConvertString(html)
if err != nil {
log.Fatal(err)
}
fmt.Println("md ->", markdown)
```If you are already using [goquery](https://github.com/PuerkitoBio/goquery) you can pass a selection to `Convert`.
```go
markdown, err := converter.Convert(selec)
```### Using it on the command line
If you want to make use of `html-to-markdown` on the command line without any Go coding, check out [`html2md`](https://github.com/suntong/html2md#usage), a cli wrapper for `html-to-markdown` that has all the following options and plugins builtin.
## Options
The third parameter to `md.NewConverter` is `*md.Options`.
For example you can change the character that is around a bold text ("`**`") to a different one (for example "`__`") by changing the value of `StrongDelimiter`.
```go
opt := &md.Options{
StrongDelimiter: "__", // default: **
// ...
}
converter := md.NewConverter("", true, opt)
```For all the possible options look at [godocs](https://godoc.org/github.com/JohannesKaufmann/html-to-markdown/#Options) and for a example look at the [example](/examples/options/main.go).
## Adding Rules
```go
converter.AddRules(
md.Rule{
Filter: []string{"del", "s", "strike"},
Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
// You need to return a pointer to a string (md.String is just a helper function).
// If you return nil the next function for that html element
// will be picked. For example you could only convert an element
// if it has a certain class name and fallback if not.
content = strings.TrimSpace(content)
return md.String("~" + content + "~")
},
},
// more rules
)
```For more information have a look at the example [add_rules](/examples/add_rules/main.go).
## Using Plugins
If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to `Use`.
```go
import "github.com/JohannesKaufmann/html-to-markdown/plugin"// Use the `GitHubFlavored` plugin from the `plugin` package.
converter.Use(plugin.GitHubFlavored())
```Or if you only want to use the `Strikethrough` plugin. You can change the character that distinguishes
the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").```go
converter.Use(plugin.Strikethrough(""))
```For more information have a look at the example [github_flavored](/examples/github_flavored/main.go).
---
These are the plugins located in the [plugin folder](/plugin) which you can use by importing "github.com/JohannesKaufmann/html-to-markdown/plugin".
| Name | Description |
| --------------------- | ------------------------------------------------------------------------------------------- |
| GitHubFlavored | GitHub's Flavored Markdown contains `TaskListItems`, `Strikethrough` and `Table`. |
| TaskListItems | (Included in `GitHubFlavored`). Converts `` checkboxes into `- [x] Task`. |
| Strikethrough | (Included in `GitHubFlavored`). Converts ``, ``, and `` to the `~~` syntax. |
| Table | (Included in `GitHubFlavored`). Convert a `` into something like this... |
| TableCompat | |
| | |
| VimeoEmbed | |
| YoutubeEmbed | |
| | |
| ConfluenceCodeBlock | Converts `` elements that are used in Atlassian’s Wiki "Confluence". |
| ConfluenceAttachments | Converts `` elements. |These are the plugins in other repositories:
| Name | Description |
| ---------------------------- | ------------------- |
| \[Plugin Name\]\(Your Link\) | A short description |I you write a plugin, feel free to open a PR that adds your Plugin to this list.
## Writing Plugins
Have a look at the [plugin folder](/plugin) for a reference implementation. The most basic one is [Strikethrough](/plugin/strikethrough.go).
## Security
This library produces markdown that is readable and can be changed by humans.
Once you convert this markdown back to HTML (e.g. using [goldmark](https://github.com/yuin/goldmark) or [blackfriday](https://github.com/russross/blackfriday)) you need to be careful of malicious content.
This library does NOT sanitize untrusted content. Use an HTML sanitizer such as [bluemonday](https://github.com/microcosm-cc/bluemonday) before displaying the HTML in the browser.
## Other Methods
[Godoc](https://godoc.org/github.com/JohannesKaufmann/html-to-markdown)
### `func (c *Converter) Keep(tags ...string) *Converter`
Determines which elements are to be kept and rendered as HTML.
### `func (c *Converter) Remove(tags ...string) *Converter`
Determines which elements are to be removed altogether i.e. converted to an empty string.
## Escaping
Some characters have a special meaning in markdown. For example, the character "\*" can be used for lists, emphasis and dividers. By placing a backlash before that character (e.g. "\\\*") you can "escape" it. Then the character will render as a raw "\*" without the _"markdown meaning"_ applied.
But why is "escaping" even necessary?
```md
Paragraph 1
-
Paragraph 2
```The markdown above doesn't seem that problematic. But "Paragraph 1" (with only one hyphen below) will be recognized as a _setext heading_.
```html
Paragraph 1
Paragraph 2
```A well-placed backslash character would prevent that...
```md
Paragraph 1
\-
Paragraph 2
```---
How to configure escaping? Depending on the `EscapeMode` option, the markdown output is going to be different.
```go
opt = &md.Options{
EscapeMode: "basic", // default
}
```Lets try it out with this HTML input:
| | |
| -------- | ----------------------------------------------------- |
| input | `fake **bold** and real bold
` |
| | |
| | **With EscapeMode "basic"** |
| output | `fake \*\*bold\*\* and real **bold**` |
| rendered | fake \*\*bold\*\* and real **bold** |
| | |
| | **With EscapeMode "disabled"** |
| output | `fake **bold** and real **bold**` |
| rendered | fake **bold** and real **bold** |With **basic** escaping, we get some escape characters (the backlash "\\") but it renders correctly.
With escaping **disabled**, the fake and real bold can't be distinguished in the markdown. That means it is both going to render as bold.
---
So now you know the purpose of escaping. However, if you encounter some content where the escaping breaks, you can manually disable it. But please also open an issue!
## Issues
If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!
## Contributing & Testing
Please first discuss the change you wish to make, by opening an issue. I'm also happy to guide you to where a change is most likely needed.
_Note: The outside API should not change because of backwards compatibility..._
You don't have to be afraid of breaking the converter, since there are many "Golden File Tests":
Add your problematic HTML snippet to one of the `input.html` files in the `testdata` folder. Then run `go test -update` and have a look at which `.golden` files changed in GIT.
You can now change the internal logic and inspect what impact your change has by running `go test -update` again.
_Note: Before submitting your change as a PR, make sure that you run those tests and check the files into GIT..._
## Related Projects
- [turndown (js)](https://github.com/domchristie/turndown), a very good library written in javascript.
- [lunny/html2md](https://github.com/lunny/html2md), which is using [regex instead of goquery](https://stackoverflow.com/a/1732454). I came around a few edge case when using it (leaving some html comments, ...) so I wrote my own.## License
This project is licensed under the terms of the MIT license.