https://github.com/aarzilli/sandblast
Library to extract text from HTML files
https://github.com/aarzilli/sandblast
Last synced: 3 months ago
JSON representation
Library to extract text from HTML files
- Host: GitHub
- URL: https://github.com/aarzilli/sandblast
- Owner: aarzilli
- License: bsd-3-clause
- Created: 2014-07-16T13:35:33.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2015-12-20T13:55:28.000Z (over 10 years ago)
- Last Synced: 2025-03-25T06:23:56.631Z (about 1 year ago)
- Language: Go
- Size: 21.5 KB
- Stars: 11
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
Library that uses Readability-like heuristics to extract text from an HTML document.
Example:
```go
import "golang.org/x/net/html"
…
node, err := html.Parse(bytes.NewReader(raw_html))
if err != nil {
log.Fatal("Parsing error: ", err)
}
title, text := sandblast.Extract(node)
fmt.Printf("Title: %s\n%s", title, text)
…
```
See also `example/extract.go`, a command line utility to extract text from a URL.