Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/msoap/html2data
Library and cli for extracting data from HTML via CSS selectors
https://github.com/msoap/html2data
cli css-selector extract-data golang homebrew html library parser scrapping
Last synced: 3 days ago
JSON representation
Library and cli for extracting data from HTML via CSS selectors
- Host: GitHub
- URL: https://github.com/msoap/html2data
- Owner: msoap
- License: mit
- Created: 2016-01-10T18:40:23.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2024-09-30T21:05:08.000Z (about 1 month ago)
- Last Synced: 2024-10-31T20:25:49.702Z (10 days ago)
- Topics: cli, css-selector, extract-data, golang, homebrew, html, library, parser, scrapping
- Language: Go
- Homepage:
- Size: 7.15 MB
- Stars: 68
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
html2data
=========[![Go Reference](https://pkg.go.dev/badge/github.com/msoap/html2data.svg)](https://pkg.go.dev/github.com/msoap/html2data)
[![Go](https://github.com/msoap/html2data/actions/workflows/go.yml/badge.svg)](https://github.com/msoap/html2data/actions/workflows/go.yml)
[![Coverage Status](https://coveralls.io/repos/github/msoap/html2data/badge.svg?branch=master)](https://coveralls.io/github/msoap/html2data?branch=master)
[![Sourcegraph](https://sourcegraph.com/github.com/msoap/html2data/-/badge.svg)](https://sourcegraph.com/github.com/msoap/html2data?badge)
[![Report Card](https://goreportcard.com/badge/github.com/msoap/html2data)](https://goreportcard.com/report/github.com/msoap/html2data)Library and cli-utility for extracting data from HTML via CSS selectors
Install
-------Install package and command line utility:
go install github.com/msoap/html2data/cmd/html2data@latest
Install package only:
go get -u github.com/msoap/html2data
Methods
-------* `FromReader(io.Reader)` - create document for parse
* `FromURL(URL, [config URLCfg])` - create document from http(s) URL
* `FromFile(file)` - create document from local file
* `doc.GetData(css map[string]string)` - get texts by CSS selectors
* `doc.GetDataFirst(css map[string]string)` - get texts by CSS selectors, get first entry for each selector or ""
* `doc.GetDataNested(outerCss string, css map[string]string)` - extract nested data by CSS-selectors from another CSS-selector
* `doc.GetDataNestedFirst(outerCss string, css map[string]string)` - extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""
* `doc.GetDataSingle(css string)` - get one result by one CSS selectoror with config:
* `doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})`
* `doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})`
* `doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})`Pseudo-selectors
----------------* `:attr(attr_name)` - getting attribute instead of text, for example getting urls from links: `a:attr(href)`
* `:html` - getting HTML instead of text
* `:get(N)` - getting n-th element from listExample
-------```go
package mainimport (
"fmt"
"log""github.com/msoap/html2data"
)func main() {
doc := html2data.FromURL("http://example.com")
// or with config
// doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
if doc.Err != nil {
log.Fatal(doc.Err)
}// get title
title, _ := doc.GetDataSingle("title")
fmt.Println("Title is:", title)title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
fmt.Println("Title as is, with spaces:", title)texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
// get all H1 headers:
if textOne, ok := texts["h1"]; ok {
for _, text := range textOne {
fmt.Println(text)
}
}
// get all urls from links
if links, ok := texts["links"]; ok {
for _, text := range links {
fmt.Println(text)
}
}
}
```Command line utility
--------------------[![Homebrew formula exists](https://img.shields.io/badge/homebrew-🍺-d7af72.svg)](https://github.com/msoap/html2data#install-1)
### Usage
html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"### Options
* `-user-agent="Custom UA"` -- set custom user-agent
* `-find-in="outer.css.selector"` -- search in the specified elements instead document
* `-json` -- get result as JSON
* `-dont-trim-spaces` -- get text as is
* `-dont-detect-charset` -- don't detect charset and convert text
* `-timeout=10` -- setting timeout when loading the URL### Install
Download binaries from: [releases](https://github.com/msoap/html2data/releases) (OS X/Linux/Windows/RaspberryPi)
Or install from homebrew (MacOS):
brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2dataUsing snap (Ubuntu or any Linux distribution with snap):
# install stable version:
sudo snap install html2data
# install the latest version:
sudo snap install --edge html2data
# update
sudo snap refresh html2dataFrom source:
go get -u github.com/msoap/html2data/cmd/html2data
### examples
Get title of page:
html2data https://go.dev/ title
Last blog posts:
html2data https://go.dev/blog/ 'div#blogindex p.blogtitle a'
Getting RSS URL:
html2data https://go.dev/blog/ 'link[type="application/atom+xml"]:attr(href)'
More examples from [wiki](https://github.com/msoap/html2data/wiki/Examples).