https://github.com/msoap/html2data

Library and cli for extracting data from HTML via CSS selectors
https://github.com/msoap/html2data

cli css-selector extract-data golang homebrew html library parser scrapping

Last synced: 6 months ago
JSON representation

Library and cli for extracting data from HTML via CSS selectors

Host: GitHub
URL: https://github.com/msoap/html2data
Owner: msoap
License: mit
Created: 2016-01-10T18:40:23.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2024-09-30T21:05:08.000Z (over 1 year ago)
Last Synced: 2025-03-24T03:57:57.552Z (10 months ago)
Topics: cli, css-selector, extract-data, golang, homebrew, html, library, parser, scrapping
Language: Go
Homepage:
Size: 7.15 MB
Stars: 69
Watchers: 3
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          html2data

=========

[![Go Reference](https://pkg.go.dev/badge/github.com/msoap/html2data.svg)](https://pkg.go.dev/github.com/msoap/html2data)

[![Go](https://github.com/msoap/html2data/actions/workflows/go.yml/badge.svg)](https://github.com/msoap/html2data/actions/workflows/go.yml)

[![Coverage Status](https://coveralls.io/repos/github/msoap/html2data/badge.svg?branch=master)](https://coveralls.io/github/msoap/html2data?branch=master)

[![Sourcegraph](https://sourcegraph.com/github.com/msoap/html2data/-/badge.svg)](https://sourcegraph.com/github.com/msoap/html2data?badge)

[![Report Card](https://goreportcard.com/badge/github.com/msoap/html2data)](https://goreportcard.com/report/github.com/msoap/html2data)

Library and cli-utility for extracting data from HTML via CSS selectors

Install

-------

Install package and command line utility:

    go install github.com/msoap/html2data/cmd/html2data@latest

Install package only:

    go get -u github.com/msoap/html2data

Methods

-------

  * `FromReader(io.Reader)` - create document for parse

  * `FromURL(URL, [config URLCfg])` - create document from http(s) URL

  * `FromFile(file)` - create document from local file

  * `doc.GetData(css map[string]string)` - get texts by CSS selectors

  * `doc.GetDataFirst(css map[string]string)` - get texts by CSS selectors, get first entry for each selector or ""

  * `doc.GetDataNested(outerCss string, css map[string]string)` - extract nested data by CSS-selectors from another CSS-selector

  * `doc.GetDataNestedFirst(outerCss string, css map[string]string)` - extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""

  * `doc.GetDataSingle(css string)` - get one result by one CSS selector

  or with config:

  * `doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})`

  * `doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})`

  * `doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})`

Pseudo-selectors

----------------

  * `:attr(attr_name)` - getting attribute instead of text, for example getting urls from links: `a:attr(href)`

  * `:html` - getting HTML instead of text

  * `:get(N)` - getting n-th element from list

Example

-------

```go

package main

import (

    "fmt"

    "log"

    "github.com/msoap/html2data"

)

func main() {

    doc := html2data.FromURL("http://example.com")

    // or with config

    // doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})

    if doc.Err != nil {

        log.Fatal(doc.Err)

    }

    // get title

    title, _ := doc.GetDataSingle("title")

    fmt.Println("Title is:", title)

    title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})

    fmt.Println("Title as is, with spaces:", title)

    texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})

    // get all H1 headers:

    if textOne, ok := texts["h1"]; ok {

        for _, text := range textOne {

            fmt.Println(text)

        }

    }

    // get all urls from links

    if links, ok := texts["links"]; ok {

        for _, text := range links {

            fmt.Println(text)

        }

    }

}

```

Command line utility

--------------------

[![Homebrew formula exists](https://img.shields.io/badge/homebrew-🍺-d7af72.svg)](https://github.com/msoap/html2data#install-1)

### Usage

    html2data [options] URL "css selector"

    html2data [options] URL :name1 "css1" :name2 "css2"...

    html2data [options] file.html "css selector"

    cat file.html | html2data "css selector"

### Options

  * `-user-agent="Custom UA"` -- set custom user-agent

  * `-find-in="outer.css.selector"` -- search in the specified elements instead document

  * `-json` -- get result as JSON

  * `-dont-trim-spaces` -- get text as is

  * `-dont-detect-charset` -- don't detect charset and convert text

  * `-timeout=10` -- setting timeout when loading the URL

### Install

Download binaries from: [releases](https://github.com/msoap/html2data/releases) (OS X/Linux/Windows/RaspberryPi)

Or install from homebrew (MacOS):

    brew tap msoap/tools

    brew install html2data

    # update:

    brew upgrade html2data

Using snap (Ubuntu or any Linux distribution with snap):

    # install stable version:

    sudo snap install html2data

    

    # install the latest version:

    sudo snap install --edge html2data

    

    # update

    sudo snap refresh html2data

From source:

    go get -u github.com/msoap/html2data/cmd/html2data

### examples

Get title of page:

    html2data https://go.dev/ title

Last blog posts:

    html2data https://go.dev/blog/ 'div#blogindex p.blogtitle a'

Getting RSS URL:

    html2data https://go.dev/blog/ 'link[type="application/atom+xml"]:attr(href)'

More examples from [wiki](https://github.com/msoap/html2data/wiki/Examples).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/msoap/html2data

Awesome Lists containing this project

README