Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nfx/go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
https://github.com/nfx/go-htmltable
data-extraction go go-generics html
Last synced: 7 days ago
JSON representation
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
- Host: GitHub
- URL: https://github.com/nfx/go-htmltable
- Owner: nfx
- License: mit
- Created: 2022-09-17T10:00:39.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-10T21:29:36.000Z (5 months ago)
- Last Synced: 2024-06-18T23:16:30.113Z (5 months ago)
- Topics: data-extraction, go, go-generics, html
- Language: Go
- Homepage: https://pkg.go.dev/github.com/nfx/go-htmltable
- Size: 417 KB
- Stars: 114
- Watchers: 4
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HTML table data extractor for Go
[![GoDoc](https://img.shields.io/badge/go-documentation-blue.svg)](https://pkg.go.dev/mod/github.com/nfx/go-htmltable)
[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/nfx/go-htmltable/blob/main/LICENSE)
[![codecov](https://codecov.io/gh/nfx/go-htmltable/branch/main/graph/badge.svg)](https://codecov.io/gh/nfx/go-htmltable)
[![build](https://github.com/nfx/go-htmltable/workflows/build/badge.svg?branch=main)](https://github.com/nfx/go-htmltable/actions?query=workflow%3Abuild+branch%3Amain)`htmltable` enables structured data extraction from HTML tables and URLs and requires almost no external dependencies. Tested with Go 1.18.x and 1.19.x.
## Installation
```bash
go get github.com/nfx/go-htmltable
```## Usage
You can retrieve a slice of `header`-annotated types using the `NewSlice*` contructors:
```go
type Ticker struct {
Symbol string `header:"Symbol"`
Security string `header:"Security"`
CIK string `header:"CIK"`
}url := "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
out, _ := htmltable.NewSliceFromURL[Ticker](url)
fmt.Println(out[0].Symbol)
fmt.Println(out[0].Security)// Output:
// MMM
// 3M
```An error would be thrown if there's no matching page with the specified columns:
```go
page, _ := htmltable.NewFromURL("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
_, err := page.FindWithColumns("invalid", "column", "names")
fmt.Println(err)// Output:
// cannot find table with columns: invalid, column, names
```And you can use more low-level API to work with extracted data:
```go
page, _ := htmltable.NewFromString(`
foo
ab
1 2
3 4
bar
bcd
125
346
`)fmt.Printf("found %d tables\n", page.Len())
_ = page.Each2("c", "d", func(c, d string) error {
fmt.Printf("c:%s d:%s\n", c, d)
return nil
})// Output:
// found 2 tables
// c:2 d:5
// c:4 d:6
```Complex [tables with row and col spans](https://en.wikipedia.org/wiki/List_of_AMD_chipsets#AM4_chipsets) are natively supported as well. You can annotate `string`, `int`, and `bool` fields. Any `bool` field value is `true` if it is equal in lowercase to one of `yes`, `y`, `true`, `t`.
![Wikipedia, AMD AM4 chipsets](doc/colspans-rowspans.png)
```go
type AM4 struct {
Model string `header:"Model"`
ReleaseDate string `header:"Release date"`
PCIeSupport string `header:"PCIesupport[a]"`
MultiGpuCrossFire bool `header:"Multi-GPU CrossFire"`
MultiGpuSLI bool `header:"Multi-GPU SLI"`
USBSupport string `header:"USBsupport[b]"`
SATAPorts int `header:"Storage features SATAports"`
RAID string `header:"Storage features RAID"`
AMDStoreMI bool `header:"Storage features AMD StoreMI"`
Overclocking string `header:"Processoroverclocking"`
TDP string `header:"TDP"`
SupportExcavator string `header:"CPU support[14] Excavator"`
SupportZen string `header:"CPU support[14] Zen"`
SupportZenPlus string `header:"CPU support[14] Zen+"`
SupportZen2 string `header:"CPU support[14] Zen 2"`
SupportZen3 string `header:"CPU support[14] Zen 3"`
Architecture string `header:"Architecture"`
}
am4Chipsets, _ := htmltable.NewSliceFromURL[AM4]("https://en.wikipedia.org/wiki/List_of_AMD_chipsets")
fmt.Println(am4Chipsets[2].Model)
fmt.Println(am4Chipsets[2].SupportZen2)// Output:
// X370
// Varies[c]
```And the last note: you're encouraged to plug your own structured logger:
```go
htmltable.Logger = func(_ context.Context, msg string, fields ...any) {
fmt.Printf("[INFO] %s %v\n", msg, fields)
}
htmltable.NewFromURL("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")// Output:
// [INFO] found table [columns [Symbol Security SEC filings GICSSector GICS Sub-Industry Headquarters Location Date first added CIK Founded] count 504]
// [INFO] found table [columns [Date Added Ticker Added Security Removed Ticker Removed Security Reason] count 308]
```## Inspiration
This library aims to be something like [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) or [table_extract](https://docs.rs/table-extract/latest/table_extract/) Rust crate, but more idiomatic for Go.