{"id":18404127,"url":"https://github.com/nfx/go-htmltable","last_synced_at":"2025-04-05T12:06:11.089Z","repository":{"id":59619982,"uuid":"537767063","full_name":"nfx/go-htmltable","owner":"nfx","description":"Structured HTML table data extraction from URLs in Go that has almost no external dependencies","archived":false,"fork":false,"pushed_at":"2025-03-24T23:20:00.000Z","size":425,"stargazers_count":120,"open_issues_count":1,"forks_count":8,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-29T11:08:14.911Z","etag":null,"topics":["data-extraction","go","go-generics","html"],"latest_commit_sha":null,"homepage":"https://pkg.go.dev/github.com/nfx/go-htmltable","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nfx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-17T10:00:39.000Z","updated_at":"2025-01-18T15:25:35.000Z","dependencies_parsed_at":"2023-12-17T10:27:07.972Z","dependency_job_id":"3fdd8e19-d1d1-43ae-941d-7b16aad6097c","html_url":"https://github.com/nfx/go-htmltable","commit_stats":{"total_commits":11,"total_committers":2,"mean_commits":5.5,"dds":0.09090909090909094,"last_synced_commit":"0baac98de3f1b5c5152f07d67bacf6ab6ea599e0"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nfx%2Fgo-htmltable","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nfx%2Fgo-htmltable/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nfx%2Fgo-htmltable/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nfx%2Fgo-htmltable/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nfx","download_url":"https://codeload.github.com/nfx/go-htmltable/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247332604,"owners_count":20921853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","go","go-generics","html"],"created_at":"2024-11-06T02:50:39.341Z","updated_at":"2025-04-05T12:06:11.066Z","avatar_url":"https://github.com/nfx.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HTML table data extractor for Go\n\n[![GoDoc](https://img.shields.io/badge/go-documentation-blue.svg)](https://pkg.go.dev/mod/github.com/nfx/go-htmltable)\n[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/nfx/go-htmltable/blob/main/LICENSE)\n[![codecov](https://codecov.io/gh/nfx/go-htmltable/branch/main/graph/badge.svg)](https://codecov.io/gh/nfx/go-htmltable)\n[![build](https://github.com/nfx/go-htmltable/workflows/build/badge.svg?branch=main)](https://github.com/nfx/go-htmltable/actions?query=workflow%3Abuild+branch%3Amain)\n\n\n`htmltable` enables structured data extraction from HTML tables and URLs and requires almost no external dependencies. Tested with Go 1.18.x and 1.19.x.\n\n## Installation\n\n```bash\ngo get github.com/nfx/go-htmltable\n```\n\n## Usage\n\nYou can retrieve a slice of `header`-annotated types using the `NewSlice*` contructors:\n\n```go\ntype Ticker struct {\n    Symbol   string `header:\"Symbol\"`\n    Security string `header:\"Security\"`\n    CIK      string `header:\"CIK\"`\n}\n\nurl := \"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies\"\nout, _ := htmltable.NewSliceFromURL[Ticker](url)\nfmt.Println(out[0].Symbol)\nfmt.Println(out[0].Security)\n\n// Output: \n// MMM\n// 3M\n```\n\nAn error would be thrown if there's no matching page with the specified columns:\n\n```go\npage, _ := htmltable.NewFromURL(\"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies\")\n_, err := page.FindWithColumns(\"invalid\", \"column\", \"names\")\nfmt.Println(err)\n\n// Output: \n// cannot find table with columns: invalid, column, names\n```\n\nAnd you can use more low-level API to work with extracted data:\n\n```go\npage, _ := htmltable.NewFromString(`\u003cbody\u003e\n    \u003ch1\u003efoo\u003c/h2\u003e\n    \u003ctable\u003e\n        \u003ctr\u003e\u003ctd\u003ea\u003c/td\u003e\u003ctd\u003eb\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003e 1 \u003c/td\u003e\u003ctd\u003e2\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003e3  \u003c/td\u003e\u003ctd\u003e4   \u003c/td\u003e\u003c/tr\u003e\n    \u003c/table\u003e\n    \u003ch1\u003ebar\u003c/h2\u003e\n    \u003ctable\u003e\n        \u003ctr\u003e\u003cth\u003eb\u003c/th\u003e\u003cth\u003ec\u003c/th\u003e\u003cth\u003ed\u003c/th\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003e1\u003c/td\u003e\u003ctd\u003e2\u003c/td\u003e\u003ctd\u003e5\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003e3\u003c/td\u003e\u003ctd\u003e4\u003c/td\u003e\u003ctd\u003e6\u003c/td\u003e\u003c/tr\u003e\n    \u003c/table\u003e\n\u003c/body\u003e`)\n\nfmt.Printf(\"found %d tables\\n\", page.Len())\n_ = page.Each2(\"c\", \"d\", func(c, d string) error {\n    fmt.Printf(\"c:%s d:%s\\n\", c, d)\n    return nil\n})\n\n// Output: \n// found 2 tables\n// c:2 d:5\n// c:4 d:6\n```\n\nComplex [tables with row and col spans](https://en.wikipedia.org/wiki/List_of_AMD_chipsets#AM4_chipsets) are natively supported as well. You can annotate `string`, `int`, and `bool` fields. Any `bool` field value is `true` if it is equal in lowercase to one of `yes`, `y`, `true`, `t`.\n\n![Wikipedia, AMD AM4 chipsets](doc/colspans-rowspans.png)\n\n```go\ntype AM4 struct {\n    Model             string `header:\"Model\"`\n    ReleaseDate       string `header:\"Release date\"`\n    PCIeSupport       string `header:\"PCIesupport[a]\"`\n    MultiGpuCrossFire bool   `header:\"Multi-GPU CrossFire\"`\n    MultiGpuSLI       bool   `header:\"Multi-GPU SLI\"`\n    USBSupport        string `header:\"USBsupport[b]\"`\n    SATAPorts         int    `header:\"Storage features SATAports\"`\n    RAID              string `header:\"Storage features RAID\"`\n    AMDStoreMI        bool   `header:\"Storage features AMD StoreMI\"`\n    Overclocking      string `header:\"Processoroverclocking\"`\n    TDP               string `header:\"TDP\"`\n    SupportExcavator  string `header:\"CPU support[14] Excavator\"`\n    SupportZen        string `header:\"CPU support[14] Zen\"`\n    SupportZenPlus    string `header:\"CPU support[14] Zen+\"`\n    SupportZen2       string `header:\"CPU support[14] Zen 2\"`\n    SupportZen3       string `header:\"CPU support[14] Zen 3\"`\n    Architecture      string `header:\"Architecture\"`\n}\nam4Chipsets, _ := htmltable.NewSliceFromURL[AM4](\"https://en.wikipedia.org/wiki/List_of_AMD_chipsets\")\nfmt.Println(am4Chipsets[2].Model)\nfmt.Println(am4Chipsets[2].SupportZen2)\n\n// Output:\n// X370\n// Varies[c]\n```\n\nAnd the last note: you're encouraged to plug your own structured logger:\n\n```go\nhtmltable.Logger = func(_ context.Context, msg string, fields ...any) {\n    fmt.Printf(\"[INFO] %s %v\\n\", msg, fields)\n}\nhtmltable.NewFromURL(\"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies\")\n\n// Output:\n// [INFO] found table [columns [Symbol Security SEC filings GICSSector GICS Sub-Industry Headquarters Location Date first added CIK Founded] count 504]\n// [INFO] found table [columns [Date Added Ticker Added Security Removed Ticker Removed Security Reason] count 308]\n```\n\n## Inspiration\n\nThis library aims to be something like [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) or [table_extract](https://docs.rs/table-extract/latest/table_extract/) Rust crate, but more idiomatic for Go.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnfx%2Fgo-htmltable","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnfx%2Fgo-htmltable","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnfx%2Fgo-htmltable/lists"}