{"id":15364821,"url":"https://github.com/hexilee/unhtml","last_synced_at":"2025-04-15T07:31:05.194Z","repository":{"id":57504546,"uuid":"150932184","full_name":"Hexilee/unhtml","owner":"Hexilee","description":"HTML unmarshaler for golang","archived":false,"fork":false,"pushed_at":"2018-10-03T11:05:28.000Z","size":102,"stargazers_count":56,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-28T18:21:17.898Z","etag":null,"topics":["go","golang","html-parser","unmarshaller"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hexilee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-30T05:02:37.000Z","updated_at":"2025-03-04T20:08:25.000Z","dependencies_parsed_at":"2022-08-30T03:41:11.769Z","dependency_job_id":null,"html_url":"https://github.com/Hexilee/unhtml","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hexilee%2Funhtml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hexilee%2Funhtml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hexilee%2Funhtml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hexilee%2Funhtml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hexilee","download_url":"https://codeload.github.com/Hexilee/unhtml/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249026717,"owners_count":21200497,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","golang","html-parser","unmarshaller"],"created_at":"2024-10-01T13:13:21.288Z","updated_at":"2025-04-15T07:31:04.928Z","avatar_url":"https://github.com/Hexilee.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Coverage Status](https://coveralls.io/repos/github/Hexilee/unhtml/badge.svg)](https://coveralls.io/github/Hexilee/unhtml)\n[![Go Report Card](https://goreportcard.com/badge/github.com/Hexilee/unhtml)](https://goreportcard.com/report/github.com/Hexilee/unhtml)\n[![Build Status](https://travis-ci.org/Hexilee/unhtml.svg?branch=master)](https://travis-ci.org/Hexilee/unhtml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Hexilee/unhtml/blob/master/LICENSE)\n[![Documentation](https://godoc.org/github.com/Hexilee/unhtml?status.svg)](https://godoc.org/github.com/Hexilee/unhtml)\n\nTable of Contents\n=================\n\n* [Example \u0026amp; Performance](#example--performance)\n* [Tips \u0026amp; Features](#tips--features)\n  * [Types](#types)\n  * [Root](#root)\n  * [Selector](#selector)\n     * [Struct](#struct)\n     * [Slice](#slice)\n  * [Tags](#tags)\n     * [html](#html)\n     * [attr](#attr)\n     * [converter](#converter)\n\n\n### Example \u0026 Performance\n\nA HTML file\n\n```html\n\u003c!DOCTYPE html\u003e\n\u003chtml lang=\"en\"\u003e\n\u003chead\u003e\n    \u003cmeta charset=\"UTF-8\"\u003e\n    \u003ctitle\u003eTitle\u003c/title\u003e\n\u003c/head\u003e\n\u003cbody\u003e\n    \u003cdiv id=\"test\"\u003e\n        \u003cul\u003e\n            \u003cli\u003e0\u003c/li\u003e\n            \u003cli\u003e1\u003c/li\u003e\n            \u003cli\u003e2\u003c/li\u003e\n            \u003cli\u003e3\u003c/li\u003e\n        \u003c/ul\u003e\n        \u003cdiv\u003e\n            \u003cp\u003eHexilee\u003c/p\u003e\n            \u003cp\u003e20\u003c/p\u003e\n            \u003cp\u003etrue\u003c/p\u003e\n        \u003c/div\u003e\n        \u003cp\u003eHello World!\u003c/p\u003e\n        \u003cp\u003e10\u003c/p\u003e\n        \u003cp\u003e3.14\u003c/p\u003e\n        \u003cp\u003etrue\u003c/p\u003e\n    \u003c/div\u003e\n\u003c/body\u003e\n\u003c/html\u003e\n```\n\nRead it\n\n```go\nAllTypeHTML, _ := ioutil.ReadFile(\"testHTML/all-type.html\")\n```\n\nIf we want to parse it and get the values we want, like the following structs, how should we do it?\n\n\n```go\npackage example\n\ntype (\n\tPartTypesStruct struct {\n\t\tSlice   []int    \n\t\tStruct  TestUser \n\t\tString  string   \n\t\tInt     int      \n\t\tFloat64 float64  \n\t\tBool    bool     \n\t}\n\n\tTestUser struct {\n\t\tName      string \n\t\tAge       uint   \n\t\tLikeLemon bool   \n\t}\n)\n```\n\nIn the traditional way, we should do it like this:\n\n```go\npackage example\n\nimport (\n\t\"bytes\"\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"strconv\"\n)\n\nfunc parsePartTypesLogically() (PartTypesStruct, error) {\n\tdoc, err := goquery.NewDocumentFromReader(bytes.NewReader(AllTypeHTML))\n\tpartTypes := PartTypesStruct{}\n\tif err == nil {\n\t\tselection := doc.Find(partTypes.Root())\n\t\tpartTypes.Slice = make([]int, 0)\n\t\tselection.Find(`ul \u003e li`).Each(func(i int, selection *goquery.Selection) {\n\t\t\tInt, parseErr := strconv.Atoi(selection.Text())\n\t\t\tif parseErr != nil {\n\t\t\t\terr = parseErr\n\t\t\t}\n\t\t\tpartTypes.Slice = append(partTypes.Slice, Int)\n\t\t})\n\t\tif err == nil {\n\t\t\tpartTypes.Struct.Name = selection.Find(`#test \u003e div \u003e p:nth-child(1)`).Text()\n\t\t\tInt, parseErr := strconv.Atoi(selection.Find(`#test \u003e div \u003e p:nth-child(2)`).Text())\n\t\t\tif err = parseErr; err == nil {\n\t\t\t\tpartTypes.Struct.Age = uint(Int)\n\t\t\t\tBool, parseErr := strconv.ParseBool(selection.Find(`#test \u003e div \u003e p:nth-child(3)`).Text())\n\t\t\t\tif err = parseErr; err == nil {\n\t\t\t\t\tpartTypes.Struct.LikeLemon = Bool\n\n\t\t\t\t\tString := selection.Find(`#test \u003e p:nth-child(3)`).Text()\n\t\t\t\t\tInt, parseErr := strconv.Atoi(selection.Find(`#test \u003e p:nth-child(4)`).Text())\n\t\t\t\t\tif err = parseErr; err != nil {\n\t\t\t\t\t\treturn partTypes, err\n\t\t\t\t\t}\n\n\t\t\t\t\tFloat64, parseErr := strconv.ParseFloat(selection.Find(`#test \u003e p:nth-child(5)`).Text(), 0)\n\t\t\t\t\tif err = parseErr; err != nil {\n\t\t\t\t\t\treturn partTypes, err\n\t\t\t\t\t}\n\n\t\t\t\t\tBool, parseErr := strconv.ParseBool(selection.Find(`#test \u003e p:nth-child(6)`).Text())\n\t\t\t\t\tif err = parseErr; err != nil {\n\t\t\t\t\t\treturn partTypes, err\n\t\t\t\t\t}\n\t\t\t\t\tpartTypes.String = String\n\t\t\t\t\tpartTypes.Int = Int\n\t\t\t\t\tpartTypes.Float64 = Float64\n\t\t\t\t\tpartTypes.Bool = Bool\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\treturn partTypes, err\n}\n\n```\n\nIt works pretty well, but is boring. And now, you can do it like this:\n\n```go\npackage main\n\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"github.com/Hexilee/unhtml\"\n\t\"io/ioutil\"\n)\n\ntype (\n\tPartTypesStruct struct {\n\t\tSlice   []int    `html:\"ul \u003e li\"`\n\t\tStruct  TestUser `html:\"#test \u003e div\"`\n\t\tString  string   `html:\"#test \u003e p:nth-child(3)\"`\n\t\tInt     int      `html:\"#test \u003e p:nth-child(4)\"`\n\t\tFloat64 float64  `html:\"#test \u003e p:nth-child(5)\"`\n\t\tBool    bool     `html:\"#test \u003e p:nth-child(6)\"`\n\t}\n\t\n\tTestUser struct {\n\t\tName      string `html:\"p:nth-child(1)\"`\n\t\tAge       uint   `html:\"p:nth-child(2)\"`\n\t\tLikeLemon bool   `html:\"p:nth-child(3)\"`\n\t}\n)\n\nfunc (PartTypesStruct) Root() string {\n\treturn \"#test\"\n}\n\nfunc main() {\n\tallTypes := PartTypesStruct{}\n\t_ := unhtml.Unmarshal(AllTypeHTML, \u0026allTypes)\n\tresult, _ := json.Marshal(\u0026allTypes)\n\tfmt.Println(string(result))\n}\n```\n\nResult: \n\n```json\n{\n  \"Slice\": [\n    0,\n    1,\n    2,\n    3\n  ],\n  \"Struct\": {\n    \"Name\": \"Hexilee\",\n    \"Age\": 20,\n    \"LikeLemon\": true\n  },\n  \"String\": \"Hello World!\",\n  \"Int\": 10,\n  \"Float64\": 3.14,\n  \"Bool\": true\n}\n```\n\nI think it can really improve the efficiency of my development, but what about its performance?\n\nThere are two benchmarks:\n\n```go\nfunc BenchmarkUnmarshalPartTypes(b *testing.B) {\n\tassert.NotNil(b, AllTypeHTML)\n\tfor i := 0; i \u003c b.N; i++ {\n\t\tpartTypes := PartTypesStruct{}\n\t\tassert.Nil(b, Unmarshal(AllTypeHTML, \u0026partTypes))\n\t}\n}\n\nfunc BenchmarkParsePartTypesLogically(b *testing.B) {\n\tassert.NotNil(b, AllTypeHTML)\n\tfor i := 0; i \u003c b.N; i++ {\n\t\t_, err := parsePartTypesLogically()\n\t\tassert.Nil(b, err)\n\t}\n}\n```\n\nTest it:\n\n```bash\n\u003e go test -bench=.\ngoos: darwin\ngoarch: amd64\npkg: github.com/Hexilee/unhtml\nBenchmarkUnmarshalPartTypes-4        \t   30000\t     54096 ns/op\nBenchmarkParsePartTypesLogically-4   \t   30000\t     45188 ns/op\nPASS\nok  \tgithub.com/Hexilee/unhtml\t4.098s\n```\n\nNot very bad, in consideration of the small size of the demo HTML. In true development with more complicated HTML, their efficiency is almost the same.\n\n### Tips \u0026 Features\n\nThe only API this package exposes is the function:\n\n```go\nfunc Unmarshal(data []byte, v interface{}) error\n```\n\nwhich is compatible with the standard library's `json` and `xml`. However, you can do some jobs with the data types in your code.\n\n#### Types\n\nThis package supports all kinds of type in the `reflect` package except `Ptr/Uintptr/Interface/Chan/Func`.\n\nThe following fields are invalid and will cause `UnmarshalerItemKindError`.\n\n```go\ntype WrongFieldsStruct struct {\n    Ptr *int\n    Uintptr uintptr\n    Interface io.Reader\n    Chan chan int\n    Func func()\n}\n```\n\nHowever, when you call the function `Unmarshal`, you **MUST** pass a pointer, otherwise you will get an `UnmarshaledKindMustBePtrError`.\n\n```go\na := 1\n\n// Wrong\nUnmarshal([]byte(\"\"), a)\n\n// Right\nUnmarshal([]byte(\"\"), \u0026a)\n```\n\n#### Root\n\nReturn the root selector.\n\nYou are only supported to define a `Root() string` method for the root type, like\n\n```go\nfunc (PartTypesStruct) Root() string {\n\treturn \"#test\"\n}\n```\n\nIf you define it for a field type, such as `TestUser`\n\n```go\nfunc (TestUser) Root() string {\n\treturn \"#test\"\n}\n```\n\nIn this case, in `PartTypesStruct`, the field selector will be covered.\n\n```go\ntype (\n\tPartTypesStruct struct {\n\t\t...\n\t\tStruct  TestUser `html:\"#test \u003e div\"`\n\t\t...\n\t}\n)\n\n// real\ntype (\n\tPartTypesStruct struct {\n\t\t...\n\t\tStruct  TestUser `html:\"#test\"`\n\t\t...\n\t}\n)\n```\n\n\n\n#### Selector\n\nThis package is based on `github.com/PuerkitoBio/goquery` and supports standard css selectors.\n\nYou can define selectors of a field in tags, like this:\n\n```go\ntype (\n\tPartTypesStruct struct {\n\t   ...\n\t\tInt     int      `html:\"#test \u003e p:nth-child(4)\"`\n\t\t...\n\t}\n)\n```\n\nIn most cases, this package will find the `#test \u003e p:nth-child(4)` element and try to parse its `innerText` as int.\n\nHowever, when the field type is `Struct` or `Slice`, it will be more complex.\n\n##### Struct\n\n```go\ntype (\n\tPartTypesStruct struct {\n\t\t...\n\t\tStruct  TestUser `html:\"#test \u003e div\"`\n\t\t...\n\t}\n\n\tTestUser struct {\n\t\tName      string `html:\"p:nth-child(1)\"`\n\t\tAge       uint   `html:\"p:nth-child(2)\"`\n\t\tLikeLemon bool   `html:\"p:nth-child(3)\"`\n\t}\n)\n\nfunc (PartTypesStruct) Root() string {\n\treturn \"#test\"\n}\n```\n\nFirst, it will call `*goquery.Selection.Find(\"#test\")`, we get:\n\n```html\n    \u003cdiv id=\"test\"\u003e\n        \u003cul\u003e\n            \u003cli\u003e0\u003c/li\u003e\n            \u003cli\u003e1\u003c/li\u003e\n            \u003cli\u003e2\u003c/li\u003e\n            \u003cli\u003e3\u003c/li\u003e\n        \u003c/ul\u003e\n        \u003cdiv\u003e\n            \u003cp\u003eHexilee\u003c/p\u003e\n            \u003cp\u003e20\u003c/p\u003e\n            \u003cp\u003etrue\u003c/p\u003e\n        \u003c/div\u003e\n        \u003cp\u003eHello World!\u003c/p\u003e\n        \u003cp\u003e10\u003c/p\u003e\n        \u003cp\u003e3.14\u003c/p\u003e\n        \u003cp\u003etrue\u003c/p\u003e\n    \u003c/div\u003e\n```\n\nThen, it will call `*goquery.Selection.Find(\"#test \u003e div\")`, we get\n\n```html\n\u003cdiv\u003e\n    \u003cp\u003eHexilee\u003c/p\u003e\n    \u003cp\u003e20\u003c/p\u003e\n    \u003cp\u003etrue\u003c/p\u003e\n\u003c/div\u003e\n```\n\nThen, in `TestUser`, it will call\n\n```go\n*goquery.Selection.Find(\"p:nth-child(1)\") // as Name\n*goquery.Selection.Find(\"p:nth-child(2)\") // as Age\n*goquery.Selection.Find(\"p:nth-child(3)\") // as LikeLemon\n```\n\n##### Slice\n\n```go\ntype (\n\tPartTypesStruct struct {\n\t\tSlice   []int    `html:\"ul \u003e li\"`\t\t...\n\t}\n)\n\nfunc (PartTypesStruct) Root() string {\n\treturn \"#test\"\n}\n```\n\nAs above, we get\n\n```html\n    \u003cdiv id=\"test\"\u003e\n        \u003cul\u003e\n            \u003cli\u003e0\u003c/li\u003e\n            \u003cli\u003e1\u003c/li\u003e\n            \u003cli\u003e2\u003c/li\u003e\n            \u003cli\u003e3\u003c/li\u003e\n        \u003c/ul\u003e\n        \u003cdiv\u003e\n            \u003cp\u003eHexilee\u003c/p\u003e\n            \u003cp\u003e20\u003c/p\u003e\n            \u003cp\u003etrue\u003c/p\u003e\n        \u003c/div\u003e\n        \u003cp\u003eHello World!\u003c/p\u003e\n        \u003cp\u003e10\u003c/p\u003e\n        \u003cp\u003e3.14\u003c/p\u003e\n        \u003cp\u003etrue\u003c/p\u003e\n    \u003c/div\u003e\n```\n\nThen it will call `*goquery.Selection.Find(\"ul \u003e li\")`, we get\n\n```html\n  \u003cli\u003e0\u003c/li\u003e\n  \u003cli\u003e1\u003c/li\u003e\n  \u003cli\u003e2\u003c/li\u003e\n  \u003cli\u003e3\u003c/li\u003e\n```\n\nThen, it will call `*goquery.Selection.Each(func(int, *goquery.Selection))`, iterate the list and parse values for slice.\n\n#### Tags\n\nThis package supports three tags, `html`, `attr` and `converter`\n\n##### html\n\nProvide the `css selector` of this field.\n\n##### attr\n\nBy default, this package regards the `innerText` of a element as its `value`\n\n```html\n\u003ca href=\"https://google.com\"\u003eGoogle\u003c/a\u003e\n```\n\n```go\ntype Link struct {\n    Text string `html:\"a\"`\n}\n```\n\nYou will get `Text = Google`. However, what should we do if we want to get `href`?\n\n```go\ntype Link struct {\n    Href string `html:\"a\" attr:\"href\"`\n    Text string `html:\"a\"`\n}\n```\n\nYou will get `link.Href == \"https://google.com\"`\n\n##### converter\n\nSometimes, you want to process the original data\n\n```html\n\u003cp\u003e2018-10-01 00:00:01\u003c/p\u003e\n```\n\nYou may unmarshal it like this\n\n```go\ntype Birthday struct {\n\tTime time.Time `html:\"p\"`\n}\n\nfunc TestConverter(t *testing.T) {\n\tbirthday := Birthday{}\n\tassert.Nil(t, Unmarshal([]byte(BirthdayHTML), \u0026birthday))\n\tassert.Equal(t, 2018, birthday.Time.Year())\n\tassert.Equal(t, time.October, birthday.Time.Month())\n\tassert.Equal(t, 1, birthday.Time.Day())\n}\n```\n\nAbsolutely, you will fail, because you don't define the way it converts a string to time.Time. `unhtml` will regard it as a struct.\n\nHowever, you can use `converter`\n\n```go\ntype Birthday struct {\n    Time time.Time `html:\"p\" converter:\"StringToTime\"`\n}\n\nconst TimeStandard = `2006-01-02 15:04:05`\n\nfunc (Birthday) StringToTime(str string) (time.Time, error) {\n\treturn time.Parse(TimeStandard, str)\n}\n\nfunc TestConverter(t *testing.T) {\n\tbirthday := Birthday{}\n\tassert.Nil(t, Unmarshal([]byte(BirthdayHTML), \u0026birthday))\n\tassert.Equal(t, 2018, birthday.Time.Year())\n\tassert.Equal(t, time.October, birthday.Time.Month())\n\tassert.Equal(t, 1, birthday.Time.Day())\n}\n```\n\nMake it.\n\nThe type of converter **MUST** be \n\n```go\nfunc (inputType) (resultType, error)\n```\n\n`resultType` **MUST** be the same as the field type, and they can be any type.\n\n`inputType` **MUST NOT** violate the requirements in [Types](#types).\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhexilee%2Funhtml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhexilee%2Funhtml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhexilee%2Funhtml/lists"}