{"id":13413988,"url":"https://github.com/foolin/pagser","last_synced_at":"2025-04-22T07:31:37.655Z","repository":{"id":44830540,"uuid":"256959515","full_name":"foolin/pagser","owner":"foolin","description":"Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler","archived":false,"fork":false,"pushed_at":"2023-10-15T14:31:46.000Z","size":133,"stargazers_count":103,"open_issues_count":7,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-07-31T20:53:13.927Z","etag":null,"topics":["colly","crawler","deserialization","go","golang","goquery","html","page","parser","scrapy"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/foolin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-04-19T09:22:00.000Z","updated_at":"2024-07-05T08:54:28.000Z","dependencies_parsed_at":"2024-01-09T08:11:41.712Z","dependency_job_id":null,"html_url":"https://github.com/foolin/pagser","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foolin%2Fpagser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foolin%2Fpagser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foolin%2Fpagser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/foolin%2Fpagser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/foolin","download_url":"https://codeload.github.com/foolin/pagser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250195033,"owners_count":21390230,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["colly","crawler","deserialization","go","golang","goquery","html","page","parser","scrapy"],"created_at":"2024-07-30T20:01:54.547Z","updated_at":"2025-04-22T07:31:37.357Z","avatar_url":"https://github.com/foolin.png","language":"Go","readme":"# Pagser\n\n[![go-doc-img]][go-doc] [![travis-img]][travis] [![go-report-card-img]][go-report-card] [![Coverage Status][cov-img]][cov]\n\n**Pagser** inspired by  \u003cu\u003e**pag**\u003c/u\u003ee par\u003cu\u003e**ser**\u003c/u\u003e。\n\n**Pagser** is a simple, extensible, configurable parse and deserialize html page to struct based on [goquery](https://github.com/PuerkitoBio/goquery) and struct tags for golang crawler.\n\n## Contents\n\n- [Install](#install)\n- [Features](#features)\n- [Docs](#docs)\n- [Usage](#usage)\n- [Configuration](#configuration)\n- [Struct Tag Grammar](#struct-tag-grammar)\n- [Functions](#functions)\n    - [Builtin functions](#builtin-functions)\n    - [Extension functions](#extension-functions)\n    - [Custom function](#custom-function)\n    - [Function interface](#function-interface)\n    - [Call Syntax](#call-syntax)\n    - [Priority Order](#priority-order)\n    - [More Examples](#more-examples)\n- [Examples](#examples)\n- [Dependencies](#dependencies)\n\n\n## Install\n\n```bash\ngo get -u github.com/foolin/pagser\n```\n\nOr get the specified version:\n```bash\ngo get github.com/foolin/pagser@{version}\n```\nThe {version} release list: \u003chttps://github.com/foolin/pagser/releases\u003e\n\n\n## Features\n\n* **Simple** - Use golang struct tag syntax.\n* **Easy** - Easy use for your spider/crawler/colly application.\n* **Extensible** - Support for extension functions.\n* **Struct tag grammar** - Grammar is simple, like \\`pagser:\"a-\u003eattr(href)\"\\`.\n* **Nested Structure** - Support Nested Structure for node.\n* **Configurable** - Support configuration.\n* **Implicit type conversion** - Automatic implicit type conversion, Output result string convert to int, int64, float64...\n* **GoQuery/Colly** - Support all [goquery](https://github.com/PuerkitoBio/goquery) project, such as [go-colly](https://github.com/gocolly/colly).\n\n## Docs\n\nSee [Pagser](https://pkg.go.dev/github.com/foolin/pagser)\n\n\n## Usage\n\n```golang\n\npackage main\n\nimport (\n\t\"encoding/json\"\n\t\"github.com/foolin/pagser\"\n\t\"log\"\n)\n\nconst rawPageHtml = `\n\u003c!doctype html\u003e\n\u003chtml\u003e\n\u003chead\u003e\n    \u003cmeta charset=\"utf-8\"\u003e\n    \u003ctitle\u003ePagser Title\u003c/title\u003e\n\t\u003cmeta name=\"keywords\" content=\"golang,pagser,goquery,html,page,parser,colly\"\u003e\n\u003c/head\u003e\n\n\u003cbody\u003e\n\t\u003ch1\u003eH1 Pagser Example\u003c/h1\u003e\n\t\u003cdiv class=\"navlink\"\u003e\n\t\t\u003cdiv class=\"container\"\u003e\n\t\t\t\u003cul class=\"clearfix\"\u003e\n\t\t\t\t\u003cli id=''\u003e\u003ca href=\"/\"\u003eIndex\u003c/a\u003e\u003c/li\u003e\n\t\t\t\t\u003cli id='2'\u003e\u003ca href=\"/list/web\" title=\"web site\"\u003eWeb page\u003c/a\u003e\u003c/li\u003e\n\t\t\t\t\u003cli id='3'\u003e\u003ca href=\"/list/pc\" title=\"pc page\"\u003ePc Page\u003c/a\u003e\u003c/li\u003e\n\t\t\t\t\u003cli id='4'\u003e\u003ca href=\"/list/mobile\" title=\"mobile page\"\u003eMobile Page\u003c/a\u003e\u003c/li\u003e\n\t\t\t\u003c/ul\u003e\n\t\t\u003c/div\u003e\n\t\u003c/div\u003e\n\u003c/body\u003e\n\u003c/html\u003e\n`\n\ntype PageData struct {\n\tTitle    string   `pagser:\"title\"`\n\tKeywords []string `pagser:\"meta[name='keywords']-\u003eattrSplit(content)\"`\n\tH1       string   `pagser:\"h1\"`\n\tNavs     []struct {\n\t\tID   int    `pagser:\"-\u003eattrEmpty(id, -1)\"`\n\t\tName string `pagser:\"a-\u003etext()\"`\n\t\tUrl  string `pagser:\"a-\u003eattr(href)\"`\n\t} `pagser:\".navlink li\"`\n}\n\nfunc main() {\n\t//New default config\n\tp := pagser.New()\n\n\t//data parser model\n\tvar data PageData\n\t//parse html data\n\terr := p.Parse(\u0026data, rawPageHtml)\n\t//check error\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\t//print data\n\tlog.Printf(\"Page data json: \\n-------------\\n%v\\n-------------\\n\", toJson(data))\n}\n\nfunc toJson(v interface{}) string {\n\tdata, _ := json.MarshalIndent(v, \"\", \"\\t\")\n\treturn string(data)\n}\n\n```\n\nRun output:\n```\n\nPage data json: \n-------------\n{\n\t\"Title\": \"Pagser Title\",\n\t\"Keywords\": [\n\t\t\"golang\",\n\t\t\"pagser\",\n\t\t\"goquery\",\n\t\t\"html\",\n\t\t\"page\",\n\t\t\"parser\",\n\t\t\"colly\"\n\t],\n\t\"H1\": \"H1 Pagser Example\",\n\t\"Navs\": [\n\t\t{\n\t\t\t\"ID\": -1,\n\t\t\t\"Name\": \"Index\",\n\t\t\t\"Url\": \"/\"\n\t\t},\n\t\t{\n\t\t\t\"ID\": 2,\n\t\t\t\"Name\": \"Web page\",\n\t\t\t\"Url\": \"/list/web\"\n\t\t},\n\t\t{\n\t\t\t\"ID\": 3,\n\t\t\t\"Name\": \"Pc Page\",\n\t\t\t\"Url\": \"/list/pc\"\n\t\t},\n\t\t{\n\t\t\t\"ID\": 4,\n\t\t\t\"Name\": \"Mobile Page\",\n\t\t\t\"Url\": \"/list/mobile\"\n\t\t}\n\t]\n}\n-------------\n\n```\n\n## Configuration\n\n```golang\n\ntype Config struct {\n\tTagName    string //struct tag name, default is `pagser`\n\tFuncSymbol   string //Function symbol, default is `-\u003e`\n\tDebug        bool   //Debug mode, debug will print some log, default is `false`\n}\n\n```\n\n\n\n## Struct Tag Grammar\n\n```\n[goquery selector]-\u003e[function]\n```\nExample:\n```golang\n\ntype ExamData struct {\n\tHerf string `pagser:\".navLink li a-\u003eattr(href)\"`\n}\n```\n\n\u003e 1.Struct tag name: `pagser`  \n\u003e 2.[goquery](https://github.com/PuerkitoBio/goquery) selector: `.navLink li a`   \n\u003e 3.Function symbol: `-\u003e`  \n\u003e 4.Function name: `attr`  \n\u003e 5.Function arguments: `href` \n\n![grammar](grammar.png)\n\n## Functions\n\n### Builtin functions\n\n\u003e - text() get element  text, return string, this is default function, if not define function in struct tag.\n\n\u003e - eachText() get each element text, return []string.\n\n\u003e - html() get element inner html, return string.\n\n\u003e - eachHtml() get each element inner html, return []string.\n\n\u003e - outerHtml() get element  outer html, return string.\n\n\u003e - eachOutHtml() get each element outer html, return []string.\n\n\u003e - attr(name) get element attribute value, return string.\n\n\u003e - eachAttr() get each element attribute value, return []string.\n\n\u003e - attrSplit(name, sep)  get attribute value and split by separator to array string.\n\n\u003e - attr('value') get element attribute value by name is `value`, return string, eg: \u003cinput value='xxxx' /\u003e will return \"xxx\".\n\n\u003e - textSplit(sep) get element text and split by separator to array string, return []string.\n\n\u003e - eachTextJoin(sep) get each element text and join to string, return string.\n\n\u003e - eq(index) reduces the set of matched elements to the one at the specified index, return Selection for nested struct.\n\n\u003e - ...\n\nMore builtin functions see docs: \u003chttps://pkg.go.dev/github.com/foolin/pagser?tab=doc#BuiltinFunctions\u003e\n\n### Extension functions\n\n\u003e- Markdown() //convert html to markdown format.\n\n\u003e- UgcHtml() //sanitize html\n\nExtensions function need register, like:\n```golang\nimport \"github.com/foolin/pagser/extensions/markdown\"\n\np := pagser.New()\n\n//Register Markdown\nmarkdown.Register(p)\n\n```\n\n### Custom function\n\n#### Function interface\n```golang\n\ntype CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)\n\n```\n\n#### Define global function\n```golang\n\n//global function need call pagser.RegisterFunc(\"MyGlob\", MyGlobalFunc) before use it.\n// this global method must call pagser.RegisterFunc(\"MyGlob\", MyGlobalFunc).\nfunc MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {\n\treturn \"Global-\" + node.Text(), nil\n}\n\ntype PageData struct{\n  MyGlobalValue string    `pagser:\"-\u003eMyGlob()\"`\n}\n\nfunc main(){\n\n    p := pagser.New()\n\n    //Register global function `MyGlob`\n    p.RegisterFunc(\"MyGlob\", MyGlobalFunc)\n\n    //Todo\n\n    //data parser model\n    var data PageData\n    //parse html data\n    err := p.Parse(\u0026data, rawPageHtml)\n\n    //...\n}\n\n```\n\n\n#### Define struct function\n```golang\n\ntype PageData struct{\n  MyFuncValue int    `pagser:\"-\u003eMyFunc()\"`\n}\n\n// this method will auto call, not need register.\nfunc (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {\n\treturn \"Struct-\" + node.Text(), nil\n}\n\n\nfunc main(){\n\n    p := pagser.New()\n\n    //Todo\n\n    //data parser model\n    var data PageData\n    //parse html data\n    err := p.Parse(\u0026data, rawPageHtml)\n\n    //...\n}\n\n```\n\n#### Call Syntax\n\n\u003e **Note**: all function arguments are string, single quotes are optional.\n\n1. Function call with no arguments\n\u003e -\u003efn()\n\n2. Function calls with one argument, and single quotes are optional\n\n\u003e -\u003efn(one)\n\u003e\n\u003e -\u003efn('one')\n\n3. Function calls with many arguments\n\n\u003e -\u003efn(one, two, three, ...)\n\u003e\n\u003e -\u003efn('one', 'two', 'three', ...)\n\n\n5. Function calls with single quotes and escape character\n\n\u003e -\u003efn('it\\\\'s ok', 'two,xxx', 'three', ...)\n\n\n### Priority Order\n\nLookup function priority order:\n\n\u003e struct method -\u003e parent method -\u003e ... -\u003e global\n\n\n### More Examples\nSee advance example: \u003chttps://github.com/foolin/pagser/tree/master/_examples/advance\u003e\n\n## Implicit type conversion\nAutomatic implicit type conversion, Output result string convert to int, int64, float64...\n\n**Support type:**\n\n- bool\n- float32\n- float64\n- int\n- int32\n- int64\n- string\n- []bool\n- []float32\n- []float64\n- []int\n- []int32\n- []int64\n- []string\n\n\n\n## Examples\n\n### Crawl page example\n\n```golang\n\npackage main\n\nimport (\n\t\"encoding/json\"\n\t\"github.com/foolin/pagser\"\n\t\"log\"\n\t\"net/http\"\n)\n\ntype PageData struct {\n\tTitle    string `pagser:\"title\"`\n\tRepoList []struct {\n\t\tNames       []string `pagser:\"h1-\u003etextSplit('/', true)\"`\n\t\tDescription string   `pagser:\"h1 + p\"`\n\t\tStars       string   `pagser:\"a.muted-link-\u003eeqAndText(0)\"`\n\t\tRepo        string   `pagser:\"h1 a-\u003eattrConcat('href', 'https://github.com', $value, '?from=pagser')\"`\n\t} `pagser:\"article.Box-row\"`\n}\n\nfunc main() {\n\tresp, err := http.Get(\"https://github.com/trending\")\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tdefer resp.Body.Close()\n\n\t//New default config\n\tp := pagser.New()\n\n\t//data parser model\n\tvar data PageData\n\t//parse html data\n\terr = p.ParseReader(\u0026data, resp.Body)\n\t//check error\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\t//print data\n\tlog.Printf(\"Page data json: \\n-------------\\n%v\\n-------------\\n\", toJson(data))\n}\n\nfunc toJson(v interface{}) string {\n\tdata, _ := json.MarshalIndent(v, \"\", \"\\t\")\n\treturn string(data)\n}\n\n\n```\n\nRun output:\n```\n\n2020/04/25 12:26:04 Page data json: \n-------------\n{\n\t\"Title\": \"Trending  repositories on GitHub today · GitHub\",\n\t\"RepoList\": [\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"pcottle\",\n\t\t\t\t\"learnGitBranching\"\n\t\t\t],\n\t\t\t\"Description\": \"An interactive git visualization to challenge and educate!\",\n\t\t\t\"Stars\": \"16,010\",\n\t\t\t\"Repo\": \"https://github.com/pcottle/learnGitBranching?from=pagser\"\n\t\t},\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"jackfrued\",\n\t\t\t\t\"Python-100-Days\"\n\t\t\t],\n\t\t\t\"Description\": \"Python - 100天从新手到大师\",\n\t\t\t\"Stars\": \"83,484\",\n\t\t\t\"Repo\": \"https://github.com/jackfrued/Python-100-Days?from=pagser\"\n\t\t},\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"brave\",\n\t\t\t\t\"brave-browser\"\n\t\t\t],\n\t\t\t\"Description\": \"Next generation Brave browser for macOS, Windows, Linux, Android.\",\n\t\t\t\"Stars\": \"5,963\",\n\t\t\t\"Repo\": \"https://github.com/brave/brave-browser?from=pagser\"\n\t\t},\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"MicrosoftDocs\",\n\t\t\t\t\"azure-docs\"\n\t\t\t],\n\t\t\t\"Description\": \"Open source documentation of Microsoft Azure\",\n\t\t\t\"Stars\": \"3,798\",\n\t\t\t\"Repo\": \"https://github.com/MicrosoftDocs/azure-docs?from=pagser\"\n\t\t},\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"ahmetb\",\n\t\t\t\t\"kubectx\"\n\t\t\t],\n\t\t\t\"Description\": \"Faster way to switch between clusters and namespaces in kubectl\",\n\t\t\t\"Stars\": \"6,979\",\n\t\t\t\"Repo\": \"https://github.com/ahmetb/kubectx?from=pagser\"\n\t\t},\n\n        //...        \n\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"serverless\",\n\t\t\t\t\"serverless\"\n\t\t\t],\n\t\t\t\"Description\": \"Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions \\u0026 more! –\",\n\t\t\t\"Stars\": \"35,502\",\n\t\t\t\"Repo\": \"https://github.com/serverless/serverless?from=pagser\"\n\t\t},\n\t\t{\n\t\t\t\"Names\": [\n\t\t\t\t\"vuejs\",\n\t\t\t\t\"vite\"\n\t\t\t],\n\t\t\t\"Description\": \"Experimental no-bundle dev server for Vue SFCs\",\n\t\t\t\"Stars\": \"1,573\",\n\t\t\t\"Repo\": \"https://github.com/vuejs/vite?from=pagser\"\n\t\t}\n\t]\n}\n-------------\n```\n\n### Colly Example\n\nWork with colly:\n```golang\n\np := pagser.New()\n\n\n// On every a element which has href attribute call callback\ncollector.OnHTML(\"body\", func(e *colly.HTMLElement) {\n\t//data parser model\n\tvar data PageData\n\t//parse html data\n\terr := p.ParseSelection(\u0026data, e.Dom)\n\n})\n\n```\n\n- [See Examples](https://github.com/foolin/pagser/tree/master/_examples)\n- [See Tests](https://github.com/foolin/pagser/blob/master/parse_test.go)\n\n## Dependencies\n\n- github.com/PuerkitoBio/goquery\n\n- github.com/spf13/cast\n\n**Extensions:**\n\n- github.com/mattn/godown\n\n- github.com/microcosm-cc/bluemonday\n\n\n\n[go-doc]: https://pkg.go.dev/github.com/foolin/pagser\n[go-doc-img]: https://godoc.org/github.com/foolin/pagser?status.svg\n[travis]: https://travis-ci.org/foolin/pagser\n[travis-img]: https://travis-ci.org/foolin/pagser.svg?branch=master\n[go-report-card]: https://goreportcard.com/report/github.com/foolin/pagser\n[go-report-card-img]: https://goreportcard.com/badge/github.com/foolin/pagser\n[cov-img]: https://codecov.io/gh/foolin/pagser/branch/master/graph/badge.svg\n[cov]: https://codecov.io/gh/foolin/pagser\n","funding_links":[],"categories":["Text Processing","文本处理`解析和操作文本的代码库`","Template Engines","文本处理","Specific Formats","Bot Building"],"sub_categories":["Scrapers","查询语","刮刀","HTTP Clients"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoolin%2Fpagser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffoolin%2Fpagser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffoolin%2Fpagser/lists"}