{"id":13464602,"url":"https://github.com/wspl/creeper","last_synced_at":"2025-04-05T07:04:11.910Z","repository":{"id":57494660,"uuid":"82251024","full_name":"wspl/creeper","owner":"wspl","description":":paw_prints: Creeper - The Next Generation Crawler Framework (Go)","archived":false,"fork":false,"pushed_at":"2017-05-16T12:14:14.000Z","size":402,"stargazers_count":781,"open_issues_count":5,"forks_count":57,"subscribers_count":46,"default_branch":"master","last_synced_at":"2025-03-29T06:04:40.555Z","etag":null,"topics":["crawler","cross-platform","framework","golang","language","script","spider"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wspl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-02-17T03:01:50.000Z","updated_at":"2025-02-27T15:21:02.000Z","dependencies_parsed_at":"2022-08-28T15:10:38.407Z","dependency_job_id":null,"html_url":"https://github.com/wspl/creeper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wspl%2Fcreeper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wspl%2Fcreeper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wspl%2Fcreeper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wspl%2Fcreeper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wspl","download_url":"https://codeload.github.com/wspl/creeper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247299831,"owners_count":20916190,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","cross-platform","framework","golang","language","script","spider"],"created_at":"2024-07-31T14:00:47.089Z","updated_at":"2025-04-05T07:04:11.894Z","avatar_url":"https://github.com/wspl.png","language":"Go","readme":"[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=flat)](https://opensource.org/licenses/Apache-2.0)\n[![Go Report Card](https://goreportcard.com/badge/github.com/wspl/creeper)](https://goreportcard.com/report/github.com/wspl/creeper)\n[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/creeper-project/Lobby?utm_source=share-link\u0026utm_medium=link\u0026utm_campaign=share-link)\n![Creeper](https://raw.githubusercontent.com/wspl/creeper/master/art/Creeper.png)\n## About\n\nCreeper is a *next-generation* crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.\n\n**Warning:** At present this project is still under early stage development, please do not use in the production environment.\n\n## Get Started\n\n#### Installation\n\n```\n$ go get github.com/wspl/creeper\n```\n\n#### Hello World!\n\nCreate `hacker_news.crs`\n\n```\npage(@page=1) = \"https://news.ycombinator.com/news?p={@page}\"\n\nnews[]: page -\u003e $(\"tr.athing\")\n    title: $(\".title a.storylink\").text\n    site: $(\".title span.sitestr\").text\n    link: $(\".title a.storylink\").href\n```\n\nThen, create `main.go`\n\n```go\npackage main\n\nimport \"github.com/wspl/creeper\"\n\nfunc main() {\n\tc := creeper.Open(\"./hacker_news.crs\")\n\tc.Array(\"news\").Each(func(c *creeper.Creeper) {\n\t\tprintln(\"title: \", c.String(\"title\"))\n\t\tprintln(\"site: \", c.String(\"site\"))\n\t\tprintln(\"link: \", c.String(\"link\"))\n\t\tprintln(\"===\")\n\t})\n}\n```\n\nBuild and run. Console will print something like:\n\n```\ntitle:  Samsung chief Lee arrested as S.Korean corruption probe deepens\nsite:  reuters.com\nlink:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD\n===\ntitle:  ReactOS 0.4.4 Released\nsite:  reactos.org\nlink:  https://reactos.org/project-news/reactos-044-released\n===\ntitle:  FeFETs: How this new memory stacks up against existing non-volatile memory\nsite:  semiengineering.com\nlink:  http://semiengineering.com/what-are-fefets/\n```\n\n## Script Spec\n\n### Town\n\nTown is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.\n\n```\npage(@page=1, ext) = \"https://news.ycombinator.com/news?p={@page}\u0026ext={ext}\"\n```\n\nWhen you need town, use it as if you were calling a function:\n\n```\nnews[]: page(ext=\"Hello World!\") -\u003e $(\"tr.athing\")\n```\n\nYou might have noticed that the `@page` parameter is not used. Yeah, it is a special parameter.\n\nExpression in town definition line like `name=\"something\"`, represents parameter `name` has a default value `\"something\"`.\n\nIncidentally, `@page` is a parameter that will automatically increasing when current page has no more content.\n\n\n### Node\n\nNodes are tree structure that represent the data structure you are going to crawl.\n\n```\nnews[]: page -\u003e $(\"tr.athing\")\n\ttitle: $(\".title a.storylink\").text\n\tsite: $(\".title span.sitestr\").text\n\tlink: $(\".title a.storylink\").href\n```\n\nLike `yaml`, nodes distinguishes the hierarchy by indentation.\n\n#### Node Name\n\nNode has name. `title` is a field name, represents a general string data. `news[]` is a array name, represents a parent structure with multiple sub-data.\n\n#### Page\n\nPage indicates where to fetching the field data. It can be a town expression or field reference.\n\nField reference is a advanced usage of Node, you can found the details in [./eh.crs](./eh.crs).\n\nIf a node owned page and fun at the same time, page should on the left of `-\u003e`, fun should on the right of `-\u003e`. Which is `page -\u003e fun`\n\n#### Fun\n\nFun represents the data processing process.\n\nThere are all supported funs:\n\n| Name      | Parameters                       | Description                              |\n| --------- | -------------------------------- | ---------------------------------------- |\n| $         | (selector: string)               | Relative CSS selector (select from parent node)|\n| $root     | (selector: string)               | Absolute CSS selector (select from body)|\n| html      |                                  | inner HTML                               |\n| text      |                                  | inner text                               |\n| outerHTML |                                  | outer HTML                               |\n| attr      | (attr: string)                   | attribute value                          |\n| style     |                                  | style attribute value                    |\n| href      |                                  | href attribute value                     |\n| src       |                                  | src attribute value                      |\n| class     |                                  | class attribute value                    |\n| id        |                                  | id attribute value                       |\n| calc      | (prec: int)                      | calculate arithmetic expression          |\n| match     | (regexp: string)                 | match first sub-string via regular expression |\n| expand    | (regexp: string, target: string) | expand matched strings to target string  |\n\n\n\n## Author\n\nPlutonist\n\n\u003e [impl.moe](https://impl.moe) · Github [@wspl](https://github.com/wspl) \n","funding_links":[],"categories":["All","开源类库","Go"],"sub_categories":["爬虫"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwspl%2Fcreeper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwspl%2Fcreeper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwspl%2Fcreeper/lists"}