{"id":19590090,"url":"https://github.com/sparticleinc/content-parser","last_synced_at":"2026-06-10T13:31:17.305Z","repository":{"id":198825683,"uuid":"700723535","full_name":"sparticleinc/content-parser","owner":"sparticleinc","description":"支持在解析HTML时根据预先设置的配置文件提取内容。修改自：@postlight/parser 具体修改点在README","archived":false,"fork":false,"pushed_at":"2023-10-31T02:13:02.000Z","size":23805,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-11-21T13:03:15.972Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sparticleinc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-05T07:02:54.000Z","updated_at":"2025-03-18T06:09:55.000Z","dependencies_parsed_at":"2023-11-12T04:00:17.181Z","dependency_job_id":null,"html_url":"https://github.com/sparticleinc/content-parser","commit_stats":null,"previous_names":["sparticleinc/content-parser"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sparticleinc/content-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparticleinc%2Fcontent-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparticleinc%2Fcontent-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparticleinc%2Fcontent-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparticleinc%2Fcontent-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sparticleinc","download_url":"https://codeload.github.com/sparticleinc/content-parser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparticleinc%2Fcontent-parser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34155422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T08:23:23.989Z","updated_at":"2026-06-10T13:31:17.266Z","avatar_url":"https://github.com/sparticleinc.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"### 修改点\n\n1. 如果对应网页没有匹配到 Extractor 的话，默认会返回所有 HTML 内容，之前是会走默认有效内容提取器。\n2. Extractor 配置文件增加 content.includedPaths 字段，支持多个匹配路径，如果匹配到任意一个，就会使用该 Extractor。路径字符串会被当成正则表达式字符串。\n3. Extractor 配置文件的 domain 字段现在其实是可以填任何字符串，但一般只填域名。该域名在检测配对时，只要被包含在 url 中就算匹配成功。之前是完全匹配才行。\n4. 修改包名为：@sparticle/content-parser\n5. 包支持导出 getExtractor，方便下游程序提前获取 Extractor，做一些其他事情。\n\n# Postlight Parser - Extracting content from chaos\n\n[![CircleCI](https://circleci.com/gh/postlight/parser.svg?style=svg\u0026circle-token=3026c2b527d3767750e767872d08991aeb4f8f10)](https://circleci.com/gh/postlight/mercury-parser) [![Greenkeeper badge](https://badges.greenkeeper.io/postlight/mercury-parser.svg)](https://greenkeeper.io/) [![Apache License][license-apach-badge]][license-apach] [![MITC License][license-mit-badge]][license-mit]\n[![Gitter chat](https://badges.gitter.im/postlight/mercury.png)](https://gitter.im/postlight/mercury)\n\n[license-apach-badge]: https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=flat-square\n[license-apach]: https://github.com/postlight/mercury-parser/blob/master/LICENSE-APACHE\n[license-mit-badge]: https://img.shields.io/badge/License-MIT%202.0-blue.svg?style=flat-square\n[license-mit]: https://github.com/postlight/mercury-parser/blob/master/LICENSE-MIT\n\n[Postlight](https://postlight.com)'s Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.\n\nPostlight Parser powers [Postlight Reader](https://reader.postlight.com/), a browser extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.\n\nPostlight Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are [many examples available](https://github.com/postlight/parser/tree/master/src/extractors/custom) along with [documentation](https://github.com/postlight/parser/blob/master/src/extractors/custom/README.md).\n\n## How? Like this.\n\n### Installation\n\n```bash\n# If you're using yarn\nyarn add @postlight/parser\n\n# If you're using npm\nnpm install @postlight/parser\n```\n\n### Usage\n\n```javascript\nimport Parser from '@postlight/parser';\n\nParser.parse(url).then(result =\u003e console.log(result));\n\n// NOTE: When used in the browser, you can omit the URL argument\n// and simply run `Parser.parse()` to parse the current page.\n```\n\nThe result looks like this:\n\n```json\n{\n \"title\": \"Thunder (mascot)\",\n \"content\": \"... \u003cp\u003e\u003cb\u003eThunder\u003c/b\u003e is the \u003ca href=\\\"https://en.wikipedia.org/wiki/Stage_name\\\"\u003estage name\u003c/a\u003e for the...\",\n \"author\": \"Wikipedia Contributors\",\n \"date_published\": \"2016-09-16T20:56:00.000Z\",\n \"lead_image_url\": null,\n \"dek\": null,\n \"next_page_url\": null,\n \"url\": \"https://en.wikipedia.org/wiki/Thunder_(mascot)\",\n \"domain\": \"en.wikipedia.org\",\n \"excerpt\": \"Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos\",\n \"word_count\": 4677,\n \"direction\": \"ltr\",\n \"total_pages\": 1,\n \"rendered_pages\": 1\n}\n```\n\nIf Parser is unable to find a field, that field will return `null`.\n\n#### `parse()` Options\n\n##### Content Formats\n\nBy default, Postlight Parser returns the `content` field as HTML. However, you can override this behavior by passing in options to the `parse` function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are `'html'`, `'markdown'`, and `'text'`). For example:\n\n```javascript\nParser.parse(url, { contentType: 'markdown' }).then(result =\u003e\n console.log(result)\n);\n```\n\nThis returns the the page's `content` as GitHub-flavored Markdown:\n\n```json\n\"content\": \"...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the...\"\n```\n\n##### Custom Request Headers\n\nYou can include custom headers in requests by passing name-value pairs to the `parse` function as follows:\n\n```javascript\nParser.parse(url, {\n headers: {\n Cookie: 'name=value; name2=value2; name3=value3',\n 'User-Agent':\n 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',\n },\n}).then(result =\u003e console.log(result));\n```\n\n##### Pre-fetched HTML\n\nYou can use Postlight Parser to parse custom or pre-fetched HTML by passing an HTML string to the `parse` function as follows:\n\n```javascript\nParser.parse(url, {\n html:\n '\u003chtml\u003e\u003cbody\u003e\u003carticle\u003e\u003ch1\u003eThunder (mascot)\u003c/h1\u003e\u003cp\u003eThunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos\u003c/p\u003e\u003c/article\u003e\u003c/body\u003e\u003c/html\u003e',\n}).then(result =\u003e console.log(result));\n```\n\nNote that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.\n\n#### The command-line parser\n\nPostlight Parser also ships with a CLI, meaning you can use it from your command line like so:\n\n![Postlight Parser CLI Basic Usage](./assets/parser-basic-usage.gif)\n\n```bash\n# Install Postlight Parser globally\nyarn global add @postlight/parser\n# or\nnpm -g install @postlight/parser\n\n# Then\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source\n\n# Pass optional --format argument to set content type (html|markdown|text)\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown\n\n# Pass optional --header.name=value arguments to include custom headers in the request\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie=\"name=value; name2=value2; name3=value3\" --header.User-Agent=\"Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1\"\n\n# Pass optional --extend argument to add a custom type to the response\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit=\"p:last-child em\"\n\n# Pass optional --extend-list argument to add a custom type with multiple matches\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=\".meta__tags-list a\"\n\n# Get the value of attributes by adding a pipe to --extend or --extend-list\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=\".body a|href\"\n\n# Pass optional --add-extractor argument to add a custom extractor at runtime.\npostlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js\n```\n\n## License\n\nLicensed under either of the below, at your preference:\n\n- Apache License, Version 2.0\n ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)\n- MIT license\n ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)\n\n## Contributing\n\nFor details on how to contribute to Postlight Parser, including how to write a custom content extractor for any site, see [CONTRIBUTING.md](./CONTRIBUTING.md)\n\nUnless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.\n\n---\n\n🔬 A Labs project from your friends at [Postlight](https://postlight.com). Happy coding!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsparticleinc%2Fcontent-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsparticleinc%2Fcontent-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsparticleinc%2Fcontent-parser/lists"}