{"id":13468354,"url":"https://github.com/extractus/article-extractor","last_synced_at":"2025-04-27T03:59:20.237Z","repository":{"id":39669005,"uuid":"47049710","full_name":"extractus/article-extractor","owner":"extractus","description":"To extract main article from given URL with Node.js","archived":false,"fork":false,"pushed_at":"2025-02-09T03:52:35.000Z","size":7745,"stargazers_count":1695,"open_issues_count":6,"forks_count":149,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-04-27T03:59:13.762Z","etag":null,"topics":["article","article-extractor","article-parser","crawler","extract","nodejs","readability","scraper"],"latest_commit_sha":null,"homepage":"https://extractor-demos.pages.dev/article-extractor","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/extractus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-11-29T04:20:27.000Z","updated_at":"2025-04-24T16:30:29.000Z","dependencies_parsed_at":"2023-02-09T02:31:21.632Z","dependency_job_id":"f4ff663f-cb47-4b7e-a32c-68af0b0e6443","html_url":"https://github.com/extractus/article-extractor","commit_stats":{"total_commits":516,"total_committers":17,"mean_commits":"30.352941176470587","dds":"0.18992248062015504","last_synced_commit":"197a2b59e70d5afdcc140fff2eade49c7085964f"},"previous_names":["ndaidong/article-parser"],"tags_count":100,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extractus%2Farticle-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extractus%2Farticle-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extractus%2Farticle-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extractus%2Farticle-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/extractus","download_url":"https://codeload.github.com/extractus/article-extractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251085194,"owners_count":21533841,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["article","article-extractor","article-parser","crawler","extract","nodejs","readability","scraper"],"created_at":"2024-07-31T15:01:09.320Z","updated_at":"2025-04-27T03:59:20.211Z","avatar_url":"https://github.com/extractus.png","language":"JavaScript","funding_links":["https://paypal.me/ndaidong"],"categories":["JavaScript","scraper"],"sub_categories":[],"readme":"# @extractus/article-extractor\n\nExtract main article, main image and meta data from URL.\n\n[![npm version](https://badge.fury.io/js/@extractus%2Farticle-extractor.svg)](https://badge.fury.io/js/@extractus%2Farticle-extractor)\n![CodeQL](https://github.com/extractus/article-extractor/workflows/CodeQL/badge.svg)\n![CI test](https://github.com/extractus/article-extractor/workflows/ci-test/badge.svg)\n\n(This library is derived from [article-parser](https://www.npmjs.com/package/article-parser) renamed.)\n\n## Demo\n\n- [Give it a try!](https://extractor-demos.pages.dev/article-extractor)\n- [Example FaaS](https://extractus.deno.dev/extract?apikey=rn0wbHos2e73W6ghQf705bdF\u0026type=article\u0026url=https://github.blog/2022-11-17-octoverse-2022-10-years-of-tracking-open-source/)\n\n\n## Install \u0026 Usage\n\n### Node.js\n\n```bash\nnpm i @extractus/article-extractor\n\n# pnpm\npnpm i @extractus/article-extractor\n\n# yarn\nyarn add @extractus/article-extractor\n```\n\n```ts\n// es6 module\nimport { extract } from '@extractus/article-extractor'\n```\n\n### Deno\n\n```ts\nimport { extract } from 'https://esm.sh/@extractus/article-extractor'\n\n// deno \u003e 1.28\nimport { extract } from 'npm:@extractus/article-extractor'\n```\n\n### Browser\n\n```ts\nimport { extract } from 'https://esm.sh/@extractus/article-extractor'\n```\n\nPlease check [the examples](examples) for reference.\n\n\n## APIs\n\n- [extract()](#extract)\n- [extractFromHtml()](#extractfromhtml)\n- [Transformations](#transformations)\n  - [`transformation` object](#transformation-object)\n  - [.addTransformations](#addtransformationsobject-transformation--array-transformations)\n  - [.removeTransformations](#removetransformationsarray-patterns)\n  - [Priority order](#priority-order)\n- [`sanitize-html`'s options](#sanitize-htmls-options)\n\n---\n\n### `extract()`\n\nLoad and extract article data. Return a Promise object.\n\n#### Syntax\n\n```ts\nextract(String input)\nextract(String input, Object parserOptions)\nextract(String input, Object parserOptions, Object fetchOptions)\n```\n\nExample:\n\n```js\nimport { extract } from '@extractus/article-extractor'\n\nconst input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'\n\n// here we use top-level await, assume current platform supports it\ntry {\n  const article = await extract(input)\n  console.log(article)\n} catch (err) {\n  console.error(err)\n}\n```\n\nThe result - `article` - can be `null` or an object with the following structure:\n\n```ts\n{\n  url: String,\n  title: String,\n  description: String,\n  image: String,\n  author: String,\n  favicon: String,\n  content: String,\n  published: Date String,\n  type: String, // page type\n  source: String, // original publisher\n  links: Array, // list of alternative links\n  ttr: Number, // time to read in second, 0 = unknown\n}\n```\n\n\n#### Parameters\n\n##### `input` *required*\n\nURL string links to the article or HTML content of that web page.\n\n##### `parserOptions` *optional*\n\nObject with all or several of the following properties:\n\n  - `wordsPerMinute`: Number, to estimate time to read. Default `300`.\n  - `descriptionTruncateLen`: Number, max num of chars generated for description. Default `210`.\n  - `descriptionLengthThreshold`: Number, min num of chars required for description. Default `180`.\n  - `contentLengthThreshold`: Number, min num of chars required for content. Default `200`.\n\nFor example:\n\n```js\nimport { extract } from '@extractus/article-extractor'\n\nconst article = await extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {\n  descriptionLengthThreshold: 120,\n  contentLengthThreshold: 500\n})\n\nconsole.log(article)\n```\n\n##### `fetchOptions` *optional*\n\n`fetchOptions` is an object that can have the following properties:\n\n- `headers`: to set request headers\n- `proxy`: another endpoint to forward the request to\n- `agent`: a HTTP proxy agent\n- `signal`: AbortController signal or AbortSignal timeout to terminate the request\n\nFor example, you can use this param to set request headers to fetch as below:\n\n```js\nimport { extract } from '@extractus/article-extractor'\n\nconst url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'\nconst article = await extract(url, {}, {\n  headers: {\n    'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'\n  }\n})\n\nconsole.log(article)\n```\n\nYou can also specify a proxy endpoint to load remote content, instead of fetching directly.\n\nFor example:\n\n```js\nimport { extract } from '@extractus/article-extractor'\n\nconst url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'\n\nawait extract(url, {}, {\n  headers: {\n    'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'\n  },\n  proxy: {\n    target: 'https://your-secret-proxy.io/loadXml?url=',\n    headers: {\n      'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...'\n    },\n  }\n})\n```\n\nPassing requests to proxy is useful while running `@extractus/article-extractor` on browser. View [examples/browser-article-parser](examples/browser-article-parser) as reference example.\n\nFor more info about proxy authentication, please refer [HTTP authentication](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication)\n\nFor a deeper customization, you can consider using [Proxy](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy) to replace `fetch` behaviors with your own handlers.\n\nAnother way to work with proxy is use `agent` option instead of `proxy` as below:\n\n```js\nimport { extract } from '@extractus/article-extractor'\n\nimport { HttpsProxyAgent } from 'https-proxy-agent'\n\nconst proxy = 'http://abc:RaNdoMpasswORd_country-France@proxy.packetstream.io:31113'\n\nconst url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'\n\nconst article = await extract(url, {}, {\n  agent: new HttpsProxyAgent(proxy),\n})\nconsole.log('Run article-extractor with proxy:', proxy)\nconsole.log(article)\n```\n\nFor more info about [https-proxy-agent](https://www.npmjs.com/package/https-proxy-agent), check [its repo](https://github.com/TooTallNate/proxy-agents).\n\nBy default, there is no request timeout. You can use the option `signal` to cancel request at the right time.\n\nThe common way is to use AbortControler:\n\n```js\nconst controller = new AbortController()\n\n// stop after 5 seconds\nsetTimeout(() =\u003e {\n  controller.abort()\n}, 5000)\n\nconst data = await extract(url, null, {\n  signal: controller.signal,\n})\n```\n\nA newer solution is AbortSignal's `timeout()` static method:\n\n```js\n// stop after 5 seconds\nconst data = await extract(url, null, {\n  signal: AbortSignal.timeout(5000),\n})\n```\n\nFor more info:\n\n- [AbortController constructor](https://developer.mozilla.org/en-US/docs/Web/API/AbortController)\n- [AbortSignal: timeout() static method](https://developer.mozilla.org/en-US/docs/Web/API/AbortSignal/timeout_static)\n\n\n### `extractFromHtml()`\n\nExtract article data from HTML string. Return a Promise object as same as `extract()` method above.\n\n#### Syntax\n\n```ts\nextractFromHtml(String html)\nextractFromHtml(String html, String url)\nextractFromHtml(String html, String url, Object parserOptions)\n```\n\nExample:\n\n```js\nimport { extractFromHtml } from '@extractus/article-extractor'\n\nconst url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'\n\nconst res = await fetch(url)\nconst html = await res.text()\n\n// you can do whatever with this raw html here: clean up, remove ads banner, etc\n// just ensure a html string returned\n\nconst article = await extractFromHtml(html, url)\nconsole.log(article)\n```\n\n#### Parameters\n\n##### `html` *required*\n\nHTML string which contains the article you want to extract.\n\n##### `url` *optional*\n\nURL string that indicates the source of that HTML content.\n`article-extractor` may use this info to handle internal/relative links.\n\n##### `parserOptions` *optional*\n\nSee [parserOptions](#parseroptions-optional) above.\n\n\n---\n\n### Transformations\n\nSometimes the default extraction algorithm may not work well. That is the time when we need transformations.\n\nBy adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.\n\nThere are 2 methods to play with transformations:\n\n- `addTransformations(Object transformation | Array transformations)`\n- `removeTransformations(Array patterns)`\n\nAt first, let's talk about `transformation` object.\n\n#### `transformation` object\n\nIn `@extractus/article-extractor`, `transformation` is an object with the following properties:\n\n- `patterns`: required, a list of regexps to match the URLs\n- `pre`: optional, a function to process raw HTML\n- `post`: optional, a function to process extracted article\n\nBasically, the meaning of `transformation` can be interpreted like this:\n\n\u003e with the urls which match these `patterns` \u003cbr\u003e\n\u003e let's run `pre` function to normalize HTML content \u003cbr\u003e\n\u003e then extract main article content with normalized HTML, and if success \u003cbr\u003e\n\u003e let's run `post` function to normalize extracted article content\n\n![article-extractor extraction process](https://res.cloudinary.com/pwshub/image/upload/v1657336822/documentation/article-parser_extraction_process.png)\n\nHere is an example transformation:\n\n```ts\n{\n  patterns: [\n    /([\\w]+.)?domain.tld\\/*/,\n    /domain.tld\\/articles\\/*/\n  ],\n  pre: (document) =\u003e {\n    // remove all .advertise-area and its siblings from raw HTML content\n    document.querySelectorAll('.advertise-area').forEach((element) =\u003e {\n      if (element.nodeName === 'DIV') {\n        while (element.nextSibling) {\n          element.parentNode.removeChild(element.nextSibling)\n        }\n        element.parentNode.removeChild(element)\n      }\n    })\n    return document\n  },\n  post: (document) =\u003e {\n    // with extracted article, replace all h4 tags with h2\n    document.querySelectorAll('h4').forEach((element) =\u003e {\n      const h2Element = document.createElement('h2')\n      h2Element.innerHTML = element.innerHTML\n      element.parentNode.replaceChild(h2Element, element)\n    })\n    // change small sized images to original version\n    document.querySelectorAll('img').forEach((element) =\u003e {\n      const src = element.getAttribute('src')\n      if (src.includes('domain.tld/pics/150x120/')) {\n        const fullSrc = src.replace('/pics/150x120/', '/pics/original/')\n        element.setAttribute('src', fullSrc)\n      }\n    })\n    return document\n  }\n}\n```\n\n- To write better transformation logic, please refer [linkedom](https://github.com/WebReflection/linkedom) and [Document Object](https://developer.mozilla.org/en-US/docs/Web/API/Document).\n\n#### `addTransformations(Object transformation | Array transformations)`\n\nAdd a single transformation or a list of transformations. For example:\n\n```ts\nimport { addTransformations } from '@extractus/article-extractor'\n\naddTransformations({\n  patterns: [\n    /([\\w]+.)?abc.tld\\/*/\n  ],\n  pre: (document) =\u003e {\n    // do something with document\n    return document\n  },\n  post: (document) =\u003e {\n    // do something with document\n    return document\n  }\n})\n\naddTransformations([\n  {\n    patterns: [\n      /([\\w]+.)?def.tld\\/*/\n    ],\n    pre: (document) =\u003e {\n      // do something with document\n      return document\n    },\n    post: (document) =\u003e {\n      // do something with document\n      return document\n    }\n  },\n  {\n    patterns: [\n      /([\\w]+.)?xyz.tld\\/*/\n    ],\n    pre: (document) =\u003e {\n      // do something with document\n      return document\n    },\n    post: (document) =\u003e {\n      // do something with document\n      return document\n    }\n  }\n])\n````\n\nThe transformations without `patterns` will be ignored.\n\n#### `removeTransformations(Array patterns)`\n\nTo remove transformations that match the specific patterns.\n\nFor example, we can remove all added transformations above:\n\n```js\nimport { removeTransformations } from '@extractus/article-extractor'\n\nremoveTransformations([\n  /([\\w]+.)?abc.tld\\/*/,\n  /([\\w]+.)?def.tld\\/*/,\n  /([\\w]+.)?xyz.tld\\/*/\n])\n```\n\nCalling `removeTransformations()` without parameter will remove all current transformations.\n\n#### Priority order\n\nWhile processing an article, more than one transformation can be applied.\n\nSuppose that we have the following transformations:\n\n```ts\n[\n  {\n    patterns: [\n      /http(s?):\\/\\/google.com\\/*/,\n      /http(s?):\\/\\/goo.gl\\/*/\n    ],\n    pre: function_one,\n    post: function_two\n  },\n  {\n    patterns: [\n      /http(s?):\\/\\/goo.gl\\/*/,\n      /http(s?):\\/\\/google.inc\\/*/\n    ],\n    pre: function_three,\n    post: function_four\n  }\n]\n```\n\nAs you can see, an article from `goo.gl` certainly matches both them.\n\nIn this scenario, `@extractus/article-extractor` will execute both transformations, one by one:\n\n`function_one` -\u003e `function_three` -\u003e extraction -\u003e `function_two` -\u003e `function_four`\n\n---\n\n### `sanitize-html`'s options\n\n`@extractus/article-extractor` uses [sanitize-html](https://github.com/apostrophecms/sanitize-html) to make a clean sweep of HTML content.\n\nHere is the [default options](src/config.js#L5)\n\nDepending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others.\n\nThere are 2 methods to access and modify these options in `@extractus/article-extractor`.\n\n- `getSanitizeHtmlOptions()`\n- `setSanitizeHtmlOptions(Object sanitizeHtmlOptions)`\n\nRead [sanitize-html](https://github.com/apostrophecms/sanitize-html#default-options) docs for more info.\n\n---\n\n## Test\n\n```bash\ngit clone https://github.com/extractus/article-extractor.git\ncd article-extractor\npnpm i\npnpm test\n```\n\n![article-extractor-test.png](https://i.imgur.com/TbRCUSS.png?110222)\n\n\n## Quick evaluation\n\n```bash\ngit clone https://github.com/extractus/article-extractor.git\ncd article-extractor\npnpm i\npnpm eval {URL_TO_PARSE_ARTICLE}\n```\n\n## License\n\nThe MIT License (MIT)\n\n## Support the project\n\nIf you find value from this open source project, you can support in the following ways:\n\n- Give it a star ⭐\n- Buy me a coffee: https://paypal.me/ndaidong 🍵\n- Subscribe [Article Extractor service](https://rapidapi.com/pwshub-pwshub-default/api/article-extractor2) on RapidAPI 😉\n\nThank you.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fextractus%2Farticle-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fextractus%2Farticle-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fextractus%2Farticle-extractor/lists"}