{"id":13393341,"url":"https://github.com/ageitgey/node-unfluff","last_synced_at":"2025-05-14T10:11:06.783Z","repository":{"id":18228292,"uuid":"21369952","full_name":"ageitgey/node-unfluff","owner":"ageitgey","description":"Automatically extract body content (and other cool stuff) from an html document","archived":false,"fork":false,"pushed_at":"2023-05-26T18:52:19.000Z","size":1279,"stargazers_count":2155,"open_issues_count":37,"forks_count":218,"subscribers_count":56,"default_branch":"master","last_synced_at":"2025-04-09T22:44:25.595Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"djangonauts/django-rest-framework-gis","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ageitgey.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2014-07-01T00:09:17.000Z","updated_at":"2025-04-05T21:40:57.000Z","dependencies_parsed_at":"2024-01-13T17:10:59.842Z","dependency_job_id":"51ed2c27-2a01-4fe8-bd16-a553135f41da","html_url":"https://github.com/ageitgey/node-unfluff","commit_stats":{"total_commits":101,"total_committers":19,"mean_commits":5.315789473684211,"dds":0.5643564356435644,"last_synced_commit":"cf8131d85f0ab2a791be2fbb609771a662dd73f5"},"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ageitgey%2Fnode-unfluff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ageitgey%2Fnode-unfluff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ageitgey%2Fnode-unfluff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ageitgey%2Fnode-unfluff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ageitgey","download_url":"https://codeload.github.com/ageitgey/node-unfluff/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248695312,"owners_count":21146952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T17:00:50.664Z","updated_at":"2025-04-13T09:56:20.218Z","avatar_url":"https://github.com/ageitgey.png","language":"HTML","readme":"# unfluff\n\nAn automatic web page content extractor for Node.js!\n\n[![Build Status](https://travis-ci.org/ageitgey/node-unfluff.svg?branch=master)](https://travis-ci.org/ageitgey/node-unfluff)\n\nAutomatically grab the main\ntext out of a webpage like this:\n\n```\nextractor = require('unfluff');\ndata = extractor(my_html_data);\nconsole.log(data.text);\n```\n\nIn other words, it turns pretty webpages into boring plain text/json data:\n\n![](https://cloud.githubusercontent.com/assets/896692/3478577/b82f39cc-033d-11e4-9e68-226c9a7bc1c0.jpg)\n\nThis might be useful for:\n- Writing your own Instapaper clone\n- Easily building ML data sets from web pages\n- Reading your favorite articles from the console?\n\nPlease don't use this for:\n- Stealing other peoples' web pages\n- Making crappy spam sites with stolen content from other sites\n- Being a jerk\n\n## Credits / Thanks\n\nThis library is largely based on [python-goose](https://github.com/grangier/python-goose)\nby [Xavier Grangier](https://github.com/grangier) which is in turn based on [goose](https://github.com/GravityLabs/goose)\nby [Gravity Labs](https://github.com/GravityLabs). However, it's not an exact\nport so it may behave differently on some pages and the feature set is a little\nbit different.  If you are looking for a python or Scala/Java/JVM solution,\ncheck out those libraries!\n\n## Install\n\nTo install the command-line `unfluff` utility:\n\n    npm install -g unfluff\n\nTo install the `unfluff` module for use in your Node.js project:\n\n    npm install --save unfluff\n\n## Usage\n\nYou can use `unfluff` from node or right on the command line!\n\n### Extracted data elements\n\nThis is what `unfluff` will try to grab from a web page:\n- `title` - The document's title (from the \u0026lt;title\u0026gt; tag)\n- `softTitle` - A version of `title` with less truncation\n- `date` - The document's publication date\n- `copyright` - The document's copyright line, if present\n- `author` - The document's author\n- `publisher` - The document's publisher (website name)\n- `text` - The main text of the document with all the junk thrown away\n- `image` - The main image for the document (what's used by facebook, etc.)\n- `videos` - An array of videos that were embedded in the article. Each video has src, width and height.\n- `tags`- Any tags or keywords that could be found by checking \u0026lt;rel\u0026gt; tags or by looking at href urls.\n- `canonicalLink` - The [canonical url](https://support.google.com/webmasters/answer/139066?hl=en) of the document, if given.\n- `lang` - The language of the document, either detected or supplied by you.\n- `description` - The description of the document, from \u0026lt;meta\u0026gt; tags\n- `favicon` - The url of the document's [favicon](http://en.wikipedia.org/wiki/Favicon).\n- `links` - An array of links embedded within the article text. (text and href for each)\n\nThis is returned as a simple json object.\n\n### Command line interface\n\nYou can pass a webpage to unfluff and it will try to parse out the interesting\nbits.\n\nYou can either pass in a file name:\n\n```\nunfluff my_file.html\n```\n\nOr you can pipe it in:\n\n```\ncurl -s \"http://somesite.com/page\" | unfluff\n```\n\nYou can easily chain this together with other unix commands to do cool stuff.\nFor example, you can download a web page, parse it and then use\n[jq](http://stedolan.github.io/jq/) to print it just the body text.\n\n```\ncurl -s \"https://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u\" | unfluff | jq -r .text\n```\n\nAnd here's how to find the top 10 most common words in an article:\n\n```\ncurl -s \"https://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u\" | unfluff |  tr -c '[:alnum:]' '[\\n*]' | sort | uniq -c | sort -nr | head -10\n```\n\n### Module Interface\n\n#### `extractor(html, language)`\n\nhtml: The html you want to parse\n\nlanguage (optional): The document's two-letter language code. This will be\nauto-detected as best as possible, but there might be cases where you want to\noverride it.\n\nThe extraction algorithm depends heavily on the language, so it probably won't work\nif you have the language set incorrectly.\n\n```javascript\nextractor = require('unfluff');\n\ndata = extractor(my_html_data);\n```\n\nOr supply the language code yourself:\n\n```javascript\nextractor = require('unfluff');\n\ndata = extractor(my_html_data, 'en');\n```\n\n`data` will then be a json object that looks like this:\n\n```json\n{\n  \"title\": \"Shovel Knight review\",\n  \"softTitle\": \"Shovel Knight review: rewrite history\",\n  \"date\": \"2014-06-26T13:00:03Z\",\n  \"copyright\": \"2016 Vox Media Inc Designed in house\",\n  \"author\": [\n    \"Griffin McElroy\"\n  ],\n  \"publisher\": \"Polygon\",\n  \"text\": \"Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it. [.. snip ..]\",\n  \"image\": \"http://cdn2.vox-cdn.com/uploads/chorus_image/image/34834129/jellyfish_hero.0_cinema_1280.0.png\",  \n  \"tags\": [],\n  \"videos\": [],\n  \"canonicalLink\": \"http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u\",\n  \"lang\": \"en\",\n  \"description\": \"Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it.\",\n  \"favicon\": \"http://cdn1.vox-cdn.com/community_logos/42931/favicon.ico\",\n  \"links\": [\n    { \"text\": \"Six Thirty\", \"href\": \"http://www.sixthirty.co/\" }\n  ]\n}\n```\n\n#### `extractor.lazy(html, language)`\n\nLazy version of `extractor(html, language)`.\n\nThe text extraction algorithm can be somewhat slow on large documents.  If you\nonly need access to elements like `title` or `image`, you can use the\nlazy extractor to get them more quickly without running the full processing\npipeline.\n\nThis returns an object just like the regular extractor except all fields\nare replaced by functions and evaluation is only done when you call those\nfunctions.\n\n```javascript\nextractor = require('unfluff');\n\ndata = extractor.lazy(my_html_data, 'en');\n\n// Access whichever data elements you need directly.\nconsole.log(data.title());\nconsole.log(data.softTitle());\nconsole.log(data.date());\nconsole.log(data.copyright());\nconsole.log(data.author());\nconsole.log(data.publisher());\nconsole.log(data.text());\nconsole.log(data.image());\nconsole.log(data.tags());\nconsole.log(data.videos());\nconsole.log(data.canonicalLink());\nconsole.log(data.lang());\nconsole.log(data.description());\nconsole.log(data.favicon());\n```\n\nSome of these data elements require calculating intermediate representations\nof the html document. Everything is cached so looking up multiple data elements\nand looking them up multiple times should be as fast as possible.\n\n### Demo\n\nThe easiest way to try out `unfluff` is to just install it:\n\n```\n$ npm install -g unfluff\n$ curl -s \"http://www.cnn.com/2014/07/07/world/americas/mexico-earthquake/index.html\" | unfluff\n```\n\nBut if you can't be bothered, you can check out\n[fetch text](http://fetchtext.herokuapp.com/). It's a site by\n[Andy Jiang](https://twitter.com/andyjiang) that uses `unfluff`. You send an\nemail with a url and it emails back with the cleaned content of that url. It\nshould give you a good idea of how `unfluff` handles different urls.\n\n### What is broken\n\n- Parsing web pages in languages other than English is poorly tested and probably\n  is buggy right now.\n- This definitely won't work yet for languages like Chinese / Arabic / Korean /\n  etc that need smarter word tokenization.\n- This has only been tested on a limited set of web pages. There are probably lots\n  of lurking bugs with web pages that haven't been tested yet.\n","funding_links":[],"categories":["HTML","Tools","Content extraction"],"sub_categories":["Node"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fageitgey%2Fnode-unfluff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fageitgey%2Fnode-unfluff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fageitgey%2Fnode-unfluff/lists"}