{"id":16777043,"url":"https://github.com/fb55/readabilitysax","last_synced_at":"2025-04-12T18:49:01.708Z","repository":{"id":1550438,"uuid":"1930909","full_name":"fb55/readabilitySAX","owner":"fb55","description":"a fast and platform independent readability port (JS)","archived":false,"fork":false,"pushed_at":"2023-11-06T04:27:01.000Z","size":449,"stargazers_count":245,"open_issues_count":10,"forks_count":35,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-12-06T18:32:14.428Z","etag":null,"topics":["javascript","readability","readabilitysax"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fb55.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-06-21T19:02:39.000Z","updated_at":"2024-11-11T19:20:27.000Z","dependencies_parsed_at":"2024-06-18T13:56:26.030Z","dependency_job_id":"dcb5269b-9887-497e-a929-6b56554428e5","html_url":"https://github.com/fb55/readabilitySAX","commit_stats":{"total_commits":352,"total_committers":10,"mean_commits":35.2,"dds":"0.16761363636363635","last_synced_commit":"a51ac74d92330d27be5bd8b5cd24ba6a9dd7c77a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fb55%2FreadabilitySAX","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fb55%2FreadabilitySAX/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fb55%2FreadabilitySAX/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fb55%2FreadabilitySAX/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fb55","download_url":"https://codeload.github.com/fb55/readabilitySAX/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248618218,"owners_count":21134199,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","readability","readabilitysax"],"created_at":"2024-10-13T07:11:45.507Z","updated_at":"2025-04-12T18:49:01.680Z","avatar_url":"https://github.com/fb55.png","language":"HTML","readme":"# readabilitySAX\n\na fast and platform independent readability port\n\n## About\n\nThis is a port of the algorithm used by the\n[Readability](http://code.google.com/p/arc90labs-readability/) bookmarklet to\nextract relevant pieces of information from websites, using a SAX parser.\n\nThe advantage over other ports, e.g.\n[arrix/node-readability](https://github.com/arrix/node-readability), is a\nsmaller memory footprint and a much faster execution. In my tests, most pages,\neven large ones, were finished within 15ms (on node, see below for more\ninformation). It works with Rhino, so it runs on\n[YQL](http://developer.yahoo.com/yql \"Yahoo! Query Language\"), which may have\ninteresting uses. And it works within a browser.\n\nThe Readability extraction algorithm was completely ported, but some adjustments\nwere made:\n\n-   `\u003carticle\u003e` and `\u003csection\u003e` tags are recognized and gain a higher value\n\n-   If a heading is part of the pages `\u003ctitle\u003e`, it is removed (Readability\n    removed any single `\u003ch2\u003e`, and ignored other tags)\n\n-   `henry` and `instapaper-body` are classes to show an algorithm like this\n    where the content is. readabilitySAX recognizes them and adds additional\n    points\n\n-   Every bit of code that was taken from the original algorithm was optimized,\n    eg. RegExps should now perform faster (they were optimized \u0026 use\n    `RegExp#test` instead of `String#match`, which doesn't force the interpreter\n    to build an array)\n\n-   Some improvements made by\n    [GGReadability](https://github.com/curthard89/COCOA-Stuff/tree/master/GGReadability)\n    (an Obj-C port of Readability) were adopted\n    -   Images get additional scores when their `height` or `width` attributes\n        are high - icon sized images (\u003c= 32px) get skipped\n    -   Additional classes \u0026 ids are checked\n\n## How To\n\n### Install readabilitySAX\n\n    npm install readabilitySAX\n\n##### CLI\n\nA command line interface (CLI) may be installed via\n\n    npm install -g readabilitySAX\n\nIt's then available via\n\n    readability \u003cdomain\u003e [\u003cformat\u003e]\n\nTo get this readme, just run\n\n    readability https://github.com/FB55/readabilitySAX\n\nThe format is optional (it's either `text` or `html`, the default value is\n`text`).\n\n### Usage\n\n##### Node\n\nJust run `require(\"readabilitySAX\")`. You'll get an object containing three\nmethods:\n\n-   `Readability(settings)`: The readability constructor. It works as a handler\n    for `htmlparser2`. Read more about it\n    [in the wiki](https://github.com/FB55/readabilitySAX/wiki/The-Readability-constructor)!\n\n-   `WritableStream(settings, cb)`: A constructor that unites `htmlparser2` and\n    the `Readability` constructor. It's a writable stream, so simply `.write`\n    all your data to it. Your callback will be called once `.end` was called.\n    Bonus: You can also `.pipe` data into it!\n\n-   `createWritableStream(settings, cb)`: Returns a new instance of the\n    `WritableStream`. (It's a simple factory method.)\n\nThere are two methods available that are deprecated and **will be removed** in a\nfuture version:\n\n-   `get(link, [settings], callback)`: Gets a webpage and process it.\n\n-   `process(data)`: Takes a string, runs readabilitySAX and returns the page.\n\n**Please don't use those two methods anymore**. Streams are the way you should\nbuild interfaces in node, and that's what I want encourage people to use.\n\n##### Browsers\n\nI started to implement simplified SAX-\"parsers\" for Rhino/YQL (using E4X) and\nthe browser (using the DOM) to increase the overall performance on those\nplatforms. The DOM version is inside the `/browsers` dir.\n\nA demo of how to use readabilitySAX inside a browser may be found at\n[jsFiddle](http://jsfiddle.net/pXqYR/embedded/). Some basic example files are\ninside the `/browsers` directory.\n\n##### YQL\n\nA table using E4X-based events is available as the community table\n`redabilitySAX`, as well as\n[here](https://github.com/FB55/yql-tables/tree/master/readabilitySAX).\n\n## Parsers (on node)\n\nMost SAX parsers (as sax.js) fail when a document is malformed XML, even if it's\ncorrect HTML. readabilitySAX should be used with\n[htmlparser2](http://npm.im/htmlparser2), my fork of the `htmlparser`-module\n(used by eg. `jsdom`), which corrects most faults. It's listed as a dependency,\nso npm should install it with readabilitySAX.\n\n## Performance\n\n##### Speed\n\nUsing a package of 724 pages from [CleanEval](http://cleaneval.sigwac.org.uk)\n(their website seems to be down, try to google it), readabilitySAX processed all\nof them in 5768 ms, that's an average of 7.97 ms per page.\n\nThe benchmark was done using `tests/benchmark.js` on a MacBook (late 2010) and\nis probably far from perfect.\n\nPerformance is the main goal of this project. The current speed should be good\nenough to run readabilitySAX on a singe-threaded web server with an average\nnumber of requests. That's an accomplishment!\n\n##### Accuracy\n\nThe main goal of CleanEval is to evaluate the accuracy of an algorithm.\n\n**_// TODO_**\n\n## Todo\n\n-   Add documentation \u0026 examples\n-   Add support for URLs containing hash-bangs (`#!`)\n-   Allow fetching articles with more than one page\n-   Don't remove all images inside `\u003ca\u003e` tags\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffb55%2Freadabilitysax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffb55%2Freadabilitysax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffb55%2Freadabilitysax/lists"}