{"id":13583423,"url":"https://github.com/ArchiveBox/readability-extractor","last_synced_at":"2025-04-06T18:32:19.790Z","repository":{"id":46830339,"uuid":"285590444","full_name":"ArchiveBox/readability-extractor","owner":"ArchiveBox","description":"Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.","archived":false,"fork":false,"pushed_at":"2024-04-11T08:19:12.000Z","size":96,"stargazers_count":32,"open_issues_count":0,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-05-01T11:38:28.647Z","etag":null,"topics":["archivebox","internet-archiving","node","readability","wrapper"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveBox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"pirate","patreon":"theSquashSH"}},"created_at":"2020-08-06T14:20:01.000Z","updated_at":"2024-08-01T16:28:57.560Z","dependencies_parsed_at":"2024-01-03T04:43:22.611Z","dependency_job_id":"29649ee0-7147-4622-b415-7f999754d1dd","html_url":"https://github.com/ArchiveBox/readability-extractor","commit_stats":{"total_commits":26,"total_committers":4,"mean_commits":6.5,"dds":0.5,"last_synced_commit":"ef2cf5431f4cf5bfeee96b32af490a301b17c94c"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Freadability-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Freadability-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Freadability-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Freadability-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveBox","download_url":"https://codeload.github.com/ArchiveBox/readability-extractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247531229,"owners_count":20953918,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archivebox","internet-archiving","node","readability","wrapper"],"created_at":"2024-08-01T15:03:28.187Z","updated_at":"2025-04-06T18:32:19.239Z","avatar_url":"https://github.com/ArchiveBox.png","language":"JavaScript","readme":"# Readability-Extractor\n\nThis is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.\n\nIt's designed to be used as an [ArchiveBox](https://github.com/pirate/ArchiveBox) extractor.\n\n## Install\n\n```bash\nnpm install -g 'git+https://github.com/pirate/readability-extractor'\n\n# which is equivalent to this:\ncurl https://raw.githubusercontent.com/pirate/readability-extractor/master/readability-extractor \u003e /usr/local/bin/readability-extractor\nchmod +x /usr/local/bin/readability-extractor\n```\n\n## Usage\n```bash\n# readability-extractor \u003cinput HTML path\u003e \u003coriginal url?\u003e \u003csuggested encoding?\u003e \u003e \u003coutput JSON path\u003e\nreadability-extractor some_article.html 'https://exmaple.com/original/url/some/article.html' 'UTF-8' \u003e some_article.json\n```\n```json\n{\n    \"title\": \"Title autodetected from article html\",\n    \"byline\": \"Autodetected author...\",\n    \"excerpt\": \"Autodetected short description\",\n    \"dir\": \"ltr\",\n    \"length\": 1337,\n    \"lang\": null,\n    \"charset\": \"UTF-8\",\n    \"content\": \"\u003cdiv id=\\\"readability-page-1\\\" class=\\\"page\\\"\u003eabc some article body text...\u003c/div\u003e\",\n    \"textContent\": \"abc some article body text...\"\n}\n```\n\n## ArchiveBox Integration\n\n```bash\n# You don't have to run these commands usually.\n# Readability is on by default and ArchiveBox will find any \n# installed version in your $PATH automatically\n\n# However, if you explicitly want to turn readability on\n# and/or specify a manual path to the binary, you can do this:\narchivebox config --set SAVE_READABILITY=True\narchivebox config --set READABILITY_BINARY=\"$(which readability-extractor)\"\n\n# test archiving oneshot using only singlefile+readability\narchivebox add --extract=singlefile,readability 'https://exmaple.com'\n```\n","funding_links":["https://github.com/sponsors/pirate","https://patreon.com/theSquashSH"],"categories":["JavaScript"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FArchiveBox%2Freadability-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FArchiveBox%2Freadability-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FArchiveBox%2Freadability-extractor/lists"}