Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ArchiveBox/readability-extractor
Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.
https://github.com/ArchiveBox/readability-extractor
archivebox internet-archiving node readability wrapper
Last synced: about 2 months ago
JSON representation
Javascript/Node wrapper around Mozilla's Readability library so that ArchiveBox can call it as a oneshot CLI command to extract each page's article text.
- Host: GitHub
- URL: https://github.com/ArchiveBox/readability-extractor
- Owner: ArchiveBox
- Created: 2020-08-06T14:20:01.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2024-04-11T08:19:12.000Z (5 months ago)
- Last Synced: 2024-05-01T11:38:28.647Z (5 months ago)
- Topics: archivebox, internet-archiving, node, readability, wrapper
- Language: JavaScript
- Homepage:
- Size: 93.8 KB
- Stars: 32
- Watchers: 4
- Forks: 13
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project
README
# Readability-Extractor
This is a tiny JS wrapper library around Mozilla's article-text extraction tool https://github.com/mozilla/readability.
It's designed to be used as an [ArchiveBox](https://github.com/pirate/ArchiveBox) extractor.
## Install
```bash
npm install -g 'git+https://github.com/pirate/readability-extractor'# which is equivalent to this:
curl https://raw.githubusercontent.com/pirate/readability-extractor/master/readability-extractor > /usr/local/bin/readability-extractor
chmod +x /usr/local/bin/readability-extractor
```## Usage
```bash
# readability-extractor >
readability-extractor some_article.html 'https://exmaple.com/original/url/some/article.html' 'UTF-8' > some_article.json
```
```json
{
"title": "Title autodetected from article html",
"byline": "Autodetected author...",
"excerpt": "Autodetected short description",
"dir": "ltr",
"length": 1337,
"lang": null,
"charset": "UTF-8",
"content": "abc some article body text...",
"textContent": "abc some article body text..."
}
```## ArchiveBox Integration
```bash
# You don't have to run these commands usually.
# Readability is on by default and ArchiveBox will find any
# installed version in your $PATH automatically# However, if you explicitly want to turn readability on
# and/or specify a manual path to the binary, you can do this:
archivebox config --set SAVE_READABILITY=True
archivebox config --set READABILITY_BINARY="$(which readability-extractor)"# test archiving oneshot using only singlefile+readability
archivebox add --extract=singlefile,readability 'https://exmaple.com'
```