https://github.com/na9da/haskell-justext
Tool for removing boilerplate from HTML pages
https://github.com/na9da/haskell-justext
haskell justext library
Last synced: 7 months ago
JSON representation
Tool for removing boilerplate from HTML pages
- Host: GitHub
- URL: https://github.com/na9da/haskell-justext
- Owner: na9da
- License: bsd-3-clause
- Created: 2018-02-03T05:33:24.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-02-03T05:39:25.000Z (almost 8 years ago)
- Last Synced: 2025-03-25T08:43:04.208Z (10 months ago)
- Topics: haskell, justext, library
- Language: Haskell
- Size: 6.84 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# haskell-jusText
This is a haskell clone of the python [jusText](https://github.com/miso-belica/jusText) project. It is useful for removing boiler plate content from HTML pages leaving just the main content. jusText applies certain heuristics to identify the main content of the page. You can read more about it in the [thesis work](https://is.muni.cz/th/45523/fi_d/phdthesis.pdf) done by Jan Pomik´alek.
# Building
```
stack install
haskell-jusText
```
Stopword files for different languages are available in the [original repo](https://github.com/miso-belica/jusText/tree/dev/justext/stoplists).