https://github.com/miso-belica/jusText

Heuristic based boilerplate removal tool
https://github.com/miso-belica/jusText

html-parser html-parsing python text-extraction

Last synced: over 1 year ago
JSON representation

Heuristic based boilerplate removal tool

Host: GitHub
URL: https://github.com/miso-belica/jusText
Owner: miso-belica
License: bsd-2-clause
Created: 2013-02-10T11:42:20.000Z (over 13 years ago)
Default Branch: main
Last Pushed: 2024-05-09T15:55:14.000Z (about 2 years ago)
Last Synced: 2024-10-22T11:41:00.903Z (over 1 year ago)
Topics: html-parser, html-parsing, python, text-extraction
Language: Python
Homepage: https://pypi.python.org/pypi/jusText
Size: 1.01 MB
Stars: 725
Watchers: 21
Forks: 79
Open Issues: 10
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- License: LICENSE.rst

Awesome Lists containing this project

StarryDivineSky - miso-belica/jusText

README

.. _jusText: http://code.google.com/p/justext/
.. _Python: http://www.python.org/
.. _lxml: http://lxml.de/

jusText
=======
.. image:: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml/badge.svg
:target: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml

Program jusText is a tool for removing boilerplate content, such as navigation
links, headers, and footers from HTML pages. It is
`designed `_ to preserve
mainly text containing full sentences and it is therefore well suited for
creating linguistic resources such as Web corpora. You can
`try it online `_.

This is a fork of original (currently unmaintained) code of jusText_ hosted
on Google Code.

Adaptations of the algorithm to other languages:

- `C++ `_
- `Go `_
- `Java `_

Some libraries using jusText:

- `chirp `_
- `lazynlp `_
- `off-topic-memento-toolkit `_
- `pears `_
- `readability calculator `_
- `sky `_

Some currently (Jan 2020) maintained alternatives:

- `dragnet `_
- `html2text `_
- `inscriptis `_
- `newspaper `_
- `python-readability `_
- `trafilatura `_

Installation
------------
Make sure you have Python_ 2.7+/3.5+ and `pip `_
(`Windows `_,
`Linux `_) installed.
Run simply:

.. code-block:: bash

$ [sudo] pip install justext

Dependencies
------------
::

lxml (version depends on your Python version)

Usage
-----
.. code-block:: bash

$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ python -m justext -s English -o plain_text.txt english_page.html
$ python -m justext --help # for more info

Python API
----------
.. code-block:: python

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print paragraph.text

Testing
-------
Run tests via

.. code-block:: bash

$ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9

Acknowledgements
----------------
.. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc
.. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en
.. _PRESEMT: http://presemt.eu/
.. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/
.. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

This software has been developed at the `Natural Language Processing Centre`_ of
`Masaryk University in Brno`_ with a financial support from PRESEMT_ and
`Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomikálek.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/miso-belica/jusText

Awesome Lists containing this project

README