{"id":17166611,"url":"https://github.com/miso-belica/justext","last_synced_at":"2025-04-13T23:56:30.033Z","repository":{"id":6872895,"uuid":"8121947","full_name":"miso-belica/jusText","owner":"miso-belica","description":"Heuristic based boilerplate removal tool","archived":false,"fork":false,"pushed_at":"2024-05-09T15:55:14.000Z","size":1061,"stargazers_count":725,"open_issues_count":10,"forks_count":79,"subscribers_count":21,"default_branch":"main","last_synced_at":"2024-10-29T15:13:03.506Z","etag":null,"topics":["html-parser","html-parsing","python","text-extraction"],"latest_commit_sha":null,"homepage":"https://pypi.python.org/pypi/jusText","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miso-belica.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE.rst","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-02-10T11:42:20.000Z","updated_at":"2024-10-17T10:15:58.000Z","dependencies_parsed_at":"2024-06-18T14:03:48.354Z","dependency_job_id":"855373f3-7bbc-40c4-bd5e-67659ab68ebd","html_url":"https://github.com/miso-belica/jusText","commit_stats":{"total_commits":180,"total_committers":5,"mean_commits":36.0,"dds":"0.23333333333333328","last_synced_commit":"7fae5d457613cc96cf9ccb070e4aa2e70455f0e2"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miso-belica%2FjusText","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miso-belica%2FjusText/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miso-belica%2FjusText/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miso-belica%2FjusText/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miso-belica","download_url":"https://codeload.github.com/miso-belica/jusText/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248799925,"owners_count":21163403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html-parser","html-parsing","python","text-extraction"],"created_at":"2024-10-14T23:06:08.610Z","updated_at":"2025-04-13T23:56:30.011Z","avatar_url":"https://github.com/miso-belica.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":".. _jusText: http://code.google.com/p/justext/\n.. _Python: http://www.python.org/\n.. _lxml: http://lxml.de/\n\njusText\n=======\n.. image:: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml/badge.svg\n  :target: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml\n\nProgram jusText is a tool for removing boilerplate content, such as navigation\nlinks, headers, and footers from HTML pages. It is\n`designed \u003cdoc/algorithm.rst\u003e`_ to preserve\nmainly text containing full sentences and it is therefore well suited for\ncreating linguistic resources such as Web corpora. You can\n`try it online \u003chttp://nlp.fi.muni.cz/projects/justext/\u003e`_.\n\nThis is a fork of original (currently unmaintained) code of jusText_ hosted\non Google Code.\n\n\nAdaptations of the algorithm to other languages:\n\n- `C++ \u003chttps://github.com/endredy/jusText\u003e`_\n- `Go \u003chttps://github.com/JalfResi/justext\u003e`_\n- `Java \u003chttps://github.com/wizenoze/justext-java\u003e`_\n\n\nSome libraries using jusText:\n\n- `chirp \u003chttps://github.com/9b/chirp\u003e`_\n- `lazynlp \u003chttps://github.com/chiphuyen/lazynlp\u003e`_\n- `off-topic-memento-toolkit \u003chttps://github.com/oduwsdl/off-topic-memento-toolkit\u003e`_\n- `pears \u003chttps://github.com/PeARSearch/PeARS-orchard\u003e`_\n- `readability calculator \u003chttps://github.com/joaopalotti/readability_calculator\u003e`_\n- `sky \u003chttps://github.com/kootenpv/sky\u003e`_\n\n\nSome currently (Jan 2020) maintained alternatives:\n\n- `dragnet \u003chttps://github.com/dragnet-org/dragnet\u003e`_\n- `html2text \u003chttps://github.com/Alir3z4/html2text\u003e`_\n- `inscriptis \u003chttps://github.com/weblyzard/inscriptis\u003e`_\n- `newspaper \u003chttps://github.com/codelucas/newspaper\u003e`_\n- `python-readability \u003chttps://github.com/buriy/python-readability\u003e`_\n- `trafilatura \u003chttps://github.com/adbar/trafilatura\u003e`_\n\n\nInstallation\n------------\nMake sure you have Python_ 2.7+/3.5+ and `pip \u003chttps://pip.pypa.io/en/stable/\u003e`_\n(`Windows \u003chttp://docs.python-guide.org/en/latest/starting/install/win/\u003e`_,\n`Linux \u003chttp://docs.python-guide.org/en/latest/starting/install/linux/\u003e`_) installed.\nRun simply:\n\n.. code-block:: bash\n\n  $ [sudo] pip install justext\n\n\nDependencies\n------------\n::\n\n  lxml (version depends on your Python version)\n\n\nUsage\n-----\n.. code-block:: bash\n\n  $ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/\n  $ python -m justext -s English -o plain_text.txt english_page.html\n  $ python -m justext --help # for more info\n\n\nPython API\n----------\n.. code-block:: python\n\n  import requests\n  import justext\n\n  response = requests.get(\"http://planet.python.org/\")\n  paragraphs = justext.justext(response.content, justext.get_stoplist(\"English\"))\n  for paragraph in paragraphs:\n    if not paragraph.is_boilerplate:\n      print paragraph.text\n\n\nTesting\n-------\nRun tests via\n\n.. code-block:: bash\n\n  $ py.test-2.7 \u0026\u0026 py.test-3.5 \u0026\u0026 py.test-3.6 \u0026\u0026 py.test-3.7 \u0026\u0026 py.test-3.8 \u0026\u0026 py.test-3.9\n\n\nAcknowledgements\n----------------\n.. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc\n.. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en\n.. _PRESEMT: http://presemt.eu/\n.. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/\n.. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf\n\nThis software has been developed at the `Natural Language Processing Centre`_ of\n`Masaryk University in Brno`_ with a financial support from PRESEMT_ and\n`Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomikálek.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiso-belica%2Fjustext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiso-belica%2Fjustext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiso-belica%2Fjustext/lists"}