Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gatenlp/ultimate-sitemap-parser
Ultimate Website Sitemap Parser
https://github.com/gatenlp/ultimate-sitemap-parser
python python-3 python3 robots-txt sitemap sitemap-xml xml-sitemap xml-sitemap-parser
Last synced: 2 days ago
JSON representation
Ultimate Website Sitemap Parser
- Host: GitHub
- URL: https://github.com/gatenlp/ultimate-sitemap-parser
- Owner: GateNLP
- License: gpl-3.0
- Created: 2018-11-27T10:05:18.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2024-12-24T01:18:24.000Z (11 days ago)
- Last Synced: 2024-12-29T15:19:07.812Z (5 days ago)
- Topics: python, python-3, python3, robots-txt, sitemap, sitemap-xml, xml-sitemap, xml-sitemap-parser
- Language: Python
- Homepage: https://mediacloud.org/
- Size: 345 KB
- Stars: 186
- Watchers: 11
- Forks: 65
- Open Issues: 5
-
Metadata Files:
- Readme: README.rst
- Contributing: docs/contributing.rst
- License: LICENSE
Awesome Lists containing this project
README
Ultimate Sitemap Parser
-----------------------.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser
:alt: PyPI - Python Version
:target: https://github.com/GateNLP/ultimate-sitemap-parser.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser
:alt: PyPI - Version
:target: https://pypi.org/project/ultimate-sitemap-parser/.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser
:alt: Conda Version
:target: https://anaconda.org/conda-forge/ultimate-sitemap-parser.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser
:target: https://pepy.tech/project/ultimate-sitemap-parser
:alt: Pepy Total Downloads**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**
Features
========- Supports all sitemap formats:
- `XML sitemaps `_
- `Google News sitemaps `_ and `Image sitemaps `_
- `plain text sitemaps `_
- `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps `_
- `Sitemaps linked from robots.txt `_- Field-tested with ~1 million URLs as part of the `Media Cloud project `_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably testedInstallation
============.. code:: sh
pip install ultimate-sitemap-parser
or using Anaconda:
.. code:: sh
conda install -c conda-forge ultimate-sitemap-parser
Usage
=====.. code:: python
from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.example.org/')
for page in tree.all_pages():
print(page.url)``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses `_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.For more examples and details, see the `documentation `_.