https://github.com/gatenlp/ultimate-sitemap-parser
Ultimate Website Sitemap Parser
https://github.com/gatenlp/ultimate-sitemap-parser
python python-3 python3 robots-txt sitemap sitemap-xml xml-sitemap xml-sitemap-parser
Last synced: about 1 year ago
JSON representation
Ultimate Website Sitemap Parser
- Host: GitHub
- URL: https://github.com/gatenlp/ultimate-sitemap-parser
- Owner: GateNLP
- License: gpl-3.0
- Created: 2018-11-27T10:05:18.000Z (over 7 years ago)
- Default Branch: main
- Last Pushed: 2025-04-28T10:00:26.000Z (about 1 year ago)
- Last Synced: 2025-05-15T09:08:10.060Z (about 1 year ago)
- Topics: python, python-3, python3, robots-txt, sitemap, sitemap-xml, xml-sitemap, xml-sitemap-parser
- Language: Python
- Homepage: https://ultimate-sitemap-parser.readthedocs.io
- Size: 409 KB
- Stars: 208
- Watchers: 9
- Forks: 68
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- Contributing: docs/contributing.rst
- License: LICENSE
Awesome Lists containing this project
README
Ultimate Sitemap Parser
-----------------------
.. image:: https://img.shields.io/pypi/pyversions/ultimate-sitemap-parser
:alt: PyPI - Python Version
:target: https://github.com/GateNLP/ultimate-sitemap-parser
.. image:: https://img.shields.io/pypi/v/ultimate-sitemap-parser
:alt: PyPI - Version
:target: https://pypi.org/project/ultimate-sitemap-parser/
.. image:: https://img.shields.io/conda/vn/conda-forge/ultimate-sitemap-parser
:alt: Conda Version
:target: https://anaconda.org/conda-forge/ultimate-sitemap-parser
.. image:: https://img.shields.io/pepy/dt/ultimate-sitemap-parser
:target: https://pepy.tech/project/ultimate-sitemap-parser
:alt: Pepy Total Downloads
**Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps.**
Features
========
- Supports all sitemap formats:
- `XML sitemaps `_
- `Google News sitemaps `_ and `Image sitemaps `_
- `plain text sitemaps `_
- `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps `_
- `Sitemaps linked from robots.txt `_
- Field-tested with ~1 million URLs as part of the `Media Cloud project `_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested
Installation
============
.. code:: sh
pip install ultimate-sitemap-parser
or using Anaconda:
.. code:: sh
conda install -c conda-forge ultimate-sitemap-parser
Usage
=====
.. code:: python
from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.example.org/')
for page in tree.all_pages():
print(page.url)
``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses `_. `AbstractSitemap.all_pages()` returns a generator to efficiently iterate over pages without loading the entire tree into memory.
For more examples and details, see the `documentation `_.