Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gatenlp/ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://github.com/gatenlp/ultimate-sitemap-parser

python python-3 python3 robots-txt sitemap sitemap-xml xml-sitemap xml-sitemap-parser

Last synced: 2 days ago
JSON representation

Ultimate Website Sitemap Parser

Awesome Lists containing this project

README

        

.. image:: https://travis-ci.org/mediacloud/ultimate-sitemap-parser.svg?branch=develop
:target: https://travis-ci.org/mediacloud/ultimate-sitemap-parser
:alt: Build Status

.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest
:target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. image:: https://coveralls.io/repos/github/mediacloud/ultimate-sitemap-parser/badge.svg?branch=develop
:target: https://coveralls.io/github/mediacloud/ultimate-sitemap-parser?branch=develop
:alt: Coverage Status

.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg
:target: https://badge.fury.io/py/ultimate-sitemap-parser
:alt: PyPI package

.. image:: https://pepy.tech/badge/ultimate-sitemap-parser
:target: https://pepy.tech/project/ultimate-sitemap-parser
:alt: Download stats

Website sitemap parser for Python 3.5+.

Features
========

- Supports all sitemap formats:

- `XML sitemaps `_
- `Google News sitemaps `_
- `plain text sitemaps `_
- `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps `_
- `Sitemaps linked from robots.txt `_

- Field-tested with ~1 million URLs as part of the `Media Cloud project `_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably tested

Installation
============

.. code:: sh

pip install ultimate-sitemap-parser

Usage
=====

.. code:: python

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses `_.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using ``all_pages()`` method:

.. code:: python

# all_pages() returns an Iterator
for page in tree.all_pages():
print(page)

``all_pages()`` method will return an iterator yielding ``SitemapPage`` objects; see a `reference of SitemapPage `_.