Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gatenlp/ultimate-sitemap-parser
Ultimate Website Sitemap Parser
https://github.com/gatenlp/ultimate-sitemap-parser
python python-3 python3 robots-txt sitemap sitemap-xml xml-sitemap xml-sitemap-parser
Last synced: 2 days ago
JSON representation
Ultimate Website Sitemap Parser
- Host: GitHub
- URL: https://github.com/gatenlp/ultimate-sitemap-parser
- Owner: GateNLP
- License: other
- Created: 2018-11-27T10:05:18.000Z (almost 6 years ago)
- Default Branch: develop
- Last Pushed: 2023-05-17T11:16:44.000Z (over 1 year ago)
- Last Synced: 2024-10-01T21:14:03.563Z (2 days ago)
- Topics: python, python-3, python3, robots-txt, sitemap, sitemap-xml, xml-sitemap, xml-sitemap-parser
- Language: Python
- Homepage: https://mediacloud.org/
- Size: 111 KB
- Stars: 179
- Watchers: 11
- Forks: 64
- Open Issues: 22
-
Metadata Files:
- Readme: README.rst
- License: LICENSE.txt
Awesome Lists containing this project
README
.. image:: https://travis-ci.org/mediacloud/ultimate-sitemap-parser.svg?branch=develop
:target: https://travis-ci.org/mediacloud/ultimate-sitemap-parser
:alt: Build Status.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest
:target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status.. image:: https://coveralls.io/repos/github/mediacloud/ultimate-sitemap-parser/badge.svg?branch=develop
:target: https://coveralls.io/github/mediacloud/ultimate-sitemap-parser?branch=develop
:alt: Coverage Status.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg
:target: https://badge.fury.io/py/ultimate-sitemap-parser
:alt: PyPI package.. image:: https://pepy.tech/badge/ultimate-sitemap-parser
:target: https://pepy.tech/project/ultimate-sitemap-parser
:alt: Download statsWebsite sitemap parser for Python 3.5+.
Features
========- Supports all sitemap formats:
- `XML sitemaps `_
- `Google News sitemaps `_
- `plain text sitemaps `_
- `RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps `_
- `Sitemaps linked from robots.txt `_- Field-tested with ~1 million URLs as part of the `Media Cloud project `_
- Error-tolerant with more common sitemap bugs
- Tries to find sitemaps not listed in ``robots.txt``
- Uses fast and memory efficient Expat XML parsing
- Doesn't consume much memory even with massive sitemap hierarchies
- Provides a generated sitemap tree as easy to use object tree
- Supports using a custom web client
- Uses a small number of actively maintained third-party modules
- Reasonably testedInstallation
============.. code:: sh
pip install ultimate-sitemap-parser
Usage
=====.. code:: python
from usp.tree import sitemap_tree_for_homepage
tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)``sitemap_tree_for_homepage()`` will return a tree of ``AbstractSitemap`` subclass objects that represent the sitemap
hierarchy found on the website; see a `reference of AbstractSitemap subclasses `_.If you'd like to just list all the pages found in all of the sitemaps within the website, consider using ``all_pages()`` method:
.. code:: python
# all_pages() returns an Iterator
for page in tree.all_pages():
print(page)``all_pages()`` method will return an iterator yielding ``SitemapPage`` objects; see a `reference of SitemapPage `_.