Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/evemilano/sitemap.xml-checker-con-python

This Python script checks the URLs listed in a website's sitemap.xml file to ensure they are accessible and returns their HTTP status codes. It also verifies the site's robots.txt file to identify any disallowed URLs.
https://github.com/evemilano/sitemap.xml-checker-con-python

evemilano gpt python requests seo sitemap

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/evemilano/sitemap.xml-checker-con-python
Owner: evemilano
Created: 2024-12-25T18:23:58.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-12-25T22:34:31.000Z (about 1 month ago)
Last Synced: 2024-12-25T23:21:01.315Z (about 1 month ago)
Topics: evemilano, gpt, python, requests, seo, sitemap
Language: Python
Homepage: https://www.evemilano.com/blog/analizzare-sitemap-python-gpt/
Size: 1.95 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Sitemap.xml Checker with Python

Prerequisites

- Python 3.x installed on your system.
- Required Python libraries: requests, pandas, openpyxl.

You can install the necessary libraries using pip:

pip install requests pandas openpyxl

Usage

1. Clone this repository or download the sm.py script.

2. Run the script from the command line, providing the domain name as an argument:

python sm.py yourdomain.com

Replace yourdomain.com with the actual domain you want to check.

3. The script will perform the following actions:

- Fetch the robots.txt file from the specified domain.
- Parse the sitemap.xml file to extract all listed URLs.
- Check each URL's HTTP status code.
- Identify URLs disallowed by robots.txt.
- Save the results to an Excel file named sitemap_report.xlsx.

Output

The output Excel file, sitemap_report.xlsx, contains the following columns:

- URL: The URL extracted from the sitemap.
- Status Code: The HTTP status code returned by the URL.
- Allowed by robots.txt: Indicates whether the URL is allowed or disallowed by the site's robots.txt file.

Notes

- Ensure that the domain you specify has a sitemap.xml file accessible at the root level (e.g., https://yourdomain.com/sitemap.xml).
- The script uses a default User-Agent header to mimic a mobile browser. You can modify this in the HEADERS dictionary within the script if needed.
- The script includes a delay between requests to avoid overwhelming the server.

License

This project is licensed under the MIT License.