Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/evemilano/sitemap.xml-checker-con-python
This Python script checks the URLs listed in a website's sitemap.xml file to ensure they are accessible and returns their HTTP status codes. It also verifies the site's robots.txt file to identify any disallowed URLs.
https://github.com/evemilano/sitemap.xml-checker-con-python
evemilano gpt python requests seo sitemap
Last synced: 8 days ago
JSON representation
This Python script checks the URLs listed in a website's sitemap.xml file to ensure they are accessible and returns their HTTP status codes. It also verifies the site's robots.txt file to identify any disallowed URLs.
- Host: GitHub
- URL: https://github.com/evemilano/sitemap.xml-checker-con-python
- Owner: evemilano
- Created: 2024-12-25T18:23:58.000Z (10 days ago)
- Default Branch: main
- Last Pushed: 2024-12-25T22:34:31.000Z (9 days ago)
- Last Synced: 2024-12-25T23:21:01.315Z (9 days ago)
- Topics: evemilano, gpt, python, requests, seo, sitemap
- Language: Python
- Homepage: https://www.evemilano.com/blog/analizzare-sitemap-python-gpt/
- Size: 1.95 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Sitemap.xml Checker with Python
This Python script checks the URLs listed in a website's sitemap.xml file to ensure they are accessible and returns their HTTP status codes. It also verifies the site's robots.txt file to identify any disallowed URLs.
Prerequisites
- Python 3.x installed on your system.
- Required Python libraries: requests, pandas, openpyxl.You can install the necessary libraries using pip:
pip install requests pandas openpyxl
Usage
1. Clone this repository or download the sm.py script.
2. Run the script from the command line, providing the domain name as an argument:
python sm.py yourdomain.com
Replace yourdomain.com with the actual domain you want to check.
3. The script will perform the following actions:
- Fetch the robots.txt file from the specified domain.
- Parse the sitemap.xml file to extract all listed URLs.
- Check each URL's HTTP status code.
- Identify URLs disallowed by robots.txt.
- Save the results to an Excel file named sitemap_report.xlsx.Output
The output Excel file, sitemap_report.xlsx, contains the following columns:
- URL: The URL extracted from the sitemap.
- Status Code: The HTTP status code returned by the URL.
- Allowed by robots.txt: Indicates whether the URL is allowed or disallowed by the site's robots.txt file.Notes
- Ensure that the domain you specify has a sitemap.xml file accessible at the root level (e.g., https://yourdomain.com/sitemap.xml).
- The script uses a default User-Agent header to mimic a mobile browser. You can modify this in the HEADERS dictionary within the script if needed.
- The script includes a delay between requests to avoid overwhelming the server.License
This project is licensed under the MIT License.