https://github.com/devsrv/py-crawler
A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes
https://github.com/devsrv/py-crawler
Last synced: 4 months ago
JSON representation
A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes
- Host: GitHub
- URL: https://github.com/devsrv/py-crawler
- Owner: devsrv
- Created: 2024-12-20T14:38:00.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-14T12:23:45.000Z (about 1 year ago)
- Last Synced: 2025-05-15T17:14:19.843Z (9 months ago)
- Language: Python
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Crawler
A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes.
## Features
- Crawls all pages within a specified domain
- Respects same-origin policy (only crawls URLs from the same domain)
- Generates a CSV report with URLs and their HTTP status codes
- Handles relative and absolute URLs
- Implements polite crawling with built-in delays
- Filters out non-web schemes and fragments
- Robust error handling for failed requests
## Prerequisites
- Python 3.x
## Installation
1. Clone the repository:
```bash
git clone
cd
```
2. Create and activate a virtual environment:
```bash
sudo apt-get install python3-venv
python3 -m venv venv
source venv/bin/activate
```
3. Install the required dependencies:
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
## Usage
1. Run the script:
```bash
python index.py
```
2. When prompted:
- Enter the website URL you want to crawl (e.g., https://example.com)
- Specify the output CSV filename (e.g., links.csv)
3. The crawler will:
- Start crawling from the provided URL
- Save discovered URLs to the specified CSV file
- Record HTTP status codes for each URL
- Print progress information to the console
## Output Format
The script generates a CSV file with the following columns:
- `URL`: The discovered URL
- `Status_Code`: HTTP status code (only recorded if ≥ 300 or if the request failed)
## Features in Detail
### URL Processing
- Removes URL fragments (#) and query parameters (?)
- Converts relative URLs to absolute URLs
- Validates URLs against the original domain
### Rate Limiting
- Implements a 1-second delay between requests to prevent server overload
## Contributing
Feel free to submit issues and enhancement requests.