https://github.com/devsrv/py-crawler

A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes
https://github.com/devsrv/py-crawler

Last synced: 4 months ago
JSON representation

A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes

Host: GitHub
URL: https://github.com/devsrv/py-crawler
Owner: devsrv
Created: 2024-12-20T14:38:00.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-01-14T12:23:45.000Z (about 1 year ago)
Last Synced: 2025-05-15T17:14:19.843Z (9 months ago)
Language: Python
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Crawler

A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes.

## Features

- Crawls all pages within a specified domain
- Respects same-origin policy (only crawls URLs from the same domain)
- Generates a CSV report with URLs and their HTTP status codes
- Handles relative and absolute URLs
- Implements polite crawling with built-in delays
- Filters out non-web schemes and fragments
- Robust error handling for failed requests

## Prerequisites

- Python 3.x

## Installation

1. Clone the repository:

```bash
git clone
cd
```

2. Create and activate a virtual environment:

```bash
sudo apt-get install python3-venv
python3 -m venv venv
source venv/bin/activate
```

3. Install the required dependencies:

```bash
pip install --upgrade pip
pip install -r requirements.txt
```

## Usage

1. Run the script:

```bash
python index.py
```

2. When prompted:

- Enter the website URL you want to crawl (e.g., https://example.com)
- Specify the output CSV filename (e.g., links.csv)

3. The crawler will:
- Start crawling from the provided URL
- Save discovered URLs to the specified CSV file
- Record HTTP status codes for each URL
- Print progress information to the console

## Output Format

The script generates a CSV file with the following columns:

- `URL`: The discovered URL
- `Status_Code`: HTTP status code (only recorded if ≥ 300 or if the request failed)

## Features in Detail

### URL Processing

- Removes URL fragments (#) and query parameters (?)
- Converts relative URLs to absolute URLs
- Validates URLs against the original domain

### Rate Limiting

- Implements a 1-second delay between requests to prevent server overload

## Contributing

Feel free to submit issues and enhancement requests.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devsrv/py-crawler

Awesome Lists containing this project

README