https://github.com/azshurith/depth-crawler

A simple yet powerful Python web crawler that explores a given domain up to a specified depth and outputs a JSON sitemap of URLs and page titles.
https://github.com/azshurith/depth-crawler

crawler puppeteer python

Last synced: 3 months ago
JSON representation

A simple yet powerful Python web crawler that explores a given domain up to a specified depth and outputs a JSON sitemap of URLs and page titles.

Host: GitHub
URL: https://github.com/azshurith/depth-crawler
Owner: Azshurith
Created: 2025-01-07T05:23:58.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2025-06-23T09:31:18.000Z (about 1 year ago)
Last Synced: 2025-07-01T00:08:02.062Z (about 1 year ago)
Topics: crawler, puppeteer, python
Language: Python
Homepage:
Size: 10.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Depth Crawler (Python)

A simple yet powerful **Python** web crawler that explores a given domain up to a specified depth and outputs a JSON sitemap of URLs and page titles.

## 🚀 Features

- Crawls recursively within a domain up to your chosen depth
- Records each page’s URL and its HTML ``
- Outputs results as a JSON file
- Displays crawl stats: pages visited, links found, total & average time per link

## 🔧 Requirements

- Python 3.x
- `requests` – for HTTP requests
- `beautifulsoup4` – for parsing HTML

## ⚙️ Installation

Clone the repository and install dependencies:

```bash
git clone https://github.com/Azshurith/Depth-Crawler-Python.git
cd Depth-Crawler-Python
make install
```

## 🛠 Makefile Commands

The project includes a simple `Makefile` with the following commands:

- **Install dependencies**
```bash
make install
```

- **Run the crawler**
```bash
make build
```

This will execute `python ./src/Main.py`.

## 🎯 Usage

Run the crawler:
```bash
make build
```

- You’ll be prompted to enter the target URL (e.g., `https://example.com`) and crawl depth.
- A file called `sitemap.json` will be generated with the results.

## 📊 Output Format

```json
[
{
"url": "https://example.com",
"title": "Example Domain",
"links": ["https://example.com/page1", ...]
},
...
]
```

## 📈 Crawl Report

After the run, you'll see a summary showing:

- Total pages visited
- Total links gathered
- Max depth reached
- Total time taken
- Average time per link

## ✅ How to Contribute

- Fork the repo and submit PRs for improvements
- Open issues for bugs or feature suggestions

## 📝 License

Add your license info here (e.g., MIT)

## 👤 Author

Your name or GitHub profile link here

---

**Happy crawling!**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/azshurith/depth-crawler

Awesome Lists containing this project

README