https://github.com/azshurith/depth-crawler
A simple yet powerful Python web crawler that explores a given domain up to a specified depth and outputs a JSON sitemap of URLs and page titles.
https://github.com/azshurith/depth-crawler
crawler puppeteer python
Last synced: 2 months ago
JSON representation
A simple yet powerful Python web crawler that explores a given domain up to a specified depth and outputs a JSON sitemap of URLs and page titles.
- Host: GitHub
- URL: https://github.com/azshurith/depth-crawler
- Owner: Azshurith
- Created: 2025-01-07T05:23:58.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-06-23T09:31:18.000Z (12 months ago)
- Last Synced: 2025-07-01T00:08:02.062Z (12 months ago)
- Topics: crawler, puppeteer, python
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Depth Crawler (Python)
A simple yet powerful **Python** web crawler that explores a given domain up to a specified depth and outputs a JSON sitemap of URLs and page titles.
## 🚀 Features
- Crawls recursively within a domain up to your chosen depth
- Records each page’s URL and its HTML ``
- Outputs results as a JSON file
- Displays crawl stats: pages visited, links found, total & average time per link
## 🔧 Requirements
- Python 3.x
- `requests` – for HTTP requests
- `beautifulsoup4` – for parsing HTML
## ⚙️ Installation
Clone the repository and install dependencies:
```bash
git clone https://github.com/Azshurith/Depth-Crawler-Python.git
cd Depth-Crawler-Python
make install
```
## 🛠 Makefile Commands
The project includes a simple `Makefile` with the following commands:
- **Install dependencies**
```bash
make install
```
- **Run the crawler**
```bash
make build
```
This will execute `python ./src/Main.py`.
## 🎯 Usage
Run the crawler:
```bash
make build
```
- You’ll be prompted to enter the target URL (e.g., `https://example.com`) and crawl depth.
- A file called `sitemap.json` will be generated with the results.
## 📊 Output Format
```json
[
{
"url": "https://example.com",
"title": "Example Domain",
"links": ["https://example.com/page1", ...]
},
...
]
```
## 📈 Crawl Report
After the run, you'll see a summary showing:
- Total pages visited
- Total links gathered
- Max depth reached
- Total time taken
- Average time per link
## ✅ How to Contribute
- Fork the repo and submit PRs for improvements
- Open issues for bugs or feature suggestions
## 📝 License
Add your license info here (e.g., MIT)
## 👤 Author
Your name or GitHub profile link here
---
**Happy crawling!**