https://github.com/yagna123k/scout
Smart concurrent web scraper with adaptive rate limiting & live CLI dashboard. Persists full HTML + text snippets into MongoDB for later NLP, indexing, or analysis. Achieves ~6x faster scraping with multi-threaded architecture vs single-threaded baseline.
https://github.com/yagna123k/scout
adaptive-rate-limiting beautifulsoup concurrency mongodb multithreading performance python systems-engineering web-scraping
Last synced: about 1 month ago
JSON representation
Smart concurrent web scraper with adaptive rate limiting & live CLI dashboard. Persists full HTML + text snippets into MongoDB for later NLP, indexing, or analysis. Achieves ~6x faster scraping with multi-threaded architecture vs single-threaded baseline.
- Host: GitHub
- URL: https://github.com/yagna123k/scout
- Owner: Yagna123k
- Created: 2025-07-13T18:43:53.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-13T20:55:55.000Z (12 months ago)
- Last Synced: 2025-10-24T22:34:08.946Z (8 months ago)
- Topics: adaptive-rate-limiting, beautifulsoup, concurrency, mongodb, multithreading, performance, python, systems-engineering, web-scraping
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ Scout Scrapper
**Scout Scrapper** is a smart concurrent web scrapper that:
- ๐งต Uses a **multi-threaded architecture** to achieve up to **6x faster scraping** over single-threaded baselines.
- โ๏ธ Features **adaptive rate limiting**, automatically slowing down under high latency or error rates to prevent HTTP 429 bans.
- ๐ Stores **full HTML content** and a short **text snippet** of each page in **MongoDB**, enabling later analysis, indexing, or NLP.
- ๐ Provides a **real-time CLI dashboard** showing completed requests, failures, average latency, and dynamic sleep adjustments.
---
## ๐ธ Example run
```
Done: 20, Fail: 0, Avg Lat: 2.60s, Sleep: 0.90s โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100% 0:00:09
Scraped 20 URLs with 10 workers
Success: 20, Failures: 0
Avg latency: 2.58s
````
---
## ๐ Benchmark Results
| Mode | Total Time | Improvement |
|-------------------|------------|-------------|
| Single-threaded | 56.46 s | baseline |
| Multi-threaded | 8.94 s | ~6.3x faster|
| Adaptive Throttle | 11.00 s | ~5.1x faster|
Benchmarked over 20 real news and media URLs with simulated delays.
---
## ๐ What does it store?
Each scraped page is saved in MongoDB like:
```json
{
"url": "https://www.bbc.com",
"status": 200,
"latency": 1.74,
"timestamp": "2025-07-13T19:15:22.234Z",
"html": "...",
"snippet": "BBC Homepage World Business Technology ..."
}
````
## ๐ ๏ธ Tech Stack
* **Python** with `ThreadPoolExecutor` for concurrency
* **Rich** for live dashboards
* **BeautifulSoup** for text extraction
* **MongoDB** (Atlas or local) for persistence
* **Dotenv** for secure environment configs
---
## ๐ How to run
### ๐ฅ Install dependencies
```bash
pip install -r requirements.txt
```
### ๐ Add your `.env`
```
MONGO_URI=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true
```
### ๐ Run it
```bash
python main.py 10 True
```
* `10` = number of concurrent workers
* `True` = adaptive throttling on
### โก Benchmark modes
```bash
python benchmark.py
```
Runs single-threaded, multi-threaded, and adaptive, printing timing comparisons.