An open API service indexing awesome lists of open source software.

https://github.com/yagna123k/scout

Smart concurrent web scraper with adaptive rate limiting & live CLI dashboard. Persists full HTML + text snippets into MongoDB for later NLP, indexing, or analysis. Achieves ~6x faster scraping with multi-threaded architecture vs single-threaded baseline.
https://github.com/yagna123k/scout

adaptive-rate-limiting beautifulsoup concurrency mongodb multithreading performance python systems-engineering web-scraping

Last synced: about 1 month ago
JSON representation

Smart concurrent web scraper with adaptive rate limiting & live CLI dashboard. Persists full HTML + text snippets into MongoDB for later NLP, indexing, or analysis. Achieves ~6x faster scraping with multi-threaded architecture vs single-threaded baseline.

Awesome Lists containing this project

README

          

# ๐Ÿš€ Scout Scrapper

**Scout Scrapper** is a smart concurrent web scrapper that:

- ๐Ÿงต Uses a **multi-threaded architecture** to achieve up to **6x faster scraping** over single-threaded baselines.
- โš–๏ธ Features **adaptive rate limiting**, automatically slowing down under high latency or error rates to prevent HTTP 429 bans.
- ๐Ÿ—‚ Stores **full HTML content** and a short **text snippet** of each page in **MongoDB**, enabling later analysis, indexing, or NLP.
- ๐Ÿ“Š Provides a **real-time CLI dashboard** showing completed requests, failures, average latency, and dynamic sleep adjustments.

---

## ๐Ÿ“ธ Example run

```

Done: 20, Fail: 0, Avg Lat: 2.60s, Sleep: 0.90s โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 100% 0:00:09
Scraped 20 URLs with 10 workers
Success: 20, Failures: 0
Avg latency: 2.58s

````

---

## ๐Ÿš€ Benchmark Results

| Mode | Total Time | Improvement |
|-------------------|------------|-------------|
| Single-threaded | 56.46 s | baseline |
| Multi-threaded | 8.94 s | ~6.3x faster|
| Adaptive Throttle | 11.00 s | ~5.1x faster|

Benchmarked over 20 real news and media URLs with simulated delays.

---

## ๐Ÿ” What does it store?

Each scraped page is saved in MongoDB like:

```json
{
"url": "https://www.bbc.com",
"status": 200,
"latency": 1.74,
"timestamp": "2025-07-13T19:15:22.234Z",
"html": "...",
"snippet": "BBC Homepage World Business Technology ..."
}
````

## ๐Ÿ› ๏ธ Tech Stack

* **Python** with `ThreadPoolExecutor` for concurrency
* **Rich** for live dashboards
* **BeautifulSoup** for text extraction
* **MongoDB** (Atlas or local) for persistence
* **Dotenv** for secure environment configs

---

## ๐Ÿš€ How to run

### ๐Ÿ”ฅ Install dependencies

```bash
pip install -r requirements.txt
```

### ๐Ÿ“‚ Add your `.env`

```
MONGO_URI=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true
```

### ๐Ÿš€ Run it

```bash
python main.py 10 True
```

* `10` = number of concurrent workers
* `True` = adaptive throttling on

### โšก Benchmark modes

```bash
python benchmark.py
```

Runs single-threaded, multi-threaded, and adaptive, printing timing comparisons.