An open API service indexing awesome lists of open source software.

https://github.com/pointer2alvee/fk-web-crawler

Take home assignment for filerskeepers
https://github.com/pointer2alvee/fk-web-crawler

Last synced: 7 months ago
JSON representation

Take home assignment for filerskeepers

Awesome Lists containing this project

README

          


Image 1

## ๐Ÿ“š fk-web-crawler:

#### ๐Ÿ“Œ Summary
Implemented a Web Crawler system using scrapy and stored the scraped data and change logs in MongoDB with features like Change Detection, Scheduler, Report generation and REST API Features using FastAPI

#### ๐Ÿง  Summary
A fully-featured Python-based web crawler that:
- Crawls all pages and scrapes book data from `books.toscrape.com`
- Stores scraped book data including metadata and raw html snapshot to MongoDB Atlas
- Detects changes of existing books using fingerprinting strategy
- Logs changes and inserts to DB
- Able to resume from the last successful crawl
- Schedules daily updates using APScheduler
- Serves data through a FastAPI-secured RESTful API with rate limiting and authentication

#### ๐Ÿš€ Features
- โœ… Scrapy-powered web crawler
- โœ… MongoDB Atlas integration with deduplication
- โœ… Hash/Fingerprint-based change detection
- โœ… Change logging and raw HTML snapshot
- โœ… APScheduler-powered daily job
- โœ… Daily change reports in JSON
- โœ… RESTful FastAPI server using FastAPI:
- `/books` with filtering, pagination, sorting
- `/books/{book_id}` for book details
- `/changes` to get recent logs
- โœ… API Key Authentication
- โœ… Rate limiting (100 req/hr per IP)
- โœ… OpenAPI (Swagger) docs

---

#### ๐Ÿ“ Project Structure

```
FK-CRAWLER/
โ”‚
โ”œโ”€โ”€ api/ # FastAPI server
โ”‚ โ”œโ”€โ”€ auth/
โ”‚ | โ”œโ”€โ”€ security.py # API-KEY Authentication
โ”‚ โ”œโ”€โ”€ models/ # Pydantic models
โ”‚ | โ”œโ”€โ”€ book.py
โ”‚ | โ”œโ”€โ”€ change.py
โ”‚ | โ”œโ”€โ”€ schemas.py # Serialization for book and logs
โ”‚ โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ routes/ # API endpoints
โ”‚ | โ”œโ”€โ”€ __init__.py
โ”‚ | โ”œโ”€โ”€ books.py
โ”‚ | โ”œโ”€โ”€ changes.py
โ”‚ โ”œโ”€โ”€ utils/ # API endpoints
โ”‚ | โ”œโ”€โ”€ rate_limiter.py
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ main_test.py # For testing
โ”‚ โ””โ”€โ”€ main.py # main api file
โ”‚
โ”œโ”€โ”€ crawler/ # Web scraping logic
โ”‚ โ”œโ”€โ”€ fkcrawling/
โ”‚ | โ”œโ”€โ”€ spiders/ # Scrapy spiders
โ”‚ | | โ”œโ”€โ”€ __init__.py
โ”‚ | | โ”œโ”€โ”€ book_schema.py # Pydantic model
โ”‚ | | โ”œโ”€โ”€ crawling_spider.py # Web Crawler
โ”‚ | | โ””โ”€โ”€ mongodb_client.py # MongoDB connection
โ”‚ | โ”œโ”€โ”€ __init__.py
โ”‚ | โ”œโ”€โ”€ items.py
โ”‚ | โ”œโ”€โ”€ middlewares.py
โ”‚ | โ”œโ”€โ”€ pipelines.py
โ”‚ โ””โ”€โ”€ โ”œโ”€โ”€ settings.py # Scrapy config
โ”‚
โ”œโ”€โ”€ scheduler/ # Daily job scheduler
โ”‚ โ”œโ”€โ”€ daily_scheduler.py
โ”‚ โ””โ”€โ”€ crawler_runner.py
โ”‚
โ”œโ”€โ”€ utilities/ # Helper utilities
โ”‚ โ”œโ”€โ”€ logs/
โ”‚ | โ”œโ”€โ”€ activity.log # activity logging
โ”‚ โ”œโ”€โ”€ reports/
โ”‚ | โ”œโ”€โ”€ report.json # Generated Report
โ”‚ โ”œโ”€โ”€ assets/
โ”‚ | โ”œโ”€โ”€ images/ # images
โ”‚ โ”œโ”€โ”€ generate_report.py # Daily changes report
โ”‚ โ””โ”€โ”€ log_config.py # Log setup
โ”‚
โ”œโ”€โ”€ tests/ # Unit & integration
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ test_db.py
โ”‚ โ””โ”€โ”€ test_crawler.py
โ”œโ”€โ”€ .env # Secure API_KEY and mongoDB URI
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ requirements.txt # required packages
โ””โ”€โ”€ README.md # This file
```
---

## ๐Ÿ”ง Setup Instructions

### ๐Ÿ“ฆ Requirements

- Python 3.10+
- VSCode
- MongoDB Atlas account

### ๐Ÿ“ 1. Clone the Repository

```bash
git clone https://github.com/pointer2Alvee/fk-web-crawler.git
cd fk-crawler
```

### ๐Ÿ“ 2. Install Dependencies

```bash
pip install -r requirements.txt
```

### โš™๏ธ 3. Create `.env` File

Create a `.env` file at the root:

```
MONGODB_URI= :@cluster.mongodb.net/>
API_KEY=
```

> โœ… `.env` is automatically loaded using `dotenv`.

---

## ๐Ÿ•ท๏ธ Run Crawler

```bash
cd crawler/fkcrawling
scrapy crawl fkcrawler
```

- Inserts newly scraped books to MongoDB collection "books"
- Deduplication and Logs changes (if any) to MongoDB collection "change_log"
- Logs output to `/logs/activity.log`

---

## ๐Ÿ—“๏ธ Run Scheduler - Runs Cralwer + Change Report Generator

In daily_scheduler.py :-
hour=13,
minute=15
- Put hour and minute at the time you want to schedule the scheduler
- Here the scheduler will run daily at 13:15 or 1:15 PM

```bash
cd scheduler
python daily_scheduler.py
```

- Crawls every day using APScheduler
- Detects new books or changes
- Logs them in MongoDB and filesystem
- Generates Daily change Report in JSON

---

## ๐Ÿงช Run FastAPI Server

```bash
cd api
uvicorn main:app --reload
```

- API is hosted at `http://127.0.0.1:8000/`
- Swagger docs: `http://127.0.0.1:8000/docs`

---

## ๐Ÿ” API Key Usage

All endpoints are protected via API key.

### Headers:

```
FKCRAWLER-API-KEY:
```

---

## ๐Ÿ“‚ API Endpoints

| Endpoint | Method | Description |
|-------------------|--------|------------------------------------------|
| `/books` | GET | Get all books (filter, sort, paginate) |
| `/books/{id}` | GET | Get book by MongoDB ObjectId |
| `/changes` | GET | Get recent changes |
| `/docs` | GET | Swagger UI (OpenAPI spec) |

---

## ๐Ÿ“ค Daily Report Output

On successful run, you'll get:

```bash
/reports/
โ”œโ”€โ”€ change_report_YYYY-MM-DD.json
```

Includes:
- New insertions
- Fields changed
- Source URLs and timestamps

---

## ๐Ÿงช Testing

Unit and integration tests will be added in `/tests/`.

Implemented with `pytest` for:
- DB operations
- Crawling output

Make sure you're in the root of the project and run:

```bash
pytest tests/
```

Demo Output summary :-
```
===================== test session starts =================
collected 2 items

tests/test_crawler.py .... [66%]
tests/test_db.py . [83%]
====================== 2 passed in 2.31s ==================

```

---

## Demonestration -

### - mongoDB
---


Image 1


Image 1
Image 1

### - log


Image 1

### - report


Image 1

### - fastapi


Image 1

#### - fastapi - GET/Books


Image 1


Image 1

#### - fastapi - GET/{Book_id} & GET/{change_log}


Image 1


Image 1

## ๐Ÿ’ก Sample MongoDB Document

**books Document Structure**
```json
{
"_id": ObjectId("123..."),
"name": "A Light in the Attic",
"description" : "It's hard to..",
"category" : "Poetry",
"price_with_tax": 12.99,
"price_with_out_tax": 12.99,
"availability": "22",
"review" : 0,
"cover_image_url" : "https://books.toscrape.com/../fe72aea293c.jpg",
"rating": 3,
"crawl_timestamp": "2025-06-27T10:00:00Z",
"source_url": "https://books.toscrape.com/catalogue/.../index.html",
"raw_html": "...",
"fingerprint": "abc123...",

}
```
**change_log Document Structure**

```json
{
"_id": ObjectId("123..."),
"source_url": "https://books.toscrape.com/catalogue/.../index.html",
"name": "A Light in the Attic",
"timestamp": "2025-06-27T10:00:00Z",
"changes" : Object,
}
```
---

## ๐Ÿงพ Deliverables Checklist (from PDF โœ…)

| Requirement | Status |
|-------------------------------------------|------------|
| โœ… Crawler using Scrapy | Done |
| โœ… Scheduler with change detection | Done |
| โœ… Change log storage/collection | Done |
| โœ… FastAPI server | Done |
| โœ… API key + rate limiting | Done |
| โœ… Swagger UI | Done |
| โœ… `.env` support | Done |
| โœ… Daily reports (JSON + CSV) | Done |
| โœ… Screenshot/logs of scheduler/crawler | โœ”๏ธ See `/logs` |

---

## ๐Ÿ“ฌ Postman / Swagger UI

Use [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs) to interactively test all endpoints.

---

## ๐Ÿง  Future Improvements
- Add email alerts for major changes
- Add export formats: CSV, PDF, Excel

---

## ๐Ÿง‘โ€๐Ÿ’ป Author

**Alvee**
๐Ÿ“ง pointer2alvee@gmail.com
๐Ÿ”— [GitHub](https://github.com/pointer2Alvee)

---

### ๐Ÿ™ Acknowledgements
- Open-source contributors and net
- Youtube videos :-
* [1](https://www.youtube.com/watch?v=mBoX_JCKZTE) , [2](https://www.youtube.com/watch?v=GogxAQ2JP4A), [3](https://www.youtube.com/watch?v=rvFsGRvj9jo)
---

## ๐Ÿ“„ License
MIT License โ€“ feel free to use, improve, and contribute!