https://github.com/pointer2alvee/fk-web-crawler
Take home assignment for filerskeepers
https://github.com/pointer2alvee/fk-web-crawler
Last synced: 7 months ago
JSON representation
Take home assignment for filerskeepers
- Host: GitHub
- URL: https://github.com/pointer2alvee/fk-web-crawler
- Owner: pointer2Alvee
- Created: 2025-06-22T10:19:29.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-06-27T19:11:43.000Z (7 months ago)
- Last Synced: 2025-06-27T19:22:15.943Z (7 months ago)
- Language: Python
- Size: 46.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## ๐ fk-web-crawler:
#### ๐ Summary
Implemented a Web Crawler system using scrapy and stored the scraped data and change logs in MongoDB with features like Change Detection, Scheduler, Report generation and REST API Features using FastAPI
#### ๐ง Summary
A fully-featured Python-based web crawler that:
- Crawls all pages and scrapes book data from `books.toscrape.com`
- Stores scraped book data including metadata and raw html snapshot to MongoDB Atlas
- Detects changes of existing books using fingerprinting strategy
- Logs changes and inserts to DB
- Able to resume from the last successful crawl
- Schedules daily updates using APScheduler
- Serves data through a FastAPI-secured RESTful API with rate limiting and authentication
#### ๐ Features
- โ
Scrapy-powered web crawler
- โ
MongoDB Atlas integration with deduplication
- โ
Hash/Fingerprint-based change detection
- โ
Change logging and raw HTML snapshot
- โ
APScheduler-powered daily job
- โ
Daily change reports in JSON
- โ
RESTful FastAPI server using FastAPI:
- `/books` with filtering, pagination, sorting
- `/books/{book_id}` for book details
- `/changes` to get recent logs
- โ
API Key Authentication
- โ
Rate limiting (100 req/hr per IP)
- โ
OpenAPI (Swagger) docs
---
#### ๐ Project Structure
```
FK-CRAWLER/
โ
โโโ api/ # FastAPI server
โ โโโ auth/
โ | โโโ security.py # API-KEY Authentication
โ โโโ models/ # Pydantic models
โ | โโโ book.py
โ | โโโ change.py
โ | โโโ schemas.py # Serialization for book and logs
โ โ โ
โ โโโ routes/ # API endpoints
โ | โโโ __init__.py
โ | โโโ books.py
โ | โโโ changes.py
โ โโโ utils/ # API endpoints
โ | โโโ rate_limiter.py
โ โโโ __init__.py
โ โโโ main_test.py # For testing
โ โโโ main.py # main api file
โ
โโโ crawler/ # Web scraping logic
โ โโโ fkcrawling/
โ | โโโ spiders/ # Scrapy spiders
โ | | โโโ __init__.py
โ | | โโโ book_schema.py # Pydantic model
โ | | โโโ crawling_spider.py # Web Crawler
โ | | โโโ mongodb_client.py # MongoDB connection
โ | โโโ __init__.py
โ | โโโ items.py
โ | โโโ middlewares.py
โ | โโโ pipelines.py
โ โโโ โโโ settings.py # Scrapy config
โ
โโโ scheduler/ # Daily job scheduler
โ โโโ daily_scheduler.py
โ โโโ crawler_runner.py
โ
โโโ utilities/ # Helper utilities
โ โโโ logs/
โ | โโโ activity.log # activity logging
โ โโโ reports/
โ | โโโ report.json # Generated Report
โ โโโ assets/
โ | โโโ images/ # images
โ โโโ generate_report.py # Daily changes report
โ โโโ log_config.py # Log setup
โ
โโโ tests/ # Unit & integration
โ โโโ __init__.py
โ โโโ test_db.py
โ โโโ test_crawler.py
โโโ .env # Secure API_KEY and mongoDB URI
โโโ .gitignore
โโโ requirements.txt # required packages
โโโ README.md # This file
```
---
## ๐ง Setup Instructions
### ๐ฆ Requirements
- Python 3.10+
- VSCode
- MongoDB Atlas account
### ๐ 1. Clone the Repository
```bash
git clone https://github.com/pointer2Alvee/fk-web-crawler.git
cd fk-crawler
```
### ๐ 2. Install Dependencies
```bash
pip install -r requirements.txt
```
### โ๏ธ 3. Create `.env` File
Create a `.env` file at the root:
```
MONGODB_URI= :@cluster.mongodb.net/>
API_KEY=
```
> โ
`.env` is automatically loaded using `dotenv`.
---
## ๐ท๏ธ Run Crawler
```bash
cd crawler/fkcrawling
scrapy crawl fkcrawler
```
- Inserts newly scraped books to MongoDB collection "books"
- Deduplication and Logs changes (if any) to MongoDB collection "change_log"
- Logs output to `/logs/activity.log`
---
## ๐๏ธ Run Scheduler - Runs Cralwer + Change Report Generator
In daily_scheduler.py :-
hour=13,
minute=15
- Put hour and minute at the time you want to schedule the scheduler
- Here the scheduler will run daily at 13:15 or 1:15 PM
```bash
cd scheduler
python daily_scheduler.py
```
- Crawls every day using APScheduler
- Detects new books or changes
- Logs them in MongoDB and filesystem
- Generates Daily change Report in JSON
---
## ๐งช Run FastAPI Server
```bash
cd api
uvicorn main:app --reload
```
- API is hosted at `http://127.0.0.1:8000/`
- Swagger docs: `http://127.0.0.1:8000/docs`
---
## ๐ API Key Usage
All endpoints are protected via API key.
### Headers:
```
FKCRAWLER-API-KEY:
```
---
## ๐ API Endpoints
| Endpoint | Method | Description |
|-------------------|--------|------------------------------------------|
| `/books` | GET | Get all books (filter, sort, paginate) |
| `/books/{id}` | GET | Get book by MongoDB ObjectId |
| `/changes` | GET | Get recent changes |
| `/docs` | GET | Swagger UI (OpenAPI spec) |
---
## ๐ค Daily Report Output
On successful run, you'll get:
```bash
/reports/
โโโ change_report_YYYY-MM-DD.json
```
Includes:
- New insertions
- Fields changed
- Source URLs and timestamps
---
## ๐งช Testing
Unit and integration tests will be added in `/tests/`.
Implemented with `pytest` for:
- DB operations
- Crawling output
Make sure you're in the root of the project and run:
```bash
pytest tests/
```
Demo Output summary :-
```
===================== test session starts =================
collected 2 items
tests/test_crawler.py .... [66%]
tests/test_db.py . [83%]
====================== 2 passed in 2.31s ==================
```
---
## Demonestration -
### - mongoDB
---
### - log
### - report
### - fastapi
#### - fastapi - GET/Books
#### - fastapi - GET/{Book_id} & GET/{change_log}
## ๐ก Sample MongoDB Document
**books Document Structure**
```json
{
"_id": ObjectId("123..."),
"name": "A Light in the Attic",
"description" : "It's hard to..",
"category" : "Poetry",
"price_with_tax": 12.99,
"price_with_out_tax": 12.99,
"availability": "22",
"review" : 0,
"cover_image_url" : "https://books.toscrape.com/../fe72aea293c.jpg",
"rating": 3,
"crawl_timestamp": "2025-06-27T10:00:00Z",
"source_url": "https://books.toscrape.com/catalogue/.../index.html",
"raw_html": "...",
"fingerprint": "abc123...",
}
```
**change_log Document Structure**
```json
{
"_id": ObjectId("123..."),
"source_url": "https://books.toscrape.com/catalogue/.../index.html",
"name": "A Light in the Attic",
"timestamp": "2025-06-27T10:00:00Z",
"changes" : Object,
}
```
---
## ๐งพ Deliverables Checklist (from PDF โ
)
| Requirement | Status |
|-------------------------------------------|------------|
| โ
Crawler using Scrapy | Done |
| โ
Scheduler with change detection | Done |
| โ
Change log storage/collection | Done |
| โ
FastAPI server | Done |
| โ
API key + rate limiting | Done |
| โ
Swagger UI | Done |
| โ
`.env` support | Done |
| โ
Daily reports (JSON + CSV) | Done |
| โ
Screenshot/logs of scheduler/crawler | โ๏ธ See `/logs` |
---
## ๐ฌ Postman / Swagger UI
Use [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs) to interactively test all endpoints.
---
## ๐ง Future Improvements
- Add email alerts for major changes
- Add export formats: CSV, PDF, Excel
---
## ๐งโ๐ป Author
**Alvee**
๐ง pointer2alvee@gmail.com
๐ [GitHub](https://github.com/pointer2Alvee)
---
### ๐ Acknowledgements
- Open-source contributors and net
- Youtube videos :-
* [1](https://www.youtube.com/watch?v=mBoX_JCKZTE) , [2](https://www.youtube.com/watch?v=GogxAQ2JP4A), [3](https://www.youtube.com/watch?v=rvFsGRvj9jo)
---
## ๐ License
MIT License โ feel free to use, improve, and contribute!