https://github.com/devilisback100/ai_web_scrapper
https://github.com/devilisback100/ai_web_scrapper
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/devilisback100/ai_web_scrapper
- Owner: devilisback100
- Created: 2025-06-21T10:07:50.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-21T10:21:19.000Z (about 1 year ago)
- Last Synced: 2025-06-21T11:21:19.188Z (about 1 year ago)
- Language: Python
- Size: 15.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AI Chapter Rewriter & Reviewer
An AI-powered content pipeline that scrapes literature chapters from Wikisource, rewrites them using Large Language Models (LLMs), facilitates human-in-the-loop edits, captures webpage screenshots, and stores all final versions with RL-enhanced search capabilities using ChromaDB.
---
## Features
| Module | Description |
|-------------------------|-----------------------------------------------------------------------------|
| Scraping | Fetches chapter content from Wikisource using `requests`. |
| AI Writing & Review | Rewrites chapters with an AI writer and reviewer (LLM-powered). |
| Human-in-the-loop | Allows manual review and iterative improvements before finalizing. |
| Screenshot Capture | Uses `Playwright` to capture full-page screenshots of the source. |
| Agentic Flow | Structured AI agents (writer, reviewer, editor) in a modular pipeline. |
| ChromaDB Storage | Saves rewritten chapters with metadata for versioning and retrieval. |
| RL-based Search | Retrieves chapters based on semantic similarity and RL score re-ranking. |
---
## Tech Stack
- **Python** – Core programming language
- **Playwright** – Full-page screenshot automation
- **Requests** – HTML content fetching
- **ChromaDB** – Vector DB for document storage and search
- **TQDM** – Progress tracking
- **Dotenv** – Environment variable management
---
## Installation
```bash
git clone https://github.com/devilisback100/AI_web_scrapper.git
cd AI_web_scrapper
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install