Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stasonjatham/scraproxy
API wrapper for Playwright to scrape websites, capture screenshots, and record videos of the scraping process. Built with FastAPI for high performance and includes interactive Redoc documentation. Fully Dockerized for easy setup and deployment.
https://github.com/stasonjatham/scraproxy
api fastapi playwright python python3 scraper scraping
Last synced: about 2 months ago
JSON representation
API wrapper for Playwright to scrape websites, capture screenshots, and record videos of the scraping process. Built with FastAPI for high performance and includes interactive Redoc documentation. Fully Dockerized for easy setup and deployment.
- Host: GitHub
- URL: https://github.com/stasonjatham/scraproxy
- Owner: StasonJatham
- Created: 2024-10-17T19:28:21.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-24T08:24:32.000Z (4 months ago)
- Last Synced: 2024-10-25T18:35:10.367Z (3 months ago)
- Topics: api, fastapi, playwright, python, python3, scraper, scraping
- Language: Python
- Homepage: https://sandkiste.io
- Size: 69.3 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scraproxy
This repository provides a high-performance web scraping API built with FastAPI and Playwright. It enables users to automate browsing, capture screenshots, minimize HTML, extract text, and even record videos of browsing sessions. The API supports various advanced features like network request/response tracking and cookie banner handling, making it ideal for automation, scraping, and content extraction.
## Features
- Browse Web Pages: Gather detailed information such as network data, logs, redirects, cookies, and performance metrics.
- Screenshots: Capture live screenshots or retrieve them from cache, with support for full-page screenshots and thumbnails.
- Minify HTML: Minimize HTML content by removing unnecessary elements like comments and whitespace.
- Extract Text: Extract clean, plain text from HTML content.
- Video Recording: Record a browsing session and retrieve the video as a webm file.
- Reader Mode: Extract the main readable content and title from an HTML page, similar to “reader mode” in browsers.
- Markdown Conversion: Convert HTML content into Markdown format.
- Authentication: Optional Bearer token authentication using API_KEY.## Endpoints
### 1. /browse
Browse a webpage and retrieve various details like page title, meta description, network data, logs, cookies, and more.
### 2. /screenshot
Capture a screenshot of the specified URL, with support for full-page captures and thumbnails.
### 3. /minimize
Minimize HTML content by removing comments and unnecessary whitespace.
### 4. /extract_text
Extract plain text from the provided HTML content.
### 5. /reader
Extract the main readable content from an HTML page.
### 6. /markdown
Convert the provided HTML content into Markdown format.
### 7. /video
Record a video of a browsing session and return the video file in webm format. This allows for capturing the entire session from page load to interaction.
## Technology Stack
- FastAPI: For building high-performance, modern APIs.
- Playwright: For automating web browser interactions and scraping.
- Docker: Containerized for consistent environments and easy deployment.
- Diskcache: Efficient caching to reduce redundant scraping requests.
- Pillow: For image processing, optimization, and thumbnail creation.## Setup
1. Clone the repository.
2. Create a .env file or set the necessary environment variables:
- API_KEY (optional): For authentication.
- CACHE_EXPIRATION_SECONDS: Cache expiration time in seconds (default is 3600).
- PLAYWRIGHT_BROWSERS_PATH: (Optional) Set a custom path for Playwright browsers.
3. Build and run the Docker container:`docker-compose up --build`
## Usage
1. To browse a webpage, send a GET request to /browse with the URL.
2. For screenshots, use the /screenshot endpoint, providing the target URL.
3. Use the /minimize and /extract_text endpoints to process raw HTML.
4. Use the /video endpoint to record and retrieve the video of a browsing session.
5. Use /reader to extract readable content and /markdown to convert HTML into Markdown format.