https://github.com/sadadyes/post-archiver
A tool to scrape YouTube community posts
https://github.com/sadadyes/post-archiver
beautifulsoup beautifulsoup4 playwright playwright-python python python3 scraper youtube
Last synced: 4 months ago
JSON representation
A tool to scrape YouTube community posts
- Host: GitHub
- URL: https://github.com/sadadyes/post-archiver
- Owner: sadadYes
- License: mit
- Created: 2024-11-05T14:34:12.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-12T06:59:52.000Z (about 1 year ago)
- Last Synced: 2025-01-05T15:48:17.012Z (12 months ago)
- Topics: beautifulsoup, beautifulsoup4, playwright, playwright-python, python, python3, scraper, youtube
- Language: Python
- Homepage: https://post.sadad.rest/
- Size: 55.7 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DEPRECATION WARNING! This project is no longer maintained!
## This project have been superseded by [post-archiver-improved](https://github.com/sadadYes/post-archiver-improved)
# YouTube Community Scraper
A Python tool to scrape posts from YouTube community tabs.
## Features
- Scrape posts from YouTube community tabs
- Download images from posts
- Collect post comments
- Multi-browser support (Chromium, Firefox, WebKit)
- Automatic browser installation
- Proxy support (HTTP/HTTPS with auth, SOCKS5 without auth)
- Progress saving
- Configurable output directory
## Installation
Install using pip:
```bash
pip install post-archiver
```
Or install from source:
```bash
git clone https://github.com/sadadYes/post-archiver.git
cd post-archiver
pip install -e .
```
## Requirements
- Python 3.7 or higher
- No manual browser installation needed - browsers are installed automatically when needed
## Usage
```
usage: post-archiver [OPTIONS] url [amount]
YouTube Community Posts Scraper
positional arguments:
url YouTube channel community URL
amount Amount of posts to get (default: max)
options:
-h, --help show this help message and exit
-c, --get-comments Get comments from posts (WARNING: This is slow) (default: False)
-i, --get-images Get images from posts (default: False)
-d, --download-images
Download images (requires --get-images)
-q IMAGE_QUALITY, --image-quality IMAGE_QUALITY
Image quality: src, sd, or all (default: all)
--proxy PROXY Proxy file or single proxy string
-o OUTPUT, --output OUTPUT
Output directory (default: current directory)
-v, --verbose Show basic progress information
-t, --trace Show detailed debug information
--browser {chromium,firefox,webkit}
Browser to use (default: chromium)
--version show program's version number and exit
--member-only Only get membership-only posts (requires --cookies)
--browser-cookies {chrome,firefox,edge,opera}
Get cookies from browser (requires browser-cookie3)
Proxy format:
Single proxy: ://:@:
Proxy file: One proxy per line using the same format
Supported schemes: http, https
Note: SOCKS5 proxies are supported but without authentication
Amount:
Specify number of posts to scrape (default: max)
Use 'max' or any number <= 0 to scrape all posts
Examples:
post-archiver https://www.youtube.com/@channel/posts
post-archiver https://www.youtube.com/@channel/posts 50
post-archiver -c -i -d -q src https://www.youtube.com/@channel/posts max
post-archiver --browser firefox https://www.youtube.com/@channel/posts
post-archiver --proxy proxies.txt https://www.youtube.com/@channel/posts 100
post-archiver --proxy http://username:password@host:port https://www.youtube.com/@channel/posts
post-archiver --proxy https://username:password@host:port https://www.youtube.com/@channel/posts
post-archiver --proxy socks5://host:port https://www.youtube.com/@channel/posts
```
## Browser Support
The scraper supports three browser engines:
- Chromium (default)
- Firefox
- WebKit
The appropriate browser will be automatically installed when first used. You can specify which browser to use with the `--browser` option.
## Proxy Support
The scraper supports the following proxy types:
- HTTP proxies with authentication
- HTTPS proxies with authentication
- SOCKS5 proxies (without authentication)
**Note:** SOCKS5 proxies with authentication are not supported due to limitations in the underlying browser automation.
## Logging
Two levels of logging are available:
- `--verbose (-v)`: Shows basic progress information
- `--trace (-t)`: Shows detailed debug information including browser console messages
## License
MIT License