https://github.com/chanmeng666/douban-elite-scraper
【Stars are like virtual high-fives - come on, don't leave us hanging!⭐️】A streamlined Python scraper for archiving elite posts from Douban groups into well-structured Markdown files with images, designed for efficient content preservation and offline reading.
https://github.com/chanmeng666/douban-elite-scraper
beautifulsoup chinese-social-media content-archiving data-collection data-mining douban image-downloader markdown python web-scraping
Last synced: 8 months ago
JSON representation
【Stars are like virtual high-fives - come on, don't leave us hanging!⭐️】A streamlined Python scraper for archiving elite posts from Douban groups into well-structured Markdown files with images, designed for efficient content preservation and offline reading.
- Host: GitHub
- URL: https://github.com/chanmeng666/douban-elite-scraper
- Owner: ChanMeng666
- License: mit
- Created: 2024-11-17T02:06:45.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-01-07T10:16:47.000Z (10 months ago)
- Last Synced: 2025-02-01T01:41:23.228Z (9 months ago)
- Topics: beautifulsoup, chinese-social-media, content-archiving, data-collection, data-mining, douban, image-downloader, markdown, python, web-scraping
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT
Awesome Lists containing this project
README
🔍 Douban Elite Scraper
Archive elite posts from Douban groups with style
## ✨ Features
### 🎯 Smart Content Extraction
Intelligently scrapes elite posts while respecting Douban's access patterns and rate limits.
### 📸 Complete Media Preservation
Downloads and organizes all images associated with each post, maintaining the original content integrity.
### 📝 Clean Markdown Generation
Converts posts into well-structured Markdown files, perfect for offline reading and archival.
### 🔒 Robust Error Handling
Comprehensive error management for network issues, missing content, and file system operations.
### 🚦 Rate Limiting Protection
Built-in delays and smart request handling to avoid overwhelming Douban's servers.
### 📊 Metadata Preservation
Retains important post information including author details and source URLs.
## 🚀 Installation
1. Clone the repository:
```bash
git clone https://github.com/ChanMeng666/douban-elite-scraper.git
cd douban-elite-scraper
```
2. Install required dependencies:
```bash
pip install -r requirements.txt
```
## 💻 Usage
1. Run the scraper:
```bash
python main.py
```
2. Configure target groups by editing `main.py`:
```python
# Skip specific posts by title
skip_titles = ["够用就好2"]
# Target group URL
base_url = "https://www.douban.com/group/662976/?type=elite#topics"
```
## 📁 Project Structure
```
douban-elite-scraper/
├── main.py # Main script and entry point
├── scraper.py # Core scraping functionality
└── requirements.txt # Project dependencies
```
## 📦 Output Format
Each scraped post creates:
```
Post_Title_123abc/
├── post.md
├── image_1.jpg
├── image_2.jpg
└── image_3.jpg
```
The `post.md` file contains:
- Post title
- Author information
- Original URL
- Post content
- Image references
## ⚙️ Configuration
The scraper includes several configurable options in the `DoubanScraper` class:
- User-Agent headers
- File naming patterns
- Rate limiting delays
- Output formatting
## 🛡️ Rate Limiting
The scraper implements a 2-second delay between requests by default. Adjust in `main.py`:
```python
time.sleep(2) # Adjust delay as needed
```
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## ⚠️ Legal Notice
This tool is for educational purposes only. Please ensure compliance with Douban's terms of service and implement appropriate rate limiting. The user is responsible for how they use this tool.
## 📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
🔧 Advanced Configuration
The `DoubanScraper` class provides additional configuration options:
```python
scraper = DoubanScraper(
headers={'User-Agent': 'your-custom-user-agent'},
delay=3, # Custom delay between requests
output_format='markdown' # Output format
)
```
See `scraper.py` for more configuration options.
## 🙋♀ Author
Created and maintained by [Chan Meng](https://github.com/ChanMeng666).