Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chanmeng666/douban-review-scraper
A web scraper for Douban movie comments with data processing capabilities, focusing on collecting and analyzing user reviews.
https://github.com/chanmeng666/douban-review-scraper
beautifulsoup4 data-processing douban movie-reviews pandas python web-scraping
Last synced: 8 days ago
JSON representation
A web scraper for Douban movie comments with data processing capabilities, focusing on collecting and analyzing user reviews.
- Host: GitHub
- URL: https://github.com/chanmeng666/douban-review-scraper
- Owner: ChanMeng666
- License: mit
- Created: 2024-11-25T11:37:43.000Z (about 1 month ago)
- Default Branch: master
- Last Pushed: 2024-12-07T12:59:51.000Z (27 days ago)
- Last Synced: 2024-12-07T13:35:26.724Z (27 days ago)
- Topics: beautifulsoup4, data-processing, douban, movie-reviews, pandas, python, web-scraping
- Language: Python
- Homepage:
- Size: 74.2 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
๐ฌ Douban Movie Reviews Scraper
A powerful tool for collecting and analyzing Douban movie reviews
# ๐ Features
- ๐ Robust scraping with rate limiting and retry mechanisms
- ๐งน Advanced data cleaning and normalization
- ๐ Sentiment analysis categorization
- ๐พ Efficient CSV export functionality
- ๐ Comprehensive error handling and logging
- ๐ก๏ธ Built-in protection against API rate limits
- ๐ Detailed comment metadata extraction
- ๐ฏ Configurable scraping parameters# ๐ ๏ธ Requirements
- Python 3.8+
- Required packages:
```
beautifulsoup4==4.12.3
numpy==2.1.3
pandas==2.2.3
python-dateutil==2.9.0.post0
pytz==2024.2
requests~=2.32.3
six==1.16.0
soupsieve==2.6
tzdata==2024.2
```# ๐ฆ Installation
1. Clone the repository:
```bash
git clone https://github.com/ChanMeng666/douban-review-scraper.git
```2. Navigate to the project directory:
```bash
cd douban-review-scraper
```3. Install dependencies:
```bash
pip install -r requirements.txt
```# โ๏ธ Configuration
Edit `config.py` to customize your scraping parameters:
```python
MOVIE_ID = 'your_movie_id' # Douban movie ID
MAX_PAGES = 50 # Maximum pages to scrape
REQUEST_TIMEOUT = 30 # Request timeout in seconds
RETRY_TIMES = 3 # Number of retry attempts
```# ๐ Usage
1. Configure your target movie ID in `config.py`
2. Run the scraper:
```bash
python main.py
```# ๐ Output Format
The scraper generates CSV files containing:
- `timestamp`: Comment timestamp
- `content`: Cleaned comment text
- `rating`: User rating (1-5)
- `user_id`: Douban user ID
- `category`: Comment category (positive/negative/neutral)# โ ๏ธ Important Notes
- Respect Douban's robots.txt and API limitations
- Update cookies periodically for reliable operation
- Consider using proxies for large-scale scraping
- Check Douban's terms of service before use# ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request# ๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
# ๐ฅ Author
**Chan Meng**
- LinkedIn: [chanmeng666](https://www.linkedin.com/in/chanmeng666/)
- GitHub: [ChanMeng666](https://github.com/ChanMeng666)# ๐ Show your support
Give a โญ๏ธ if this project helped you!
---