Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chanmeng666/douban-review-scraper
γOne star = One happy developer doing a little dance πβοΈγA robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities.
https://github.com/chanmeng666/douban-review-scraper
beautifulsoup4 data-analysis data-processing douban movie-reviews pandas python sentiment-analysis text-mining web-scraping
Last synced: 3 days ago
JSON representation
γOne star = One happy developer doing a little dance πβοΈγA robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities.
- Host: GitHub
- URL: https://github.com/chanmeng666/douban-review-scraper
- Owner: ChanMeng666
- License: mit
- Created: 2024-11-25T11:37:43.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2025-01-07T10:45:23.000Z (about 1 month ago)
- Last Synced: 2025-01-07T11:32:21.791Z (about 1 month ago)
- Topics: beautifulsoup4, data-analysis, data-processing, douban, movie-reviews, pandas, python, sentiment-analysis, text-mining, web-scraping
- Language: Python
- Homepage:
- Size: 90.8 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
π¬ Douban Movie Reviews Scraper
A powerful tool for collecting and analyzing Douban movie reviews
![]()
![]()
![]()
![]()
# π Features
- π Robust scraping with rate limiting and retry mechanisms
- π§Ή Advanced data cleaning and normalization
- π Sentiment analysis categorization
- πΎ Efficient CSV export functionality
- π Comprehensive error handling and logging
- π‘οΈ Built-in protection against API rate limits
- π Detailed comment metadata extraction
- π― Configurable scraping parameters# π οΈ Requirements
- Python 3.8+
- Required packages:
```
beautifulsoup4==4.12.3
numpy==2.1.3
pandas==2.2.3
python-dateutil==2.9.0.post0
pytz==2024.2
requests~=2.32.3
six==1.16.0
soupsieve==2.6
tzdata==2024.2
```# π¦ Installation
1. Clone the repository:
```bash
git clone https://github.com/ChanMeng666/douban-review-scraper.git
```2. Navigate to the project directory:
```bash
cd douban-review-scraper
```3. Install dependencies:
```bash
pip install -r requirements.txt
```# βοΈ Configuration
Edit `config.py` to customize your scraping parameters:
```python
MOVIE_ID = 'your_movie_id' # Douban movie ID
MAX_PAGES = 50 # Maximum pages to scrape
REQUEST_TIMEOUT = 30 # Request timeout in seconds
RETRY_TIMES = 3 # Number of retry attempts
```# π Usage
1. Configure your target movie ID in `config.py`
2. Run the scraper:
```bash
python main.py
```# π Output Format
The scraper generates CSV files containing:
- `timestamp`: Comment timestamp
- `content`: Cleaned comment text
- `rating`: User rating (1-5)
- `user_id`: Douban user ID
- `category`: Comment category (positive/negative/neutral)# β οΈ Important Notes
- Respect Douban's robots.txt and API limitations
- Update cookies periodically for reliable operation
- Consider using proxies for large-scale scraping
- Check Douban's terms of service before use# π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request# π License
This project is licensed under the MIT License - see the LICENSE file for details.
# π₯ Author
**Chan Meng**
- LinkedIn: [chanmeng666](https://www.linkedin.com/in/chanmeng666/)
- GitHub: [ChanMeng666](https://github.com/ChanMeng666)# π Show your support
Give a βοΈ if this project helped you!
---