https://github.com/chanmeng666/douban-review-scraper
  
  
    γOne star = One happy developer doing a little dance πβοΈγA robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities. 
    https://github.com/chanmeng666/douban-review-scraper
  
beautifulsoup4 data-analysis data-processing douban movie-reviews pandas python sentiment-analysis text-mining web-scraping
        Last synced: 6 months ago 
        JSON representation
    
γOne star = One happy developer doing a little dance πβοΈγA robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities.
- Host: GitHub
 - URL: https://github.com/chanmeng666/douban-review-scraper
 - Owner: ChanMeng666
 - License: mit
 - Created: 2024-11-25T11:37:43.000Z (11 months ago)
 - Default Branch: master
 - Last Pushed: 2025-01-07T10:45:23.000Z (10 months ago)
 - Last Synced: 2025-02-17T14:45:15.721Z (9 months ago)
 - Topics: beautifulsoup4, data-analysis, data-processing, douban, movie-reviews, pandas, python, sentiment-analysis, text-mining, web-scraping
 - Language: Python
 - Homepage:
 - Size: 90.8 KB
 - Stars: 0
 - Watchers: 1
 - Forks: 0
 - Open Issues: 0
 - 
            Metadata Files:
            
- Readme: README.md
 - Funding: .github/FUNDING.yml
 - License: LICENSE
 - Code of conduct: CODE_OF_CONDUCT.md
 
 
Awesome Lists containing this project
README
          
 π¬ Douban Movie Reviews Scraper
 A powerful tool for collecting and analyzing Douban movie reviews
 
 
 
 
# π Features
- π Robust scraping with rate limiting and retry mechanisms
- π§Ή Advanced data cleaning and normalization
- π Sentiment analysis categorization
- πΎ Efficient CSV export functionality
- π Comprehensive error handling and logging
- π‘οΈ Built-in protection against API rate limits
- π Detailed comment metadata extraction
- π― Configurable scraping parameters
# π οΈ Requirements
- Python 3.8+
- Required packages:
  ```
  beautifulsoup4==4.12.3
  numpy==2.1.3
  pandas==2.2.3
  python-dateutil==2.9.0.post0
  pytz==2024.2
  requests~=2.32.3
  six==1.16.0
  soupsieve==2.6
  tzdata==2024.2
  ```
# π¦ Installation
1. Clone the repository:
```bash
git clone https://github.com/ChanMeng666/douban-review-scraper.git
```
2. Navigate to the project directory:
```bash
cd douban-review-scraper
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
# βοΈ Configuration
Edit `config.py` to customize your scraping parameters:
```python
MOVIE_ID = 'your_movie_id'  # Douban movie ID
MAX_PAGES = 50              # Maximum pages to scrape
REQUEST_TIMEOUT = 30        # Request timeout in seconds
RETRY_TIMES = 3            # Number of retry attempts
```
# π Usage
1. Configure your target movie ID in `config.py`
2. Run the scraper:
```bash
python main.py
```
# π Output Format
The scraper generates CSV files containing:
- `timestamp`: Comment timestamp
- `content`: Cleaned comment text
- `rating`: User rating (1-5)
- `user_id`: Douban user ID
- `category`: Comment category (positive/negative/neutral)
# β οΈ Important Notes
- Respect Douban's robots.txt and API limitations
- Update cookies periodically for reliable operation
- Consider using proxies for large-scale scraping
- Check Douban's terms of service before use
# π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
# π License
This project is licensed under the MIT License - see the LICENSE file for details.
# π₯ Author
**Chan Meng**
- LinkedIn: [chanmeng666](https://www.linkedin.com/in/chanmeng666/)
- GitHub: [ChanMeng666](https://github.com/ChanMeng666)
# π Show your support
Give a βοΈ if this project helped you!
---