Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chanmeng666/douban-review-scraper

【One star = One happy developer doing a little dance 💃⭐️】A robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities.
https://github.com/chanmeng666/douban-review-scraper

beautifulsoup4 data-analysis data-processing douban movie-reviews pandas python sentiment-analysis text-mining web-scraping

Last synced: 3 days ago
JSON representation

Host: GitHub
URL: https://github.com/chanmeng666/douban-review-scraper
Owner: ChanMeng666
License: mit
Created: 2024-11-25T11:37:43.000Z (3 months ago)
Default Branch: master
Last Pushed: 2025-01-07T10:45:23.000Z (about 1 month ago)
Last Synced: 2025-01-07T11:32:21.791Z (about 1 month ago)
Topics: beautifulsoup4, data-analysis, data-processing, douban, movie-reviews, pandas, python, sentiment-analysis, text-mining, web-scraping
Language: Python
Homepage:
Size: 90.8 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

🎬 Douban Movie Reviews Scraper

A powerful tool for collecting and analyzing Douban movie reviews

# 🚀 Features

- 🔄 Robust scraping with rate limiting and retry mechanisms
- 🧹 Advanced data cleaning and normalization
- 📊 Sentiment analysis categorization
- 💾 Efficient CSV export functionality
- 🔍 Comprehensive error handling and logging
- 🛡️ Built-in protection against API rate limits
- 📝 Detailed comment metadata extraction
- 🎯 Configurable scraping parameters

# 🛠️ Requirements

- Python 3.8+
- Required packages:
```
beautifulsoup4==4.12.3
numpy==2.1.3
pandas==2.2.3
python-dateutil==2.9.0.post0
pytz==2024.2
requests~=2.32.3
six==1.16.0
soupsieve==2.6
tzdata==2024.2
```

# 📦 Installation

1. Clone the repository:
```bash
git clone https://github.com/ChanMeng666/douban-review-scraper.git
```

2. Navigate to the project directory:
```bash
cd douban-review-scraper
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

# ⚙️ Configuration

Edit `config.py` to customize your scraping parameters:

```python
MOVIE_ID = 'your_movie_id' # Douban movie ID
MAX_PAGES = 50 # Maximum pages to scrape
REQUEST_TIMEOUT = 30 # Request timeout in seconds
RETRY_TIMES = 3 # Number of retry attempts
```

# 🚀 Usage

1. Configure your target movie ID in `config.py`
2. Run the scraper:
```bash
python main.py
```

# 📊 Output Format

The scraper generates CSV files containing:
- `timestamp`: Comment timestamp
- `content`: Cleaned comment text
- `rating`: User rating (1-5)
- `user_id`: Douban user ID
- `category`: Comment category (positive/negative/neutral)

# ⚠️ Important Notes

- Respect Douban's robots.txt and API limitations
- Update cookies periodically for reliable operation
- Consider using proxies for large-scale scraping
- Check Douban's terms of service before use

# 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

# 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

# 👥 Author

**Chan Meng**
- LinkedIn: [chanmeng666](https://www.linkedin.com/in/chanmeng666/)
- GitHub: [ChanMeng666](https://github.com/ChanMeng666)

# 🌟 Show your support

Give a ⭐️ if this project helped you!

---

Made with ❤️ by Chan Meng