Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chanmeng666/douban-review-scraper

A web scraper for Douban movie comments with data processing capabilities, focusing on collecting and analyzing user reviews.
https://github.com/chanmeng666/douban-review-scraper

beautifulsoup4 data-processing douban movie-reviews pandas python web-scraping

Last synced: 8 days ago
JSON representation

A web scraper for Douban movie comments with data processing capabilities, focusing on collecting and analyzing user reviews.

Awesome Lists containing this project

README

        


๐ŸŽฌ Douban Movie Reviews Scraper


A powerful tool for collecting and analyzing Douban movie reviews







# ๐Ÿš€ Features

- ๐Ÿ”„ Robust scraping with rate limiting and retry mechanisms
- ๐Ÿงน Advanced data cleaning and normalization
- ๐Ÿ“Š Sentiment analysis categorization
- ๐Ÿ’พ Efficient CSV export functionality
- ๐Ÿ” Comprehensive error handling and logging
- ๐Ÿ›ก๏ธ Built-in protection against API rate limits
- ๐Ÿ“ Detailed comment metadata extraction
- ๐ŸŽฏ Configurable scraping parameters

# ๐Ÿ› ๏ธ Requirements

- Python 3.8+
- Required packages:
```
beautifulsoup4==4.12.3
numpy==2.1.3
pandas==2.2.3
python-dateutil==2.9.0.post0
pytz==2024.2
requests~=2.32.3
six==1.16.0
soupsieve==2.6
tzdata==2024.2
```

# ๐Ÿ“ฆ Installation

1. Clone the repository:
```bash
git clone https://github.com/ChanMeng666/douban-review-scraper.git
```

2. Navigate to the project directory:
```bash
cd douban-review-scraper
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

# โš™๏ธ Configuration

Edit `config.py` to customize your scraping parameters:

```python
MOVIE_ID = 'your_movie_id' # Douban movie ID
MAX_PAGES = 50 # Maximum pages to scrape
REQUEST_TIMEOUT = 30 # Request timeout in seconds
RETRY_TIMES = 3 # Number of retry attempts
```

# ๐Ÿš€ Usage

1. Configure your target movie ID in `config.py`
2. Run the scraper:
```bash
python main.py
```

# ๐Ÿ“Š Output Format

The scraper generates CSV files containing:
- `timestamp`: Comment timestamp
- `content`: Cleaned comment text
- `rating`: User rating (1-5)
- `user_id`: Douban user ID
- `category`: Comment category (positive/negative/neutral)

# โš ๏ธ Important Notes

- Respect Douban's robots.txt and API limitations
- Update cookies periodically for reliable operation
- Consider using proxies for large-scale scraping
- Check Douban's terms of service before use

# ๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

# ๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

# ๐Ÿ‘ฅ Author

**Chan Meng**
- LinkedIn: [chanmeng666](https://www.linkedin.com/in/chanmeng666/)
- GitHub: [ChanMeng666](https://github.com/ChanMeng666)

# ๐ŸŒŸ Show your support

Give a โญ๏ธ if this project helped you!

---


Made with โค๏ธ by Chan Meng