https://github.com/ahmed122000/hotels_scraper
A Python-based web scraper and Flask API for extracting hotel data from Booking.com. Features include detailed room information, amenities, and JSON export functionality. Perfect for travel data analysis and exploration!
https://github.com/ahmed122000/hotels_scraper
beautifulsoup4 flask json pytohn3 rest-api scraping scraping-websites
Last synced: about 2 months ago
JSON representation
A Python-based web scraper and Flask API for extracting hotel data from Booking.com. Features include detailed room information, amenities, and JSON export functionality. Perfect for travel data analysis and exploration!
- Host: GitHub
- URL: https://github.com/ahmed122000/hotels_scraper
- Owner: Ahmed122000
- License: mit
- Created: 2024-01-30T18:30:45.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-30T03:22:03.000Z (over 1 year ago)
- Last Synced: 2025-02-02T17:53:48.761Z (over 1 year ago)
- Topics: beautifulsoup4, flask, json, pytohn3, rest-api, scraping, scraping-websites
- Language: Python
- Homepage:
- Size: 2.74 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🏨 Hotels Scraper - Web Scraping & API Service
[](https://www.python.org/)
[](https://flask.palletsprojects.com/)
[](https://selenium.dev/)
[](https://www.crummy.com/software/BeautifulSoup/)
[](LICENSE)
> A powerful Python web scraping solution for extracting comprehensive hotel data from Booking.com with a RESTful Flask API for seamless integration.
---
## 📑 Table of Contents
- [Overview](#-overview)
- [Features](#-features)
- [Tech Stack](#-tech-stack)
- [Project Structure](#-project-structure)
- [Installation](#-installation)
- [API Documentation](#-api-documentation)
- [Usage Guide](#-usage-guide)
- [Configuration](#-configuration)
- [Database Schema](#-database-schema)
- [Troubleshooting](#-troubleshooting)
- [Legal Notice](#-legal-notice)
---
## 📊 Overview
This project provides a complete web scraping and API service for hotel data extraction. It combines:
- **Selenium WebDriver** for dynamic page rendering
- **BeautifulSoup** for HTML parsing
- **Flask** for REST API endpoints
- **JSON** for data persistence
**Use Cases**:
- Travel price comparison
- Market research
- Amenity analysis
- Availability tracking
---
## ✨ Features
### 🕷️ Web Scraper Features
| Feature | Description |
|---------|-------------|
| **Dynamic Rendering** | JavaScript-enabled browsing with Selenium |
| **Hotel Information** | Scrapes address, images, ratings, reviews |
| **Room Details** | Room types, prices, availability, capacity |
| **Amenities** | Comprehensive list of hotel amenities |
| **Images** | Downloads and stores hotel photos |
| **Pagination** | Automatically handles multi-page results |
| **Error Recovery** | Handles timeouts and connection failures |
### 🌐 API Endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/codes` | GET | List available city codes |
| `/scrape` | GET | Start scraping process |
| `/download/` | GET | Download JSON files |
| `/status` | GET | Check scraper status |
| `/history` | GET | View scraping history |
### 📊 Data Extraction
**Hotel Information**:
- Hotel name and official rating
- Address and GPS coordinates
- Phone number and website
- Check-in/check-out times
- Number of rooms and floors
**Room Details**:
- Room type and size
- Price per night
- Occupancy capacity
- Available dates
- Special offers
**Amenities**:
- WiFi availability
- Parking options
- Pool facilities
- Fitness center
- Pet policies
- Accessibility features
---
## 🛠️ Tech Stack
| Component | Technology |
|-----------|-----------|
| **Language** | Python 3.8+ |
| **Web Framework** | Flask 2.0+ |
| **Web Driver** | Selenium 4.0+ |
| **HTML Parser** | BeautifulSoup 4.0+ |
| **Browser** | Firefox (Geckodriver) |
| **Data Format** | JSON |
| **Logging** | Python logging module |
---
## 📂 Project Structure
```plaintext
Hotels_Scraper/
├── app.py # Flask application entry point
├── booking_hotels.py # Scraper implementation
├── requirements.txt # Python dependencies
├── .env.example # Environment configuration template
│
├── config/
│ ├── cities.json # City codes and configuration
│ ├── selectors.json # CSS selectors for scraping
│ └── user_agents.txt # Browser user agents
│
├── data/
│ ├── raw/ # Raw scraped data
│ ├── processed/ # Cleaned data
│ └── hotels_data_*.json # Output JSON files
│
├── logs/
│ ├── scraper.log # Scraping operations log
│ └── api.log # API request log
│
├── templates/
│ ├── index.html # Web interface
│ ├── results.html # Results page
│ └── error.html # Error page
│
├── static/
│ ├── css/
│ │ └── style.css
│ ├── js/
│ │ └── app.js
│ └── images/
│ └── logo.png
│
├── tests/
│ ├── test_scraper.py
│ ├── test_api.py
│ └── test_selectors.py
│
├── docs/
│ ├── API.md # API documentation
│ ├── SCRAPING.md # Scraping guide
│ └── EXAMPLES.md # Usage examples
│
└── README.md # This file
```
---
## 🚀 Installation
### Prerequisites
- **Python 3.8+**
- **Firefox browser**
- **Geckodriver** (Firefox WebDriver)
- **pip** (Python package manager)
### Step 1: Install System Dependencies
**Linux (Ubuntu/Debian)**:
```bash
sudo apt-get update
sudo apt-get install python3 python3-pip firefox
wget https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
tar -xvf geckodriver-v0.33.0-linux64.tar.gz
sudo mv geckodriver /usr/local/bin/
```
**Mac**:
```bash
brew install python3 firefox
brew install geckodriver
```
**Windows**:
1. Download Firefox from https://www.mozilla.org/
2. Download Geckodriver from https://github.com/mozilla/geckodriver/releases
3. Add Geckodriver to PATH
### Step 2: Clone Repository
```bash
git clone https://github.com/Ahmed122000/Hotels_Scraper.git
cd Hotels_Scraper
```
### Step 3: Create Virtual Environment
```bash
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
```
### Step 4: Install Dependencies
```bash
pip install -r requirements.txt
```
**requirements.txt**:
```
beautifulsoup4==4.11.1
selenium==4.10.0
requests==2.31.0
flask==2.3.0
python-dotenv==1.0.0
```
### Step 5: Configuration
Create `.env` file from template:
```bash
cp .env.example .env
```
Edit `.env`:
```ini
# Flask configuration
FLASK_ENV=development
DEBUG=True
SECRET_KEY=your-secret-key
# Scraping configuration
HEADLESS_BROWSER=True
TIMEOUT=30
RETRY_ATTEMPTS=3
# Data paths
DATA_DIR=./data
LOG_DIR=./logs
```
### Step 6: Run Application
```bash
python app.py
```
Application will be available at: `http://localhost:5000`
---
## 📡 API Documentation
### 1. Get City Codes
**Endpoint**: `GET /codes`
**Description**: Retrieve available city codes for scraping
**Response**:
```json
{
"status": "success",
"cities": {
"cairo": "290692",
"alexandria": "290263",
"hurghada": "290029",
"sharm_el_sheikh": "290039",
"giza": "290693"
}
}
```
**Example**:
```bash
curl -X GET http://localhost:5000/codes
```
---
### 2. Start Scraping
**Endpoint**: `GET /scrape`
**Query Parameters**:
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `city` | string | Yes | City name (e.g., "cairo") |
| `city_code` | string | Yes | City code from `/codes` |
| `pages` | integer | No | Number of pages to scrape (default: 1) |
| `format` | string | No | Output format: "json" or "csv" (default: "json") |
| `max_results` | integer | No | Maximum hotels to scrape (default: 50) |
**Response**:
```json
{
"status": "success",
"message": "Scraping for cairo completed. Data saved",
"data_file": "cairo_hotels_1672503492.json",
"download_link": "http://localhost:5000/download/cairo_hotels_1672503492.json",
"metadata": {
"city": "cairo",
"hotels_count": 47,
"pages_scraped": 1,
"duration_seconds": 45.23,
"timestamp": "2023-12-31T10:30:00Z"
}
}
```
**Example**:
```bash
# Scrape 2 pages of Cairo hotels
curl -X GET "http://localhost:5000/scrape?city=cairo&city_code=290692&pages=2&format=json"
```
---
### 3. Download Data
**Endpoint**: `GET /download/`
**Description**: Download previously saved JSON/CSV files
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `file_name` | string | Name of file in data directory |
**Response**: File download or error
**Example**:
```bash
curl -X GET http://localhost:5000/download/cairo_hotels_1672503492.json \
-o hotels_data.json
```
---
### 4. Scraping Status
**Endpoint**: `GET /status`
**Description**: Check current scraping status
**Response**:
```json
{
"status": "idle",
"current_job": null,
"completed_jobs": 5,
"failed_jobs": 1,
"average_duration": 42.5
}
```
---
### 5. Scraping History
**Endpoint**: `GET /history`
**Description**: View previous scraping operations
**Response**:
```json
{
"history": [
{
"id": 1,
"city": "cairo",
"hotels_count": 47,
"timestamp": "2023-12-31T10:30:00Z",
"status": "completed",
"file": "cairo_hotels_1672503492.json"
}
]
}
```
---
## 💻 Usage Guide
### Web Interface
1. **Start Application**:
```bash
python app.py
```
2. **Open Browser**:
Navigate to `http://localhost:5000`
3. **Select City**:
Choose from dropdown menu
4. **Configure Scraping**:
- Number of pages
- Output format
- Maximum results
5. **Start Scraping**:
Click "Scrape" button
6. **Download Results**:
Click download link when complete
### Programmatic Usage
**Python Example**:
```python
import requests
import json
# Get available cities
response = requests.get('http://localhost:5000/codes')
cities = response.json()['cities']
# Scrape hotels
scrape_url = 'http://localhost:5000/scrape'
params = {
'city': 'cairo',
'city_code': cities['cairo'],
'pages': 2,
'format': 'json'
}
response = requests.get(scrape_url, params=params)
result = response.json()
# Download data
if result['status'] == 'success':
file_url = result['download_link']
data_response = requests.get(file_url)
with open('hotels.json', 'w') as f:
json.dump(data_response.json(), f, indent=2)
```
---
## ⚙️ Configuration
### Browser Configuration
Edit `config/selectors.json` for different Booking.com layouts:
```json
{
"hotel_name": ".hotel-name",
"price": ".price-tag",
"rating": ".hotel-rating",
"address": ".hotel-address",
"amenities": ".amenity-list"
}
```
### User Agents
Add browser user agents to `config/user_agents.txt`:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...
```
---
## 📊 Database Schema
### Hotel Object
```json
{
"hotel_id": "12345",
"name": "Hotel Cairo",
"address": "123 Nile Street, Cairo, Egypt",
"city": "Cairo",
"coordinates": {
"latitude": 30.0444,
"longitude": 31.2357
},
"phone": "+20123456789",
"website": "https://hotelcairo.example.com",
"rating": 4.5,
"review_count": 234,
"rooms_count": 50,
"check_in_time": "14:00",
"check_out_time": "12:00",
"amenities": [
"Free WiFi",
"Pool",
"Fitness Center"
],
"images": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"rooms": [
{
"type": "Single Room",
"price": 45.00,
"capacity": 1,
"available": true
}
]
}
```
---
## 🐛 Troubleshooting
### Issue: Geckodriver Not Found
**Solution**:
```bash
# Add to PATH or specify in environment
export PATH=$PATH:/path/to/geckodriver
```
### Issue: Timeout During Scraping
**Solution**: Increase timeout in `.env`:
```ini
TIMEOUT=60
```
### Issue: CSS Selectors Not Working
**Solution**:
1. Update `config/selectors.json`
2. Run tests: `python -m pytest tests/test_selectors.py`
### Issue: Browser Memory Issues
**Solution**: Enable headless mode in `.env`:
```ini
HEADLESS_BROWSER=True
```
---
## 📈 Performance Tips
1. **Parallel Scraping**: Use threading for multiple cities
2. **Caching**: Cache city codes and selectors
3. **Rate Limiting**: Add delays between requests (ethical)
4. **Data Compression**: Compress large JSON files
5. **Database**: Use SQLite/PostgreSQL for large datasets
---
## ⚖️ Legal Notice
**Important**: This scraper is designed for **educational purposes** only.
- **Respect robots.txt**: Check Booking.com's robots.txt
- **Rate Limiting**: Use reasonable request delays
- **Terms of Service**: Comply with Booking.com's ToS
- **Ethical Use**: Do not resell or commercially use data
- **Legal Compliance**: Check local laws regarding web scraping
---
## 🔄 Roadmap
- [ ] Rotating proxy support
- [ ] CAPTCHA handling
- [ ] Browser pool management
- [ ] PostgreSQL integration
- [ ] Scheduled scraping
- [ ] REST API authentication
- [ ] Email notifications
- [ ] Data validation pipeline
---
## 📝 Contributing
1. Fork repository
2. Create feature branch (`git checkout -b feature/enhancement`)
3. Commit changes (`git commit -m 'Add enhancement'`)
4. Push to branch (`git push origin feature/enhancement`)
5. Open Pull Request
---
## 📄 License
This project is licensed under the **MIT License** - see [LICENSE](LICENSE) for details.
---
## 🙏 Acknowledgments
- [Selenium Documentation](https://selenium.dev/)
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Flask Documentation](https://flask.palletsprojects.com/)
- Booking.com for API inspiration
---
## 📞 Support
For issues, questions, or suggestions:
- Open an issue on GitHub
- Email: ahmedhesham122000@gmail.com
- Check [docs/](docs/) for detailed guides
---
**Built with ❤️ for data enthusiasts**