An open API service indexing awesome lists of open source software.

https://github.com/ahmed122000/hotels_scraper

A Python-based web scraper and Flask API for extracting hotel data from Booking.com. Features include detailed room information, amenities, and JSON export functionality. Perfect for travel data analysis and exploration!
https://github.com/ahmed122000/hotels_scraper

beautifulsoup4 flask json pytohn3 rest-api scraping scraping-websites

Last synced: about 2 months ago
JSON representation

A Python-based web scraper and Flask API for extracting hotel data from Booking.com. Features include detailed room information, amenities, and JSON export functionality. Perfect for travel data analysis and exploration!

Awesome Lists containing this project

README

          

# 🏨 Hotels Scraper - Web Scraping & API Service

[![Python](https://img.shields.io/badge/Python-3.8+-blue?style=flat-square&logo=python)](https://www.python.org/)
[![Flask](https://img.shields.io/badge/Flask-2.0+-black?style=flat-square&logo=flask)](https://flask.palletsprojects.com/)
[![Selenium](https://img.shields.io/badge/Selenium-4.0+-green?style=flat-square&logo=selenium)](https://selenium.dev/)
[![BeautifulSoup](https://img.shields.io/badge/BeautifulSoup-4.0+-yellow?style=flat-square&logo=python)](https://www.crummy.com/software/BeautifulSoup/)
[![License](https://img.shields.io/badge/License-MIT-black?style=flat-square)](LICENSE)

> A powerful Python web scraping solution for extracting comprehensive hotel data from Booking.com with a RESTful Flask API for seamless integration.

---

## 📑 Table of Contents

- [Overview](#-overview)
- [Features](#-features)
- [Tech Stack](#-tech-stack)
- [Project Structure](#-project-structure)
- [Installation](#-installation)
- [API Documentation](#-api-documentation)
- [Usage Guide](#-usage-guide)
- [Configuration](#-configuration)
- [Database Schema](#-database-schema)
- [Troubleshooting](#-troubleshooting)
- [Legal Notice](#-legal-notice)

---

## 📊 Overview

This project provides a complete web scraping and API service for hotel data extraction. It combines:
- **Selenium WebDriver** for dynamic page rendering
- **BeautifulSoup** for HTML parsing
- **Flask** for REST API endpoints
- **JSON** for data persistence

**Use Cases**:
- Travel price comparison
- Market research
- Amenity analysis
- Availability tracking

---

## ✨ Features

### 🕷️ Web Scraper Features

| Feature | Description |
|---------|-------------|
| **Dynamic Rendering** | JavaScript-enabled browsing with Selenium |
| **Hotel Information** | Scrapes address, images, ratings, reviews |
| **Room Details** | Room types, prices, availability, capacity |
| **Amenities** | Comprehensive list of hotel amenities |
| **Images** | Downloads and stores hotel photos |
| **Pagination** | Automatically handles multi-page results |
| **Error Recovery** | Handles timeouts and connection failures |

### 🌐 API Endpoints

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/codes` | GET | List available city codes |
| `/scrape` | GET | Start scraping process |
| `/download/` | GET | Download JSON files |
| `/status` | GET | Check scraper status |
| `/history` | GET | View scraping history |

### 📊 Data Extraction

**Hotel Information**:
- Hotel name and official rating
- Address and GPS coordinates
- Phone number and website
- Check-in/check-out times
- Number of rooms and floors

**Room Details**:
- Room type and size
- Price per night
- Occupancy capacity
- Available dates
- Special offers

**Amenities**:
- WiFi availability
- Parking options
- Pool facilities
- Fitness center
- Pet policies
- Accessibility features

---

## 🛠️ Tech Stack

| Component | Technology |
|-----------|-----------|
| **Language** | Python 3.8+ |
| **Web Framework** | Flask 2.0+ |
| **Web Driver** | Selenium 4.0+ |
| **HTML Parser** | BeautifulSoup 4.0+ |
| **Browser** | Firefox (Geckodriver) |
| **Data Format** | JSON |
| **Logging** | Python logging module |

---

## 📂 Project Structure

```plaintext
Hotels_Scraper/
├── app.py # Flask application entry point
├── booking_hotels.py # Scraper implementation
├── requirements.txt # Python dependencies
├── .env.example # Environment configuration template

├── config/
│ ├── cities.json # City codes and configuration
│ ├── selectors.json # CSS selectors for scraping
│ └── user_agents.txt # Browser user agents

├── data/
│ ├── raw/ # Raw scraped data
│ ├── processed/ # Cleaned data
│ └── hotels_data_*.json # Output JSON files

├── logs/
│ ├── scraper.log # Scraping operations log
│ └── api.log # API request log

├── templates/
│ ├── index.html # Web interface
│ ├── results.html # Results page
│ └── error.html # Error page

├── static/
│ ├── css/
│ │ └── style.css
│ ├── js/
│ │ └── app.js
│ └── images/
│ └── logo.png

├── tests/
│ ├── test_scraper.py
│ ├── test_api.py
│ └── test_selectors.py

├── docs/
│ ├── API.md # API documentation
│ ├── SCRAPING.md # Scraping guide
│ └── EXAMPLES.md # Usage examples

└── README.md # This file
```

---

## 🚀 Installation

### Prerequisites

- **Python 3.8+**
- **Firefox browser**
- **Geckodriver** (Firefox WebDriver)
- **pip** (Python package manager)

### Step 1: Install System Dependencies

**Linux (Ubuntu/Debian)**:
```bash
sudo apt-get update
sudo apt-get install python3 python3-pip firefox
wget https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
tar -xvf geckodriver-v0.33.0-linux64.tar.gz
sudo mv geckodriver /usr/local/bin/
```

**Mac**:
```bash
brew install python3 firefox
brew install geckodriver
```

**Windows**:
1. Download Firefox from https://www.mozilla.org/
2. Download Geckodriver from https://github.com/mozilla/geckodriver/releases
3. Add Geckodriver to PATH

### Step 2: Clone Repository

```bash
git clone https://github.com/Ahmed122000/Hotels_Scraper.git
cd Hotels_Scraper
```

### Step 3: Create Virtual Environment

```bash
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
```

### Step 4: Install Dependencies

```bash
pip install -r requirements.txt
```

**requirements.txt**:
```
beautifulsoup4==4.11.1
selenium==4.10.0
requests==2.31.0
flask==2.3.0
python-dotenv==1.0.0
```

### Step 5: Configuration

Create `.env` file from template:
```bash
cp .env.example .env
```

Edit `.env`:
```ini
# Flask configuration
FLASK_ENV=development
DEBUG=True
SECRET_KEY=your-secret-key

# Scraping configuration
HEADLESS_BROWSER=True
TIMEOUT=30
RETRY_ATTEMPTS=3

# Data paths
DATA_DIR=./data
LOG_DIR=./logs
```

### Step 6: Run Application

```bash
python app.py
```

Application will be available at: `http://localhost:5000`

---

## 📡 API Documentation

### 1. Get City Codes

**Endpoint**: `GET /codes`

**Description**: Retrieve available city codes for scraping

**Response**:
```json
{
"status": "success",
"cities": {
"cairo": "290692",
"alexandria": "290263",
"hurghada": "290029",
"sharm_el_sheikh": "290039",
"giza": "290693"
}
}
```

**Example**:
```bash
curl -X GET http://localhost:5000/codes
```

---

### 2. Start Scraping

**Endpoint**: `GET /scrape`

**Query Parameters**:

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `city` | string | Yes | City name (e.g., "cairo") |
| `city_code` | string | Yes | City code from `/codes` |
| `pages` | integer | No | Number of pages to scrape (default: 1) |
| `format` | string | No | Output format: "json" or "csv" (default: "json") |
| `max_results` | integer | No | Maximum hotels to scrape (default: 50) |

**Response**:
```json
{
"status": "success",
"message": "Scraping for cairo completed. Data saved",
"data_file": "cairo_hotels_1672503492.json",
"download_link": "http://localhost:5000/download/cairo_hotels_1672503492.json",
"metadata": {
"city": "cairo",
"hotels_count": 47,
"pages_scraped": 1,
"duration_seconds": 45.23,
"timestamp": "2023-12-31T10:30:00Z"
}
}
```

**Example**:
```bash
# Scrape 2 pages of Cairo hotels
curl -X GET "http://localhost:5000/scrape?city=cairo&city_code=290692&pages=2&format=json"
```

---

### 3. Download Data

**Endpoint**: `GET /download/`

**Description**: Download previously saved JSON/CSV files

**Parameters**:

| Parameter | Type | Description |
|-----------|------|-------------|
| `file_name` | string | Name of file in data directory |

**Response**: File download or error

**Example**:
```bash
curl -X GET http://localhost:5000/download/cairo_hotels_1672503492.json \
-o hotels_data.json
```

---

### 4. Scraping Status

**Endpoint**: `GET /status`

**Description**: Check current scraping status

**Response**:
```json
{
"status": "idle",
"current_job": null,
"completed_jobs": 5,
"failed_jobs": 1,
"average_duration": 42.5
}
```

---

### 5. Scraping History

**Endpoint**: `GET /history`

**Description**: View previous scraping operations

**Response**:
```json
{
"history": [
{
"id": 1,
"city": "cairo",
"hotels_count": 47,
"timestamp": "2023-12-31T10:30:00Z",
"status": "completed",
"file": "cairo_hotels_1672503492.json"
}
]
}
```

---

## 💻 Usage Guide

### Web Interface

1. **Start Application**:
```bash
python app.py
```

2. **Open Browser**:
Navigate to `http://localhost:5000`

3. **Select City**:
Choose from dropdown menu

4. **Configure Scraping**:
- Number of pages
- Output format
- Maximum results

5. **Start Scraping**:
Click "Scrape" button

6. **Download Results**:
Click download link when complete

### Programmatic Usage

**Python Example**:
```python
import requests
import json

# Get available cities
response = requests.get('http://localhost:5000/codes')
cities = response.json()['cities']

# Scrape hotels
scrape_url = 'http://localhost:5000/scrape'
params = {
'city': 'cairo',
'city_code': cities['cairo'],
'pages': 2,
'format': 'json'
}

response = requests.get(scrape_url, params=params)
result = response.json()

# Download data
if result['status'] == 'success':
file_url = result['download_link']
data_response = requests.get(file_url)
with open('hotels.json', 'w') as f:
json.dump(data_response.json(), f, indent=2)
```

---

## ⚙️ Configuration

### Browser Configuration

Edit `config/selectors.json` for different Booking.com layouts:

```json
{
"hotel_name": ".hotel-name",
"price": ".price-tag",
"rating": ".hotel-rating",
"address": ".hotel-address",
"amenities": ".amenity-list"
}
```

### User Agents

Add browser user agents to `config/user_agents.txt`:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...
```

---

## 📊 Database Schema

### Hotel Object

```json
{
"hotel_id": "12345",
"name": "Hotel Cairo",
"address": "123 Nile Street, Cairo, Egypt",
"city": "Cairo",
"coordinates": {
"latitude": 30.0444,
"longitude": 31.2357
},
"phone": "+20123456789",
"website": "https://hotelcairo.example.com",
"rating": 4.5,
"review_count": 234,
"rooms_count": 50,
"check_in_time": "14:00",
"check_out_time": "12:00",
"amenities": [
"Free WiFi",
"Pool",
"Fitness Center"
],
"images": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"rooms": [
{
"type": "Single Room",
"price": 45.00,
"capacity": 1,
"available": true
}
]
}
```

---

## 🐛 Troubleshooting

### Issue: Geckodriver Not Found
**Solution**:
```bash
# Add to PATH or specify in environment
export PATH=$PATH:/path/to/geckodriver
```

### Issue: Timeout During Scraping
**Solution**: Increase timeout in `.env`:
```ini
TIMEOUT=60
```

### Issue: CSS Selectors Not Working
**Solution**:
1. Update `config/selectors.json`
2. Run tests: `python -m pytest tests/test_selectors.py`

### Issue: Browser Memory Issues
**Solution**: Enable headless mode in `.env`:
```ini
HEADLESS_BROWSER=True
```

---

## 📈 Performance Tips

1. **Parallel Scraping**: Use threading for multiple cities
2. **Caching**: Cache city codes and selectors
3. **Rate Limiting**: Add delays between requests (ethical)
4. **Data Compression**: Compress large JSON files
5. **Database**: Use SQLite/PostgreSQL for large datasets

---

## ⚖️ Legal Notice

**Important**: This scraper is designed for **educational purposes** only.

- **Respect robots.txt**: Check Booking.com's robots.txt
- **Rate Limiting**: Use reasonable request delays
- **Terms of Service**: Comply with Booking.com's ToS
- **Ethical Use**: Do not resell or commercially use data
- **Legal Compliance**: Check local laws regarding web scraping

---

## 🔄 Roadmap

- [ ] Rotating proxy support
- [ ] CAPTCHA handling
- [ ] Browser pool management
- [ ] PostgreSQL integration
- [ ] Scheduled scraping
- [ ] REST API authentication
- [ ] Email notifications
- [ ] Data validation pipeline

---

## 📝 Contributing

1. Fork repository
2. Create feature branch (`git checkout -b feature/enhancement`)
3. Commit changes (`git commit -m 'Add enhancement'`)
4. Push to branch (`git push origin feature/enhancement`)
5. Open Pull Request

---

## 📄 License

This project is licensed under the **MIT License** - see [LICENSE](LICENSE) for details.

---

## 🙏 Acknowledgments

- [Selenium Documentation](https://selenium.dev/)
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Flask Documentation](https://flask.palletsprojects.com/)
- Booking.com for API inspiration

---

## 📞 Support

For issues, questions, or suggestions:
- Open an issue on GitHub
- Email: ahmedhesham122000@gmail.com
- Check [docs/](docs/) for detailed guides

---

**Built with ❤️ for data enthusiasts**