https://github.com/ahmed122000/hotels_scraper

A Python-based web scraper and Flask API for extracting hotel data from Booking.com. Features include detailed room information, amenities, and JSON export functionality. Perfect for travel data analysis and exploration!
https://github.com/ahmed122000/hotels_scraper
beautifulsoup4 flask json pytohn3 rest-api scraping scraping-websites
Last synced: about 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/ahmed122000/hotels_scraper
Owner: Ahmed122000
License: mit
Created: 2024-01-30T18:30:45.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-01-30T03:22:03.000Z (over 1 year ago)
Last Synced: 2025-02-02T17:53:48.761Z (over 1 year ago)
Topics: beautifulsoup4, flask, json, pytohn3, rest-api, scraping, scraping-websites
Language: Python
Homepage:
Size: 2.74 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # 🏨 Hotels Scraper - Web Scraping & API Service

[![Python](https://img.shields.io/badge/Python-3.8+-blue?style=flat-square&logo=python)](https://www.python.org/)

[![Flask](https://img.shields.io/badge/Flask-2.0+-black?style=flat-square&logo=flask)](https://flask.palletsprojects.com/)

[![Selenium](https://img.shields.io/badge/Selenium-4.0+-green?style=flat-square&logo=selenium)](https://selenium.dev/)

[![BeautifulSoup](https://img.shields.io/badge/BeautifulSoup-4.0+-yellow?style=flat-square&logo=python)](https://www.crummy.com/software/BeautifulSoup/)

[![License](https://img.shields.io/badge/License-MIT-black?style=flat-square)](LICENSE)

> A powerful Python web scraping solution for extracting comprehensive hotel data from Booking.com with a RESTful Flask API for seamless integration.

---

## 📑 Table of Contents

- [Overview](#-overview)

- [Features](#-features)

- [Tech Stack](#-tech-stack)

- [Project Structure](#-project-structure)

- [Installation](#-installation)

- [API Documentation](#-api-documentation)

- [Usage Guide](#-usage-guide)

- [Configuration](#-configuration)

- [Database Schema](#-database-schema)

- [Troubleshooting](#-troubleshooting)

- [Legal Notice](#-legal-notice)

---

## 📊 Overview

This project provides a complete web scraping and API service for hotel data extraction. It combines:

- **Selenium WebDriver** for dynamic page rendering

- **BeautifulSoup** for HTML parsing

- **Flask** for REST API endpoints

- **JSON** for data persistence

**Use Cases**:

- Travel price comparison

- Market research

- Amenity analysis

- Availability tracking

---

## ✨ Features

### 🕷️ Web Scraper Features

| Feature | Description |

|---------|-------------|

| **Dynamic Rendering** | JavaScript-enabled browsing with Selenium |

| **Hotel Information** | Scrapes address, images, ratings, reviews |

| **Room Details** | Room types, prices, availability, capacity |

| **Amenities** | Comprehensive list of hotel amenities |

| **Images** | Downloads and stores hotel photos |

| **Pagination** | Automatically handles multi-page results |

| **Error Recovery** | Handles timeouts and connection failures |

### 🌐 API Endpoints

| Endpoint | Method | Purpose |

|----------|--------|---------|

| `/codes` | GET | List available city codes |

| `/scrape` | GET | Start scraping process |

| `/download/` | GET | Download JSON files |

| `/status` | GET | Check scraper status |

| `/history` | GET | View scraping history |

### 📊 Data Extraction

**Hotel Information**:

- Hotel name and official rating

- Address and GPS coordinates

- Phone number and website

- Check-in/check-out times

- Number of rooms and floors

**Room Details**:

- Room type and size

- Price per night

- Occupancy capacity

- Available dates

- Special offers

**Amenities**:

- WiFi availability

- Parking options

- Pool facilities

- Fitness center

- Pet policies

- Accessibility features

---

## 🛠️ Tech Stack

| Component | Technology |

|-----------|-----------|

| **Language** | Python 3.8+ |

| **Web Framework** | Flask 2.0+ |

| **Web Driver** | Selenium 4.0+ |

| **HTML Parser** | BeautifulSoup 4.0+ |

| **Browser** | Firefox (Geckodriver) |

| **Data Format** | JSON |

| **Logging** | Python logging module |

---

## 📂 Project Structure

```plaintext

Hotels_Scraper/

├── app.py                        # Flask application entry point

├── booking_hotels.py             # Scraper implementation

├── requirements.txt              # Python dependencies

├── .env.example                  # Environment configuration template

│

├── config/

│   ├── cities.json              # City codes and configuration

│   ├── selectors.json           # CSS selectors for scraping

│   └── user_agents.txt          # Browser user agents

│

├── data/

│   ├── raw/                     # Raw scraped data

│   ├── processed/               # Cleaned data

│   └── hotels_data_*.json       # Output JSON files

│

├── logs/

│   ├── scraper.log              # Scraping operations log

│   └── api.log                  # API request log

│

├── templates/

│   ├── index.html               # Web interface

│   ├── results.html             # Results page

│   └── error.html               # Error page

│

├── static/

│   ├── css/

│   │   └── style.css

│   ├── js/

│   │   └── app.js

│   └── images/

│       └── logo.png

│

├── tests/

│   ├── test_scraper.py

│   ├── test_api.py

│   └── test_selectors.py

│

├── docs/

│   ├── API.md                   # API documentation

│   ├── SCRAPING.md              # Scraping guide

│   └── EXAMPLES.md              # Usage examples

│

└── README.md                    # This file

```

---

## 🚀 Installation

### Prerequisites

- **Python 3.8+**

- **Firefox browser**

- **Geckodriver** (Firefox WebDriver)

- **pip** (Python package manager)

### Step 1: Install System Dependencies

**Linux (Ubuntu/Debian)**:

```bash

sudo apt-get update

sudo apt-get install python3 python3-pip firefox

wget https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz

tar -xvf geckodriver-v0.33.0-linux64.tar.gz

sudo mv geckodriver /usr/local/bin/

```

**Mac**:

```bash

brew install python3 firefox

brew install geckodriver

```

**Windows**:

1. Download Firefox from https://www.mozilla.org/

2. Download Geckodriver from https://github.com/mozilla/geckodriver/releases

3. Add Geckodriver to PATH

### Step 2: Clone Repository

```bash

git clone https://github.com/Ahmed122000/Hotels_Scraper.git

cd Hotels_Scraper

```

### Step 3: Create Virtual Environment

```bash

python3 -m venv venv

source venv/bin/activate        # Linux/Mac

# or

venv\Scripts\activate            # Windows

```

### Step 4: Install Dependencies

```bash

pip install -r requirements.txt

```

**requirements.txt**:

```

beautifulsoup4==4.11.1

selenium==4.10.0

requests==2.31.0

flask==2.3.0

python-dotenv==1.0.0

```

### Step 5: Configuration

Create `.env` file from template:

```bash

cp .env.example .env

```

Edit `.env`:

```ini

# Flask configuration

FLASK_ENV=development

DEBUG=True

SECRET_KEY=your-secret-key

# Scraping configuration

HEADLESS_BROWSER=True

TIMEOUT=30

RETRY_ATTEMPTS=3

# Data paths

DATA_DIR=./data

LOG_DIR=./logs

```

### Step 6: Run Application

```bash

python app.py

```

Application will be available at: `http://localhost:5000`

---

## 📡 API Documentation

### 1. Get City Codes

**Endpoint**: `GET /codes`

**Description**: Retrieve available city codes for scraping

**Response**:

```json

{

  "status": "success",

  "cities": {

    "cairo": "290692",

    "alexandria": "290263",

    "hurghada": "290029",

    "sharm_el_sheikh": "290039",

    "giza": "290693"

  }

}

```

**Example**:

```bash

curl -X GET http://localhost:5000/codes

```

---

### 2. Start Scraping

**Endpoint**: `GET /scrape`

**Query Parameters**:

| Parameter | Type | Required | Description |

|-----------|------|----------|-------------|

| `city` | string | Yes | City name (e.g., "cairo") |

| `city_code` | string | Yes | City code from `/codes` |

| `pages` | integer | No | Number of pages to scrape (default: 1) |

| `format` | string | No | Output format: "json" or "csv" (default: "json") |

| `max_results` | integer | No | Maximum hotels to scrape (default: 50) |

**Response**:

```json

{

  "status": "success",

  "message": "Scraping for cairo completed. Data saved",

  "data_file": "cairo_hotels_1672503492.json",

  "download_link": "http://localhost:5000/download/cairo_hotels_1672503492.json",

  "metadata": {

    "city": "cairo",

    "hotels_count": 47,

    "pages_scraped": 1,

    "duration_seconds": 45.23,

    "timestamp": "2023-12-31T10:30:00Z"

  }

}

```

**Example**:

```bash

# Scrape 2 pages of Cairo hotels

curl -X GET "http://localhost:5000/scrape?city=cairo&city_code=290692&pages=2&format=json"

```

---

### 3. Download Data

**Endpoint**: `GET /download/`

**Description**: Download previously saved JSON/CSV files

**Parameters**:

| Parameter | Type | Description |

|-----------|------|-------------|

| `file_name` | string | Name of file in data directory |

**Response**: File download or error

**Example**:

```bash

curl -X GET http://localhost:5000/download/cairo_hotels_1672503492.json \

  -o hotels_data.json

```

---

### 4. Scraping Status

**Endpoint**: `GET /status`

**Description**: Check current scraping status

**Response**:

```json

{

  "status": "idle",

  "current_job": null,

  "completed_jobs": 5,

  "failed_jobs": 1,

  "average_duration": 42.5

}

```

---

### 5. Scraping History

**Endpoint**: `GET /history`

**Description**: View previous scraping operations

**Response**:

```json

{

  "history": [

    {

      "id": 1,

      "city": "cairo",

      "hotels_count": 47,

      "timestamp": "2023-12-31T10:30:00Z",

      "status": "completed",

      "file": "cairo_hotels_1672503492.json"

    }

  ]

}

```

---

## 💻 Usage Guide

### Web Interface

1. **Start Application**:

   ```bash

   python app.py

   ```

2. **Open Browser**:

   Navigate to `http://localhost:5000`

3. **Select City**:

   Choose from dropdown menu

4. **Configure Scraping**:

   - Number of pages

   - Output format

   - Maximum results

5. **Start Scraping**:

   Click "Scrape" button

6. **Download Results**:

   Click download link when complete

### Programmatic Usage

**Python Example**:

```python

import requests

import json

# Get available cities

response = requests.get('http://localhost:5000/codes')

cities = response.json()['cities']

# Scrape hotels

scrape_url = 'http://localhost:5000/scrape'

params = {

    'city': 'cairo',

    'city_code': cities['cairo'],

    'pages': 2,

    'format': 'json'

}

response = requests.get(scrape_url, params=params)

result = response.json()

# Download data

if result['status'] == 'success':

    file_url = result['download_link']

    data_response = requests.get(file_url)

    with open('hotels.json', 'w') as f:

        json.dump(data_response.json(), f, indent=2)

```

---

## ⚙️ Configuration

### Browser Configuration

Edit `config/selectors.json` for different Booking.com layouts:

```json

{

  "hotel_name": ".hotel-name",

  "price": ".price-tag",

  "rating": ".hotel-rating",

  "address": ".hotel-address",

  "amenities": ".amenity-list"

}

```

### User Agents

Add browser user agents to `config/user_agents.txt`:

```

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...

```

---

## 📊 Database Schema

### Hotel Object

```json

{

  "hotel_id": "12345",

  "name": "Hotel Cairo",

  "address": "123 Nile Street, Cairo, Egypt",

  "city": "Cairo",

  "coordinates": {

    "latitude": 30.0444,

    "longitude": 31.2357

  },

  "phone": "+20123456789",

  "website": "https://hotelcairo.example.com",

  "rating": 4.5,

  "review_count": 234,

  "rooms_count": 50,

  "check_in_time": "14:00",

  "check_out_time": "12:00",

  "amenities": [

    "Free WiFi",

    "Pool",

    "Fitness Center"

  ],

  "images": [

    "https://example.com/image1.jpg",

    "https://example.com/image2.jpg"

  ],

  "rooms": [

    {

      "type": "Single Room",

      "price": 45.00,

      "capacity": 1,

      "available": true

    }

  ]

}

```

---

## 🐛 Troubleshooting

### Issue: Geckodriver Not Found

**Solution**:

```bash

# Add to PATH or specify in environment

export PATH=$PATH:/path/to/geckodriver

```

### Issue: Timeout During Scraping

**Solution**: Increase timeout in `.env`:

```ini

TIMEOUT=60

```

### Issue: CSS Selectors Not Working

**Solution**: 

1. Update `config/selectors.json`

2. Run tests: `python -m pytest tests/test_selectors.py`

### Issue: Browser Memory Issues

**Solution**: Enable headless mode in `.env`:

```ini

HEADLESS_BROWSER=True

```

---

## 📈 Performance Tips

1. **Parallel Scraping**: Use threading for multiple cities

2. **Caching**: Cache city codes and selectors

3. **Rate Limiting**: Add delays between requests (ethical)

4. **Data Compression**: Compress large JSON files

5. **Database**: Use SQLite/PostgreSQL for large datasets

---

## ⚖️ Legal Notice

**Important**: This scraper is designed for **educational purposes** only.

- **Respect robots.txt**: Check Booking.com's robots.txt

- **Rate Limiting**: Use reasonable request delays

- **Terms of Service**: Comply with Booking.com's ToS

- **Ethical Use**: Do not resell or commercially use data

- **Legal Compliance**: Check local laws regarding web scraping

---

## 🔄 Roadmap

- [ ] Rotating proxy support

- [ ] CAPTCHA handling

- [ ] Browser pool management

- [ ] PostgreSQL integration

- [ ] Scheduled scraping

- [ ] REST API authentication

- [ ] Email notifications

- [ ] Data validation pipeline

---

## 📝 Contributing

1. Fork repository

2. Create feature branch (`git checkout -b feature/enhancement`)

3. Commit changes (`git commit -m 'Add enhancement'`)

4. Push to branch (`git push origin feature/enhancement`)

5. Open Pull Request

---

## 📄 License

This project is licensed under the **MIT License** - see [LICENSE](LICENSE) for details.

---

## 🙏 Acknowledgments

- [Selenium Documentation](https://selenium.dev/)

- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- [Flask Documentation](https://flask.palletsprojects.com/)

- Booking.com for API inspiration

---

## 📞 Support

For issues, questions, or suggestions:

- Open an issue on GitHub

- Email: ahmedhesham122000@gmail.com

- Check [docs/](docs/) for detailed guides

---

**Built with ❤️ for data enthusiasts**
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ahmed122000/hotels_scraper

Awesome Lists containing this project

README