{"id":24938292,"url":"https://github.com/ahmed122000/hotels_scraper","last_synced_at":"2026-05-09T00:36:19.876Z","repository":{"id":237076539,"uuid":"750487463","full_name":"Ahmed122000/Hotels_Scraper","owner":"Ahmed122000","description":"A Python-based web scraper and Flask API for extracting hotel data from Booking.com. Features include detailed room information, amenities, and JSON export functionality. Perfect for travel data analysis and exploration!","archived":false,"fork":false,"pushed_at":"2025-01-30T03:22:03.000Z","size":2869,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T17:53:48.761Z","etag":null,"topics":["beautifulsoup4","flask","json","pytohn3","rest-api","scraping","scraping-websites"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ahmed122000.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-30T18:30:45.000Z","updated_at":"2025-01-30T03:40:19.000Z","dependencies_parsed_at":"2025-02-02T17:53:49.838Z","dependency_job_id":null,"html_url":"https://github.com/Ahmed122000/Hotels_Scraper","commit_stats":null,"previous_names":["ahmed122000/booking_scrapping","ahmed122000/hotels_scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ahmed122000%2FHotels_Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ahmed122000%2FHotels_Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ahmed122000%2FHotels_Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ahmed122000%2FHotels_Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ahmed122000","download_url":"https://codeload.github.com/Ahmed122000/Hotels_Scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246068293,"owners_count":20718503,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup4","flask","json","pytohn3","rest-api","scraping","scraping-websites"],"created_at":"2025-02-02T17:53:57.609Z","updated_at":"2026-05-09T00:36:19.839Z","avatar_url":"https://github.com/Ahmed122000.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🏨 Hotels Scraper - Web Scraping \u0026 API Service\n\n[![Python](https://img.shields.io/badge/Python-3.8+-blue?style=flat-square\u0026logo=python)](https://www.python.org/)\n[![Flask](https://img.shields.io/badge/Flask-2.0+-black?style=flat-square\u0026logo=flask)](https://flask.palletsprojects.com/)\n[![Selenium](https://img.shields.io/badge/Selenium-4.0+-green?style=flat-square\u0026logo=selenium)](https://selenium.dev/)\n[![BeautifulSoup](https://img.shields.io/badge/BeautifulSoup-4.0+-yellow?style=flat-square\u0026logo=python)](https://www.crummy.com/software/BeautifulSoup/)\n[![License](https://img.shields.io/badge/License-MIT-black?style=flat-square)](LICENSE)\n\n\u003e A powerful Python web scraping solution for extracting comprehensive hotel data from Booking.com with a RESTful Flask API for seamless integration.\n\n---\n\n## 📑 Table of Contents\n\n- [Overview](#-overview)\n- [Features](#-features)\n- [Tech Stack](#-tech-stack)\n- [Project Structure](#-project-structure)\n- [Installation](#-installation)\n- [API Documentation](#-api-documentation)\n- [Usage Guide](#-usage-guide)\n- [Configuration](#-configuration)\n- [Database Schema](#-database-schema)\n- [Troubleshooting](#-troubleshooting)\n- [Legal Notice](#-legal-notice)\n\n---\n\n## 📊 Overview\n\nThis project provides a complete web scraping and API service for hotel data extraction. It combines:\n- **Selenium WebDriver** for dynamic page rendering\n- **BeautifulSoup** for HTML parsing\n- **Flask** for REST API endpoints\n- **JSON** for data persistence\n\n**Use Cases**:\n- Travel price comparison\n- Market research\n- Amenity analysis\n- Availability tracking\n\n---\n\n## ✨ Features\n\n### 🕷️ Web Scraper Features\n\n| Feature | Description |\n|---------|-------------|\n| **Dynamic Rendering** | JavaScript-enabled browsing with Selenium |\n| **Hotel Information** | Scrapes address, images, ratings, reviews |\n| **Room Details** | Room types, prices, availability, capacity |\n| **Amenities** | Comprehensive list of hotel amenities |\n| **Images** | Downloads and stores hotel photos |\n| **Pagination** | Automatically handles multi-page results |\n| **Error Recovery** | Handles timeouts and connection failures |\n\n### 🌐 API Endpoints\n\n| Endpoint | Method | Purpose |\n|----------|--------|---------|\n| `/codes` | GET | List available city codes |\n| `/scrape` | GET | Start scraping process |\n| `/download/\u003cfile\u003e` | GET | Download JSON files |\n| `/status` | GET | Check scraper status |\n| `/history` | GET | View scraping history |\n\n### 📊 Data Extraction\n\n**Hotel Information**:\n- Hotel name and official rating\n- Address and GPS coordinates\n- Phone number and website\n- Check-in/check-out times\n- Number of rooms and floors\n\n**Room Details**:\n- Room type and size\n- Price per night\n- Occupancy capacity\n- Available dates\n- Special offers\n\n**Amenities**:\n- WiFi availability\n- Parking options\n- Pool facilities\n- Fitness center\n- Pet policies\n- Accessibility features\n\n---\n\n## 🛠️ Tech Stack\n\n| Component | Technology |\n|-----------|-----------|\n| **Language** | Python 3.8+ |\n| **Web Framework** | Flask 2.0+ |\n| **Web Driver** | Selenium 4.0+ |\n| **HTML Parser** | BeautifulSoup 4.0+ |\n| **Browser** | Firefox (Geckodriver) |\n| **Data Format** | JSON |\n| **Logging** | Python logging module |\n\n---\n\n## 📂 Project Structure\n\n```plaintext\nHotels_Scraper/\n├── app.py                        # Flask application entry point\n├── booking_hotels.py             # Scraper implementation\n├── requirements.txt              # Python dependencies\n├── .env.example                  # Environment configuration template\n│\n├── config/\n│   ├── cities.json              # City codes and configuration\n│   ├── selectors.json           # CSS selectors for scraping\n│   └── user_agents.txt          # Browser user agents\n│\n├── data/\n│   ├── raw/                     # Raw scraped data\n│   ├── processed/               # Cleaned data\n│   └── hotels_data_*.json       # Output JSON files\n│\n├── logs/\n│   ├── scraper.log              # Scraping operations log\n│   └── api.log                  # API request log\n│\n├── templates/\n│   ├── index.html               # Web interface\n│   ├── results.html             # Results page\n│   └── error.html               # Error page\n│\n├── static/\n│   ├── css/\n│   │   └── style.css\n│   ├── js/\n│   │   └── app.js\n│   └── images/\n│       └── logo.png\n│\n├── tests/\n│   ├── test_scraper.py\n│   ├── test_api.py\n│   └── test_selectors.py\n│\n├── docs/\n│   ├── API.md                   # API documentation\n│   ├── SCRAPING.md              # Scraping guide\n│   └── EXAMPLES.md              # Usage examples\n│\n└── README.md                    # This file\n```\n\n---\n\n## 🚀 Installation\n\n### Prerequisites\n\n- **Python 3.8+**\n- **Firefox browser**\n- **Geckodriver** (Firefox WebDriver)\n- **pip** (Python package manager)\n\n### Step 1: Install System Dependencies\n\n**Linux (Ubuntu/Debian)**:\n```bash\nsudo apt-get update\nsudo apt-get install python3 python3-pip firefox\nwget https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz\ntar -xvf geckodriver-v0.33.0-linux64.tar.gz\nsudo mv geckodriver /usr/local/bin/\n```\n\n**Mac**:\n```bash\nbrew install python3 firefox\nbrew install geckodriver\n```\n\n**Windows**:\n1. Download Firefox from https://www.mozilla.org/\n2. Download Geckodriver from https://github.com/mozilla/geckodriver/releases\n3. Add Geckodriver to PATH\n\n### Step 2: Clone Repository\n\n```bash\ngit clone https://github.com/Ahmed122000/Hotels_Scraper.git\ncd Hotels_Scraper\n```\n\n### Step 3: Create Virtual Environment\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate        # Linux/Mac\n# or\nvenv\\Scripts\\activate            # Windows\n```\n\n### Step 4: Install Dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n**requirements.txt**:\n```\nbeautifulsoup4==4.11.1\nselenium==4.10.0\nrequests==2.31.0\nflask==2.3.0\npython-dotenv==1.0.0\n```\n\n### Step 5: Configuration\n\nCreate `.env` file from template:\n```bash\ncp .env.example .env\n```\n\nEdit `.env`:\n```ini\n# Flask configuration\nFLASK_ENV=development\nDEBUG=True\nSECRET_KEY=your-secret-key\n\n# Scraping configuration\nHEADLESS_BROWSER=True\nTIMEOUT=30\nRETRY_ATTEMPTS=3\n\n# Data paths\nDATA_DIR=./data\nLOG_DIR=./logs\n```\n\n### Step 6: Run Application\n\n```bash\npython app.py\n```\n\nApplication will be available at: `http://localhost:5000`\n\n---\n\n## 📡 API Documentation\n\n### 1. Get City Codes\n\n**Endpoint**: `GET /codes`\n\n**Description**: Retrieve available city codes for scraping\n\n**Response**:\n```json\n{\n  \"status\": \"success\",\n  \"cities\": {\n    \"cairo\": \"290692\",\n    \"alexandria\": \"290263\",\n    \"hurghada\": \"290029\",\n    \"sharm_el_sheikh\": \"290039\",\n    \"giza\": \"290693\"\n  }\n}\n```\n\n**Example**:\n```bash\ncurl -X GET http://localhost:5000/codes\n```\n\n---\n\n### 2. Start Scraping\n\n**Endpoint**: `GET /scrape`\n\n**Query Parameters**:\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `city` | string | Yes | City name (e.g., \"cairo\") |\n| `city_code` | string | Yes | City code from `/codes` |\n| `pages` | integer | No | Number of pages to scrape (default: 1) |\n| `format` | string | No | Output format: \"json\" or \"csv\" (default: \"json\") |\n| `max_results` | integer | No | Maximum hotels to scrape (default: 50) |\n\n**Response**:\n```json\n{\n  \"status\": \"success\",\n  \"message\": \"Scraping for cairo completed. Data saved\",\n  \"data_file\": \"cairo_hotels_1672503492.json\",\n  \"download_link\": \"http://localhost:5000/download/cairo_hotels_1672503492.json\",\n  \"metadata\": {\n    \"city\": \"cairo\",\n    \"hotels_count\": 47,\n    \"pages_scraped\": 1,\n    \"duration_seconds\": 45.23,\n    \"timestamp\": \"2023-12-31T10:30:00Z\"\n  }\n}\n```\n\n**Example**:\n```bash\n# Scrape 2 pages of Cairo hotels\ncurl -X GET \"http://localhost:5000/scrape?city=cairo\u0026city_code=290692\u0026pages=2\u0026format=json\"\n```\n\n---\n\n### 3. Download Data\n\n**Endpoint**: `GET /download/\u003cfile_name\u003e`\n\n**Description**: Download previously saved JSON/CSV files\n\n**Parameters**:\n\n| Parameter | Type | Description |\n|-----------|------|-------------|\n| `file_name` | string | Name of file in data directory |\n\n**Response**: File download or error\n\n**Example**:\n```bash\ncurl -X GET http://localhost:5000/download/cairo_hotels_1672503492.json \\\n  -o hotels_data.json\n```\n\n---\n\n### 4. Scraping Status\n\n**Endpoint**: `GET /status`\n\n**Description**: Check current scraping status\n\n**Response**:\n```json\n{\n  \"status\": \"idle\",\n  \"current_job\": null,\n  \"completed_jobs\": 5,\n  \"failed_jobs\": 1,\n  \"average_duration\": 42.5\n}\n```\n\n---\n\n### 5. Scraping History\n\n**Endpoint**: `GET /history`\n\n**Description**: View previous scraping operations\n\n**Response**:\n```json\n{\n  \"history\": [\n    {\n      \"id\": 1,\n      \"city\": \"cairo\",\n      \"hotels_count\": 47,\n      \"timestamp\": \"2023-12-31T10:30:00Z\",\n      \"status\": \"completed\",\n      \"file\": \"cairo_hotels_1672503492.json\"\n    }\n  ]\n}\n```\n\n---\n\n## 💻 Usage Guide\n\n### Web Interface\n\n1. **Start Application**:\n   ```bash\n   python app.py\n   ```\n\n2. **Open Browser**:\n   Navigate to `http://localhost:5000`\n\n3. **Select City**:\n   Choose from dropdown menu\n\n4. **Configure Scraping**:\n   - Number of pages\n   - Output format\n   - Maximum results\n\n5. **Start Scraping**:\n   Click \"Scrape\" button\n\n6. **Download Results**:\n   Click download link when complete\n\n### Programmatic Usage\n\n**Python Example**:\n```python\nimport requests\nimport json\n\n# Get available cities\nresponse = requests.get('http://localhost:5000/codes')\ncities = response.json()['cities']\n\n# Scrape hotels\nscrape_url = 'http://localhost:5000/scrape'\nparams = {\n    'city': 'cairo',\n    'city_code': cities['cairo'],\n    'pages': 2,\n    'format': 'json'\n}\n\nresponse = requests.get(scrape_url, params=params)\nresult = response.json()\n\n# Download data\nif result['status'] == 'success':\n    file_url = result['download_link']\n    data_response = requests.get(file_url)\n    with open('hotels.json', 'w') as f:\n        json.dump(data_response.json(), f, indent=2)\n```\n\n---\n\n## ⚙️ Configuration\n\n### Browser Configuration\n\nEdit `config/selectors.json` for different Booking.com layouts:\n\n```json\n{\n  \"hotel_name\": \".hotel-name\",\n  \"price\": \".price-tag\",\n  \"rating\": \".hotel-rating\",\n  \"address\": \".hotel-address\",\n  \"amenities\": \".amenity-list\"\n}\n```\n\n### User Agents\n\nAdd browser user agents to `config/user_agents.txt`:\n```\nMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...\nMozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...\n```\n\n---\n\n## 📊 Database Schema\n\n### Hotel Object\n\n```json\n{\n  \"hotel_id\": \"12345\",\n  \"name\": \"Hotel Cairo\",\n  \"address\": \"123 Nile Street, Cairo, Egypt\",\n  \"city\": \"Cairo\",\n  \"coordinates\": {\n    \"latitude\": 30.0444,\n    \"longitude\": 31.2357\n  },\n  \"phone\": \"+20123456789\",\n  \"website\": \"https://hotelcairo.example.com\",\n  \"rating\": 4.5,\n  \"review_count\": 234,\n  \"rooms_count\": 50,\n  \"check_in_time\": \"14:00\",\n  \"check_out_time\": \"12:00\",\n  \"amenities\": [\n    \"Free WiFi\",\n    \"Pool\",\n    \"Fitness Center\"\n  ],\n  \"images\": [\n    \"https://example.com/image1.jpg\",\n    \"https://example.com/image2.jpg\"\n  ],\n  \"rooms\": [\n    {\n      \"type\": \"Single Room\",\n      \"price\": 45.00,\n      \"capacity\": 1,\n      \"available\": true\n    }\n  ]\n}\n```\n\n---\n\n## 🐛 Troubleshooting\n\n### Issue: Geckodriver Not Found\n**Solution**:\n```bash\n# Add to PATH or specify in environment\nexport PATH=$PATH:/path/to/geckodriver\n```\n\n### Issue: Timeout During Scraping\n**Solution**: Increase timeout in `.env`:\n```ini\nTIMEOUT=60\n```\n\n### Issue: CSS Selectors Not Working\n**Solution**: \n1. Update `config/selectors.json`\n2. Run tests: `python -m pytest tests/test_selectors.py`\n\n### Issue: Browser Memory Issues\n**Solution**: Enable headless mode in `.env`:\n```ini\nHEADLESS_BROWSER=True\n```\n\n---\n\n## 📈 Performance Tips\n\n1. **Parallel Scraping**: Use threading for multiple cities\n2. **Caching**: Cache city codes and selectors\n3. **Rate Limiting**: Add delays between requests (ethical)\n4. **Data Compression**: Compress large JSON files\n5. **Database**: Use SQLite/PostgreSQL for large datasets\n\n---\n\n## ⚖️ Legal Notice\n\n**Important**: This scraper is designed for **educational purposes** only.\n\n- **Respect robots.txt**: Check Booking.com's robots.txt\n- **Rate Limiting**: Use reasonable request delays\n- **Terms of Service**: Comply with Booking.com's ToS\n- **Ethical Use**: Do not resell or commercially use data\n- **Legal Compliance**: Check local laws regarding web scraping\n\n---\n\n## 🔄 Roadmap\n\n- [ ] Rotating proxy support\n- [ ] CAPTCHA handling\n- [ ] Browser pool management\n- [ ] PostgreSQL integration\n- [ ] Scheduled scraping\n- [ ] REST API authentication\n- [ ] Email notifications\n- [ ] Data validation pipeline\n\n---\n\n## 📝 Contributing\n\n1. Fork repository\n2. Create feature branch (`git checkout -b feature/enhancement`)\n3. Commit changes (`git commit -m 'Add enhancement'`)\n4. Push to branch (`git push origin feature/enhancement`)\n5. Open Pull Request\n\n---\n\n## 📄 License\n\nThis project is licensed under the **MIT License** - see [LICENSE](LICENSE) for details.\n\n---\n\n## 🙏 Acknowledgments\n\n- [Selenium Documentation](https://selenium.dev/)\n- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n- [Flask Documentation](https://flask.palletsprojects.com/)\n- Booking.com for API inspiration\n\n---\n\n## 📞 Support\n\nFor issues, questions, or suggestions:\n- Open an issue on GitHub\n- Email: ahmedhesham122000@gmail.com\n- Check [docs/](docs/) for detailed guides\n\n---\n\n**Built with ❤️ for data enthusiasts**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fahmed122000%2Fhotels_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fahmed122000%2Fhotels_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fahmed122000%2Fhotels_scraper/lists"}