https://github.com/amirespahbodi/google-maps-scraper
google map scraper. extract title, phone, address, latitude and longitude, category, website URL, rating, reviews number, email, active_hours, reviews and first picture of listing
https://github.com/amirespahbodi/google-maps-scraper
dynamic-website google-map-scraper google-map-scraping google-maps-scraper google-maps-scraper-python google-maps-scraping playwright playwright-python python3 web-crawling web-scraping webcrawling webscraping
Last synced: about 2 months ago
JSON representation
google map scraper. extract title, phone, address, latitude and longitude, category, website URL, rating, reviews number, email, active_hours, reviews and first picture of listing
- Host: GitHub
- URL: https://github.com/amirespahbodi/google-maps-scraper
- Owner: AmirEspahbodi
- License: mit
- Created: 2024-11-16T08:39:46.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-23T02:36:12.000Z (over 1 year ago)
- Last Synced: 2024-12-23T03:24:38.629Z (over 1 year ago)
- Topics: dynamic-website, google-map-scraper, google-map-scraping, google-maps-scraper, google-maps-scraper-python, google-maps-scraping, playwright, playwright-python, python3, web-crawling, web-scraping, webcrawling, webscraping
- Language: Python
- Homepage:
- Size: 239 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Google Maps Scraper
## Overview
This repository contains a Google Maps scraping system composed of two cooperating services:
- **Scraper worker**: Uses Playwright to open multiple browser contexts/tabs, search Google Maps, scroll listings, extract details, and persist results to Excel (and optionally download listing images). The worker pulls work items from a Redis queue.
- **API server**: A FastAPI service that accepts scrape requests, validates duplicates against Redis queues and stored output files, and exposes a status endpoint to inspect current queue and processing state.
A helper script (`run_app.py`) runs both services concurrently using a thread pool.
## Key Features
- Asynchronous Playwright-driven scraping with multiple browser tabs/contexts.
- Redis-backed queue for search requests and in-process tracking.
- Excel output via Pandas + OpenPyXL.
- Optional image download and storage.
- FastAPI endpoints for requesting scrapes and observing system status.
## Tech Stack
- **Python 3.12** (Poetry-managed dependencies).
- **Playwright** for browser automation.
- **FastAPI** + **Uvicorn** for the API server.
- **Redis** for queueing and coordination.
- **Pydantic Settings** for environment configuration.
- **lxml** for HTML parsing.
- **Pandas / OpenPyXL** for Excel output.
- **aiofiles / httpx** for async file and HTTP operations.
## Architecture & Data Flow
1. The API server receives a scrape request (listing category + query details) and enqueues it in Redis.
2. The scraper worker reads queued requests, launches Playwright browsers/tabs, navigates to Google Maps, and performs search + scrolling.
3. Listings are extracted, deduplicated across tabs, and enriched with details (title, category, address, phone, website, hours, coordinates, images).
4. Results are written to Excel files in the configured storage directory.
## Design Patterns & Structure
- **BO/DAO/DTO layering**:
- `data/bo`: Business logic orchestration for search, scrolling, extraction, and multi-tab coordination.
- `data/dao`: Infrastructure wrappers (Redis access, Playwright page wrapper).
- `data/dto`: Request/response schemas for the API.
- **Singleton RuntimeResource**: Centralized Playwright lifecycle management for browsers and tabs.
- **Async concurrency**: Heavy use of `asyncio.gather` for multi-tab scraping and page interactions.
## Project Structure
```
.
├── run_app.py # Runs scraper + API server together
├── scraper/
│ ├── run_scraper.py # Entry point for the scraper worker
│ ├── processes/ # Scraper processes
│ ├── data/ # bo/dao/dto + static data (user agents, referers)
│ ├── config/ # Pydantic settings for scraper + Redis
│ └── utils/ # Helper utilities (Excel, image storage, singleton)
└── server/
├── main.py # API server entry point
├── apis/ # FastAPI routes + schemas
├── core/ # Server configuration + Redis connection
└── utils/ # File-based request status helpers
```
## Requirements
- Python 3.12
- Redis (local or remote)
- Playwright browsers installed (Chromium/WebKit)
## Setup
1. **Install dependencies**
```bash
poetry install
```
2. **Install Playwright browsers**
```bash
poetry run playwright install
```
3. **Configure environment**
```bash
cp .env.sample .env
```
Update values as needed (Redis host/port, storage directories, separators, etc.).
## Running the System
### Run both services together
```bash
poetry run python run_app.py
```
### Run scraper only
```bash
cd scraper
poetry run python run_scraper.py
```
### Run API server only
```bash
cd server
poetry run python main.py
```
## API Endpoints
### `POST /request`
Queues a new scrape request.
**Request body**
```json
{
"listing_category": "restaurant",
"listing_type": "restaurant",
"verb": "near",
"city": "Berlin",
"province": "Berlin"
}
```
**Response**
Returns lists of imported, not-imported, in-process, and waiting requests.
### `GET /status`
Returns current status:
- `imported_requests` and `not_imported_requests` (based on storage files)
- `in_process` (current Redis processing item)
- `waiting_to_scrape` (queue contents)
## Output & Storage
- **Excel files**: Written to `NOT_IMPORTED_SHEETS_DIRECTORY` with a filename derived from the query fields and separators.
- **Images**: Saved under `PICTURES_DIRECTORY` (if image URLs are found).
## Configuration Reference
Environment variables are defined in `.env.sample` and loaded via Pydantic Settings. Key variables include:
- `REDIS_HOST`, `REDIS_PORT`
- `REDIS_REQUESTED_SEARCH_QUERY_QUEUE_NAME`, `REDIS_IN_PROCESSING_SEARCH_QUERY`
- `SEARCH_QUERY_ITEMS_SEPARATOR`, `LISTING_TYPE_ITEMS_SEPARATOR`
- `PICTURES_DIRECTORY`, `NOT_IMPORTED_SHEETS_DIRECTORY`, `IMPORTED_SHEETS_DIRECTORY`
- API server settings: `APP_NAME`, `HOST`, `PORT`, `ENVIRONMENT`, `WORKER`
## Notes & Considerations
- Scraping Google Maps may violate Google’s terms of service. Use responsibly and ensure compliance with local laws and platform policies.
- Playwright is configured to run in non-headless mode by default in `RuntimeResource`.
- Redis queues use simple list operations; scaling or fault tolerance might require additional coordination logic.
## License
This project is licensed under the terms of the MIT license. See [LICENSE](LICENSE).