https://github.com/damikaalwis-gif/adscrapex

AdScrapeX is a web scraping project built with Scrapy and Scrapy Playwright to extract data from popular classified ad websites in Sri Lanka, including vehicles, properties, and job listings.
https://github.com/damikaalwis-gif/adscrapex

classifieds ikman-lk playwright python scrapy scrapy-spider webscraping

Last synced: 5 months ago
JSON representation

AdScrapeX is a web scraping project built with Scrapy and Scrapy Playwright to extract data from popular classified ad websites in Sri Lanka, including vehicles, properties, and job listings.

Host: GitHub
URL: https://github.com/damikaalwis-gif/adscrapex
Owner: DamikaAlwis-Gif
Created: 2024-09-23T05:36:54.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-11-03T10:30:55.000Z (7 months ago)
Last Synced: 2024-11-03T11:22:34.141Z (7 months ago)
Topics: classifieds, ikman-lk, playwright, python, scrapy, scrapy-spider, webscraping
Language: Python
Homepage:
Size: 47.9 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# AdScrapeX

### Overview

**AdScrapeX** is a Python-based web scraping project using **Scrapy** and **Scrapy Playwright** to crawl and extract data from various classified advertisement websites. It’s designed to efficiently scrape information such as vehicle listings, property ads, and general classifieds, providing structured data that can be further processed or analyzed.

### Features

- Scrapes multiple classified ads websites efficiently.
- Handles dynamic websites using Scrapy Playwright for better interaction with JavaScript-heavy pages.
- Can be easily customized to add more classified ad websites.
- Built-in support for pagination and dynamic content handling.
- Error handling and logging mechanisms to ensure robust data scraping.

### Supported Websites

This project currently supports scraping the following classified ad websites:

- [Ikman.lk](https://ikman.lk) - Sri Lanka's largest marketplace for vehicles, properties, jobs, and more.
- [PatPat.lk](https://patpat.lk) - Online marketplace in Sri Lanka, specializing in vehicles and properties.
- [Adz.lk](https://adz.lk) - A classified ads platform in Sri Lanka for vehicles, properties, and services.
- [HitAd.lk](https://hitad.lk) - A leading classified ads site in Sri Lanka, listing jobs, vehicles, real estate, and services.
- [Sunday Observer](https://www.sundayobserver.lk/classifieds) - The classified section of the Sunday Observer, a major Sri Lankan newspaper.

## Installation

### 1. Clone the Repository

```bash
git clone https://github.com/DamikaAlwis-Gif/AdScrapeX.git
```
### 2.Create a Virtual Environment
Create a virtual environment to manage project dependencies.

```bash
python -m venv venv
venv\Scripts\activate # On Windows
# On macOS/Linux: source venv/bin/activate
```
### 3.Install Required Packages
Install the project dependencies listed in requirements.txt.

```bash
pip install -r requirements.txt
```
### 4.Install Browser binaries
Once Playwright is installed, you need to install the browser binaries (Chromium, Firefox, and WebKit) that Playwright will automate.

```bash
playwright install
```

### 5.Set Up Environment Variables
Navigate to the ad_scraper folder and create a .env file based on the .env.example file.

```bash
cd ad_scraper
```

```bash
SCRAPEOPS_API_KEY=your_scrapeops_api_key
```

## Usage

### 1.You can find the created spiders in the spiders folder
### 2.Replace the start_urls with your desired urls
Before running the spider, make sure to replace the start_urls in your spider file with the URLs you want to scrape.

### 3.Run the Spider
To start a spider, run the following command in the terminal:

```bash
scrapy crawl spider_name -L INFO
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/damikaalwis-gif/adscrapex

Awesome Lists containing this project

README