Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aathifzahir/py-scrap

Last synced: 15 days ago
JSON representation

Host: GitHub
URL: https://github.com/aathifzahir/py-scrap
Owner: AathifZahir
License: mit
Created: 2024-12-18T09:04:40.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-12-26T10:28:44.000Z (about 1 month ago)
Last Synced: 2024-12-26T11:17:35.997Z (about 1 month ago)
Language: Python
Size: 28.3 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 📚 Web Scraping Project with Scrapy and MongoDB

![Python Version](https://img.shields.io/badge/python-3.9%2B-blue)
![Scrapy Version](https://img.shields.io/badge/scrapy-2.6.1-brightgreen)
![MongoDB](https://img.shields.io/badge/mongodb-database-brightgreen)
![License](https://img.shields.io/github/license/AathifZahir/Py-Scrap)
![Last Commit](https://img.shields.io/github/last-commit/AathifZahir/Py-Scrap)

---

## 📋 Table of Contents

- [📖 About the Project](#about-the-project)
- [⚙️ Getting Started](#getting-started)
- [📦 Prerequisites](#prerequisites)
- [🔧 Setup Instructions](#setup-instructions)
- [💡 Running the Scraper](#running-the-scraper)
- [📂 Project Structure](#project-structure)
- [✨ Customization](#customization)
- [📜 License](#license)
- [📚 References](#references)

---

## 📖 About the Project

This project demonstrates how to build a web scraper using **Scrapy**, a powerful Python framework for web scraping, and store the extracted data in **MongoDB**, a flexible NoSQL database.

### Objective

The scraper is designed to extract product information from **Amazon**. It:
- Extracts relevant data like product names, prices, ratings, and images.
- Handles pagination to scrape data across multiple pages.
- Stores the extracted data in a MongoDB database for further analysis or processing.

The scraper can be used for various products, not just books. You can search for any product category by passing the desired keyword.

---

## ⚙️ Getting Started

Follow these steps to get the project running on your local machine. 🚀

### 📦 Prerequisites

Before running the scraper, ensure you have the following installed:
- 🐍 Python 3.9 or higher
- 🕷️ Scrapy
- 💾 MongoDB
- 🔗 pymongo

---

### 🔧 Setup Instructions

1. **Clone the Repository**:

```bash
git clone https://github.com/yourusername/books_scraper.git
cd books_scraper
```

2. **Set Up a Virtual Environment**:

```bash
python -m venv venv
```

3. **Activate the Virtual Environment**:

- On Windows:

```bash
venv\Scripts\activate
```

- On Unix or macOS:

```bash
source venv/bin/activate
```

4. **Install the Required Packages**:

```bash
pip install scrapy pymongo
```

5. **Configure MongoDB**:

Ensure MongoDB is installed and running on your local machine. The default connection settings in the project are:
- Host: `localhost`
- Port: `27017`
- Database: `books_db`
- Collection: `books`

If your MongoDB configuration differs, update the settings in `settings.py` accordingly.

---

## 💡 Running the Scraper

To execute the scraper, use the following command:

```bash
scrapy crawl book -a keyword="laptops"
```

The scraper will:
- Use the passed keyword (default is "books").
- Start at the specified Amazon search URL.
- Navigate through the pages and extract data like product names, prices, ratings, and images.
- Store the extracted data in the MongoDB database.

If you do not pass a keyword, the scraper will default to searching for "books". Example:

```bash
scrapy crawl book
```

---

## 📂 Project Structure

The project follows Scrapy's standard structure:

```
books_scraper/
├── books/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ ├── __init__.py
│ └── book_spider.py
├── scrapy.cfg
└── README.md
```

- `items.py`: Defines the data structure for the scraped items.
- `pipelines.py`: Contains the pipeline for processing and storing items in MongoDB.
- `settings.py`: Configuration settings for the Scrapy project, including MongoDB connection details.
- `spiders/book_spider.py`: The main spider responsible for scraping Amazon.

---

## ✨ Customization

To adapt the scraper for different keywords or websites:

1. **Pass a Keyword Dynamically**:

The spider can be run with a dynamic keyword using the `-a` argument. Example for scraping books related to "laptops":

```bash
scrapy crawl book -a keyword="laptops"
```

By default, if no keyword is passed, the scraper will search for "books".

2. **Update the `start_urls`**:

Modify the `start_urls` list in `book_spider.py` to point to a different website or category.

3. **Adjust the Parsing Logic**:

Ensure the CSS selectors in the `parse` method of `book_spider.py` accurately target the desired data fields on the new website.

4. **Handle Pagination**:

If the target website uses a different pagination structure, update the pagination handling logic in the `parse` method accordingly.

---

## 📜 License

This project is licensed under the MIT License. See the `LICENSE` file for details. 📄

---

## 📚 References

For more detailed information on the tools and techniques used in this project, refer to the following resources:
- 📖 [Scrapy Documentation](https://docs.scrapy.org/en/latest/)
- 🗃️ [Scrapy MongoDB Pipeline](https://github.com/julien-duponchelle/scrapy-mongodb)
- 📰 [Web Scraping With Scrapy and MongoDB](https://realpython.com/web-scraping-with-scrapy-and-mongodb/)

---

## ⭐ Support

If you like this project, please give it a ⭐ by clicking the star button at the top of the repository! It helps others discover the project and motivates me to improve it further. ❤️

---

This update now allows users to pass a custom keyword for the search and if not passed, the default is "books".