An open API service indexing awesome lists of open source software.

https://github.com/rogendo/web-scraping

This repository contains automation dockerized and undockerized scripts that scrape various websites.
https://github.com/rogendo/web-scraping

beautifulsoup4 scrapy selenium webscraping

Last synced: about 2 months ago
JSON representation

This repository contains automation dockerized and undockerized scripts that scrape various websites.

Awesome Lists containing this project

README

          

# Web Scraping

Welcome to the Web Scraping repository! This repository contains various web scraping scripts written in Python to extract data from different websites.

## Table of Contents

- [Introduction](#introduction)
- [Installation](#installation)
- [Usage](#usage)
- [Directories](#directories)
- [Contributing](#contributing)
- [License](#license)

## Introduction

This repository contains a collection of web scraping scripts that can be used to extract data from various websites. Each script is designed to scrape specific data and save it in a structured format such as CSV or JSON.

## Installation

To get started, clone the repository and set up a virtual environment:

```sh
git clone https://github.com/Rogendo/Web-Scraping.git
cd Web-Scraping
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt

```

## Directories
- Notebooks
- Docker
- Scripts


### Notebooks
The Notebooks directory contains Jupyter notebooks that implement various web scraping tasks. Notebooks are great workspaces for developing and testing scraping code.

Directory: Notebooks
Description: Contains Jupyter notebooks for various scraping tasks.
Usage: Open the notebooks using Jupyter and run the cells to execute the scraping tasks.

### Docker

The Docker directory contains Dockerized scripts that automate the scraping tasks and run them in Docker containers. Docker containers provide isolated environments, ensuring that your web scraping scripts run consistently across different machines without conflicts from varying system configurations.
Docker too helps with:

- Dependency Management: Docker containers encapsulate all dependencies, libraries, and tools required for your web scraping scripts, eliminating the need to install and configure them on each machine.

- Cross-Platform Compatibility: Docker containers can run on any system that supports Docker, making it easy to move your web scraping setup between development, testing, and production environments.

- Cloud Deployment: Docker containers can be deployed on cloud platforms, allowing you to scale your web scraping operations based on demand.

- Scalability: Horizontal Scaling: Docker makes it easy to scale your web scraping operations by adding more containers to handle increased load. This is particularly useful for large-scale scraping projects.
- Load Balancing: Docker Swarm and Kubernetes can be used to manage and distribute the load across multiple containers, ensuring efficient resource utilization.

- Quick Setup: Docker allows for rapid setup of development environments, enabling you to start scraping quickly without spending time on environment configuration.
- Version Control: Docker images can be versioned, making it easy to track changes and roll back to previous versions if necessary.
- Resource Efficiency:
Lightweight: Docker containers are lightweight compared to virtual machines, leading to faster startup times and lower resource consumption.
- Optimized Resource Usage: Containers share the host system's kernel, making them more efficient in terms of CPU and memory usage.
- Sandboxing: Docker containers provide a level of security by isolating applications from the host system and other containers.
- Reproducible Builds: Docker images can be built and shared, ensuring that anyone can reproduce the same environment and run the web scraping scripts without issues.

#### Usage: Build and run the Docker containers using the provided Dockerfiles.

### Scripts
The Scripts directory contains standalone Python scripts that extract data from various websites.

##### Usage:
Run the scripts directly using Python.
```sh
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt

python script_name.py
```

## Contributing
Contributions are welcome! If you have any improvements or new scripts to add, please open a pull request. Make sure to follow the coding standards and include a detailed description of your changes.

## License
This project is licensed under the MIT License. See the LICENSE file for more details.