https://github.com/developer-sumit/web-groper-python

WebGroper is a Python class designed to recursively scrape and download media files (images, PDFs, etc.) from a specified website directory, such as the /wp-content/uploads directory of a WordPress site.
https://github.com/developer-sumit/web-groper-python

package python-package web-scr web-scraping website-scraper wordpress-scraper wordpress-website-scraper

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/developer-sumit/web-groper-python
Owner: developer-sumit
Created: 2024-11-08T07:22:42.000Z (8 months ago)
Default Branch: master
Last Pushed: 2024-11-08T08:28:00.000Z (8 months ago)
Last Synced: 2025-02-16T08:42:46.366Z (5 months ago)
Topics: package, python-package, web-scr, web-scraping, website-scraper, wordpress-scraper, wordpress-website-scraper
Language: Python
Homepage:
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # WebGroper

WebGroper is a Python class designed to recursively scrape and download media files (images, PDFs, etc.) from a specified website directory, such as the `/wp-content/uploads` directory of a WordPress site.

## Features

- Recursively traverses URLs to find and download media files.

- Ignores resized images generated by WordPress.

- Saves downloaded files in a structured directory.

## Requirements

- Python 3.x

- `requests` library

- `beautifulsoup4` library

## Installation

1. Clone the repository or download the script.

2. Install the required libraries using pip:

    ```sh

    pip install requests beautifulsoup4

    ```

## Usage

1. Create an instance of the `WebGroper` class with the desired parameters.

2. Call the `traverse_url_recursive` method with the starting URL.

Example:

```python

from webgroper import WebGroper

# Initialize the WebGroper class

web_groper = WebGroper(

    output_directory="groped_data",

    time_between_download_requests=1,

    ignore_sizes_regex=r"-\d+x\d+\.[a-z]+"

)

# Start scraping from the specified URL

web_groper.traverse_url_recursive("https://example-wordpress-site.com/wp-content/uploads/")

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/developer-sumit/web-groper-python

Awesome Lists containing this project

README