Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/developer-sumit/web-groper-python
https://github.com/developer-sumit/web-groper-python
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/developer-sumit/web-groper-python
- Owner: developer-sumit
- Created: 2024-11-08T07:22:42.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2024-11-08T08:28:00.000Z (3 months ago)
- Last Synced: 2024-11-08T09:28:14.513Z (3 months ago)
- Language: Python
- Size: 4.88 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# WebGroper
WebGroper is a Python class designed to recursively scrape and download media files (images, PDFs, etc.) from a specified website directory, such as the `/wp-content/uploads` directory of a WordPress site.
## Features
- Recursively traverses URLs to find and download media files.
- Ignores resized images generated by WordPress.
- Saves downloaded files in a structured directory.## Requirements
- Python 3.x
- `requests` library
- `beautifulsoup4` library## Installation
1. Clone the repository or download the script.
2. Install the required libraries using pip:```sh
pip install requests beautifulsoup4
```## Usage
1. Create an instance of the `WebGroper` class with the desired parameters.
2. Call the `traverse_url_recursive` method with the starting URL.Example:
```python
from webgroper import WebGroper# Initialize the WebGroper class
web_groper = WebGroper(
output_directory="groped_data",
time_between_download_requests=1,
ignore_sizes_regex=r"-\d+x\d+\.[a-z]+"
)# Start scraping from the specified URL
web_groper.traverse_url_recursive("https://example-wordpress-site.com/wp-content/uploads/")