https://github.com/lousousa/webscrape-wizard
An efficient and easy-to-use web scraping application utilizing wget for downloading and scp for secure file transfer.
https://github.com/lousousa/webscrape-wizard
bash bash-script
Last synced: 4 months ago
JSON representation
An efficient and easy-to-use web scraping application utilizing wget for downloading and scp for secure file transfer.
- Host: GitHub
- URL: https://github.com/lousousa/webscrape-wizard
- Owner: lousousa
- Created: 2023-12-15T02:21:34.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-15T03:24:38.000Z (over 2 years ago)
- Last Synced: 2024-01-28T08:10:18.935Z (over 2 years ago)
- Topics: bash, bash-script
- Language: Shell
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# WebScrape Wizard 
## Description
A robust web scraping tool that automates data extraction from URLs listed in an input file, organizes scraped data into specified directories, securely transfers data to a remote server, and maintains a clean local workspace by removing local copies after transfer.
## Introduction
### Purpose
To streamline the process of bulk web scraping and secure data transfer.
### Features
- Automated scraping from a list of URLs.
- Organized storage of scraped data.
- Secure transfer of data to a remote server.
- Automatic local data cleanup post-transfer.
## Prerequisites
### Dependencies
- `wget` for web scraping.
- `scp` for secure file transfer.
- Appropriate permissions for file operations and server access.
## Usage
### Basic Usage
Before running the script, ensure the following steps are completed:
1. **Prepare the `.env` file**: create a `.env` file in the root directory of the project and add your SSH credentials for secure file transfer. This should include details like `SSH_USER`, `SSH_HOST`, and `SSH_PATH`.
2. **Configure `input.txt`**: prepare `input.txt` by listing **directory names** and **URLs** separated by a semicolon (`;`), with each pair on a new line.
3. **Run the script**: execute the script to start the scraping process. It will read URLs from `input.txt`, scrape data using `wget`, save it in specified directories, transfer it to the server via `scp`, and then clean up local data.
```bash
source scraper.sh
```