https://github.com/lousousa/webscrape-wizard

An efficient and easy-to-use web scraping application utilizing wget for downloading and scp for secure file transfer.
https://github.com/lousousa/webscrape-wizard

bash bash-script

Last synced: 5 months ago
JSON representation

An efficient and easy-to-use web scraping application utilizing wget for downloading and scp for secure file transfer.

Host: GitHub
URL: https://github.com/lousousa/webscrape-wizard
Owner: lousousa
Created: 2023-12-15T02:21:34.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-12-15T03:24:38.000Z (over 2 years ago)
Last Synced: 2024-01-28T08:10:18.935Z (over 2 years ago)
Topics: bash, bash-script
Language: Shell
Homepage:
Size: 10.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# WebScrape Wizard ![cli](https://badges.aleen42.com/src/cli.svg)

## Description
A robust web scraping tool that automates data extraction from URLs listed in an input file, organizes scraped data into specified directories, securely transfers data to a remote server, and maintains a clean local workspace by removing local copies after transfer.

## Introduction

### Purpose
To streamline the process of bulk web scraping and secure data transfer.

### Features
- Automated scraping from a list of URLs.
- Organized storage of scraped data.
- Secure transfer of data to a remote server.
- Automatic local data cleanup post-transfer.

## Prerequisites

### Dependencies
- `wget` for web scraping.
- `scp` for secure file transfer.
- Appropriate permissions for file operations and server access.

## Usage

### Basic Usage
Before running the script, ensure the following steps are completed:
1. **Prepare the `.env` file**: create a `.env` file in the root directory of the project and add your SSH credentials for secure file transfer. This should include details like `SSH_USER`, `SSH_HOST`, and `SSH_PATH`.
2. **Configure `input.txt`**: prepare `input.txt` by listing **directory names** and **URLs** separated by a semicolon (`;`), with each pair on a new line.
3. **Run the script**: execute the script to start the scraping process. It will read URLs from `input.txt`, scrape data using `wget`, save it in specified directories, transfer it to the server via `scp`, and then clean up local data.

```bash
source scraper.sh
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lousousa/webscrape-wizard

Awesome Lists containing this project

README