https://github.com/4rnv/scrappy

Script to scrap URLs from a webpage and archive them on the Wayback machine.
https://github.com/4rnv/scrappy

python scraper wayback-machine

Last synced: 2 months ago
JSON representation

Script to scrap URLs from a webpage and archive them on the Wayback machine.

Host: GitHub
URL: https://github.com/4rnv/scrappy
Owner: 4rnv
Created: 2024-04-04T20:45:18.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-04T20:56:37.000Z (about 1 year ago)
Last Synced: 2025-01-18T01:21:47.571Z (4 months ago)
Topics: python, scraper, wayback-machine
Language: Python
Homepage:
Size: 1000 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

A Python script to scrape URLs from a webpage and archive them to the Wayback machine. Uses Beautiful Soup to parse a page for anchor tags and then saves them using the Archive.org API. The name is supposed to be scrap.py as in scrapping plus python.

# Usage

Clone or ZIP this repo. Install the modules mentioned in `requirements.txt` using `pip install -r requirements.txt`. Then run the script in your terminal and follow the screen instructions.

IMPORTANT: `time.sleep(5)` delays archival of each URL for 5 seconds. This is to avoid overloading the API with excess requests, due to which sometimes the server refuses the connection. A healthy gap between each request prevents that.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/4rnv/scrappy

Awesome Lists containing this project

README