An open API service indexing awesome lists of open source software.

https://github.com/4rnv/scrappy

Script to scrap URLs from a webpage and archive them on the Wayback machine.
https://github.com/4rnv/scrappy

python scraper wayback-machine

Last synced: 2 months ago
JSON representation

Script to scrap URLs from a webpage and archive them on the Wayback machine.

Awesome Lists containing this project

README

        

A Python script to scrape URLs from a webpage and archive them to the Wayback machine. Uses Beautiful Soup to parse a page for anchor tags and then saves them using the Archive.org API. The name is supposed to be scrap.py as in scrapping plus python.

# Usage

Clone or ZIP this repo. Install the modules mentioned in `requirements.txt` using `pip install -r requirements.txt`. Then run the script in your terminal and follow the screen instructions.

IMPORTANT: `time.sleep(5)` delays archival of each URL for 5 seconds. This is to avoid overloading the API with excess requests, due to which sometimes the server refuses the connection. A healthy gap between each request prevents that.