https://github.com/4rnv/scrappy
Script to scrap URLs from a webpage and archive them on the Wayback machine.
https://github.com/4rnv/scrappy
python scraper wayback-machine
Last synced: 2 months ago
JSON representation
Script to scrap URLs from a webpage and archive them on the Wayback machine.
- Host: GitHub
- URL: https://github.com/4rnv/scrappy
- Owner: 4rnv
- Created: 2024-04-04T20:45:18.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-04T20:56:37.000Z (about 1 year ago)
- Last Synced: 2025-01-18T01:21:47.571Z (4 months ago)
- Topics: python, scraper, wayback-machine
- Language: Python
- Homepage:
- Size: 1000 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
A Python script to scrape URLs from a webpage and archive them to the Wayback machine. Uses Beautiful Soup to parse a page for anchor tags and then saves them using the Archive.org API. The name is supposed to be scrap.py as in scrapping plus python.
# Usage
Clone or ZIP this repo. Install the modules mentioned in `requirements.txt` using `pip install -r requirements.txt`. Then run the script in your terminal and follow the screen instructions.
IMPORTANT: `time.sleep(5)` delays archival of each URL for 5 seconds. This is to avoid overloading the API with excess requests, due to which sometimes the server refuses the connection. A healthy gap between each request prevents that.