https://github.com/airborne-commando/link-extractor-and-archive
A link extractor and archive tool, uses archive.ph as an archiving service; useful for sites that are barebones and aren't advanced.
https://github.com/airborne-commando/link-extractor-and-archive
archive cli gui-python python terminal webarchive webarchiving
Last synced: 14 days ago
JSON representation
A link extractor and archive tool, uses archive.ph as an archiving service; useful for sites that are barebones and aren't advanced.
- Host: GitHub
- URL: https://github.com/airborne-commando/link-extractor-and-archive
- Owner: airborne-commando
- License: gpl-2.0
- Created: 2024-10-27T02:28:04.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-22T01:15:59.000Z (over 1 year ago)
- Last Synced: 2025-05-17T18:09:20.338Z (12 months ago)
- Topics: archive, cli, gui-python, python, terminal, webarchive, webarchiving
- Language: Python
- Homepage:
- Size: 70.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# link extractor and archive
Included both a GUI and a CLI variant of this script:
To run both install simply clone this repo or check the release page then do the following in a virtual environment inside linux
pip install -r requirments.txt
When everything is installed run
python extractor.py --weburl [weburl]
for the GUI
python extractor-gui.py
# Features of the GUI not present in CLI
* Save funtion as JSON
* log
* cleaner extraction
* exclusion of URLS in extraction based on user input
Example from spacejam
https://www.spacejam.com/1996/cmp/pressbox/pressboxframes.html
https://www.spacejam.com/1996/cmp/jamcentral/jamcentralframes.html
https://www.spacejam.com/1996/cmp/bball/bballframes.html
https://www.spacejam.com/1996/cmp/tunes/tunesframes.html
https://www.spacejam.com/1996/cmp/lineup/lineupframes.html
https://www.spacejam.com/1996/cmp/jump/jumpframes.html
https://www.spacejam.com/1996/cmp/junior/juniorframes.html
https://shop.looneytunes.com/spacejam96?utm_source=SpaceJam1996&utm_medium=Website&utm_campaign=Theatrical2021
https://www.spacejam.com/1996/cmp/souvenirs/souvenirsframes.html
https://www.spacejam.com/1996/cmp/sitemap.html
https://www.spacejam.com/1996/cmp/behind/behindframes.html
https://policies.warnerbros.com/privacy/
http://policies.warnerbros.com/terms/en-us/
http://policies.warnerbros.com/terms/en-us/#accessibility
https://policies.warnerbros.com/privacy/en-us/#adchoices
Used to be all broken up as
https://www.spacejam.com/1996/
cmp/pressbox/pressboxframes.html
cmp/jamcentral/jamcentralframes.html
cmp/bball/bballframes.html
Be sure you have tkinter installed on your system.

# Archive Tool
python archive.py --file
Be sure you have a links.txt and it's curated to what you want archived on archive.ph
You may edit the time for archival; check the code inside `archive.py` under `time.sleep` 10 seconds is the default but you may change it to something longer.
Uses archive.ph as an archive service to archive everything, wayback machine will rate limit.
for a single link use
python archive.py --url
# Archive GUI tool

Pretty self explanatory, will do the same functions as above.
Feel free to try this on the website spacejam:
https://www.spacejam.com/1996/