Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MarcoDiFrancesco/InternetArchiveUpdater
Save web pages in Internet Archive on schedule using Heroku free tier
https://github.com/MarcoDiFrancesco/InternetArchiveUpdater
python
Last synced: 3 months ago
JSON representation
Save web pages in Internet Archive on schedule using Heroku free tier
- Host: GitHub
- URL: https://github.com/MarcoDiFrancesco/InternetArchiveUpdater
- Owner: MarcoDiFrancesco
- Created: 2020-08-08T10:15:51.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-10-26T00:12:19.000Z (about 4 years ago)
- Last Synced: 2024-06-28T10:35:39.320Z (5 months ago)
- Topics: python
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Internet Archive Updater
Save your webisites in [archive.org](http://web.archive.org/) repeatedly using the free tier of [Heroku](https://heroku.com/).
![https://i.imgur.com/Bxmb9S1.png](https://i.imgur.com/Bxmb9S1.png)
## How to install it
**Clone** this repository:
```Bash
git clone https://github.com/MarcoDiFrancesco/InternetArchiveUpdater
```**Change the URLs** you want Internet Archive to save in [links.txt](/links.txt). For example:
```txt
https://example.com/
https://www.facebook.com/
https://www.youtube.com/example
```**Change how often to update** your web page in [clock.py](/clock.py). Default is 12 minutes:
```Python
@sched.scheduled_job("interval", minutes=12)
^^^^^^^^^^
OR
@sched.scheduled_job("interval", hours=3)
^^^^^^^
OR
@sched.scheduled_job("interval", days=1)
^^^^^^
```**Set up an Heroku application**. If you are not familiar on how to run Python applications with Heroku follow this step-by-step YouTube tutorial by Andres Sevilla: [https://youtu.be/Ven-pqwk3ec](https://youtu.be/Ven-pqwk3ec?t=189) (start at 3:09)
Once the application is deployed in your `Heroku Dashboard` -> `Resources` you should see the `clock` task, you can enable it to run your application:
![https://i.imgur.com/AhI1wUO.png](https://i.imgur.com/AhI1wUO.png)
**Now your pages are being saved!**
# Internet Archive Website Uploader
Save your entire webisite in Internet Archive with only the domanin link given.
For example given the link in [scraper.py](/scraper.py):
```Plaintext
https://github.com/MarcoDiFrancesco
```the program will follow all links inside the page to save pages like:
![https://i.imgur.com/s4aswN1.png](https://i.imgur.com/s4aswN1.png)
```Plaintext
https://github.com/MarcoDiFrancesco/DoexercisesTools
https://github.com/MarcoDiFrancesco/HomeAutomation
...
```and will follow all links inside those pages:
```Plaintext
https://github.com/MarcoDiFrancesco/DoexercisesTools/blob/master/README.md
https://github.com/MarcoDiFrancesco/DoexercisesTools/blob/master/getMarks.py
...
```and so on!
## Limitations
Currently in web.archive.org it's possible to upload 20093 pages per day. The program can anyway stay up for months without ever stopping.