https://github.com/machawk1/use
Experiments in examining the use of web archives.
https://github.com/machawk1/use
Last synced: about 2 months ago
JSON representation
Experiments in examining the use of web archives.
- Host: GitHub
- URL: https://github.com/machawk1/use
- Owner: machawk1
- Created: 2020-12-09T19:09:24.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-12-09T19:09:53.000Z (over 4 years ago)
- Last Synced: 2025-03-24T09:45:21.021Z (about 2 months ago)
- Size: 4.79 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# use
These are some experiments being conducted by Jess Ogden, Shawn Walker and Ed
Summers to examine the use of web archives on the web. We are looking for the
traces of web archives in a web archive (CommonCrawl). If that sounds
confusingly meta then you are understanding correctly :)At the moment `load.py` downloads all the WAT files for a given CommonCrawl
snapshot and looks for links to known web archives, and writes out a CSV of
data about those links including:* source_url: the url linking to the web archive
* source_host: the host name of the source url
* archive_url: the URL of a web archive resource
* archive_service: the archive service, e.g. InternetArchive, ArchiveToday, etc
* link_text: the text of the hypertext link
* path: the CSS selector path to the link in the source page
* link_count: the total number of hyperlinks on the page
* warc: the CommonCrawl WARC file where the response is stored
* offset: the byte offset into the WARC file where the response is
* inflated_length: the inflated length of the response
* deflated_length: the compressed length of the response## Install
git clone https://github.com/edsu/use.git
cd use pip
install -r requirements.txt## Run
You give `load.py` the snapshot ID of a [CommonCrawl] dataset. For
example:./load.py CC-MAIN-2020-45
Wait a looooong time. Look at load.log to see what's happening then when
it's done you will have a CSV file:CC-MAIN-2020-45.csv
[CommonCrawl]: https://commoncrawl.org/
[CommonCrawl dataset]: https://commoncrawl.org/the-data/get-started/