{"id":15818610,"url":"https://github.com/edsu/zhang-webarchiving","last_synced_at":"2026-01-08T09:42:29.017Z","repository":{"id":14735394,"uuid":"17456263","full_name":"edsu/zhang-webarchiving","owner":"edsu","description":"Notes for my talk about Web Archiving to Jane Zhang's Digital Curation class.","archived":false,"fork":false,"pushed_at":"2014-03-05T22:16:16.000Z","size":200,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-06T06:03:28.160Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":"Unmaintained","scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edsu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-03-05T21:40:52.000Z","updated_at":"2018-12-21T00:09:48.000Z","dependencies_parsed_at":"2022-09-11T07:21:45.697Z","dependency_job_id":null,"html_url":"https://github.com/edsu/zhang-webarchiving","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fzhang-webarchiving","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fzhang-webarchiving/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fzhang-webarchiving/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fzhang-webarchiving/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edsu","download_url":"https://codeload.github.com/edsu/zhang-webarchiving/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246586036,"owners_count":20801028,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-05T06:03:42.233Z","updated_at":"2026-01-08T09:42:28.989Z","avatar_url":"https://github.com/edsu.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"A Brief Look at Web Archiving\n=============================\n\nNotes for Jane Zhang's Digital Curation class at Catholic University.\nMarch 5, 2014.\n\nHi\n--\n\n* How many people use social media, have a website, run a webserver?\n* Why do we love the Web? Why do we hate the Web?\n* The Web needs to be cared for, and it needs archivists.\n* [What is the archive, to archive, an archive?](http://blogs.loc.gov/digitalpreservation/2014/02/what-do-you-mean-by-archive-genres-of-usage-for-digital-preservers/)\n* Archivvy: select, appraise, arrange, describe, preserve, make available\n* My background\n* NDF Talk: [Web as a Preservation Medium](http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/)\n\nWho Cares?\n----------\n\n* The Internet Archive and Library of Congress have got this covered right?\n* [Supreme Court Opionions Clicks That Lead Nowhere](http://www.nytimes.com/2013/09/24/us/politics/in-supreme-court-opinions-clicks-that-lead-nowhere.html)\n* [UK Conservative Party deletes links](http://www.theguardian.com/politics/2013/nov/13/conservative-party-archive-speeches-internet)\n\nHow much of the Web is archived?\n--------------------------------\n\n* Not a [solved problem](http://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives).\n* IA: 366 billion\n* IIPC: 75 billion\n* Google: 1T URLs \n* generous guesstimate: 44%\n\nDespair?\n--------\n\nEven if archivists in a particular country were to preserve every record generated throughout the land, they would still have only a sliver of a window into that country’s experience. But of course in practice, this record universum is substantially reduced through deliberate and inadvertent destruction by records creators and managers, leaving a sliver of a sliver from which archivists select what they will preserve. And they do not preserve much.\n\nThe archival record is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connection to “reality”, but by its open-ended layerings of construction and reconstruction.\n\n-- Verne Harris - [The Archival Sliver](http://www.nyu.edu/classes/bkg/methods/harris.pdf)\n\nlower case \"p\" politics\n-----------------------\n\n* [Losing My Revolution](http://arxiv.org/abs/1209.3026)\n* [Wikileaks](http://www.wikileaks.org)\n* [NSA Files](http://www.theguardian.com/world/the-nsa-files+content/document)\n* [Manning Transcripts](https://pressfreedomfoundation.org/bradley-manning-transcripts)\n\nLibrary of Congress\n-------------------\n\n* team of 6 + InternetArchive\n* [selection](http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html)\n* [notification](http://www.loc.gov/webarchiving/notice_to_webmasters.html)\n* seed lists, scoping\n* quality control \n* embargo period\n* access!\n\nThe Wider Web\n-------------\n\n* [Internet Archive](http://archive.org)\n* [IIPC](http://netpreserve.org)\n* [perma.cc](http://perma.cc)\n* [ArchiveIt](http://archiveit.org)\n  * e.g. [Columbia University Human Rights Watch Archive](http://library.columbia.edu/locations/chrdr/hrwa.html)\n  * [2014 Winter Olympics](https://archive-it.org/collections/4200) selected with [UNT Nomination Tool](http://digital2.library.unt.edu/nomination/)\n* [Hanzo](http://www.hanzoarchives.com/)\n* [ArchiveTeam](http://archiveteam.org/)\n* [ArchiveSocial](http://archivesocial.com/)\n* [CommonCrawl](http://commoncrawl.org/)\n* [Social Feed Manager](https://github.com/gwu-libraries/social-feed-manager)\n\nNuts \u0026 Bolts\n------------\n\n* [1/10 Americans think HTML is an STD](http://www.latimes.com/business/technology/la-fi-tn-1-10-americans-html-std-study-finds-20140304,0,1188415.story)\n* 25 years old: HTTP, HTML, URL \n* robots.txt\n* [OpenWayback](https://github.com/iipc/openwayback)\n* [Heretrix](https://github.com/iipc/heritrix3)\n* [wget](https://www.gnu.org/software/wget/)\n* [pywb](https://github.com/ikreymer/pywb)\n* [WARC](http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717): ISO 28500:2009 ; wget -H -Dwww.cua.edu -r -l 2 --warc-file=cua --convert-links=\"on\" http://lis.cua.edu/courses/index.cfm\n* [Memento](http://tools.ietf.org/html/rfc7089) RFC 7089\n\nInterlude: Web Packages\n-----------------------\n\n* Demo Facebook and Twitter \"archive\" packages.\n* [Packaging on the Web](https://github.com/w3ctag/packaging-on-the-web)\n* [ResourceSync](http://www.niso.org/workrooms/resourcesync/)\n\nChallenges\n----------\n\n* scoping (backlinks)\n* streaming video / audio\n* dynamic content / ghost\n* funding (sustainable)\n* copyright\n* storage space\n* format migration\n* digital preservation significant characteristics?\n* collection development: seedlists, inventory\n* single point of failure (IA)\n\nWhat can you do?\n----------------\n\n* Big data is great, but start with [small data](https://www.ideals.illinois.edu/handle/2142/39750):\n  * your organizations web presence\n  * local blogs\n  * local government\n  * local arts scene / businesses\n* Website owners:\n  * Permalinks/Cool URIs\n  * robots.txt\n  * sitemaps\n* [Personal Digital Archiving](http://visions.indstate.edu/pda2014/)\n  * outreach with your community\n  * best practices / guidance\n* Keep an open mind.\n* Have a whole class about web archiving!\n\nLearn More\n----------\n\n* [ArchiveIt](https://archive-it.org/) webinars for libraries, archives, classes.\n* [More Podcast Less Process](http://keepingcollections.org/more-podcast-less-process/) - Episode #7 on Web Archiving - Alex Thurman (Columbia) and Lily Pregill (New York Art Resources Consortium)\n* [Web Curators Mailing List](http://netpreserve.org/web-curators-mailing-list)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fzhang-webarchiving","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedsu%2Fzhang-webarchiving","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fzhang-webarchiving/lists"}