{"id":18906780,"url":"https://github.com/oduwsdl/aiu","last_synced_at":"2025-12-14T04:02:02.416Z","repository":{"id":52216263,"uuid":"132295077","full_name":"oduwsdl/aiu","owner":"oduwsdl","description":"A library for interacting with web archive collections at Archive-It, Trove, Pandora, and more.","archived":false,"fork":false,"pushed_at":"2021-11-05T19:26:13.000Z","size":104361,"stargazers_count":8,"open_issues_count":6,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-11T17:32:34.106Z","etag":null,"topics":["archiveit","metadata","metadata-extraction","webarchives"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oduwsdl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-05-06T00:41:09.000Z","updated_at":"2024-04-09T12:35:10.000Z","dependencies_parsed_at":"2022-08-24T20:00:13.368Z","dependency_job_id":null,"html_url":"https://github.com/oduwsdl/aiu","commit_stats":null,"previous_names":["oduwsdl/archiveit_utilities"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Faiu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Faiu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Faiu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oduwsdl%2Faiu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oduwsdl","download_url":"https://codeload.github.com/oduwsdl/aiu/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249006462,"owners_count":21197280,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiveit","metadata","metadata-extraction","webarchives"],"created_at":"2024-11-08T09:18:42.004Z","updated_at":"2025-12-14T04:01:52.461Z","avatar_url":"https://github.com/oduwsdl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/oduwsdl/aiu.svg?branch=master)](https://travis-ci.org/oduwsdl/aiu)\n\n# AIU\n\nAIU is a Python library for extracting information from web archive collections. The work is done through different classes, each specific to a different web archive collection host. Each class performs screen-scraping and API analysis (if available) in order to acquire general collection metadata, seed lists, and seed metadata.\n\n## Installation\n\nThis package requires Python 3 and is called `aiu` on PyPI. Installation is handled via `pip`:\n\n`pip install aiu`\n\n## Using the `ArchiveItCollection` class\n\nThe class named `ArchiveItCollection` has many methods for extracting information about an Archive-It collection using its collection identifier.\n\nFor example, to use iPython to get information about Archive-It collection number 5728, one can execute the following:\n\n```\nIn [1]: from aiu import ArchiveItCollection\n\nIn [2]: aic = ArchiveItCollection(5728)\n\nIn [3]: aic.get_collection_name()\nOut[3]: 'Social Media'\n\nIn [4]: aic.get_collectedby()\nOut[4]: 'Willamette University'\n\nIn [5]: aic.get_description()\nOut[5]: 'Social media content created by Willamette University.'\n\nIn [6]: aic.get_collection_uri()\nOut[6]: 'https://archive-it.org/collections/5728'\n\nIn [7]: aic.get_archived_since()\nOut[7]: 'Apr, 2015'\n\nIn [8]: aic.is_private()\nOut[8]: False\n\nIn [9]: len(aic.list_seed_uris())\nOut[9]: 113\n\nIn [10]: aic.list_seed_uris()[0]\nOut[10]: 'http://blog.willamette.edu/mba/'\n\nIn [11]: seed_url = aic.list_seed_uris()[0]\n\nIn [12]: aic.get_seed_metadata(seed_url)\nOut[12]:\n{'collection_web_pages': [{'title': 'Willamette MBA Blog',  \n   'description': ['Blog for the Willamette University Atkinson Graduate School of Management']}]}\n\n```\n\nFrom this session we now know that the collection's name is _Social Media_, it was collected by _Willamette University_, it has been archived since _April 2015_, it is not private, and it has 113 seeds.\n\nExamine the source in `aiu/archiveit_collection.py` for a full list of methods to use with this class.\n\n## Using the `TroveCollection` class\n\nThe class named `TroveCollection` has many methods for extracting information about a [National Library of Australia (NLA)](https://www.nla.gov.au/) [Trove](https://trove.nla.gov.au/website) collection using its collection identifier. **Note: Because NLA has different collection policies than Archive-It, not all methods, or their outputs, are mirrored between `TroveCollection` and `ArchiveItCollection`.**\n\nFor example, to use iPython to get information about Trove collection number 13742, one can execute the following:\n\n```\nIn [1]: from aiu import TroveCollection\n\nIn [2]: tc = TroveCollection(13742)\n\nIn [3]: tc.get_collection_name()\nOut[3]: 'Iconic Australian Brands'\n\nIn [4]: tc.get_collectedby()\nOut[4]:\n{'National Library of Australia': 'http://www.nla.gov.au/',\n 'State Library of Queensland': 'http://www.slq.qld.gov.au/'}\n\nIn [5]: tc.get_archived_since()\nOut[5]: 'Feb 2000'\n\nIn [6]: tc.get_archived_until()\nOut[6]: 'Mar 2021'\n\nIn [7]: len(tc.list_seed_uris())\nOut[7]: 64\n\nIn [8]: tc.get_breadcrumbs()\nOut[8]: [0, 15023]\n\n```\n\nFrom this session we now know that the collection's name is _Iconic Australian Brands_, it was collected by _National Library of Australia_ and _State Library of Queensland_, has been archived since _Feb 2000_, and contains mementos up to _Mar 2021_, it has 63 seeds, and is a subcollection of collections with identifiers of 0 and 15023 -- the breadcrumbs that lead to this collection.\n\nExamine the source in `aiu/trove_collection.py` for a full list of methods to use with this class.\n\n## Using the `PandoraCollection` class\n\nThe class named `PandoraCollection` has many methods for extracting information about a [National Library of Australia (NLA)](https://www.nla.gov.au/) [Pandora](http://pandora.nla.gov.au/) collection using its collection identifier. **Note: Because NLA has different collection policies than Archive-It, not all methods, or their outputs, are mirrored between `TroveCollection` and `ArchiveItCollection` and `PandoraCollection`.**\n\nFor example, to use iPython to get information about Pandora collection number 12022, one can execute the following:\n```\nIn [1]: from aiu import PandoraCollection\n\nIn [2]: pc = PandoraCollection(12022)\n\nIn [3]: pc.get_collection_name()\nOut[3]: 'Fact sheets (Victoria. EPA Victoria) - Australian Internet Sites'\n\nIn [4]: pc.get_title_pages()\nOut[4]:\n{'136318': ('https://webarchive.nla.gov.au/tep/136318', 'Air'),\n '136347': ('https://webarchive.nla.gov.au/tep/136347',\n  'How to reduce noise from your business'),\n '136317': ('https://webarchive.nla.gov.au/tep/136317', 'Land'),\n '136346': ('https://webarchive.nla.gov.au/tep/136346', 'Landfill gas'),\n '136314': ('https://webarchive.nla.gov.au/tep/136314', 'Litter'),\n '136316': ('https://webarchive.nla.gov.au/tep/136316',\n  'Noise (EPA fact sheet)'),\n '136319': ('https://webarchive.nla.gov.au/tep/136319', 'Odour'),\n '136312': ('https://webarchive.nla.gov.au/tep/136312', 'Waste'),\n '136313': ('https://webarchive.nla.gov.au/tep/136313', 'Water')}\n\nIn [5]: len(pc.list_memento_urims())\nOut[5]: 10\n\nIn [6]: pc.list_seed_uris()\nOut[6]:\n['http://www.epa.vic.gov.au/~/media/Publications/1465.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1481.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1466.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1479.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1486%201.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1467.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1468.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1469.pdf',\n 'http://www.epa.vic.gov.au/~/media/Publications/1470.pdf']\n\nIn [7]: pc.get_collectedby()\nOut[7]: {'State Library of Victoria': 'http://www.slv.vic.gov.au/'}\n\n```\n\nExamine the source in `aiu/pandora_collection.py` for a full list of methods to use with this class.\n\n## Using the `PandoraSubject` class\n\nThe class named `PandoraSubject` has many methods for extracting information about a [National Library of Australia (NLA)](https://www.nla.gov.au/) [Pandora](http://pandora.nla.gov.au/) subject using its subject identifier. **Note: Because NLA has different collection policies than Archive-It, not all methods, or their outputs, are mirrored between `TroveCollection` and `ArchiveItCollection` and `PandoraCollection` and `PandoraSubject`.**\n\nFor example, to use iPython to get information about Pandora subject number 83, one can execute the following:\n```\nIn [1]: from aiu import PandoraSubject\n\nIn [2]: ps = PandoraSubject(83)\n\nIn [3]: ps.get_subject_name()\nOut[3]: 'Humanities'\n\nIn [4]: len(ps.get_title_pages())\nOut[4]: 71\n\nIn [5]: len(ps.list_memento_urims())\nOut[5]: 246\n\nIn [6]: len(ps.list_seed_uris())\nOut[6]: 71\n\nIn [7]: ps.subject_id\nOut[7]: '83'\n\nIn [8]: ps.get_collectedby()\nOut[8]:\n{'National Library of Australia': 'http://www.nla.gov.au/',\n 'Australian Institute of Aboriginal and Torres Strait Islander Studies': 'http://www.aiatsis.gov.au',\n 'State Library of New South Wales': 'http://www.sl.nsw.gov.au/',\n 'State Library of Victoria': 'http://www.slv.vic.gov.au/',\n 'State Library of Western Australia': 'http://www.slwa.wa.gov.au/',\n 'State Library of South Australia': 'http://www.slsa.sa.gov.au/'}\n\nIn [9]: ps.list_subcategories()\nOut[9]: ['84', '85', '86']\n \n```\n\nExamine the source in `aiu/pandora_collection.py` for a full list of methods to use with this class.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foduwsdl%2Faiu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foduwsdl%2Faiu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foduwsdl%2Faiu/lists"}