{"id":13721926,"url":"https://github.com/archivesunleashed/docker-aut","last_synced_at":"2025-04-27T11:33:00.922Z","repository":{"id":5942639,"uuid":"54291459","full_name":"archivesunleashed/docker-aut","owner":"archivesunleashed","description":"Docker image for the Archives Unleashed Toolkit","archived":false,"fork":false,"pushed_at":"2022-11-17T02:17:18.000Z","size":930,"stargazers_count":12,"open_issues_count":0,"forks_count":2,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-08-23T20:17:15.345Z","etag":null,"topics":["archives-unleashed","aut","docker","docker-image","spark","webarchives"],"latest_commit_sha":null,"homepage":"https://archivesunleashed.org/","language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/archivesunleashed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-03-19T23:18:35.000Z","updated_at":"2022-11-21T13:45:36.000Z","dependencies_parsed_at":"2023-01-11T17:01:51.498Z","dependency_job_id":null,"html_url":"https://github.com/archivesunleashed/docker-aut","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archivesunleashed%2Fdocker-aut","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archivesunleashed%2Fdocker-aut/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archivesunleashed%2Fdocker-aut/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/archivesunleashed%2Fdocker-aut/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/archivesunleashed","download_url":"https://codeload.github.com/archivesunleashed/docker-aut/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224069554,"owners_count":17250454,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archives-unleashed","aut","docker","docker-image","spark","webarchives"],"created_at":"2024-08-03T01:01:22.838Z","updated_at":"2024-11-11T07:51:32.213Z","avatar_url":"https://github.com/archivesunleashed.png","language":"Dockerfile","funding_links":[],"categories":["Training/Documentation"],"sub_categories":[],"readme":"# docker-aut\n[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat-square)](./LICENSE)\n[![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)\n\n## Attention\n\nThe `main` branch aligns with the `main` branch of [The Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut). It can be unstable at times. Stable [branches](https://github.com/archivesunleashed/docker-aut/branches) are available for each [AUT release](https://github.com/archivesunleashed/aut/releases).\n\n## Introduction\n\nThis is the Docker image for [Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut). [AUT](https://github.com/archivesunleashed/aut) documentation can be found [here](https://aut.docs.archivesunleashed.org/docs/home). If you need a hand installing Docker, check out our [Docker Install Instructions](https://github.com/archivesunleashed/docker-aut/wiki/Docker-Install-Instructions), and if you want a quick tutorial, check out our [Toolkit Lesson](https://aut.docs.archivesunleashed.org/docs/toolkit-walkthrough).\n\nThe Archives Unleashed Toolkit is part of the broader [Archives Unleashed Project](http://archivesunleashed.org/).\n\n## Requirements\n\nInstall the following dependencies:\n\n1. [Docker](https://www.docker.com/get-docker)\n\n## Use\n\n### Build and Run\n\nYou can build and run this Docker image locally with the following steps:\n\n1. `git clone https://github.com/archivesunleashed/docker-aut.git`\n2. `cd docker-aut`\n3. `docker build -t aut .`\n4. `docker run --rm -it aut`\n\n### Overrides\n\nYou can add any Spark flags to the build if you need too.\n\n```\ndocker run --rm -it aut /spark/bin/spark-shell --packages \"io.archivesunleashed:aut:1.2.1-SNAPSHOT\" --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s\n```\n\nOnce the build finishes, you should see:\n\n```bash\nWARNING: An illegal reflective access operation has occurred\nWARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\nWARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\nWARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\nWARNING: All illegal access operations will be denied in a future release\n21/11/01 17:27:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nSpark context Web UI available at http://5f477f5dcab5:4040\nSpark context available as 'sc' (master = local[*], app id = local-1635787667490).\nSpark session available as 'spark'.\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 3.1.1\n      /_/\n         \nUsing Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.13)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala\u003e \n```\n\n### PySpark\n\nIt is also possible to start an interactive PySpark console. This requires specifying Python bindings and the `aut` package, both of which are included in the Docker image under `/aut/target`.\n\nTo lauch an interactive PySpark console:\n\n```\ndocker run --rm -it aut /spark/bin/pyspark --py-files /aut/target/aut.zip --jars /aut/target/aut-1.2.1-SNAPSHOT-fatjar.jar\n```\n\nOnce the build finishes you should see:\n\n```bash\nPython 3.9.2 (default, Feb 28 2021, 17:03:44) \n[GCC 10.2.1 20210110] on linux\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\nWARNING: An illegal reflective access operation has occurred\nWARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)\nWARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\nWARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\nWARNING: All illegal access operations will be denied in a future release\n21/11/01 17:41:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /__ / .__/\\_,_/_/ /_/\\_\\   version 3.1.1\n      /_/\n\nUsing Python version 3.9.2 (default, Feb 28 2021 17:03:44)\nSpark context Web UI available at http://d03127085be4:4040\nSpark context available as 'sc' (master = local[*], app id = local-1635788517329).\nSparkSession available as 'spark'.\n\u003e\u003e\u003e \n```\n\n## Example\n\n\n### Spark Shell (Scala)\n\nWhen the image is running, you will be brought to the Spark Shell interface. Try running the following command.\n\nType\n\n```\n:paste\n```\n\nAnd then paste the following script in:\n\n```scala\nimport io.archivesunleashed._\n\nRecordLoader.loadArchives(\"/aut-resources/Sample-Data/*.gz\", sc).webgraph().show(10)\n```\n\nPress Ctrl+D in order to execute the script. You should then see the following:\n\n```\n+--------------+--------------------+--------------------+------+               \n|    crawl_date|                 src|                dest|anchor|\n+--------------+--------------------+--------------------+------+\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n+--------------+--------------------+--------------------+------+\nonly showing top 10 rows\n```\n\nIn this case, things are working! Try substituting your own data (mounted using the command above).\n\nTo quit Spark Shell, you can exit using \u003ckbd\u003eCTRL\u003c/kbd\u003e+\u003ckbd\u003ec\u003c/kbd\u003e.\n\n### PySpark\n\nWhen the images is running, you will be brought to the PySpark interface. Try running the following commands:\n\n```python\nfrom aut import *\nWebArchive(sc, sqlContext, \"/aut-resources/Sample-Data/*.gz\").webgraph().show(10)\n```\n\nYou should then see the following:\n\n```\n+--------------+--------------------+--------------------+------+               \n|    crawl_date|                 src|                dest|anchor|\n+--------------+--------------------+--------------------+------+\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |\n+--------------+--------------------+--------------------+------+\nonly showing top 10 rows\n```\n\nIn this case, things are working! Try substituting your own data (mounted using the command above).\n\nTo quit the PySpark console, you can exit using \u003ckbd\u003eCTRL\u003c/kbd\u003e+\u003ckbd\u003ec\u003c/kbd\u003e.\n\n## Resources\n\nThis build also includes the [aut resources](https://github.com/archivesunleashed/aut-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/aut-resources`).\n\nThe ARC and WARC file are drawn from the [Canadian Political Parties \u0026 Political Interest Groups Archive-It Collection](https://archive-it.org/collections/227), collected by the University of Toronto. We are grateful that they've provided this material to us.\n\nIf you use their material, please cite it along the following lines:\n\n- University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051.2.191340/http://canadianactionparty.ca/Default2.asp\n\nYou can find more information about this collection at [WebArchives.ca](http://webarchives.ca/about).\n\n## Acknowlegements\n\nThis work is primarily supported by the [Andrew W. Mellon Foundation](https://uwaterloo.ca/arts/news/multidisciplinary-project-will-help-historians-unlock). Additional funding for the Toolkit has come from the U.S. National Science Foundation, Columbia University Library's Mellon-funded Web Archiving Incentive Award, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Ontario Ministry of Research and Innovation's Early Researcher Award program. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchivesunleashed%2Fdocker-aut","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchivesunleashed%2Fdocker-aut","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchivesunleashed%2Fdocker-aut/lists"}