{"id":28376964,"url":"https://github.com/bellingcat/auto-archiver-api","last_synced_at":"2025-06-26T18:32:18.121Z","repository":{"id":278589166,"uuid":"604649223","full_name":"bellingcat/auto-archiver-api","owner":"bellingcat","description":"API to manage users/sheets/URLs and call the auto-archiver in dedicated workers. ","archived":false,"fork":false,"pushed_at":"2025-05-21T02:41:00.000Z","size":1323,"stargazers_count":5,"open_issues_count":9,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-30T00:42:02.281Z","etag":null,"topics":["celery","digital-preservation","fastapi","web-archiving"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bellingcat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-21T14:03:46.000Z","updated_at":"2025-05-30T00:13:15.000Z","dependencies_parsed_at":"2025-02-20T16:26:39.136Z","dependency_job_id":"6e30cea5-39ee-4c56-aead-d5d9e2027dc1","html_url":"https://github.com/bellingcat/auto-archiver-api","commit_stats":null,"previous_names":["bellingcat/auto-archiver-api"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bellingcat/auto-archiver-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bellingcat%2Fauto-archiver-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bellingcat%2Fauto-archiver-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bellingcat%2Fauto-archiver-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bellingcat%2Fauto-archiver-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bellingcat","download_url":"https://codeload.github.com/bellingcat/auto-archiver-api/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bellingcat%2Fauto-archiver-api/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262122864,"owners_count":23262486,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["celery","digital-preservation","fastapi","web-archiving"],"created_at":"2025-05-30T00:38:47.474Z","updated_at":"2025-06-26T18:32:18.043Z","avatar_url":"https://github.com/bellingcat.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Auto Archiver API\n\n[![CI](https://github.com/bellingcat/auto-archiver-api/workflows/CI/badge.svg)](https://github.com/bellingcat/auto-archiver-api/actions/workflows/ci.yaml)\n\nA web API that uses celery workers to process URL archive requests via [bellingcat/auto-archiver](https://github.com/bellingcat/auto-archiver), it allows authentication via Google OAuth Apps and enables CORS, everything runs on docker.\n\n![image](https://github.com/user-attachments/assets/905d697d-b83e-437b-87d1-cc86d3c8d8bf)\n\n## setup\nTo properly set up the API you need to install `docker` and to have these files, see more on the sections below:\n1. a `.env.prod` and `.env.dev` to configure the API, stays at the root level\n2. a `user-groups.yaml` to manage user permissions\n  1. note that all local files referenced in `user-groups.yaml` and any orchestration.yaml files should be relative to the home directory so if your service account is in `secrets/orchestration.yaml` use that path and not just `orchestration.yaml`.\n  2. go through the example file and configure it according to your needs.\n3. you will need to create and reference at least one `secrets/orchestration.yaml` file, you can do so by following the instructions in the [auto-archiver](https://github.com/bellingcat/auto-archiver#installation) that automatically generates one for you. If you use the archive sheets feature you will need to create a `orchestrationsheets-sheets.yaml` file as well that should have the `gsheet_feeder_db` feeder and database enabled and configured, the auto-archiver has [extensive documentation](https://auto-archiver.readthedocs.io/en/latest/) on how to set this up.\n\nDo not commit those files, they are .gitignored by default.\nWe also advise you to keep any sensitive files in the `secrets/` folder which is pinned and gitignored.\n\nWe have examples for both of those files (`.env.example` and `user-groups.example.yaml`), and here's how to set them up whether you're in development or production:\n\n### setup for DEVELOPMENT\n```bash\n# copy and modify the .env.dev file according to your needs\ncp .env.example .env.dev\n# copy the user-groups.example.yaml and modify it accordingly\ncp user-groups.example.yaml user-groups.dev.yaml\n# run the APP, make sure VPNs are off\nmake dev\n# check it's running by calling the health endpoint\ncurl 'http://localhost:8004/health'\n# \u003e {\"status\":\"ok\"}\n```\nnow go to http://localhost:8004/docs#/ and you should see the API documentation\n\n### setup for PRODUCTION\n```bash\n# copy and modify the .env.prod file according to your needs\ncp .env.example .env.prod\n# copy the user-groups.example.yaml and modify it accordingly\ncp user-groups.example.yaml user-groups.yaml\n# deploy the app\nmake prod\n# check it's running by calling the health endpoint\ncurl 'http://localhost:8004/health'\n# \u003e {\"status\":\"ok\"}\n```\nnow go to http://localhost:8004/docs#/ and you should see the API documentation\n\n## User, Domains, Groups, and permissions management\nthere are 2 ways to access the API\n1. via an API token which has full control/privileges to archive/search\n2. via a Google Auth token which goes through the user access model\n\n#### User access model\nThe permissions are defined solely via the `user-groups.yaml` file\n- users belong to groups which determine their access level/quotas/orchestration setup\n  - users are assigned to groups explicitly (via email)\n  - users are assigned to groups implicitly (via email domains) as domains can be associated to groups\n  - users that are not explicitly or implicitly in the system belong to the `default` group, restrict their permissions if you do not wish them to be able to search/archive\n  - if a user is assigned to one group which is not explicitly defined, a warning will be thrown, it may be necessary to do that if you discontinue a given group but the database still has entries for it and so\n- groups determine\n  - which orchestrator to use for single URL archives and for spreadsheet archives see [GroupPermissions](app/shared/user_groups.py)\n  - a set of permissions\n    - `read` can be [`all`], [] or a comma separated list of group names, meaning people in this group can access either all, none, or those belonging to explicitly listed groups.\n      - the group itself must be included in the list, otherwise the user cannot search archives of that group\n    - `read_public` a boolean that enables the user to search public archives\n    - `archive_url` a boolean that enables the user to archive links in this group\n    - `archive_sheet` a boolean that enables the user to archive spreadsheets\n    - `manually_trigger_sheet` a boolean that enables the user to manually trigger a sheet archive for sheets in this group\n    - `sheet_frequency` a list of options for the sheet archiving frequency, currently max permissions is `[\"hourly\", \"daily\"]`\n    - `max_sheets` defines the maximum amount of spreadsheets someone can have in total (`-1` means no limit)\n    - `max_archive_lifespan_months` defines the lifespan of an archive before being deleted from S3, users will be notified 1 month in advance with instructions to download TODO\n    - `max_monthly_urls` how many total URLs someone can archive per month (`-1` means no limit)\n    - `max_monthly_mbs` how many MBs of data someone can archive per month (`-1` means no limit)\n    - `priority` one of `high` or `low`, this will be used to give archiving priority\n  - group names are all lower-case\n\n\n## development of web/worker without docker\n\n\u003c!-- * `pipenv install --editable ../../auto-archiver` --\u003e\nWe advise you to use `make prod` but you can also spin up redis and run the API (uvicorn) and worker (celery) individually like so:\n* console 1 - `make dev-redis-only` to spin up redis, turn off any VPNs\n* console 2 - `export ENVIRONMENT_FILE=.env.dev` then `poetry run celery --app=app.worker.main.celery worker --loglevel=debug --logfile=/aa-api/logs/celery.log -Q high_priority,low_priority --concurrency=1`\n  * or with watchdog for dev auto-reload `watchmedo auto-restart --patterns=\"*.py\" --recursive --ignore-directories -- celery -- --app=app.worker.main.celery worker --loglevel=debug --logfile=/aa-api/logs/celery.log -Q high_priority,low_priority --concurrency=1`\n* console 3 - `export ENVIRONMENT_FILE=.env.dev` then `poetry run uvicorn main:app --host 0.0.0.0 --reload`\n\n\n## Database migrations\ncheck https://alembic.sqlalchemy.org/en/latest/tutorial.html#the-migration-environment\n```bash\n# set the env variables\nexport ENVIRONMENT_FILE=.env.alembic\n# create a new migration with description in app/migrations\npoetry run alembic revision -m \"create account table\"\n# perform all migrations\npoetry run alembic upgrade head\n# downgrade by one migration\npoetry run alembic downgrade -1\n```\n\n## Release\nUpdate the version in [config.py](app/web/config.py)\n\nMake sure environment and user-groups files are up to date.\n\nThen `make prod`.\n\n\n## Development\n```bash\n# make sure all development dependencies are installed\npoetry install --with dev\n\n# this project uses pre-commit to enforce code style and formatting, set that up locally\npoetry run pre-commit install\n\n# you can test pre-commit with\npoetry run pre-commit run --all-files\n\n# this means pre-commit will always run with git commit, to skip it use\ngit commit --no-verify\n\n# see the Makefile for more commands, but linting and formatting can be done with\nmake lint\n\n# run all tests\nmake test\n```\n\n### Testing\n```bash\n# set the testing environment variables\nexport ENVIRONMENT_FILE=.env.test\n# run tests and generate coverage\npoetry run coverage run -m pytest -vv --disable-warnings --color=yes app/tests/\n# get coverage report in command line\npoetry run coverage report\n# get coverage report in HTML format\npoetry run coverage html\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbellingcat%2Fauto-archiver-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbellingcat%2Fauto-archiver-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbellingcat%2Fauto-archiver-api/lists"}