{"id":22599595,"url":"https://github.com/singularityhub/container-executable-discovery","last_synced_at":"2025-03-28T19:47:08.050Z","repository":{"id":92775651,"uuid":"569514070","full_name":"singularityhub/container-executable-discovery","owner":"singularityhub","description":"GitHub action to assist in creating a cache of container executables.","archived":false,"fork":false,"pushed_at":"2024-06-05T00:47:38.000Z","size":46,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-02T23:07:51.182Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/singularityhub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-11-23T01:59:05.000Z","updated_at":"2024-12-20T05:30:05.000Z","dependencies_parsed_at":"2023-04-21T09:08:35.004Z","dependency_job_id":"192bb046-a7f0-4516-b02b-0fe62a3d0079","html_url":"https://github.com/singularityhub/container-executable-discovery","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singularityhub%2Fcontainer-executable-discovery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singularityhub%2Fcontainer-executable-discovery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singularityhub%2Fcontainer-executable-discovery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singularityhub%2Fcontainer-executable-discovery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/singularityhub","download_url":"https://codeload.github.com/singularityhub/container-executable-discovery/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246093100,"owners_count":20722395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-08T11:10:34.239Z","updated_at":"2025-03-28T19:47:08.029Z","avatar_url":"https://github.com/singularityhub.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Container Executable Discovery\n\nThis is a GitHub action that discovers 🗺️ container executables! It is used by\nthe [shpc-registry-cache](https://github.com/singularityhub/shpc-registry-cache). \n\n\n## What does it do?\n\nYou can provide a listing of container resource identifiers (via a text file)\nand it will store namespaced (based on OCI or Docker registry)\nidentifiers from the repository root in your location of choice (defaults to\nyour repository root). In addition to the cache of json files with container\nexecutables that are discovered on the path, we save a `counts.json` \n(essentially a summary across counts) and `skips.json` (a cache of containers \nthat were not successful to extract the filesystem for and we should not try again).\n\n## How does it work?\n\nYou will need to provide a text file with container URIs to check. An example\nis provided in the repository here [containers.txt](containers.txt). The idea\nwould be that you might dynamically generate this file from a resource (e.g., for\nthe shpc registry cache we derive this list from the [BioContainers](https://depot.galaxyproject.org/singularity/)\ndepot. Once you have the list, the action does the following:\n\n- We install [shpc](https://github.com/singularityhub/singularity-hpc) and the [guts software])(https://github.com/singularityhub/guts)\n- We run the [update_biocontainers.py](scripts/update_biocontainers.py) script that:\n  - Parses the latest listing of containers from the [BioContainers Depot](https://depot.galaxyproject.org/singularity/)\n  - Generate a unique list of containers and latest (first appearing) tag [^1].\n  - Read in the [skips.json](skips.json) - a cached list of containers that we skip because their guts were not extractable [^2].\n  - For every new identifier to add: \n   - Prepare a directory to store the new cache entry (a json file)\n   - Use the [pipelib](https://vsoch.github.io/pipelib/getting_started/user-guide.html) software to sort tags and get the latest.\n   - Use the guts [ManifestGenerator](https://singularityhub.github.io/guts/getting_started/user-guide.html#manifest) to retrieve a listing of paths and associated files within.\n   - Filter out known patterns that are not executables of interest.\n   - Write this output of aliases to the filesystem under the container identifier as a json file.\n- After new aliases are added, [calculate_frequency.py](.github/scripts/calculate_frequency.py) is run to update global [counts.json](counts.json)\n\nThe result is alias-level data for each container, along with a global set of counts.\n\n[^1]: For the step that grabs the \"latest\" tag, since the container URI (without any tag) can be used to get a listing of all tags, it isn't important to be correct to get the latest tag - this can be easily obtained later in a workflow from the unique resource identifier without a tag.  \n[^2]: There are several reasons for skipping a container. One is that the guts software is not able to extract every set of container guts to the filesystem. A container that attempts to extract particular locations, or that takes up too much space for the GitHub runner will be skipped. Another reason is the pipelib software failing to filter a meaningful set of versioned tags and sort them (e.g., the listing comes back empty and there are no tags known to retrieve). In practice this is a small number as a percentage of the total.\n\n\n### Singularity Registry HPC\n\nAs an example of the usage of this cache, we use these cache entries to populate \nthe [Singularity HPC Registry](https://github.com/singualrityhub/shpc-registry).\nOn a high level, shpc-registry is providing install configuration files for containers.\nDocker or other OCI registry containers are installed to an HPC system via module software,\nand to make this work really well, we need to know their aliases. This is where data from\nthe cache comes in! Specifically for this use case this means we:\n\n- Identify a new container, C, not in the registry from the executable cache here\n- Create a set of global executable counts, G\n- Define a set of counts from G in C as S\n- Rank order S from least to greatest}\n- Include any entries in S that have a frequency \u003c 10\n- Include any entries in S that have any portion of the name matching the container identifier\n- Above that, add the next 10 executables with the lowest frequencies, and \u003c 1,000\n\nThe frequencies are calculated across the cache here, included in [counts.json](counts.json).\nThis produces a container configuration file with a likely good set of executables that\nrepresent the most unique to that container, based on data from the cache.\n\nTo learn more about Singularity Registry HPC you can:\n\n- 📖️ Read the [documentation](https://singularity-hpc.readthedocs.io/en/latest/) 📖️\n- ⭐️ Browse the [container module collection](https://singularityhub.github.io/shpc-registry/) ⭐️\n\n## Usage\n\nYou will minimally next a text file, with one container unique resource identifier (with or without a namespace) per line.\nSee [containers.txt](containers.txt) and [biocontainers.txt](biocontainers.txt) for examples. A table\nof variables for the action is shown below, along with example usage. The assumption is that you are\nrunning the action after having checked out the repository you want to store the cache in.\n\n### Variables\n\n| Name | Description | Required | Default |\n|------|-------------|----------|---------|\n| token | a `${{ secrets.GITHUB_TOKEN }}` to open a pull request with updates | true | unset |\n| root | Path of the cache roots (defaults to PWD) | false | pwd |\n| listing | text file with listing of containers, one per line. | true | unset |\n| namespace | namespace to add to each container in the listing | false | unset |\n| org-letter-prefix | set to true to add a letter directory before the organzation name (e.g., docker.io/l/library/ubuntu:latest) | true | false |\n| repo-letter-prefix: set to true to add a letter directory before the repository name (e.g., docker.io/library/u/ubuntu:latest) | true | false |\n| registry-letter-prefix | set to true to add a letter directory before the registry name (e.g., d/docker.io/library/ubuntu:latest) | true | false |\n| dry_run | don't push changes (dry run only) | false | false |\n| branch | branch to push to | false | main |\n\nAs an example of namespace, see the [biocontainers.txt](biocontainers.txt) file. We would\nwant to define namespace as \"quay.io/biocontainers\" in the action, as the text file only has\npartial names. For pushing, make sure your repository allows pushes from actions.\n\n\n### Examples\n\nHere is a \"vanilla\" example updating a container executable cache in the checked out\nrepository present working directory from the [containers.txt](containers.txt) file.\n\n```yaml\nname: Update Container Cache\non:\n  workflow_dispatch:\n  schedule:\n  - cron: 0 0 * * 3\n\njobs:\n  default-run:\n    name: Update Cache\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repository\n        uses: actions/checkout@v3\n      - name: Update from Containers\n        uses: singularityhub/container-executable-discovery@main\n        with:\n          token: ${{ secrets.GITHUB_TOKEN }}\n          listing: containers.txt\n          dry_run: true\n```\n\nThe remaining recipes assume you have the \"on\" and \"name\" directive (these are just jobs):\nDo the same, but for a dry run (no GitHub token required):\n\n\n```yaml\njobs:\n  dry-run:\n    name: Update Cache\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repository\n        uses: actions/checkout@v3\n      - name: Update from Containers\n        uses: singularityhub/container-executable-discovery@main\n        with:\n          listing: containers.txt\n          dry_run: true\n```\n\nSet a namespace (e.g., as we'd need for [biocontainers.txt](biocontainers.txt))\n\n```yaml\njobs:\n  namespace:\n    name: Update Cache (Namespace)\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repository\n        uses: actions/checkout@v3\n      - name: Update from Containers\n        uses: singularityhub/container-executable-discovery@main\n        with:\n          token: ${{ secrets.GITHUB_TOKEN }}\n          listing: biocontainers.txt\n          namespace: quay.io/biocontainers\n```\n\nSet an organization (the repository organization or username) prefix, e.g.,\nquay.io/vanessa/salad:latest would be stored under `quay.io/v/vanessa/salad:latest.json`.\n\n\n```yaml\njobs:\n  org-prefix:\n    name: Update Cache (Org Prefix)\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repository\n        uses: actions/checkout@v3\n      - name: Update from Containers\n        uses: singularityhub/container-executable-discovery@main\n        with:\n          token: ${{ secrets.GITHUB_TOKEN }}\n          org-letter-prefix: true\n          listing: containers.txt\n```\nOr set a repository prefix, e.g., quay.io/vanessa/salad:latest would be stored under `quay.io/vanessa/s/salad:latest.json`:\n\n```yaml\njobs:\n  repo-prefix:\n    name: Update Cache (Repo Prefix)\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repository\n        uses: actions/checkout@v3\n      - name: Update from Containers\n        uses: singularityhub/container-executable-discovery@main\n        with:\n          token: ${{ secrets.GITHUB_TOKEN }}\n          repo-letter-prefix: true\n          listing: containers.txt\n```\n\nFinally, set a registry prefix (more unlikely since there are few, but available)\ne.g., quay.io/vanessa/salad:latest would be stored under `q/quay.io/vanessa/salad:latest.json`:\n\n```yaml\njobs:\n  registry-prefix:\n    name: Update Cache (Registry Prefix)\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout Repository\n        uses: actions/checkout@v3\n      - name: Update from Containers\n        uses: singularityhub/container-executable-discovery@main\n        with:\n          token: ${{ secrets.GITHUB_TOKEN }}\n          registry-letter-prefix: true\n          listing: containers.txt\n```\n\nAnd that's it! If you have a dynamic listing of containers, you'll likely want to write a step\nbefore using the action to generate the file. \n\n### Assets Saved\n\nThe pull request will update or create (within the cache root):\n\n - a counts.json file with total counts across the cache\n - a skips.json to store as a cache of containers to skip\n - a namespaced hierarchy (according to your preferences), e.g., `quay.io/vanessa/salad:latest.json`, each a lookup dictionary with paths as keys, and binaries / assets discovered there as values.\n\nNote that we filter out patterns that are likely not executables. See the [scripts](scripts) folder to see this logic!\n\n## Container Discovery Library\n\nThe action is powered by a python library [container_discovery](lib) that is provided\nand installed alongside the action. Since this is primarily used here, we don't \npublish to pypi. If you want to install it for your own use:\n\n```bash\n$ git clone https://github.com/singularityhub/container-executable-discovery\n$ cd container-executable-discovery/lib\n$ pip install .\n```\n\nAnd then interact with the `container_discovery` module. You can look at \nexamples under [scripts](scripts) - this is how the action runs!\n\n## Contribution\n\nThis registry showcases a container executable cache, and specifically includes over 8K containers\nfrom BioContainers. If you would like to add another source of container identifiers contributions are \nvery much welcome! \n\n## License\n\nThis code is licensed under the MPL 2.0 [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsingularityhub%2Fcontainer-executable-discovery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsingularityhub%2Fcontainer-executable-discovery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsingularityhub%2Fcontainer-executable-discovery/lists"}