{"id":22727098,"url":"https://github.com/microbiomedata/refscan","last_synced_at":"2025-10-29T21:44:11.152Z","repository":{"id":248304168,"uuid":"828323072","full_name":"microbiomedata/refscan","owner":"microbiomedata","description":"Command-line program that scans NMDC database for referential integrity violations","archived":false,"fork":false,"pushed_at":"2025-04-09T20:58:14.000Z","size":366,"stargazers_count":2,"open_issues_count":5,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-09T21:40:12.448Z","etag":null,"topics":["linkml","mongodb","referential-integrity"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/refscan/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microbiomedata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-13T19:26:32.000Z","updated_at":"2025-03-11T06:34:16.000Z","dependencies_parsed_at":"2024-09-12T18:26:23.508Z","dependency_job_id":"19eada64-ebc8-4f84-8883-b362da6fa1e8","html_url":"https://github.com/microbiomedata/refscan","commit_stats":null,"previous_names":["microbiomedata/refscan"],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Frefscan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Frefscan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Frefscan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Frefscan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microbiomedata","download_url":"https://codeload.github.com/microbiomedata/refscan/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248591973,"owners_count":21130156,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["linkml","mongodb","referential-integrity"],"created_at":"2024-12-10T17:09:29.978Z","updated_at":"2025-10-29T21:44:11.130Z","avatar_url":"https://github.com/microbiomedata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# refscan\n\n`refscan` is a command-line tool people can use to **scan** the [NMDC](https://microbiomedata.org/) MongoDB database\nfor referential integrity violations.\n\n```mermaid\n%% This is the source code of a Mermaid diagram, which GitHub will render as a diagram.\n%% Note: PyPI does not render Mermaid diagrams, and instead displays their source code.\n%%       Reference: https://github.com/pypi/warehouse/issues/13083\ngraph LR\n    schema[LinkML\u003cbr\u003eschema]\n    database[(MongoDB\u003cbr\u003edatabase)]\n    script[[\"refscan\"]]\n    violations[\"List of\u003cbr\u003eviolations\"]\n    references[\"List of\u003cbr\u003ereferences\"]:::dashed_border\n    schema --\u003e script\n    database --\u003e script\n    script -.-\u003e references\n    script --\u003e violations\n    \n    classDef dashed_border stroke-dasharray: 5 5\n```\n\nIn addition to using refscan to scan the NMDC MongoDB database for referential integrity violations,\npeople can use `refscan` to generate **graphs** (diagrams) depicting which collections' documents (or which classes'\ninstances) can contain references to which _other_ collections' documents (or classes' instances) while still being\nschema compliant.\n\n\u003c!-- Note: We removed the hard-coded Table of Contents because—nowadays—GitHub automatically derives/presents one. --\u003e\n\n## How it works\n\nHere is a summary of how each of `refscan`'s main functions works under the hood.\n\n### Scan\n\n`refscan` does this in two stages:\n1. It uses the LinkML schema to determine where references _can_ exist in a MongoDB database that conforms to the schema.\n   \u003e **Example:** The schema might say that, if a document in the `biosample_set` collection has a field named\n   \u003e `associated_studies`, that field must contain a list of `id`s of documents in the `study_set` collection.\n2. It scans the MongoDB database to check the integrity of all the references that _do_ exist.\n   \u003e **Example:** For each document in the `biosample_set` collection that _has_ a field named `associated_studies`,\n   \u003e for each value in that field, confirm there _is_ a document having that `id` in the `study_set` collection.\n\n### Graph\n\n`refscan` does this in three stages:\n1. It uses the LinkML schema to determine where references _can_ exist in a MongoDB database that conforms to the schema.\n2. It formats that list of references into a data structure compatible with [`Cytoscape.js`](https://js.cytoscape.org/).\n3. It outputs an HTML document that uses `Cytoscape.js` to visualize that data structure as a graph.\n\n## Assumptions\n\n`refscan` was designed under the assumption that **every document** in **every collection described by the schema** has\na **field named `type`**, whose value is the [class_uri](https://linkml.io/linkml/code/metamodel.html#linkml_runtime.linkml_model.meta.ClassDefinition.class_uri) of the schema class the document represents an instance\nof. `refscan` uses that `class_uri` value (in that `type` field) to determine the _name_ of that schema class,\nwhose definition `refscan` then uses to determine _which fields_ of that document can contain references.\n\n## Usage\n\n### Install\n\nAssuming you have `pipx` installed, you can install the tool by running the following command:\n\n```shell\npipx install refscan\n```\n\n\u003e [`pipx`](https://pipx.pypa.io/stable/) is a tool people can use to\n\u003e [download and install](https://pipx.pypa.io/stable/#where-does-pipx-install-apps-from)\n\u003e Python scripts that are hosted on PyPI.\n\u003e You can [install `pipx`](https://pipx.pypa.io/stable/installation/) by running `$ python -m pip install pipx`.\n\n### Run\n\nOnce installed, you can display the tool's `--help` snippet by running:\n\n```shell\nrefscan --help\n```\n\nAt the time of this writing, the tool's `--help` snippet is:\n\n```console\n Usage: refscan [OPTIONS] COMMAND [ARGS]...\n\n╭─ Options ──────────────────────────────────────────────────────────────────────────────╮\n│ --help          Show this message and exit.                                            │\n╰────────────────────────────────────────────────────────────────────────────────────────╯\n╭─ Commands ─────────────────────────────────────────────────────────────────────────────╮\n│ version   Show version number and exit.                                                │\n│ scan      Scan the NMDC MongoDB database for referential integrity violations.         │\n│ graph     Generate an interactive graph of the references described by a schema.       │\n╰────────────────────────────────────────────────────────────────────────────────────────╯\n```\n\n\u003c!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. --\u003e\n\nEach command has its own `--help` snippet.\n\n#### The `scan` command\n\nAt the time of this writing, the `--help` snippet for the `scan` command is:\n\n```console\n Usage: refscan scan [OPTIONS]\n\n Scan the NMDC MongoDB database for referential integrity violations.\n\n╭─ Options ──────────────────────────────────────────────────────────────────────────────╮\n│ *  --schema                               FILE  Filesystem path at which the YAML file │\n│                                                 representing the schema is located.    │\n│                                                 [default: None]                        │\n│                                                 [required]                             │\n│    --database-name                        TEXT  Name of the database.                  │\n│                                                 [default: nmdc]                        │\n│    --mongo-uri                            TEXT  Connection string for accessing the    │\n│                                                 MongoDB server. If you have Docker     │\n│                                                 installed, you can spin up a temporary │\n│                                                 MongoDB server at the default URI by   │\n│                                                 running: $ docker run --rm --detach -p │\n│                                                 27017:27017 mongo                      │\n│                                                 [env var: MONGO_URI]                   │\n│                                                 [default: mongodb://localhost:27017]   │\n│    --verbose                                    Show verbose output.                   │\n│    --skip-source-collection,--skip        TEXT  Name of collection you do not want to  │\n│                                                 search for referring documents. Option │\n│                                                 can be used multiple times.            │\n│                                                 [default: None]                        │\n│    --reference-report                     FILE  Filesystem path at which you want the  │\n│                                                 program to generate its reference      │\n│                                                 report.                                │\n│                                                 [default: references.tsv]              │\n│    --violation-report                     FILE  Filesystem path at which you want the  │\n│                                                 program to generate its violation      │\n│                                                 report.                                │\n│                                                 [default: violations.tsv]              │\n│    --no-scan                                    Generate a reference report, but do    │\n│                                                 not scan the database for violations.  │\n│    --locate-misplaced-documents                 For each referenced document not found │\n│                                                 in any of the collections the schema   │\n│                                                 allows, also search for it in all      │\n│                                                 other collections.                     │\n│    --help                                       Show this message and exit.            │\n╰────────────────────────────────────────────────────────────────────────────────────────╯\n```\n\n\u003c!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. --\u003e\n\n##### The MongoDB connection string (`--mongo-uri`)\n\nAs documented in the `--help` snippet above, you can provide the MongoDB connection string to the tool via either\n(a) the `--mongo-uri` option; or (b) an environment variable named `MONGO_URI`. The latter can come in handy\nwhen the MongoDB connection string contains information you don't want to appear in your shell history,\nsuch as a password.\n\nHere's how you could create that environment variable:\n\n```shell  \nexport MONGO_URI='mongodb://username:password@localhost:27017'\n```\n\n##### The schema (`--schema`)\n\nAs documented in the `--help` snippet above, you can provide the path to a YAML-formatted LinkML schema file to the tool\nvia the `--schema` option.\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\nShow/hide tips for getting a schema file\n\u003c/summary\u003e\n\n---\n\nIf you have `curl` installed, you can download a YAML file from GitHub by running the following command (after replacing\nthe `{...}` placeholders and customizing the path):\n\n```shell\n# Download the raw content of https://github.com/{user_or_org}/{repo}/blob/{branch}/path/to/schema.yaml\ncurl -o schema.yaml https://raw.githubusercontent.com/{user_or_org}/{repo}/{branch}/path/to/schema.yaml\n```\n\nFor example:\n\n```shell\n# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/nmdc_materialized_patterns.yaml\ncurl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/nmdc_schema/nmdc_materialized_patterns.yaml\n\n# Download the raw content of https://github.com/microbiomedata/nmdc-schema/blob/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml\ncurl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/nmdc-schema/v11.2.1/nmdc_schema/nmdc_materialized_patterns.yaml\n```\n\n---\n\u003c/details\u003e\n\n##### Output\n\nWhile `refscan` is running, it will display console output indicating what it's currently doing.\n\n![Screenshot of refscan console output](./docs/refscan-screenshot.png)\n\nOnce the scan is complete, the reference report (TSV file) and violation report (TSV file) will be available\nin the current directory (or in custom directories, if any were specified via CLI options).\n\n#### The `graph` command\n\nAt the time of this writing, the `--help` snippet for the `graph` command is:\n\n```console\n Usage: refscan graph [OPTIONS]\n\n Generate an interactive graph of the references described by a schema.\n\n╭─ Options ──────────────────────────────────────────────────────────────────────────────╮\n│ *  --schema         FILE                Filesystem path at which the YAML file         │\n│                                         representing the schema is located.            │\n│                                         [default: None]                                │\n│                                         [required]                                     │\n│    --graph          FILE                Filesystem path at which you want refscan to   │\n│                                         generate the graph.                            │\n│                                         [default: graph.html]                          │\n│    --subject        [collection|class]  Whether you want each node of the graph to     │\n│                                         represent a collection or a class.             │\n│                                         [default: collection]                          │\n│    --verbose                            Show verbose output.                           │\n│    --help                               Show this message and exit.                    │\n╰────────────────────────────────────────────────────────────────────────────────────────╯\n```\n\n\u003c!-- Note: The above snippet was captured from a terminal window whose width was 90 characters. --\u003e\n\n### Update\n\nYou can update the tool to [the latest version available on PyPI](https://pypi.org/project/refscan/) by running:\n\n```shell\npipx upgrade refscan\n```\n\n### Uninstall\n\nYou can uninstall the tool from your computer by running:\n\n```shell\npipx uninstall refscan\n```\n\n## Development\n\nWe use [`uv`](https://docs.astral.sh/uv/) to both (a) manage dependencies and (b) build distributable packages that can be published to PyPI.\n\n- `pyproject.toml`: [Configuration file](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) for `uv` and other tools\n- `uv.lock`: List of dependencies, both direct and [indirect/transitive](https://en.wikipedia.org/wiki/Transitive_dependency)\n\n\u003e Note: We initialized this repository using Poetry. We switched from Poetry to `uv` at around commit `#1449ceca`.\n\n### Clone repository\n\n```shell\ngit clone https://github.com/microbiomedata/refscan.git\ncd refscan\n```\n\n### Set up Python virtual environment\n\nYou can set up a Python virtual environment by issuing the following command from the root directory of the repository:\n\n```shell\nuv sync\n```\n\nThat command will:\n1. **Create a Python virtual environment** at `.venv` (if one doesn't already exist there)\n2. **Install all dependencies** described in `uv.lock` into that Python virtual environment\n3. Uninstall all dependencies _not_ described in `uv.lock` from that Python virtual environment\n\n### Activate Python virtual environment\n\nNow that you have set up a Python virtual environment, you can activate it by issuing the following command:\n\n```shell\nsource .venv/bin/activate\n```\n\n\u003e Note: Once you're ready to _deactivate_ the Python virtual environment, you can do so by running `$ deactivate`.\n\n### Make changes\n\nEdit the tool's source code and documentation however you want.\n\nWhile editing the tool's source code, you can run the tool as you normally would in order to test things out.\n\n```shell\nuv run refscan --help\n```\n\n### Check types\n\nWe use [mypy](https://mypy.readthedocs.io/en/stable/) as the static type checker for `refscan`.\n\nYou can perform static type checking by running the following command from the root directory of the repository:\n\n```shell\nuv run mypy\n```\n\n### Run tests\n\nWe use [pytest](https://docs.pytest.org/en/8.2.x/) as the testing framework for `refscan`.\n\nTests are defined in the `tests` directory.\n\nYou can run the tests by running the following command from the root directory of the repository:\n\n```shell\nuv run pytest\n```\n\n### Format code\n\nWe use [`ruff`](https://docs.astral.sh/ruff/formatter/) as the code _formatter_ for `refscan`.\n\nWe mostly use it with its default rules. All of the ways we deviate from those are listed\nin the `[tool.ruff]` section of `pyproject.toml`.\n\nYou can _check_ the code's compliance with the \"formatter rules\" by running this command from the root directory of the repository:\n\n```shell\nuv run ruff format --check\n```\n\nThat will output a _list_ of files that don't comply. To see the violations, themselves, you can run:\n\n```shell\nuv run ruff format --diff\n```\n\nYou can _format_ the code by omitting the `--check` and `--diff` flags:\n\n```shell\nuv run ruff format\n```\n\n### Lint code\n\nWe also use [`ruff`](https://docs.astral.sh/ruff/linter/) as the code _linter_ for `refscan`.\n\nWe use it with its [default rules](https://docs.astral.sh/ruff/rules/), **plus** some additional ones,\nall of which are listed in the `[tool.ruff.lint]` section of `pyproject.toml`.\n\nYou can _check_ the code's compliance with the \"linter rules\" by running this command from the root directory of the repository:\n\n```shell\nuv run ruff check\n```\n\n## Building and publishing\n\n### Build for production\n\nWhenever someone publishes a [GitHub Release](https://github.com/microbiomedata/refscan/releases) in this repository,\na [GitHub Actions workflow](.github/workflows/build-and-publish-package-to-pypi.yml)\nwill automatically build a package and publish it to [PyPI](https://pypi.org/project/refscan/).\nThat package will have a version identifier that matches the name of the Git tag associated with the Release.\n\n### Test the build process locally\n\nIn case you want to test the build process locally, you can do so by running:\n\n```shell\nuv build\n```\n\n\u003e That will create both a\n\u003e [source distribution](https://setuptools.pypa.io/en/latest/deprecated/distutils/sourcedist.html#creating-a-source-distribution)\n\u003e file (whose name ends with `.tar.gz`) and a\n\u003e [wheel](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#binary-distribution-format)\n\u003e file (whose name ends with `.whl`) in the `dist` directory.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrobiomedata%2Frefscan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrobiomedata%2Frefscan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrobiomedata%2Frefscan/lists"}