{"id":43171949,"url":"https://github.com/opencitations/oc_monitor","last_synced_at":"2026-02-01T02:35:40.538Z","repository":{"id":262824461,"uuid":"880782255","full_name":"opencitations/oc_monitor","owner":"opencitations","description":"View the latest results at https://ocmonitor.opencitations.net/.","archived":false,"fork":false,"pushed_at":"2026-01-26T04:19:27.000Z","size":557,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-01-26T19:28:10.944Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/opencitations.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-30T10:53:24.000Z","updated_at":"2026-01-26T04:19:30.000Z","dependencies_parsed_at":"2024-11-14T14:21:26.770Z","dependency_job_id":"a718bd51-0aec-4c76-9bb4-e32171355c65","html_url":"https://github.com/opencitations/oc_monitor","commit_stats":null,"previous_names":["opencitations/oc_monitor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/opencitations/oc_monitor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Foc_monitor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Foc_monitor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Foc_monitor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Foc_monitor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/opencitations","download_url":"https://codeload.github.com/opencitations/oc_monitor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Foc_monitor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28965430,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T02:14:24.993Z","status":"ssl_error","status_checked_at":"2026-02-01T02:13:55.706Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-01T02:35:39.996Z","updated_at":"2026-02-01T02:35:40.533Z","avatar_url":"https://github.com/opencitations.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SPARQL Data Quality Monitoring Tool\n\nThis software provides tools to monitor the quality of data in OpenCitation Meta and OpenCitations Index by querying their SPARQL endpoints. It includes two monitoring classes: **MetaMonitor** and **IndexMonitor**, which are designed to run a series of tests on the triplestores and generate reports on data quality issues.\n\nThe reports are generated in JSON format, which is then converted to HTML for easier visualization. \u003c!-- The tool supports customizable configurations via command-line arguments for flexibility. --\u003e\n\n## Table of Contents\n\n- [SPARQL Data Quality Monitoring Tool](#sparql-data-quality-monitoring-tool)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [Configuration Files](#configuration-files)\n    - [Command-Line Arguments](#command-line-arguments)\n    - [Examples](#examples)\n      - [Run Both MetaMonitor and IndexMonitor (Default)](#run-both-metamonitor-and-indexmonitor-default)\n      - [Run Only MetaMonitor](#run-only-metamonitor)\n      - [Custom Output Path](#custom-output-path)\n      - [Run Only IndexMonitor with Custom Paths](#run-only-indexmonitor-with-custom-paths)\n  - [Output Structure](#output-structure)\n  - [Filename Details](#filename-details)\n  - [License](#license)\n\n## Overview\n\nThe tool uses SPARQL queries to check for potential issues in the data stored in the specified SPARQL endpoints:\n\n- **MetaMonitor** is used for the **OpenCitations Meta** endpoint.\n- **IndexMonitor** is used for the **OpenCitations Index** endpoint.\n\nEach monitor runs a series of pre-configured tests defined in the related configuration file (in JSON format). The tests produce a report (JSON file) that details which issues are detected, including metadata such as runtime, whether the tests passed or failed, and any errors encountered during the process. The JSON report is then converted into HTML to be read more easily.\n\nTwo classes are responsible for interrogating the triplestores, the `MetaMonitor` class for OpenCitations Meta and the `IndexMonitor` class for OpenCitations Index, which can both be found inside the `data_monitor` module. They both require, upon instantiation, the path to the appropriate configuration file and the base path for the output files. In both the classes, the `run_tests()` method actually interrogates the endpoint specified in the related config file and produces the JSON output.\n\nThe JSON output can be converted into an HTML page by using the `generate_html()` method of the `ReportVisualiser` class, inside the `html_vis` module.\n\n## Installation\n\n1. **Clone the repository**:\n\n    ```bash\n    git clone https://github.com/opencitations/oc_monitor.git\n    ```\n\n2. **Install dependencies**:\n\n    The project's dependencies and virtual environment are managed with [Poetry](https://python-poetry.org/docs/). If you're already using Poetry and have installed on your machine, you can use it to create a virtual enviroment by simply running:\n\n    ```bash\n    poetry install\n    ```\n\n    and then, to activate it:\n\n    ```bash\n    poetry shell\n    ```\n    If you're not using Poetry, you can install the required Python libraries by using `pip` and the requirements.txt file on your preferred environment:\n\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n3. **Ensure proper configuration files**:\n\n   Make sure you have the necessary configuration files (e.g., `meta_monitor_config.json` and `index_monitor_config.json`) in the project folder. See the section on [Configuration Files](#configuration-files) for details. The configuration files provided in this repository should work out of the box.\n\n## Usage\n\nTo run the process with the default configuration:\n\n```bash\ncd oc_monitor\npython -m main\n```\n\n### Configuration Files\n\nThe configuration files for both MetaMonitor and IndexMonitor are in JSON format and contain details about the endpoint to be queried and the tests to run. The `endpoint` field stores the URL of the endpoint to interrogate. The fields for each test include:\n\n- `label`: A short name for the tested issue.\n- `description`: A brief description of the issue.\n- `query`: The SPARQL query used to perform the check.\n- `to_run`: A boolean flag (true or false) indicating whether to run this specific test.\n\nExample configuration (custom_meta_monitor_config.json):\n\n```json\n{\n    \"endpoint\": \"https://k8s.opencitations.net/meta/sparql\",\n    \"tests\": [\n        {\n            \"label\": \"duplicate_br\",\n            \"to_run\": true,\n            \"description\": \"A single value for a given external ID scheme (e.g. DOI value) is associated with more than one BR.\",\n            \"query\": \"PREFIX datacite: \u003chttp://purl.org/spar/datacite/\u003e\\nPREFIX literal: \u003chttp://www.essepuntato.it/2010/06/literalreification/\u003e\\nPREFIX fabio: \u003chttp://purl.org/spar/fabio/\u003e\\n\\nASK {\\n    ?br1 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\\n    a fabio:Expression .\\n    ?br2 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\\n    a fabio:Expression .\\n    FILTER(?br1 != ?br2)\\n}\"\n        }\n    ]\n}\n        \n```\n\n### Command-Line Arguments\n\nThe script allows users to further customise the default behaviour of the software via command-line arguments. Here are the available options:\n\n| Argument               | Description                                                                                                      | Default Value                          |\n|------------------------|------------------------------------------------------------------------------------------------------------------|----------------------------------------|\n| `--meta_config`         | Filepath for the MetaMonitor configuration file.                                                                 | `meta_monitor_config.json`             |\n| `--index_config`        | Filepath for the IndexMonitor configuration file.                                                                | `index_monitor_config.json`            |\n| `--run`                | Specify which monitor to run: `meta`, `index`, or `both`.                                                         | `both`                                 |\n| `--output_base_path`    | Base folder for reports output. The folder structure follows `monitor_results/\u003cmeta_reports\\|index_reports\u003e/\u003cYYYYMMDD\u003e/`.  | `monitor_results`                              |\n\n### Examples\n\n#### Run Both MetaMonitor and IndexMonitor (Default)\n\nTo run both monitors using default configurations and output paths:\n\n```bash\ncd oc_monitor\npython -m main\n```\n\nThis will generate reports in:\n\n- `results/meta_reports/YYYYMMDD/`\n- `results/index_reports/YYYYMMDD/`\n\n#### Run Only MetaMonitor\n\nTo run only the MetaMonitor and specify a custom configuration file:\n\n```bash\ncd oc_monitor\npython -m main --run meta --meta_config my_meta_config.json\n```\n\n#### Custom Output Path\n\nTo run both monitors but specify a custom output base path:\n\n```bash\ncd oc_monitor\npython -m main --output_base_path /my/custom/path\n```\n\nThe reports will be saved in:\n\n- `/my/custom/path/meta_reports/YYYYMMDD/`\n- `/my/custom/path/index_reports/YYYYMMDD/`\n\n#### Run Only IndexMonitor with Custom Paths\n\nTo run only the IndexMonitor and specify both the configuration file and a custom output path:\n\n```bash\ncd oc_monitor\npython -m main --run index --index_config custom_index_config.json --output_base_path custom_reports\n```\n\nThe reports will be saved in `custom_reports/index_reports/YYYYMMDD/`.\n\n## Output Structure\n\nThe output reports are stored in a folder structure that follows this pattern:\n\n```bash\nmonitor_results/\n  ├── meta_reports/\n  │    └── YYYYMMDD/\n  │         ├── output_meta_monitor_YYYYMMDD.json\n  │         └── meta_monitor_vis_YYYYMMDD.html\n  └── index_reports/\n       └── YYYYMMDD/\n            ├── output_index_monitor_YYYYMMDD.json\n            └── index_monitor_vis_YYYYMMDD.html\n```\n\nThe JSON output file stores information about the tests results along with details on the execution process (associated configuration file, date and time of the execution, runtime, raised errors, etc.). Each test result in the output file is associated with the label and description of the issue and the SPARQL query that has been run for the test itself.\n\nExample JSON output (output_meta_monitor_20241020.json):\n\n```json\n{\n    \"endpoint\": \"https://k8s.opencitations.net/meta/sparql\",\n    \"collection\": \"OpenCitations Meta\",\n    \"datetime\": \"20/10/2024, 17:29:10\",\n    \"running_time\": 1.0028636455535889,\n    \"config_fp\": \"custom_meta_monitor_config.json\",\n    \"monitoring_results\": [\n        {\n            \"label\": \"duplicate_br\",\n            \"description\": \"A single value for a given external ID scheme (e.g. DOI value) is associated with more than one BR.\",\n            \"query\": \"query\": \"PREFIX datacite: \u003chttp://purl.org/spar/datacite/\u003e\\nPREFIX literal: \u003chttp://www.essepuntato.it/2010/06/literalreification/\u003e\\nPREFIX fabio: \u003chttp://purl.org/spar/fabio/\u003e\\n\\nASK {\\n    ?br1 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\\n    a fabio:Expression .\\n    ?br2 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\\n    a fabio:Expression .\\n    FILTER(?br1 != ?br2)\\n}\"\n            \"run\": {\n                \"got_result\": true,\n                \"running_time\": 1.0028636455535889,\n                \"error\": null\n            },\n            \"passed\": false\n        }\n    ]\n}\n```\n\nThe above JSON report is then converted into an HTML document (although some information is left out, e.g. the SPARQL query for each test) and stored in the same directory:\n\n![alt text](visual.png)\n\n## Filename Details\n\n- JSON report: `output_\u003cmonitor_type\u003e_YYYYMMDD.json`\n- HTML report: `\u003cmonitor_type\u003e_monitor_vis_YYYYMMDD.html`\n\nIf the script is run multiple times on the same day, the filenames of the files created after the first one will be versioned (e.g., `output_meta_monitor_YYYYMMDD_1.json`, `meta_monitor_vis_YYYYMMDD_1.html`, etc.).\n\n## License\n\nThis project is licensed under the ISC License. See the [LICENSE.md](LICENSE.md) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencitations%2Foc_monitor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopencitations%2Foc_monitor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencitations%2Foc_monitor/lists"}