{"id":49191449,"url":"https://github.com/opencitations/meta_prov_fixer","last_synced_at":"2026-04-23T07:01:10.050Z","repository":{"id":302153025,"uuid":"1009056861","full_name":"opencitations/meta_prov_fixer","owner":"opencitations","description":"A toolkit to detect and fix issues in the OpenCitations Meta provenance dataset.","archived":false,"fork":false,"pushed_at":"2026-03-16T15:12:44.000Z","size":4413,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-17T01:48:25.601Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/opencitations.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-26T14:03:55.000Z","updated_at":"2026-03-16T15:11:57.000Z","dependencies_parsed_at":"2025-08-06T10:16:56.055Z","dependency_job_id":"7a647205-9934-421b-805a-162076ef51f1","html_url":"https://github.com/opencitations/meta_prov_fixer","commit_stats":null,"previous_names":["eliarizzetto/meta_prov_fixer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/opencitations/meta_prov_fixer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Fmeta_prov_fixer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Fmeta_prov_fixer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Fmeta_prov_fixer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Fmeta_prov_fixer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/opencitations","download_url":"https://codeload.github.com/opencitations/meta_prov_fixer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opencitations%2Fmeta_prov_fixer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32169657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-23T02:19:40.750Z","status":"ssl_error","status_checked_at":"2026-04-23T02:17:55.737Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-23T07:01:05.700Z","updated_at":"2026-04-23T07:01:10.039Z","avatar_url":"https://github.com/opencitations.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# meta-prov-fixer\n\n[![Run tests](https://github.com/opencitations/meta_prov_fixer/actions/workflows/test.yml/badge.svg)](https://github.com/opencitations/meta_prov_fixer/actions/workflows/test.yml)\n[![Coverage](https://byob.yarr.is/opencitations/meta_prov_fixer/coverage)](https://opencitations.github.io/meta_prov_fixer/)\n![Python versions](https://img.shields.io/badge/python-3.11%20%7C%203.12%20%7C%203.13-blue)\n[![License: ISC](https://img.shields.io/badge/license-ISC-green.svg)](LICENSE)\n\nA toolkit to detect and fix issues in the OpenCitations Meta provenance dataset.\n\nThis repository provides a set of fixers that detect issues reading local RDF dump files and apply corrective updates to the triplestore and the RDF files. The pipeline coordinates the fixers, supports checkpointing, and logging.\n\n## Features\n\n- Pipeline orchestration (ordered fixers with checkpointing and progress bar)\n- Multiple fixers implemented:\n  - `FillerFixer` — remove filler snapshots and rename/adjust remaining snapshots\n  - `DateTimeFixer` — normalize ill-formed datetime values (make them offset-aware with a consistent format and remove microseconds)\n  - `MissingPrimSourceFixer` — add primary source quads for creation snapshots missing them\n  - `MultiPAFixer` — normalize snapshots with multiple `prov:wasAttributedTo` values\n  - `MultiObjectFixer` — reset graphs where snapshots have too many objects for single-valued properties (creating a new creation snapshot)\n\n## Requirements\n\nThe project uses Python \u003e=3.11 and manages dependencies with [UV](https://docs.astral.sh/uv/) (see `pyproject.toml`). Key runtime dependencies are:\n\n- rdflib\n- SPARQLWrapper\n- tqdm\n- tzdata\n- docker\n\nMake sure you have UV installed. Then, install dependencies with:\n\n```bash\nuv sync\n```\n\n## Quick usage\n\nThe main CLI entrypoint is `meta_prov_fixer/main.py`. It accepts the following options (brief):\n\n- `-e`, `--endpoint` **(required)** — SPARQL endpoint URL.\n- `-i`, `--data-dir` **(required)** — Path to the directory containing the RDF files to process.\n- `-o`, `--out-dir` **(required)** — Directory where fixed files will be written. If this is same as `--data-dir` and `--overwrite-ok` is not set, an error will be raised.\n- `-m`, `--meta-dumps` **(required)** — Path to a JSON file with a list of `[date, URL]` pairs; the loader validates the structure (see \"Input format for `--meta-dumps`\" below).\n- `--chunk-size` — Number of detected issues included in each SPARQL update (default: `100`).\n- `--failed-queries-fp` — File path to log failed SPARQL update queries (default: `prov_fix_failed_queries_\u003cYYYY-MM-DD\u003e.txt`).\n- `-l`, `--log-fp` — File path for the run log (default: `provenance_fix_\u003cYYYY-MM-DD\u003e.log`).\n- `--overwrite-ok` — Allow overwriting input files when `--out-dir` equals `--data-dir` and the input files are decompressed `.json` (default: not set).\n- `--checkpoint-fp` — Path for the checkpoint file used to resume a run (default: `fix_prov.checkpoint.json`).\n- `--cache-fp` — Path for the issues cache file (default: `filler_issues.cache.json`).\n- `--dry-run-db` — Skip SPARQL updates to the endpoint (default: `False`). Useful for testing or when you only want to write fixed files.\n- `--dry-run-files` — Skip writing fixed files to out-dir (default: `False`). Useful when you only want to update the database.\n- `--dry-run-issues-dir` — Directory where to write issues found during dry-run as JSON-Lines files (default: `None`). Works with `--dry-run-db`.\n- `--dry-run-process-id` — Optional identifier for parallel execution (e.g., directory name like 'br', 'ar') to create unique filenames (default: `None`).\n\n### Example\n\nDetect issues from RDF files and fix them on the triplestore, and new correct copies of invalid files:\n\n```shell\nuv run python meta_prov_fixer/main.py -e http://localhost:8890/sparql/ -i \"../meta_prov/br\" -o \"../fixed/br\" -m meta_dumps.json \n```\n\n### Dry-run mode\n\nRun without updating the database (only write fixed files):\n\n```shell\nuv run python meta_prov_fixer/main.py -e http://localhost:8890/sparql/ -i \"../meta_prov/br\" -o \"../fixed/br\" -m meta_dumps.json --dry-run-db\n```\n\nRun dry-run mode and log all detected issues to JSON-Lines files for analysis:\n\n```shell\nuv run python meta_prov_fixer/main.py -e http://localhost:8890/sparql/ -i \"../meta_prov/br\" -o \"../fixed/br\" -m meta_dumps.json --dry-run-db --dry-run-issues-dir \"issues_output\"\n```\n\n\u003c!-- For detailed documentation on dry-run mode, issues logging, and parallel execution, see [DRY_RUN_USAGE.md](DRY_RUN_USAGE.md). --\u003e\n\n## Input format for `--meta-dumps`\n\nThe `--meta-dumps` argument expects a JSON file containing a top-level array of two-item arrays (date and URL). Example (`meta_dumps.json`):\n\n```json\n[\n  [\"2022-12-19\", \"https://doi.org/10.6084/m9.figshare.21747536.v1\"],\n  [\"2022-12-20\", \"https://doi.org/10.6084/m9.figshare.21747536.v2\"],\n  [\"2023-02-15\", \"https://doi.org/10.6084/m9.figshare.21747536.v3\"],\n  [\"2023-06-28\", \"https://doi.org/10.6084/m9.figshare.21747536.v4\"],\n  [\"2023-10-26\", \"https://doi.org/10.6084/m9.figshare.21747536.v5\"],\n  [\"2024-04-06\", \"https://doi.org/10.6084/m9.figshare.21747536.v6\"],\n  [\"2024-06-17\", \"https://doi.org/10.6084/m9.figshare.21747536.v7\"],\n  [\"2025-02-02\", \"https://doi.org/10.6084/m9.figshare.21747536.v8\"],\n  [\"2025-06-06\", \"https://doi.org/10.5281/zenodo.15855112\"]\n]\n```\n\nThe date format must be ISO-style (YYYY-MM-DD). The CLI loader validates the structure and will raise an error for invalid files.\n\n## Output and logging\n\n- A log file is written to the path supplied with `-l/--log-fp` (default includes date in filename).\n- A checkpoint file (default: `fix_prov.checkpoint.json`) is used to resume the pipeline if interrupted. The pipeline clears the checkpoint after successful completion.\n\n## Developer notes\n\n- The pipeline uses a per-file and per-fixer checkpointing mechanism so long-running runs can be resumed after interruptions.\n- Dry-run mode with issues logging is supported via `--dry-run-db` and `--dry-run-issues-dir`. This allows you to process files, detect issues, write fixed files, and log all detected issues to JSON-Lines files without updating the SPARQL endpoint. \u003c!-- See [DRY_RUN_USAGE.md](DRY_RUN_USAGE.md) for detailed documentation. --\u003e\n- The `dry_run_callback` parameter in `src.fix_provenance_process()` allows custom callbacks for handling detected issues. The `meta_prov_fixer.dry_run_utils.create_dry_run_issues_callback()` function provides a ready-to-use callback that writes issues to JSON-Lines files with automatic chunking and parallel execution safety.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencitations%2Fmeta_prov_fixer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopencitations%2Fmeta_prov_fixer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopencitations%2Fmeta_prov_fixer/lists"}