{"id":19710868,"url":"https://github.com/vemonet/csvw-ontomap","last_synced_at":"2026-06-07T22:33:58.043Z","repository":{"id":210910635,"uuid":"727744618","full_name":"vemonet/csvw-ontomap","owner":"vemonet","description":"🗺️ ️Generate CSVW metadata for tabular data files, and map columns to terms in a given OWL ontology using semantic search","archived":false,"fork":false,"pushed_at":"2024-03-05T13:01:23.000Z","size":122,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-02T08:23:19.280Z","etag":null,"topics":["csvw","data-extraction","linked-data","ontology-mapping","owl-ontology"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vemonet.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-05T13:43:30.000Z","updated_at":"2025-11-03T13:53:27.000Z","dependencies_parsed_at":"2025-02-27T16:26:09.416Z","dependency_job_id":"74a81030-4423-4651-8c13-9b331429c77b","html_url":"https://github.com/vemonet/csvw-ontomap","commit_stats":{"total_commits":9,"total_committers":1,"mean_commits":9.0,"dds":0.0,"last_synced_commit":"e728cfc17f53672dc89257818c0ee432347ddbb9"},"previous_names":["vemonet/csvw-ontomap"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vemonet/csvw-ontomap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vemonet%2Fcsvw-ontomap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vemonet%2Fcsvw-ontomap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vemonet%2Fcsvw-ontomap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vemonet%2Fcsvw-ontomap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vemonet","download_url":"https://codeload.github.com/vemonet/csvw-ontomap/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vemonet%2Fcsvw-ontomap/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34041087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csvw","data-extraction","linked-data","ontology-mapping","owl-ontology"],"created_at":"2024-11-11T22:08:42.739Z","updated_at":"2026-06-07T22:33:58.028Z","avatar_url":"https://github.com/vemonet.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🔎 CSVW OntoMap 🗺️\n\n\u003c!-- [![PyPI - Version](https://img.shields.io/pypi/v/csvw-ontomap.svg?logo=pypi\u0026label=PyPI\u0026logoColor=silver)](https://pypi.org/project/csvw-ontomap/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/csvw-ontomap.svg?logo=python\u0026label=Python\u0026logoColor=silver)](https://pypi.org/project/csvw-ontomap/)\n[![license](https://img.shields.io/pypi/l/csvw-ontomap.svg?color=%2334D058)](https://github.com/vemonet/csvw-ontomap/blob/main/LICENSE.txt)\n[![Publish package](https://github.com/vemonet/csvw-ontomap/actions/workflows/publish.yml/badge.svg)](https://github.com/vemonet/csvw-ontomap/actions/workflows/publish.yml) --\u003e\n\n[![Test package](https://github.com/vemonet/csvw-ontomap/actions/workflows/test.yml/badge.svg)](https://github.com/vemonet/csvw-ontomap/actions/workflows/test.yml)\n\n[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch) [![linting - Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![code style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)\n\n\u003c/div\u003e\n\nAutomatically generate descriptive [CSVW](https://csvw.org) (CSV on the Web) metadata for tabular data files:\n\n- **Extract columns datatypes**: detect if they are categorical, and which values are accepted, using [`ydata-profiling`](https://github.com/ydataai/ydata-profiling).\n- **Ontology mappings**: when provided with a URL to an OWL ontology, text embeddings are generated and stored in a local [Qdrant](https://github.com/qdrant/qdrant) vector database for all classes and properties, we use similarity search to match each data column to the most relevant ontology terms.\n- Currently supports: CSV, Excel, SPSS files. Any format that can be loaded in a Pandas DataFrame could be easily added, create an issue on GitHub to request a new format to be added.\n    - Processed files needs to contain 1 sheet, if multiple sheets are present in a file only the first one will be processed.\n\n\u003e [!WARNING]\n\u003e\n\u003e The lib does not check yet if the VectorDB has been fully loaded. It will skip loading if there is at least 2 vectors in the DB. So if you stop the loading process halfway through, you will need to delete the VectorDB folder to make sure the lib run the ontology loading.\n\n## 📦️ Installation\n\nThis package requires Python \u003e=3.8, simply install it with:\n\n```bash\npip install git+https://github.com/vemonet/csvw-ontomap.git\n```\n\n## 🪄 Usage\n\n### ⌨️ Use as a command-line interface\n\nYou can easily use your package from your terminal after installing `csvw-ontomap` with pip:\n\n```bash\ncsvw-ontomap tests/resources/*.csv\n```\n\nStore CSVW metadata report JSON-LD output to file:\n\n```bash\ncsvw-ontomap tests/resources/*.csv -o csvw-report.json\n```\n\nStore CSVW metadata report as CSV file:\n\n```bash\ncsvw-ontomap tests/resources/*.csv -o csvw-report.csv\n```\n\nProvide the URL to an OWL ontology that will be used to map the column names:\n\n```bash\ncsvw-ontomap tests/resources/*.csv -m https://semanticscience.org/ontology/sio.owl\n```\n\nSpecify the path to store the vectors (default is `data/vectordb`):\n\n```bash\ncsvw-ontomap tests/resources/*.csv -m https://semanticscience.org/ontology/sio.owl -d data/vectordb\n```\n\n### 🐍 Use with python\n\nUse this package in python scripts:\n\n```python\nfrom csvw_ontomap import CsvwProfiler, OntomapConfig\nimport json\n\nprofiler = CsvwProfiler(\n    ontologies=[\"https://semanticscience.org/ontology/sio.owl\"],\n    vectordb_path=\"data/vectordb\",\n    config=OntomapConfig(       # Optional\n        comment_best_matches=3, # Add the ontology matches as comment\n        search_threshold=0,     # Between 0 and 1\n    ),\n)\ncsvw_report = profiler.profile_files([\n    \"tests/resources/*.csv\",\n    \"tests/resources/*.xlsx\",\n    \"tests/resources/*.spss\",\n])\nprint(json.dumps(csvw_report, indent=2))\n```\n\n## 🧑‍💻 Development setup\n\nThe final section of the README is for if you want to run the package in development, and get involved by making a code contribution.\n\n\n### 📥️ Clone\n\nClone the repository:\n\n```bash\ngit clone https://github.com/vemonet/csvw-ontomap\ncd csvw-ontomap\n```\n\n### 🐣 Install dependencies\n\nInstall [Hatch](https://hatch.pypa.io), this will automatically handle virtual environments and make sure all dependencies are installed when you run a script in the project:\n\n```bash\npipx install hatch\n```\n\n### ☑️ Run tests\n\nMake sure the existing tests still work by running the test suite and linting checks. Note that any pull requests to the fairworkflows repository on github will automatically trigger running of the test suite;\n\n```bash\nhatch run test\n```\n\nTo display all logs when debugging:\n\n```bash\nhatch run test -s\n```\n\n\n### ♻️ Reset the environment\n\nIn case you are facing issues with dependencies not updating properly you can easily reset the virtual environment with:\n\n```bash\nhatch env prune\n```\n\nManually trigger installing the dependencies in a local virtual environment:\n\n```bash\nhatch -v env create\n```\n\n### 🏷️ New release process\n\nThe deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:\n\n1. Make sure the `PYPI_TOKEN` secret has been defined in the GitHub repository (in Settings \u003e Secrets \u003e Actions). You can get an API token from PyPI at [pypi.org/manage/account](https://pypi.org/manage/account).\n2. Increment the `version` number in the `pyproject.toml` file in the root folder of the repository.\n\n    ```bash\n    hatch version fix\n    ```\n\n3. Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.\n\nYou can also do it locally:\n\n```bash\nhatch build\nhatch publish\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvemonet%2Fcsvw-ontomap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvemonet%2Fcsvw-ontomap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvemonet%2Fcsvw-ontomap/lists"}