{"id":48355091,"url":"https://github.com/delftdata/valentine","last_synced_at":"2026-04-05T11:01:11.153Z","repository":{"id":37493917,"uuid":"194281482","full_name":"delftdata/valentine","owner":"delftdata","description":"A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.","archived":false,"fork":false,"pushed_at":"2026-04-04T22:07:57.000Z","size":118046,"stargazers_count":107,"open_issues_count":8,"forks_count":27,"subscribers_count":8,"default_branch":"master","last_synced_at":"2026-04-04T22:44:30.500Z","etag":null,"topics":["data-integration","dataset-discovery","experiment-suite","schema-matching"],"latest_commit_sha":null,"homepage":"https://delftdata.github.io/valentine/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/delftdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-06-28T13:50:14.000Z","updated_at":"2026-04-04T22:07:06.000Z","dependencies_parsed_at":"2025-05-28T17:30:40.776Z","dependency_job_id":"eb183ee7-07bb-4cb7-80b7-7cabd3da8207","html_url":"https://github.com/delftdata/valentine","commit_stats":{"total_commits":282,"total_committers":9,"mean_commits":"31.333333333333332","dds":0.3900709219858156,"last_synced_commit":"3eba4cd1a2514105f30438f896f982e7b01542aa"},"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/delftdata/valentine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/delftdata%2Fvalentine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/delftdata%2Fvalentine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/delftdata%2Fvalentine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/delftdata%2Fvalentine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/delftdata","download_url":"https://codeload.github.com/delftdata/valentine/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/delftdata%2Fvalentine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31433044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T08:13:15.228Z","status":"ssl_error","status_checked_at":"2026-04-05T08:13:11.839Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-integration","dataset-discovery","experiment-suite","schema-matching"],"created_at":"2026-04-05T11:00:36.751Z","updated_at":"2026-04-05T11:01:11.113Z","avatar_url":"https://github.com/delftdata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eValentine 💘\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003e(Schema-) Matching DataFrames Made Easy\u003c/em\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/delftdata/valentine/actions/workflows/build.yml\"\u003e\n    \u003cimg src=\"https://github.com/delftdata/valentine/actions/workflows/build.yml/badge.svg\" alt=\"Build\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://codecov.io/gh/delftdata/valentine\"\u003e\n    \u003cimg src=\"https://codecov.io/gh/delftdata/valentine/branch/master/graph/badge.svg?token=4QR0X315CL\" alt=\"codecov\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://app.codacy.com/gh/delftdata/valentine/dashboard\"\u003e\n    \u003cimg src=\"https://app.codacy.com/project/badge/Grade/85cfebfc9c6a43359c5b2e56a5fdf3a3\" alt=\"Codacy Badge\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/astral-sh/ruff\"\u003e\n    \u003cimg src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\" alt=\"Ruff\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/valentine/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/valentine.svg\" alt=\"PyPI version\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/valentine/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/pyversions/valentine.svg\" alt=\"Python versions\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/valentine/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/dm/valentine.svg\" alt=\"PyPI downloads\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/delftdata/valentine/blob/master/LICENSE\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/delftdata/valentine.svg\" alt=\"License\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://delftdata.github.io/valentine/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/docs-GitHub%20Pages-blue.svg\" alt=\"Docs\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n---\n\nA Python package for capturing potential relationships among columns of different tabular datasets, given as pandas DataFrames.  \nValentine is based on the paper [**Valentine: Evaluating Matching Techniques for Dataset Discovery**](https://ieeexplore.ieee.org/abstract/document/9458921).\n\nYou can find more information about the research supporting Valentine [here](https://delftdata.github.io/valentine/).\n\n\n## Experimental suite version\n\nThe original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found [here](https://github.com/delftdata/valentine/tree/v1.1).\n\n## Installation instructions\n### Requirements\n\n*   *Python* \u003e=3.10,\u003c3.15\n\nTo install Valentine simply run:\n\n```shell\npip install valentine\n```\n\n\n## Usage\nValentine can be used to find matches among columns of a given pair of pandas DataFrames. \n\n### Matching methods\nIn order to do so, the user can choose one of the following matching methods:\n\n1.   `Coma(int: max_n, bool: use_instances, bool: use_schema, float: delta, float: threshold)` is a pure Python implementation of the [COMA 3.0](https://sourceforge.net/projects/coma-ce/) schema matching algorithm.\n     *    **Parameters**:\n           *    **max_n**(*int*) - Maximum number of matches to keep per column, 0 means unlimited (default: 0).\n           *    **use_instances**(*bool*) - Whether to use TF-IDF instance-based matching on data values (default: False).\n           *    **use_schema**(*bool*) - Whether to use schema-based matching on column names, paths, and structure (default: True).\n           *    **delta**(*float*) - Fraction from the best score within which matches are kept (default: 0.15).\n           *    **threshold**(*float*) - Absolute minimum similarity score to keep a match (default: 0.0).\n\n2.   `Cupid(float: w_struct, float: leaf_w_struct, float: th_accept)` is the python implementation of the paper [Generic Schema Matching with Cupid](https://www.vldb.org/conf/2001/P049.pdf)\n     *    **Parameters**:\n          *    **w_struct**(*float*) - Structural similarity threshold, default is 0.2.\n          *    **leaf_w_struct**(*float*) - Structural similarity threshold, leaf level, default is 0.2.\n          *    **th_accept**(*float*) - Accept similarity threshold, default is 0.7.\n\n3.   `DistributionBased(float: threshold1, float: threshold2)` is the python implementation of the paper [Automatic Discovery of Attributes in Relational Databases](https://dl.acm.org/doi/10.1145/1989323.1989336)\n     *    **Parameters**:\n          *    **threshold1**(*float*) - The threshold for phase 1 of the method, default is 0.15.\n          *    **threshold2**(*float*) - The threshold for phase 2 of the method, default is 0.15.\n\n4.   `JaccardDistanceMatcher(float: threshold_dist)` is a baseline method that uses Jaccard Similarity between columns to assess their correspondence score, optionally enhanced by a string similarity measure of choice.\n     *    **Parameters**:\n          *    **threshold_dist**(*float*) - Acceptance threshold for assessing two strings as equal, default is 0.8.\n\n          *    **distance_fun**(*StringDistanceFunction*) - String similarity function used to assess whether two strings are equal. The enumeration class type `StringDistanceFunction` can be imported from `valentine.algorithms.jaccard_distance`. Functions currently supported are:\n   \t\t       * `StringDistanceFunction.Levenshtein`: [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)\n               * `StringDistanceFunction.DamerauLevenshtein`: [Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)\n               * `StringDistanceFunction.Hamming`: [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)\n               * `StringDistanceFunction.Jaro`: [Jaro distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)\n               * `StringDistanceFunction.JaroWinkler`: [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)\n              * `StringDistanceFunction.Exact`: String equality `==`\n\n5.   `SimilarityFlooding(str: coeff_policy, str: formula)` is the python implementation of the paper [Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching](https://ieeexplore.ieee.org/document/994702)\n     * **Parameters**: \n        *    **coeff_policy**(*str*) - Policy for deciding the weight coefficients of the propagation graph. Choice of \"inverse\\_product\" or \"inverse\\_average\" (default).\n        *    **formula**(*str*) - Formula on which iterative fixpoint computation is based. Choice of \"basic\", \"formula\\_a\", \"formula\\_b\" and \"formula\\_c\" (default).\n\n### Matching DataFrame Pair\n\nAfter selecting one of the matching methods, the user can initiate the pairwise matching process in the following way:\n\n```python\nmatches = valentine_match(df1, df2, matcher, df1_name, df2_name)\n```\n\nwhere df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardDistanceMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are \"table\\_1\" and \"table\\_2\"). Function ```valentine_match``` returns a MatcherResults object, which is a dictionary with additional convenience methods, such as `one_to_one`, `take_top_percent`, `get_metrics` and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.\n\n### Matching DataFrame Batch\n\nAfter selecting one of the matching methods, the user can initiate the batch matching process in the following way:\n\n```python\nmatches = valentine_match_batch(df_iter_1, df_iter_2, matcher, df_iter_1_names, df_iter_2_names)\n```\n\nwhere df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardDistanceMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function ```valentine_match_batch``` returns a MatcherResults object, which is a dictionary with additional convenience methods, such as `one_to_one`, `take_top_percent`, `get_metrics` and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.\n\n\n### MatcherResults instance\nThe `MatcherResults` instance has some convenience methods that the user can use to either obtain a subset of the data or to transform the data. This instance is a dictionary and is sorted upon instantiation, from high similarity to low similarity.\n```python\ntop_n_matches = matches.take_top_n(5)\n\ntop_n_percent_matches = matches.take_top_percent(25)\n\none_to_one_matches = matches.one_to_one()\n```\n\n\n### Measuring effectiveness\nThe MatcherResults instance that is returned by `valentine_match` or `valentine_match_batch` also has a `get_metrics` method that the user can use \n\n```python \nmetrics = matches.get_metrics(ground_truth)\n``` \n\nin order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold (see example below).\n\nBy default, all the core metrics will be used for this with default parameters, but the user can also customize which metrics to run with what parameters, and implement own custom metrics by extending from the `Metric` base class. Some sets of metrics are available as well.\n\n```python\nfrom valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N\nmetrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})\nmetrics_prefefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)\n\n```\n\n\n### Example\nThe following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (a more extensive example is shown in [`valentine_example.py`](https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py)):\n\n```python\nimport os\nimport pandas as pd\nfrom valentine import valentine_match\nfrom valentine.algorithms import Coma\n\n# Load data using pandas\nd1_path = os.path.join('data', 'authors1.csv')\nd2_path = os.path.join('data', 'authors2.csv')\ndf1 = pd.read_csv(d1_path)\ndf2 = pd.read_csv(d2_path)\n\n# Instantiate matcher and run\nmatcher = Coma(use_instances=True)\nmatches = valentine_match(df1, df2, matcher)\n\nprint(matches)\n\n# If ground truth available valentine could calculate the metrics\nground_truth = [('Cited by', 'Cited by'),\n                ('Authors', 'Authors'),\n                ('EID', 'EID')]\n\nmetrics = matches.get_metrics(ground_truth)\n    \nprint(metrics)\n```\n\nThe output of the above code block is:\n\n```\n{\n     (('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.86994505, \n     (('table_1', 'Authors'), ('table_2', 'Authors')): 0.8679843, \n     (('table_1', 'EID'), ('table_2', 'EID')): 0.8571245\n}\n{\n     'Recall': 1.0, \n     'F1Score': 1.0, \n     'RecallAtSizeofGroundTruth': 1.0, \n     'Precision': 1.0, \n     'PrecisionTop10Percent': 1.0\n}\n```\n\n## Cite Valentine\n```\nOriginal Valentine paper:\n@inproceedings{koutras2021valentine,\n  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},\n  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},\n  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},\n  pages={468--479},\n  year={2021},\n  organization={IEEE}\n}\nDemo Paper:\n@article{koutras2021demo,\n  title={Valentine in Action: Matching Tabular Data at Scale},\n  author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},\n  journal={VLDB},\n  volume={14},\n  number={12},\n  pages={2871--2874},\n  year={2021},\n  publisher={VLDB Endowment}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdelftdata%2Fvalentine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdelftdata%2Fvalentine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdelftdata%2Fvalentine/lists"}