{"id":23430748,"url":"https://github.com/iesl/stance","last_synced_at":"2025-07-09T07:36:57.054Z","repository":{"id":73985615,"uuid":"198517830","full_name":"iesl/stance","owner":"iesl","description":"Learned string similarity for entity names using optimal transport.","archived":false,"fork":false,"pushed_at":"2020-11-17T21:29:48.000Z","size":73,"stargazers_count":35,"open_issues_count":3,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-12T23:14:35.665Z","etag":null,"topics":["aliases","entity-resolution","optimal-transport","record-linkage","stance","string-distance","string-matching","string-similarity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iesl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-07-23T22:32:52.000Z","updated_at":"2025-02-02T13:36:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"d50a7818-30c9-4aab-9d50-46f80b1753ff","html_url":"https://github.com/iesl/stance","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/iesl/stance","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iesl%2Fstance","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iesl%2Fstance/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iesl%2Fstance/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iesl%2Fstance/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iesl","download_url":"https://codeload.github.com/iesl/stance/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iesl%2Fstance/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264414627,"owners_count":23604440,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aliases","entity-resolution","optimal-transport","record-linkage","stance","string-distance","string-matching","string-similarity"],"created_at":"2024-12-23T09:46:49.118Z","updated_at":"2025-07-09T07:36:57.047Z","avatar_url":"https://github.com/iesl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Stance #\n**S**imiliarity of \n**T**ransport \n**A**ligned \n**N**eural \n**C**haracter \n**E**ncodings\n\n[Optimal Transport-based Alignment of Learned Character Representations for String Similarity](https://www.aclweb.org/anthology/P19-1592)\nDerek Tam, Nicholas Monath, Ari Kobren, Aaron Traylor, Rajarshi Das, Andrew McCallum.\nAssociation for Computational Linguistics (ACL). 2019.\n\n\n## Dependencies ##\nPython 3.6\\\nPytorch 0.4\\\nnumpy 1.13.3\\\nscikit-learn 0.21.1 \\\ncython \\\nnose \n\n## Dataset ## \n\nThe datasets are at this google drive [link](https://drive.google.com/drive/folders/1LeGQWdXOwkDxVJ2UXieqbh0-ZyqraAyj) \\[Updated 9/7/19\\]  and the `data` directory should be put under the top directory `stance`\n\nTraining files are of the form `query \\t positive \\t negative`. For example, \n```\nWilliam Paget, 1st Baron Paget \\t William Lord Paget \\t William George Stevens \nWilliam Paget, 1st Baron Paget \\t William Lord Paget \\t William Tighe  \nWilliam Paget, 1st Baron Paget \\t William Lord Paget \\t Edward Paget    \n```\n\nDev and Test files are of the form `query \\t candidate \\t label ` where label is 1 (if candidate is alias of query) or 0 (if candidate is not alias of query). For example, \n\n```\npeace agreement peace negotiation       1      \npeace agreement interim peace treaty    1      \npeace agreement Peace Accord    1  \n```\n\n\n## Setup ##\n\nFirst, install the baselines by running `source bin/install_baseline.sh`  (from https://github.com/mblondel/soft-dtw)\n\nFor each session, run `source bin/setup.sh` to set environment variables.\n\nIf running on your own dataset, create the vocab for a dataset by running `bin/make_vocab.sh` with the training file, vocab file name, tokenizer, and miniumum count as arguments. For example, `sh bin/make_vocab.sh data/artist/artist.train data/artist/artist.vocab Char 5`. Vocab files are provided for the datasets we released.\n\n\\* Note creating the vocab only has to be done once per dataset.\n\n## Training Models ##\n\nFirst create a config JSON file (sample file at `config/artist/STANCE.json`).\n\nThen, train the model by running `bin/run/train_model.sh` with the config JSON file as an argument. For example, `sh bin/run/train_mode.sh config/artist/stance.json`\n\nSee below for how to grid search train models \n\n## Evaluating Models ##\n\nThere are two options: \n1) evaluating the model on the entire test file (can take a long time to run)\n\n    * For the first option, run `bin/run/eval_model.sh`, passing in the experiment directory as the argument. For example, `sh bin/run/eval_model.sh exp_out/artist/Stance/Char/2019-05-30-10-36-55/`. \n\n2) sharding the test file and evaluate the model in parallel  \n\n\n    * For the second option, first shard the test file by running `bin/shard_test.sh` and passing in the test file and number of shards as arguments. For example, `sh bin/shard_test.sh data/disease/disease.test 10 0`.\n\n        \\* This only has to be done once per dataset\n\n    * Then, setup a script by running `src/main/eval/setup_parallel_test.py` that will evaluate the model on each shard in parallel, passing in the experiment directory, number of shards, and gpu type as arguments. The experiment directory has to be the configuration directory with the best model when using grid search. For example, `python src/main/eval/setup_parallel_test.py -e exp_out/artist/Stance/Char/2019-05-30-10-36-55 -n 10 -g 1080ti-short`\n    \n       \\* The script assumes a slurm manager  \n\n    * Finally, run the script which will be at `exp_out/{dataset}/{model}/{tokenizer}/{timestamp}/parallel_test.sh`. For example, `sh exp_out/artist/Stance/Char/2019-05-30-10-36-55/parallel_test.sh`. \n    \n3) Calculate the score on the shards\n   \n   * Run `src/main/eval/score_shards.py`. The experiment directory has to be the same experiment directory passed into `src/main/eval/setup_parallel_test.py` earlier. For example, `python src/main/eval/score_shards.py -e exp_out/artist/Stance/Char/2019-05-30-10-36-55` The test scores will appear in `exp_out/{dataset}/{model}/{tokenizer}/{timestamp}/test_scores.json` \n\n## Grid Search Train Models ##\n\nFirst, create a grid search config JSON file (sample file at `config/artist/grid_search_STANCE.json`)\n\nThen, create a script to train each model configuration in parallel by running `src/main/setup/setup_grid_search_train.py` with the grid search config file and gpu type as arguments. For example, `python src/main/setup/setup_grid_search_train.py -c config/artist/grid_search_STANCE.json -g gpu`.\n\n\\* The script assumes a slurm manager \n\nFinally, run the script, which wil be at `exp_out/{dataset}/{model}/{tokenizer}/{timestamp}/grid_search_config.sh`. For example, `sh exp_out/artist/Stance/Char/2019-05-30-15-08-47/grid_search_config.sh`.\n\n## Citing ##\n\nPlease cite: \n\n```\n@inproceedings{tam2019optimal,\n    title = \"Optimal Transport-based Alignment of Learned Character Representations for String Similarity\",\n    author = \"Tam, Derek  and\n      Monath, Nicholas  and\n      Kobren, Ari  and\n      Traylor, Aaron  and\n      Das, Rajarshi  and\n      McCallum, Andrew\",\n    booktitle = \"Association for Computational Linguistics (ACL)\",\n    year = \"2019\"\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiesl%2Fstance","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiesl%2Fstance","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiesl%2Fstance/lists"}