{"id":23440850,"url":"https://github.com/living-with-machines/deezymatch","last_synced_at":"2025-05-16T14:04:37.040Z","repository":{"id":39341462,"uuid":"261796758","full_name":"Living-with-machines/DeezyMatch","owner":"Living-with-machines","description":"A Flexible Deep Learning Approach to Fuzzy String Matching","archived":false,"fork":false,"pushed_at":"2024-10-16T14:52:18.000Z","size":2554,"stargazers_count":144,"open_issues_count":30,"forks_count":34,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-03T10:12:34.066Z","etag":null,"topics":["deep-learning","hacktoberfest","hut23","hut23-96","machine-learning","natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"https://living-with-machines.github.io/DeezyMatch/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Living-with-machines.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-06T15:12:51.000Z","updated_at":"2025-02-28T21:13:49.000Z","dependencies_parsed_at":"2024-12-31T15:34:18.559Z","dependency_job_id":null,"html_url":"https://github.com/Living-with-machines/DeezyMatch","commit_stats":{"total_commits":452,"total_committers":6,"mean_commits":75.33333333333333,"dds":0.1084070796460177,"last_synced_commit":"b3e550422187a7c5872efbbc469f7071b53bc0ab"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Living-with-machines%2FDeezyMatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Living-with-machines%2FDeezyMatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Living-with-machines%2FDeezyMatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Living-with-machines%2FDeezyMatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Living-with-machines","download_url":"https://codeload.github.com/Living-with-machines/DeezyMatch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248546394,"owners_count":21122306,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","hacktoberfest","hut23","hut23-96","machine-learning","natural-language-processing","nlp"],"created_at":"2024-12-23T16:19:21.176Z","updated_at":"2025-04-12T09:33:14.183Z","avatar_url":"https://github.com/Living-with-machines.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/Living-with-machines/DeezyMatch/master/figs/DM_logo.png\" \n         alt=\"DeezyMatch logo\" width=\"30%\" align=\"center\"\u003e\n    \u003c/p\u003e\n    \u003ch2\u003eA Flexible Deep Neural Network Approach to Fuzzy String Matching\u003c/h2\u003e\n\u003c/div\u003e\n \n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/DeezyMatch/\"\u003e\n        \u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/DeezyMatch\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/Living-with-machines/DeezyMatch/blob/master/LICENSE\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://mybinder.org/v2/gh/Living-with-machines/DeezyMatch/HEAD?filepath=examples\"\u003e\n        \u003cimg alt=\"Binder\" src=\"https://mybinder.org/badge_logo.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/Living-with-machines/DeezyMatch/actions/workflows/dm_ci.yml/badge.svg\"\u003e\n        \u003cimg alt=\"Integration Tests badge\" src=\"https://github.com/Living-with-machines/DeezyMatch/actions/workflows/dm_ci.yml/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003cbr/\u003e\n\u003c/p\u003e\n\nDeezyMatch can be used in the following tasks:\n\n- Fuzzy string matching\n- Candidate ranking/selection\n- Query expansion\n- Toponym matching\n\nOr as a component in tasks requiring fuzzy string matching and candidate ranking, such as:\n- Record linkage\n- Entity linking \n\nTable of contents\n-----------------\n\n- [Installation and setup](#installation)\n- [Data and directory structure in tutorials](#data-and-directory-structure-in-tutorials)\n    * [The input file](#the-input-file)\n    * [The vocabulary file](#the-vocabulary-file)\n    * [The datasets](#the-datasets)\n- [Run DeezyMatch: the quick tour](#run-deezymatch-the-quick-tour)\n- [Run DeezyMatch: the complete tour](#run-deezymatch-the-complete-tour)\n    * [Train a new model](#train-a-new-model)\n    * [Plot the log file](#plot-the-log-file)\n    * [Finetune a pretrained model](#finetune-a-pretrained-model)\n    * [Model inference](#model-inference)\n    * [Generate query and candidate vectors](#generate-query-and-candidate-vectors)\n    * [Combine vector representations](#combine-vector-representations)\n    * [Candidate ranking](#candidate-ranking)\n    * [Candidate ranking on-the-fly](#candidate-ranking-on-the-fly)\n    * [Tips / Suggestions on DeezyMatch functionalities](#tips--suggestions-on-deezymatch-functionalities)\n- [Examples on how to run DeezyMatch on Jupyter notebooks](./examples)\n- [How to cite DeezyMatch](#how-to-cite-deezymatch)\n- [Credits](#credits)\n\n## Installation\n\nWe strongly recommend installation via Anaconda (refer to [Anaconda website and follow the instructions](https://docs.anaconda.com/anaconda/install/)).\n\n* Create a new environment for DeezyMatch\n\n```bash\nconda create -n py39deezy python=3.9\n```\n\n* Activate the environment:\n\n```bash\nconda activate py39deezy\n```\n\n* DeezyMatch can be installed in different ways:\n\n  1. **Install DeezyMatch via [PyPi](https://pypi.org/project/DeezyMatch/)** (which tends to be the most user-friendly option):\n      \n      * Install DeezyMatch:\n\n      ```bash\n      pip install DeezyMatch\n      ```\n\n  2. **Install DeezyMatch from the source code**:\n\n      * Clone DeezyMatch source code:\n\n      ```bash\n      git clone https://github.com/Living-with-machines/DeezyMatch.git\n      ```\n\n      * Install DeezyMatch dependencies:\n\n      ```\n      cd /path/to/my/DeezyMatch\n      pip install -r requirements.txt\n      ```\n\n        \u003e :warning: If you get `ModuleNotFoundError: No module named '_swigfaiss'` error when running `candidateRanker.py`, one way to solve this issue is by:\n        \u003e \n        \u003e ```bash\n        \u003e pip install faiss-cpu --no-cache\n        \u003e ```\n        \u003e\n        \u003e Refer to [this page](https://github.com/facebookresearch/faiss/issues/821).\n\n      * DeezyMatch can be installed using one of the following two options:\n      \n        * Install DeezyMatch in non-editable mode:\n\n          ```\n          cd /path/to/my/DeezyMatch\n          python setup.py install\n          ```\n\n        * Install DeezyMatch in editable mode:\n\n          ```\n          cd /path/to/my/DeezyMatch\n          pip install -v -e .\n          ```\n\n* We have provided some [Jupyter Notebooks to show how different components in DeezyMatch can be run](./examples). To allow the newly created `py39deezy` environment to show up in the notebooks:\n\n  ```bash\n  python -m ipykernel install --user --name py39deezy --display-name \"Python (py39deezy)\"\n  ```\n\n## Data and directory structure in tutorials\n\nYou can create a new directory for your experiments. Note that this directory can be created outside of the DeezyMarch source code (after installation, DeezyMatch command lines and modules are accessible from anywhere on your local machine).\n\nIn the tutorials, we assume the following directory structure (i.e. we assume the commands are run from the main DeezyMatch directory):\n\n```bash\nDeezyMatch\n   ├── dataset\n   │   ├── characters_v001.vocab\n   │   ├── dataset-string-matching_train.txt\n   │   ├── dataset-string-matching_finetune.txt\n   │   ├── dataset-string-matching_test.txt\n   │   ├── dataset-candidates.txt\n   │   └── dataset-queries.txt\n   └── inputs\n       ├── characters_v001.vocab\n       └── input_dfm.yaml\n```\n\n### The input file\n\nThe input file (`input_dfm.yaml`) allows the user to specify a series of parameters that will define the behaviour of DeezyMatch, without requiring the user to modify the code. The input file allows you to configure the following:\n* Type of normalization and preprocessing that is to be applied to the input string, and tokenization mode (char, ngram, word).\n* Neural network architecture (RNN, GRU, or LSTM) and its hyperparameters (number of layers and directions in the recurrent units, the dimensionality of the hidden layer, learning rate, number of epochs, batch size, early stopping, dropout probability), pooling mode and layers to freeze during fine-tuning.\n* Proportion of data used for training, validation and test.\n\nSee the sample [input file](https://github.com/Living-with-machines/DeezyMatch/blob/master/inputs/input_dfm.yaml) for a complete list of the DeezyMatch options that can be configured from the input file.\n\n### The vocabulary file\nThe vocabulary file (`./inputs/characters_v001.vocab`) file combines all characters from the different datasets we have used in our experiments (see [DeezyMatch's paper](https://www.aclweb.org/anthology/2020.emnlp-demos.9/) and [this paper](https://arxiv.org/abs/2009.08114) for a detailed description of the datasets). It consists of 7,540 characters from multiple alphabets, containing special characters. You will only need to change the vocabulary file in certain fine-tuning settings.\n\n### The datasets\n\nWe provide the following minimal sample datasets to showcase the functionality of DeezyMatch. Please note that these are very small files that have been provided just for illustration purposes.\n\n* **String matching datasets:** The `dataset-string-matching_xxx.txt` files are small subsets from a larger [toponym matching dataset](https://github.com/ruipds/Toponym-Matching). We provide:\n  * `dataset-string-matching_train.txt`: data used for training a DeezyMatch model from scratch [5000 string pairs].\n  * `dataset-string-matching_finetune.txt`: data used for fine-tuning an existing DeezyMatch model (this is an optional step) [2500 string pairs].\n  * `dataset-string-matching_test.txt`: data used for assessing the performance of the DeezyMatch model (this is an optional step, as the training step already produces an intrinsic evaluation) [2495 string pairs].\n  \n  The string matching datasets are composed of an equal number of positive and negative string matches, where:\n  * A positive string match is a pair of strings that can refer to the same entity (e.g. \"Wādī Qānī\" and \"Uàdi Gani\" are different variations of the same place name).\n  * A negative string match is a pair of strings that do not refer to the same entity (e.g. \"Liufangwan\" and \"Wangjiawo\" are **not** variations of the same place name).\n\n  The string matching datasets consist of at least three columns (tab-separated), where the first and second columns contain the two comparing strings, and the third column contain the label (i.e. `TRUE` for a positive match, `FALSE` for a negative match). The dataset can have a number of additional columns, which DeezyMatch will ignore (e.g. the last six columns in the sample datasets).\n\n* **Candidates dataset:** The `dataset-candidates.txt` lists the potential candidates to which we want to match a query. The dataset we provide lists only 40 candidates (just for illustration purposes), with one candidate per line.\n  \n  In real case experiments, the candidates file is usually a large file, as it contains all possible name variations of the potential entities in a knowledge base (for example, supposing we want to find the Wikidata entity that corresponds to a certain query, the candidates would be all potential Wikidata names). This dataset lists one candidate per line. Additional tab-separated columns are allowed (they may be useful to keep information related to the candidate, such as the identifier in the knowledge base, but this additional information will be ignored by DeezyMatch).\n\n* **Queries dataset:** The `dataset-queries.txt` lists the set of queries that we want to match with the candidates: 30 queries, with one query per line. The queries dataset is not required if you use DeezyMatch on-the-fly (see more about this [below](#candidate-ranking-on-the-fly)).\n\n## Run DeezyMatch: the quick tour\n\n:warning: Refer to [installation section](#installation) to set up DeezyMatch on your local machine. In the following tutorials, we assume a directory structure specified in [this section](#data-and-directory-structure-in-tutorials). The outputs of DeezyMatch will be created in the directory from which you are running it, unless otherwise explicitly specified.\n\nWritten in the Python programming language, DeezyMatch can be used as a stand-alone command-line tool or can be integrated as a module with other Python codes. In what follows, we describe DeezyMatch's functionalities in different examples and by providing both command lines and python modules syntaxes.\n\nIn this \"quick tour\", we go through all the DeezyMatch main functionalities with minimal explanations. Note that we provide basic examples using the DeezyMatch python modules here. If you want to know more about each module or run DeezyMatch via command line, refer to the relevant part of README (also referenced in this section):\n\n* [Train a new model](#train-a-new-model):\n\n```python\nfrom DeezyMatch import train as dm_train\n\n# train a new model\ndm_train(input_file_path=\"./inputs/input_dfm.yaml\", \n         dataset_path=\"dataset/dataset-string-matching_train.txt\", \n         model_name=\"test001\")\n```\n\n* [Plot the log file](#plot-the-log-file) (stored at `./models/test001/log.txt` and contains loss/accuracy/recall/F1-scores as a function of epoch):\n\n```python\nfrom DeezyMatch import plot_log\n\n# plot log file\nplot_log(path2log=\"./models/test001/log.txt\", \n         output_name=\"t001\")\n```\n\n* [Finetune a pretrained model](#finetune-a-pretrained-model):\n\n```python\nfrom DeezyMatch import finetune as dm_finetune\n\n# fine-tune a pretrained model stored at pretrained_model_path and pretrained_vocab_path \ndm_finetune(input_file_path=\"./inputs/input_dfm.yaml\", \n            dataset_path=\"dataset/dataset-string-matching_finetune.txt\", \n            model_name=\"finetuned_test001\",\n            pretrained_model_path=\"./models/test001/test001.model\", \n            pretrained_vocab_path=\"./models/test001/test001.vocab\")\n```\n\n* [Model inference](#model-inference):\n\n```python\nfrom DeezyMatch import inference as dm_inference\n\n# model inference using a model stored at pretrained_model_path and pretrained_vocab_path (in this case, the model we have just finetuned)\ndm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n             dataset_path=\"dataset/dataset-string-matching_test.txt\", \n             pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n             pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\")\n```\n\n * [Generate candidate vectors](#generate-query-and-candidate-vectors):\n\n```python\nfrom DeezyMatch import inference as dm_inference\n\n# generate vectors for candidates (specified in dataset_path) \n# using a model stored at pretrained_model_path and pretrained_vocab_path \ndm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n             dataset_path=\"dataset/dataset-candidates.txt\", \n             pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n             pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\",\n             inference_mode=\"vect\",\n             scenario=\"candidates/test\")\n```\n\n* [Assembling candidates vector representations](#combine-vector-representations):\n\n```python\nfrom DeezyMatch import combine_vecs\n\n# combine vectors stored in candidates/test and save them in combined/candidates_test\ncombine_vecs(rnn_passes=['fwd', 'bwd'], \n             input_scenario='candidates/test', \n             output_scenario='combined/candidates_test', \n             print_every=10)\n```\n\n * [Generate query vectors](#generate-query-and-candidate-vectors) (not required for on-the-fly DeezyMatch):\n \n```python\nfrom DeezyMatch import inference as dm_inference\n\n# generate vectors for queries (specified in dataset_path) \n# using a model stored at pretrained_model_path and pretrained_vocab_path \ndm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n             dataset_path=\"dataset/dataset-queries.txt\", \n             pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n             pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\",\n             inference_mode=\"vect\",\n             scenario=\"queries/test\")\n```\n\n* [Assembling queries vector representations](#combine-vector-representations) (not required for on-the-fly DeezyMatch):\n\n```python\nfrom DeezyMatch import combine_vecs\n\n# combine vectors stored in queries/test and save them in combined/queries_test\ncombine_vecs(rnn_passes=['fwd', 'bwd'], \n             input_scenario='queries/test', \n             output_scenario='combined/queries_test', \n             print_every=10)\n```\n\n* [Candidate ranker](#candidate-ranking):\n\n```python\nfrom DeezyMatch import candidate_ranker\n\n# Select candidates based on L2-norm distance (aka faiss distance):\n# find candidates from candidate_scenario \n# for queries specified in query_scenario\ncandidates_pd = \\\n    candidate_ranker(query_scenario=\"./combined/queries_test\",\n                     candidate_scenario=\"./combined/candidates_test\", \n                     ranking_metric=\"faiss\", \n                     selection_threshold=5., \n                     num_candidates=2, \n                     search_size=2, \n                     output_path=\"ranker_results/test_candidates_deezymatch\", \n                     pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                     pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                     number_test_rows=20)\n```\n\n* [Candidate ranking on-the-fly](#candidate-ranking-on-the-fly):\n\n```python\nfrom DeezyMatch import candidate_ranker\n\n# Ranking on-the-fly\n# find candidates from candidate_scenario \n# for queries specified by the `query` argument\ncandidates_pd = \\\n    candidate_ranker(candidate_scenario=\"./combined/candidates_test\",\n                     query=[\"DeezyMatch\", \"kasra\", \"fede\", \"mariona\"],\n                     ranking_metric=\"faiss\", \n                     selection_threshold=5., \n                     num_candidates=1, \n                     search_size=100, \n                     output_path=\"ranker_results/test_candidates_deezymatch_on_the_fly\", \n                     pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                     pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                     number_test_rows=20)\n```\n\nThe candidate ranker can be initialised, to be used multiple times, by running:\n\n```python\nfrom DeezyMatch import candidate_ranker_init\n\n# initializing candidate_ranker via candidate_ranker_init\nmyranker = candidate_ranker_init(candidate_scenario=\"./combined/candidates_test\",\n                                 query=[\"DeezyMatch\", \"kasra\", \"fede\", \"mariona\"],\n                                 ranking_metric=\"faiss\", \n                                 selection_threshold=5., \n                                 num_candidates=1, \n                                 search_size=100, \n                                 output_path=\"ranker_results/test_candidates_deezymatch_on_the_fly\", \n                                 pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                                 pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                                 number_test_rows=20)\n\n# print the content of myranker by:\nprint(myranker)\n\n# To rank the queries:\nmyranker.rank()\n\n#The results are stored in:\nmyranker.output\n```\n\n## Run DeezyMatch: the complete tour\n\n:warning: Refer to [installation section](#installation) to set up DeezyMatch on your local machine. In the following tutorials, we assume a directory structure specified in [this section](#data-and-directory-structure-in-tutorials). The outputs of DeezyMatch will be created in the directory from which you are running it, unless otherwise explicitly specified.\n\nWritten in the Python programming language, DeezyMatch can be used as a stand-alone command-line tool or can be integrated as a module with other Python codes. In what follows, we describe DeezyMatch's functionalities in different examples and by providing both command lines and python modules syntaxes.\n\nIn this \"complete tour\", we go through all the DeezyMatch main functionalities in more detail.\n\n### Train a new model\n\nDeezyMatch `train` module can be used to train a new model:\n\n```python\nfrom DeezyMatch import train as dm_train\n\n# train a new model\ndm_train(input_file_path=\"./inputs/input_dfm.yaml\", \n         dataset_path=\"dataset/dataset-string-matching_train.txt\", \n         model_name=\"test001\")\n```\n\nThe same model can be trained via command line:\n\n```bash\nDeezyMatch -i ./inputs/input_dfm.yaml -d dataset/dataset-string-matching_train.txt -m test001\n```\n\n---\n\nSummary of the arguments/flags:\n\n| Func. argument    | Command-line flag     | Description                                           |\n|------------------ |-------------------    |---------------------------------------------------    |\n| input_file_path   | -i                    | path to the input file                                |\n| dataset_path      | -d                    | path to the dataset                                   |\n| model_name        | -m                    | name of the new model                                 |\n| n_train_examples  | -n                    | number of training examples to be used (optional)     |\n\n---\n\nA new model directory called `test001` will be created in `models` directory (as specified in the input file, see `models_dir` in `./inputs/input_dfm.yaml`).\n\n:warning: Dataset (e.g., `dataset/dataset-string-matching_train.txt` in the above command)\n* The third column (label column) should be one of (case-insensitive): [\"true\", \"false\", \"1\", \"0\"]\n* Delimiter is fixed to `\\t` for now.\n\nDeezyMatch keeps track of some **evaluation metrics** (e.g., loss/accuracy/precision/recall/F1) at each epoch.\n\nDeezyMatch **stores models, vocabularies, input file, log file and checkpoints (for each epoch)** in the following directory structure (unless `validation` option in the input file is not equal to 1). When DeezyMatch finishes the last epoch, it will **save the model with least validation loss as well** (`test001.model` in the following directory structure).\n\n```bash\nmodels/\n└── test001\n    ├── checkpoint00001.model\n    ├── checkpoint00001.model_state_dict\n    ├── checkpoint00002.model\n    ├── checkpoint00002.model_state_dict\n    ├── checkpoint00003.model\n    ├── checkpoint00003.model_state_dict\n    ├── checkpoint00004.model\n    ├── checkpoint00004.model_state_dict\n    ├── checkpoint00005.model\n    ├── checkpoint00005.model_state_dict\n    ├── input_dfm.yaml\n    ├── log_t001.png\n    ├── log.txt\n    ├── test001.model\n    ├── test001.model_state_dict\n    └── test001.vocab\n```\n\nDeezyMatch has an **early stopping** functionality, which can be activated by setting the `early_stopping_patience` option in the input file. This option specifies the number of epochs with no improvement after which training will be stopped and the model with the least validation loss will be saved.\n\n### Plot the log file\n\nAs said, DeezyMatch keeps track of some evaluation metrics (e.g., loss/accuracy/precision/recall/F1) at each epoch. It is possible to plot the log-file by:\n\n```python\nfrom DeezyMatch import plot_log\n\n# plot log file\nplot_log(path2log=\"./models/test001/log.txt\", \n         output_name=\"t001\")\n```\n\nor:\n\n```bash\nDeezyMatch -lp ./models/test001/log.txt -lo t001\n```\n\n---\n\nSummary of the arguments/flags:\n\n| Func. argument    | Command-line flag     | Description                                   |\n|----------------   |-------------------    |---------------------------------------------- |\n| path2log          | -lp                   | path to the log file                          |\n| output_name       | -lo                   | output name   |\n\n---\n\nThis command generates a figure `log_t001.png` and stores it in `models/test001` directory.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/Living-with-machines/DeezyMatch/master/figs/log_t001.png\" alt=\"Example output of plot_log module\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n### Finetune a pretrained model\n\nThe `finetune` module can be used to fine-tune a pretrained model:\n\n```python\nfrom DeezyMatch import finetune as dm_finetune\n\n# fine-tune a pretrained model stored at pretrained_model_path and pretrained_vocab_path \ndm_finetune(input_file_path=\"./inputs/input_dfm.yaml\", \n            dataset_path=\"dataset/dataset-string-matching_finetune.txt\", \n            model_name=\"finetuned_test001\",\n            pretrained_model_path=\"./models/test001/test001.model\", \n            pretrained_vocab_path=\"./models/test001/test001.vocab\")\n```\n\n`dataset_path` specifies the dataset to be used for finetuning. The paths to model and vocabulary of the pretrained model are specified in `pretrained_model_path` and `pretrained_vocab_path`, respectively. \n\nIt is also possible to fine-tune a model on a specified number of examples/rows from `dataset_path` (see the `n_train_examples` argument):\n\n```python\nfrom DeezyMatch import finetune as dm_finetune\n\n# fine-tune a pretrained model stored at pretrained_model_path and pretrained_vocab_path \ndm_finetune(input_file_path=\"./inputs/input_dfm.yaml\", \n            dataset_path=\"dataset/dataset-string-matching_finetune.txt\", \n            model_name=\"finetuned_test001\",\n            pretrained_model_path=\"./models/test001/test001.model\", \n            pretrained_vocab_path=\"./models/test001/test001.vocab\",\n            n_train_examples=100)\n```\n\nThe same can be done via command line:\n\n```bash\nDeezyMatch -i ./inputs/input_dfm.yaml -d dataset/dataset-string-matching_finetune.txt -m finetuned_test001 -f ./models/test001/test001.model -v ./models/test001/test001.vocab -n 100 \n```\n\n---\n\nSummary of the arguments/flags:\n\n| Func. argument            | Command-line flag     | Description                                           |\n|-----------------------    |-------------------    |---------------------------------------------------    |\n| input_file_path           | -i                    | path to the input file                                |\n| dataset_path              | -d                    | path to the dataset                                   |\n| model_name                | -m                    | name of the new, fine-tuned model                     |\n| pretrained_model_path     | -f                    | path to the pretrained model                          |\n| pretrained_vocab_path     | -v                    | path to the pretrained vocabulary                     |\n| ---                       | -pm                   | print all parameters in a model                       |\n| n_train_examples          | -n                    | number of training examples to be used (optional)     |\n\n---\n\n:warning: If `-n` flag (or `n_train_examples` argument) is not specified, the train/valid/test proportions are read from the input file.\n\nA new fine-tuned model called `finetuned_test001` will be stored in `models` directory. In this example, two components in the neural network architecture were frozen, that is, not changed during fine-tuning (see `layers_to_freeze` in the input file). When running the above command, DeezyMatch lists the parameters in the model and whether or not they will be used in finetuning:\n\n```\n============================================================\nList all parameters in the model\n============================================================\nemb.weight False\nrnn_1.weight_ih_l0 False\nrnn_1.weight_hh_l0 False\nrnn_1.bias_ih_l0 False\nrnn_1.bias_hh_l0 False\nrnn_1.weight_ih_l0_reverse False\nrnn_1.weight_hh_l0_reverse False\nrnn_1.bias_ih_l0_reverse False\nrnn_1.bias_hh_l0_reverse False\nrnn_1.weight_ih_l1 False\nrnn_1.weight_hh_l1 False\nrnn_1.bias_ih_l1 False\nrnn_1.bias_hh_l1 False\nrnn_1.weight_ih_l1_reverse False\nrnn_1.weight_hh_l1_reverse False\nrnn_1.bias_ih_l1_reverse False\nrnn_1.bias_hh_l1_reverse False\nattn_step1.weight False\nattn_step1.bias False\nattn_step2.weight False\nattn_step2.bias False\nfc1.weight True\nfc1.bias True\nfc2.weight True\nfc2.bias True\n============================================================\n```\n\nThe first column lists the parameters in the model, and the second column specifies if those parameters will be used in the optimization or not. In our example, we set `layers_to_freeze: [\"emb\", \"rnn_1\", \"attn\"]`, so all the parameters except for `fc1` and `fc2` will not be changed during the training.\n\nIt is possible to print all parameters in a model by:\n\n```bash\nDeezyMatch -pm models/finetuned_test001/finetuned_test001.model\n```\n\nwhich generates a similar output as above.\n\n### Model inference\n\nWhen a model is trained/fine-tuned, `inference` module can be used for predictions/model-inference. Again, `dataset_path` specifies the dataset to be used for inference, and the paths to model and vocabulary of the pretrained model (the fine-tuned model, in this case) are specified in `pretrained_model_path` and `pretrained_vocab_path`, respectively. \n\n```python\nfrom DeezyMatch import inference as dm_inference\n\n# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \ndm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n             dataset_path=\"dataset/dataset-string-matching_test.txt\", \n             pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n             pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\")\n```\n\nSimilarly via command line:\n\n```bash\nDeezyMatch --deezy_mode inference -i ./inputs/input_dfm.yaml -d dataset/dataset-string-matching_test.txt -m ./models/finetuned_test001/finetuned_test001.model  -v ./models/finetuned_test001/finetuned_test001.vocab  -mode test\n```\n\n---\n\nSummary of the arguments/flags:\n\n| Func. argument            | Command-line flag     | Description                                                                       |\n|-----------------------    |-------------------    |---------------------------------------------- |\n| input_file_path           | -i                    | path to the input file                                                            |\n| dataset_path              | -d                    | path to the dataset                                                               |\n| pretrained_model_path     | -m                    | path to the pretrained model                                                      |\n| pretrained_vocab_path     | -v                    | path to the pretrained vocabulary                                                 |\n| inference_mode            | -mode                 | two options:\u003cbr\u003etest (inference, default),\u003cbr\u003evect (generate vector representations)  |\n| scenario                  | -sc, --scenario       | name of the experiment top-directory (only for `inference_mode='vect'`)                                               |\n| cutoff                    | -n                    | number of examples to be used (optional)                                          |\n\n---\n\nThe inference component creates a file: `models/finetuned_test001/pred_results_dataset-string-matching_test.txt` in which:\n\n```\n# s1    s2  prediction  p0  p1  label\nge lan bo er da Грюиссан    0   0.5223  0.4777  0\nLeirvikskjæret  Leirvikskjeret  1   0.4902  0.5098  1\nShih-pa-li-p'u  P'u-k'ou-chen   1   0.4954  0.5046  0\nTopoli  Topolinski  0   0.5248  0.4752  1\n```\n\n`p0` and `p1` are the probabilities assigned to labels 0 and 1 respectively, `prediction` is the predicted label, and `label` is the true label.\n\nFor example, in the first row (\"ge lan bo er da\" and \"Грюиссан\") the actual label is 0 and the predicted label is 0. The model confidence on the predicted label is `p0=0.5223`. In this example, DeezyMatch correctly predicts the label in the first two rows, and predicts it incorrectly in the last two rows. \n\n:bangbang: It should be noted, in this example and for showcasing DeezyMatch's functionalities, the dataset used to train the above model was very small (~5000 rows for training, ~2500 rows for fine-tuning). In practice, we use \u003cins\u003elarger datasets\u003c/ins\u003e for training. Whereas the larger the better, the optimal minimum size of a dataset will depend on many factors. You can find some tips and recommendations [in this document](https://github.com/LinkedPasts/LaNC-workshop/blob/main/deezymatch/recommendations.md).\n\n### Generate query and candidate vectors\n\nThe `inference` module can also be used to generate vector representations for a set of strings in a dataset. This is **a required step for candidate selection and ranking** (which we will [discuss later](#candidate-ranking)).\n\nWe first create vector representations for **candidate** mentions (we assume the candidate mentions are stored in `dataset/dataset-candidates.txt`):\n\n```python\nfrom DeezyMatch import inference as dm_inference\n\n# generate vectors for candidates (specified in dataset_path) \n# using a model stored at pretrained_model_path and pretrained_vocab_path \ndm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n             dataset_path=\"dataset/dataset-candidates.txt\", \n             pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n             pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\",\n             inference_mode=\"vect\",\n             scenario=\"candidates/test\")\n```\n\nCompared to the previous section, here we have two additional arguments: \n* `inference_mode=\"vect\"`: generate vector representations for the **first column** in `dataset_path`.\n* `scenario`: directory to store the vectors.\n\nThe same can be done via command line:\n\n```bash\nDeezyMatch --deezy_mode inference -i ./inputs/input_dfm.yaml -d dataset/dataset-candidates.txt -m ./models/finetuned_test001/finetuned_test001.model -v ./models/finetuned_test001/finetuned_test001.vocab -mode vect --scenario candidates/test\n```\n\n---\n\nFor summary of the arguments/flags, refer to the table in [model inference](#model-inference). \n\n---\n\nThe resulting directory structure is:\n\n```\ncandidates/\n└── test\n    ├── dataframe.df\n    ├── embeddings\n    │   ├── rnn_bwd_0\n    │   ├── rnn_fwd_0\n    │   ├── rnn_indxs_0\n    │   ├── rnn_bwd_1\n    │   ├── rnn_fwd_1\n    │   ├── rnn_indxs_1\n    │   └── ...\n    ├── input_dfm.yaml\n    └── log.txt\n```\n\nThe `embeddings` directory contains all the vector representations.\n\nWe can do the same for `queries`, using the `dataset-queries.txt` dataset:\n\n```python\nfrom DeezyMatch import inference as dm_inference\n\n# generate vectors for queries (specified in dataset_path) \n# using a model stored at pretrained_model_path and pretrained_vocab_path \ndm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n             dataset_path=\"dataset/dataset-queries.txt\", \n             pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n             pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\",\n             inference_mode=\"vect\",\n             scenario=\"queries/test\")\n```\n\nor via command line:\n\n```bash\nDeezyMatch --deezy_mode inference -i ./inputs/input_dfm.yaml -d dataset/dataset-queries.txt -m ./models/finetuned_test001/finetuned_test001.model -v ./models/finetuned_test001/finetuned_test001.vocab -mode vect --scenario queries/test\n```\n\n---\n\nFor summary of the arguments/flags, refer to the table in [model inference](#model-inference). \n\n---\n\nThe resulting directory structure is:\n\n```\nqueries/\n└── test\n    ├── dataframe.df\n    ├── embeddings\n    │   ├── rnn_bwd_0\n    │   ├── rnn_fwd_0\n    │   ├── rnn_indxs_0\n    │   ├── rnn_bwd_1\n    │   ├── rnn_fwd_1\n    │   ├── rnn_indxs_1\n    │   └── ...\n    ├── input_dfm.yaml\n    └── log.txt\n```\n\n:warning: Note that DeezyMatch's candidate ranker can be used on-the-fly, which means that we do not need to have a precomputed set of query vectors when ranking the candidates (they are generated on the spot). This previous last step (query vector generation) can therefore be skipped if DeezyMatch is used on-the-fly.\n\n### Combine vector representations \n\nThis step is required if the query or candidate vector representations are stored on several files (\u003cins\u003enormally the case!\u003c/ins\u003e). The `combine_vecs` module assembles those vectors and stores the results in `output_scenario` (see function below).\n\nFor candidate vectors:\n\n```python\nfrom DeezyMatch import combine_vecs\n\n# combine vectors stored in candidates/test and save them in combined/candidates_test\ncombine_vecs(rnn_passes=['fwd', 'bwd'], \n             input_scenario='candidates/test', \n             output_scenario='combined/candidates_test', \n             print_every=10)\n```\n\nSimilarly, for query vectors (this step should be skipped with on-the-fly ranking):\n\n```python\nfrom DeezyMatch import combine_vecs\n\n# combine vectors stored in queries/test and save them in combined/queries_test\ncombine_vecs(rnn_passes=['fwd', 'bwd'], \n             input_scenario='queries/test', \n             output_scenario='combined/queries_test', \n             print_every=10)\n```\n\nHere, `rnn_passes` specifies that `combine_vecs` should assemble all vectors generated in the forward and backward RNN/GRU/LSTM passes (which are stored in the `input_scenario` directory).\n\n**NOTE:** we have a backward pass only if `bidirectional` is set to `True` in the input file.\n\nThe results (for both query and candidate vectors) are stored in the `output_scenario` as follows:\n\n```bash\ncombined/\n├── candidates_test\n│   ├── bwd_id.pt\n│   ├── bwd_items.npy\n│   ├── bwd.pt\n│   ├── fwd_id.pt\n│   ├── fwd_items.npy\n│   ├── fwd.pt\n│   └── input_dfm.yaml\n└── queries_test\n    ├── bwd_id.pt\n    ├── bwd_items.npy\n    ├── bwd.pt\n    ├── fwd_id.pt\n    ├── fwd_items.npy\n    ├── fwd.pt\n    └── input_dfm.yaml\n```\n\nThe above steps can be done via command line, for candidate vectors:\n\n```bash\nDeezyMatch --deezy_mode combine_vecs -p fwd,bwd -sc candidates/test -combs combined/candidates_test\n```\n\nFor query vectors:\n\n```bash\nDeezyMatch --deezy_mode combine_vecs -p fwd,bwd -sc queries/test -combs combined/queries_test\n```\n\n---\n\nSummary of the arguments/flags:\n\n| Func. argument    | Command-line flag     | Description                                                                                                   |\n|-----------------  |-------------------    |-------------------------------------------------------------------------------------------------------------  |\n| rnn_passes        | -p                    | RNN/GRU/LSTM passes to be used in assembling vectors (fwd or bwd)                                             |\n| input_scenario    | -sc                   | name of the input top-directory                                                                               |\n| output_scenario   | -combs                | name of the output top-directory                                                                              |\n| input_file_path   | -i                    | path to the input file. \"default\": read input file in `input_scenario`                    |\n| print_every       | ---                   | interval to print the progress in assembling vectors                                                          |\n| sel_device        | ---                   | set the device (cpu, cuda, cuda:0, cuda:1, ...).\u003cbr\u003eif \"default\", the device will be read from the input file.    |\n| save_df           | ---                   | save strings of the first column in queries/candidates files (default: True)                                  |\n\n\n### Candidate ranking\n\nCandidate ranker uses the vector representations, generated and assembled in the previous sections, to find a set of candidates (from a dataset or knowledge base) for a given set of queries.\n\nIn the following example, for queries stored in `query_scenario`, we want to find 2 candidates (specified by `num_candidates`) from the candidates stored in `candidate_scenario` (i.e. `candidate_scenario` and `query_scenario` point to the outputs from combining vector representations).\n\n\u003e :warning: As mentioned, it is also possible to do candidate ranking on-the-fly in which query vectors are not read from a dataset (instead, they are generated on-the-fly). This is described [in the next subsection](#candidate-ranking-on-the-fly).\n\n```python\nfrom DeezyMatch import candidate_ranker\n\n# Select candidates based on L2-norm distance (aka faiss distance):\n# find candidates from candidate_scenario \n# for queries specified in query_scenario\ncandidates_pd = \\\n    candidate_ranker(query_scenario=\"./combined/queries_test\",\n                     candidate_scenario=\"./combined/candidates_test\", \n                     ranking_metric=\"faiss\", \n                     selection_threshold=5., \n                     num_candidates=2, \n                     search_size=2, \n                     output_path=\"ranker_results/test_candidates_deezymatch\", \n                     pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                     pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                     number_test_rows=20)\n```\n\nIn this example, `query_scenario` points to the directory that contains all the assembled query vectors (see previous section on [combining vector representations](#combine-vector-representations)) while `candidate_scenario` points to the directory that contains the assembled candidate vectors.\n\nThe retrieval of candidates is performed based on several parameters (`ranking_metric`, `num_candidates`, `selection_threshold` and `search_size` in the example), and using the DeezyMatch model specified in `pretrained_model_path` and using the vocabulary specified in `pretrained_vocab_path`. The output (a dataframe) is saved in the directory specified in `output_path`, but can also be accessed directly from the `candidates_pd` variable. The first few rows of the resulting dataframe are:\n\n```bash\n                                query                                         pred_score                                       1-pred_score  ...                             candidate_original_ids query_original_id num_all_searches\nid                                                                                                                                           ...                                                                                      \n0                          Lamdom Noi  {'la dom nxy': 0.5271, 'Ouâdi ech Chalta': 0.5...  {'la dom nxy': 0.4729, 'Ouâdi ech Chalta': 0.4...  ...          {'la dom nxy': 0, 'Ouâdi ech Chalta': 34}                 0                2\n1                              Krutoi          {'Krutoy': 0.5236, 'Engeskjæran': 0.4956}  {'Krutoy': 0.47640000000000005, 'Engeskjæran':...  ...                    {'Krutoy': 1, 'Engeskjæran': 6}                 1                2\n2                          Sharuniata          {'Sharunyata': 0.5296, 'Ndetach': 0.5272}  {'Sharunyata': 0.47040000000000004, 'Ndetach':...  ...                   {'Sharunyata': 2, 'Ndetach': 19}                 2                2\n3                         Su Tang Cun  {'Sutangcun': 0.5193, 'sthani xnamay ban hnxng...  {'Sutangcun': 0.4807, 'sthani xnamay ban hnxng...  ...  {'Sutangcun': 3, 'sthani xnamay ban hnxngphi':...                 3                2\n4                       Jowkare Shafi            {'Anfijld': 0.5022, 'Ljublino': 0.5097}  {'Anfijld': 0.4978, 'Ljublino': 0.490299999999...  ...                    {'Anfijld': 10, 'Ljublino': 39}                 4                2\n5   Rongrian Ban Huai Huak Chom Thong  {'rongreiyn ban hnxngbawdæng': 0.4975, 'rongre...  {'rongreiyn ban hnxngbawdæng': 0.5025, 'rongre...  ...  {'rongreiyn ban hnxngbawdæng': 35, 'rongreiyn ...                 5                2\n```\n\nAs mentioned, the retrieval of candidates is based on several parameters:\n* **Ranking metric** (`ranking_metric`): The metric used to rank the candidates based on their vectors. Choices are:\n  * `faiss`: L2-norm distance, as implemented in the `faiss` library.\n  * `cosine`: cosine distance.\n  * `conf`: confidence as measured by DeezyMatch prediction outputs.\n* **Selection threshold** (`threshold`): Selection threshold, which changes according to the ranking metric that has been specified. A candidate will be selected in the following cases:\n  ```text\n  faiss-distance \u003c= selection_threshold\n  cosine-distance \u003c= selection_threshold\n  prediction-confidence \u003e= selection_threshold\n  ```\n  :bangbang: In `conf` (i.e., prediction-confidence), the threshold corresponds to the **minimum** accepted value, while in `faiss` and `cosine` metrics, the threshold is the **maximum** accepted value.\n  :bangbang: The `cosine` and `conf` scores are between [0, 1] while `faiss` distance can take any values from [0, +\u0026#8734;).\n* **Calculate prediction** (`calc_predict`): If the selected ranking metric is `faiss` or `cosine`, you can choose to skip prediction (by setting it to `False`), therefore speeding up the ranking significantly.\n* **Search size** (`search_size`): Unless `calc_predict` is set to `False` (and therefore the prediction step is skipped during ranking), for a given query, DeezyMatch searches for candidates iteratively. At each iteration, the selected ranking metric between a query and candidates (with the size of `search_size`) is computed, and if the number of desired candidates (specified by `num_candidates`) is not reached, a new batch of candidates with the size of `search_size` is tested in the next iteration. This continues until candidates with the size of `num_candidates` are found or all the candidates are tested. If the role of `search_size` argument is not clear, refer to [Tips / Suggestions on DeezyMatch functionalities](#tips--suggestions-on-deezymatch-functionalities).\n* **Maximum length difference** (`length_diff`): Finally, you can also specify the maximum length difference allowed between the query and the retrieved candidate strings, which may be a useful feature for certain applications.\n\nFinally, **only for testing**, you can use `number_test_rows`. It specifies the number of queries to be used for testing.\n\nThe above results can be generated via command line as well:\n\n```bash\nDeezyMatch --deezy_mode candidate_ranker -qs ./combined/queries_test -cs ./combined/candidates_test -rm faiss -t 5 -n 2 -sz 2 -o ranker_results/test_candidates_deezymatch -mp ./models/finetuned_test001/finetuned_test001.model -v ./models/finetuned_test001/finetuned_test001.vocab -tn 20\n```\n\n---\n\nSummary of the arguments/flags:\n\n| Func. argument        \t| Command-line flag \t| Description                                                                                                                                                                 \t|\n|-----------------------\t|-------------------\t|-------------------------------------------------------------------------------------------------\t|\n| query_scenario        \t| -qs               \t| directory that contains all the assembled query vectors                                                                                                                     \t|\n| candidate_scenario    \t| -cs               \t| directory that contains all the assembled candidate vectors                                                                                                                 \t|\n| ranking_metric        \t| -rm               \t| choices are\u003cbr\u003e`faiss` (used here, L2-norm distance),\u003cbr\u003e`cosine` (cosine distance),\u003cbr\u003e`conf` (confidence as measured by DeezyMatch prediction outputs)                           \t|\n| selection_threshold   \t| -t                \t| changes according to the `ranking_metric`, a candidate will be selected if:\u003cbr\u003efaiss-distance \u003c= selection_threshold,\u003cbr\u003ecosine-distance \u003c= selection_threshold,\u003cbr\u003eprediction-confidence \u003e= selection_threshold \t|\n| query                 \t| -q                \t| one string or a list of strings to be used in candidate ranking on-the-fly                                                                                                  \t|\n| num_candidates        \t| -n                \t| number of desired candidates                                                                                                                                                \t|\n| search_size           \t| -sz               \t| number of candidates to be tested at each iteration                                                                                                                         \t|\n| length_diff           \t| -ld               \t| max length difference allowed between query and candidate strings                                                                                                                         \t|\n| calc_predict           \t| -up               \t| whether to calculate prediction (i.e., model inference) or not                                                                                                                         \t|\n| calc_cosine           \t| -cc               \t| whether to calculate cosine similarity or not                                                                                                                         \t|\n| output_path           \t| -o                \t| path to the output file                                                                                                                                                     \t|\n| pretrained_model_path \t| -mp               \t| path to the pretrained model                                                                                                                                                \t|\n| pretrained_vocab_path \t| -v                \t| path to the pretrained vocabulary                                                                                                                                           \t|\n| input_file_path       \t| -i                \t| path to the input file. \"default\": read input file in `candidate_scenario`                                    \t|\n| number_test_rows      \t| -tn               \t| number of examples to be used (optional, normally for testing)                                                                                                              \t|\n\n---\n\n**Other methods**\n\n* Select candidates based on cosine distance:\n\n```python\nfrom DeezyMatch import candidate_ranker\n\n# Select candidates based on cosine distance:\n# find candidates from candidate_scenario \n# for queries specified in query_scenario\ncandidates_pd = \\\n    candidate_ranker(query_scenario=\"./combined/queries_test\",\n                     candidate_scenario=\"./combined/candidates_test\", \n                     ranking_metric=\"cosine\", \n                     selection_threshold=0.49, \n                     num_candidates=2, \n                     search_size=2, \n                     output_path=\"ranker_results/test_candidates_deezymatch\", \n                     pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                     pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                     number_test_rows=20)\n```\n\nNote that the only differences compared to the previous command are `ranking_metric=\"cosine\"` and `selection_threshold=0.49`.\n\n### Candidate ranking on-the-fly\n\nFor a list of input strings (specified in `query` argument), DeezyMatch can rank candidates (stored in `candidate_scenario`) on-the-fly. Here, DeezyMatch generates and assembles the vector representations of strings in `query` on-the-fly.\n\n```python\nfrom DeezyMatch import candidate_ranker\n\n# Ranking on-the-fly\n# find candidates from candidate_scenario \n# for queries specified by the `query` argument\ncandidates_pd = \\\n    candidate_ranker(candidate_scenario=\"./combined/candidates_test\",\n                     query=[\"DeezyMatch\", \"kasra\", \"fede\", \"mariona\"],\n                     ranking_metric=\"faiss\", \n                     selection_threshold=5., \n                     num_candidates=1, \n                     search_size=100, \n                     output_path=\"ranker_results/test_candidates_deezymatch_on_the_fly\", \n                     pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                     pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                     number_test_rows=20)\n```\n\nThe candidate ranker can be initialised, to be used multiple times, by running:\n\n```python\nfrom DeezyMatch import candidate_ranker_init\n\n# initializing candidate_ranker via candidate_ranker_init\nmyranker = candidate_ranker_init(candidate_scenario=\"./combined/candidates_test\",\n                                 query=[\"DeezyMatch\", \"kasra\", \"fede\", \"mariona\"],\n                                 ranking_metric=\"faiss\", \n                                 selection_threshold=5., \n                                 num_candidates=1, \n                                 search_size=100, \n                                 output_path=\"ranker_results/test_candidates_deezymatch_on_the_fly\", \n                                 pretrained_model_path=\"./models/finetuned_test001/finetuned_test001.model\", \n                                 pretrained_vocab_path=\"./models/finetuned_test001/finetuned_test001.vocab\", \n                                 number_test_rows=20)\n```\n\nThe content of `myranker` can be printed by:\n\n```python\nprint(myranker)\n```\n\nwhich results in:\n\n```bash\n-------------------------\n* Candidate ranker params\n-------------------------\n\nQueries are based on the following list:\n['DeezyMatch', 'kasra', 'fede', 'mariona']\n\ncandidate_scenario:     ./combined/candidates_test\n---Searching params---\nnum_candidates:         1\nranking_metric:         faiss\nselection_threshold:    5.0\nsearch_size:            100\nnumber_test_rows:       20\n---I/O---\ninput_file_path:        default (path: ./combined/candidates_test/input_dfm.yaml)\noutput_path:            ranker_results/test_candidates_deezymatch_on_the_fly\npretrained_model_path:  ./models/finetuned_test001/finetuned_test001.model\npretrained_vocab_path:  ./models/finetuned_test001/finetuned_test001.vocab\n```\n\nTo rank the queries:\n\n```python\nmyranker.rank()\n```\n\nThe results are stored in:\n\n```python\nmyranker.output\n```\n\nAll the query-related parameters can be changed via `set_query` method, for example:\n\n```python\nmyranker.set_query(query=[\"another_example\"])\n```\n\nother parameters include: \n```bash\nquery\nquery_scenario\nranking_metric\nselection_threshold\nnum_candidates\nsearch_size\nnumber_test_rows\noutput_path\n```\n\nAgain, we can rank the candidates for the new query by:\n\n```python\nmyranker.rank()\n# to access output:\nmyranker.output\n```\n\n## Tips / Suggestions on DeezyMatch functionalities\n\n### Candidate ranker\n\n* As already mentioned, based on our experiments, `conf` is not a good metric for ranking candidates. Consider using `faiss` or `cosine`.\n\n* Adding prefix/suffix to input strings (see `prefix_suffix` option in the input file) can greatly enhance the ranking results. However, we recommend one-character-long prefix/suffix (for example '\u003c' and '\u003e'); otherwise, this may affect the computation time.\n\n* In `candidate_ranker`, the user specifies a `ranking_metric` based on which the candidates are selected and ranked. However, DeezyMatch also reports the values of other metrics for those candidates. For example, if the user selects `ranking_metric=\"faiss\"`, the candidates are selected based on the `faiss`-distance metric. At the same time, the values of `cosine` and `conf` metrics for **those candidates** (ranked according to the selected metric, in this case faiss) are also reported.\n\n* What is the role of `search_size` in candidate ranker?\n  \n  For a given query, DeezyMatch searches for candidates iteratively. If we set `search_size` to **five**, at each iteration (i.e., one colored region in the figure below), the query vector is compared with **five** potential candidate vectors according to the selected ranking metric (`ranking_metric` argument). In step-1, the five closest candidate vectors, as measured by L2-norm distance, are examined. Four (out of five) candidate vectors passed the threshold (specified by `selection_threshold` argument) in the figure (step-1). However, in this example, we assume `num_candidates` is 10. So, DeezyMatch examines the second batch of potential candidates, again five vectors (as specified by `search_size`). Three (out of five) candidates pass the threshold in step-2. Finally, in the third iteration, three more candidates are found. DeezyMatch collects the information of these ten candidates and go to the next query.\n\n  This adaptive search algorithm significantly reduces the computation time to find and rank a set of candidates in a (large) dataset. \nInstead of searching the whole dataset, DeezyMatch iteratively compares a query vector with the \"most-promising\" candidates.\n\n  \u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/Living-with-machines/DeezyMatch/master/figs/query_candidate_selection.png\" alt=\"role of search_size in candidate ranker\" width=\"70%\"\u003e\n  \u003c/p\u003e\n\n  In most use cases, `search_size` can be set `\u003e= num_candidates`. However, if `num_candidates` is very large, it is better to set the `search_size` to a lower value.\n  \n  Let's clarify this in an example. First, assume `num_candidates=4` (number of desired candidates is 4 for each query). If we set the `search_size` to values less than 4, let's say, 2. DeezyMatch needs to do at least two iterations. In the first iteration, it looks at the closest 2 candidate vectors (as `search_size` is 2). In the second iteration, candidate vectors 3 and 4 will be examined. So two iterations. Another choice is `search_size=4`. Here, DeezyMatch looks at 4 candidates in the first iteration, if they pass the threshold, the process concludes. If not, it will seach candidates 5-8 in the next iteration. Now, let's assume `num_candidates=1001` (i.e., number of desired candidates is 1001 for each query). If we set the `search_size=1000`, DeezyMatch has to search at least 2000 candidates (2 x 1000 `search_size`). If we set `search_size=100`, this time, DeezyMatch has to search at least 1100 candidates (11 x 100 `search_size`). So 900 vectors less. In the end, it is a trade-off between iterations and `search_size`.\n\n## How to cite DeezyMatch\n\nPlease consider acknowledging DeezyMatch if it helps you to obtain results and figures for publications or presentations, by citing:\n\nACL link: https://www.aclweb.org/anthology/2020.emnlp-demos.9/\n\n```text\nKasra Hosseini, Federico Nanni, and Mariona Coll Ardanuy (2020), DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 62–69. Association for Computational Linguistics.\n```\n\nand in BibTeX:\n\n```bibtex\n@inproceedings{hosseini-etal-2020-deezymatch,\n    title = \"{D}eezy{M}atch: A Flexible Deep Learning Approach to Fuzzy String Matching\",\n    author = \"Hosseini, Kasra  and\n      Nanni, Federico  and\n      Coll Ardanuy, Mariona\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    month = oct,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.emnlp-demos.9\",\n    pages = \"62--69\",\n    abstract = \"We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch{'}s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets.\",\n}\n```\n\nThe results presented in this paper were generated by [DeezyMatch v1.2.0 (Released: Sep 15, 2020)](https://github.com/Living-with-machines/DeezyMatch/releases/tag/v1.2.0).\n\nYou can [reproduce Fig. 2 of DeezyMatch's paper, EMNLP2020, here.](./figs/EMNLP2020_figures/fig2) \n\n## Credits\n\nThis project extensively uses the ideas/neural-network-architecture published in https://github.com/ruipds/Toponym-Matching. \n\nThis work was supported by Living with Machines (AHRC grant AH/S01179X/1) and The Alan Turing Institute (EPSRC grant EP/ N510129/1).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliving-with-machines%2Fdeezymatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliving-with-machines%2Fdeezymatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliving-with-machines%2Fdeezymatch/lists"}