{"id":17596009,"url":"https://github.com/superbooming/simtester","last_synced_at":"2025-04-30T04:49:49.043Z","repository":{"id":144671764,"uuid":"558723780","full_name":"Superbooming/simtester","owner":"Superbooming","description":"Simtester is an open-source toolkit for evaluating user simulator of task-oriented dialogue system(TOD).","archived":false,"fork":false,"pushed_at":"2023-02-22T09:00:44.000Z","size":418,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-30T04:49:42.313Z","etag":null,"topics":["dialog","dialogue-systems","pytorch","simulator","task-oriented-dialogue"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Superbooming.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-10-28T06:37:49.000Z","updated_at":"2025-01-30T04:58:20.000Z","dependencies_parsed_at":"2023-12-03T14:46:52.277Z","dependency_job_id":null,"html_url":"https://github.com/Superbooming/simtester","commit_stats":{"total_commits":16,"total_committers":2,"mean_commits":8.0,"dds":0.0625,"last_synced_commit":"9583b0a5070702ec4babdfc10d90b4b41480d876"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Superbooming%2Fsimtester","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Superbooming%2Fsimtester/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Superbooming%2Fsimtester/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Superbooming%2Fsimtester/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Superbooming","download_url":"https://codeload.github.com/Superbooming/simtester/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251644827,"owners_count":21620630,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dialog","dialogue-systems","pytorch","simulator","task-oriented-dialogue"],"created_at":"2024-10-22T08:07:09.047Z","updated_at":"2025-04-30T04:49:49.036Z","avatar_url":"https://github.com/Superbooming.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Simtester\n\n[![Pypi Latest Version](https://img.shields.io/pypi/v/simtester)](https://pypi.org/project/simtester)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)\n[![arXiv](https://img.shields.io/badge/arXiv-Simtester-%23B21B1B)](https://arxiv.org/abs/2204.00763)\n\n[comment]: \u003c\u003e ([![Release]\u0026#40;https://img.shields.io/github/v/release/rucaibox/crslab.svg\u0026#41;]\u0026#40;https://github.com/rucaibox/crslab/releases\u0026#41;)\n\n[comment]: \u003c\u003e ([![Documentation Status]\u0026#40;https://readthedocs.org/projects/crslab/badge/?version=latest\u0026#41;]\u0026#40;https://crslab.readthedocs.io/en/latest/?badge=latest\u0026#41;)\n\n[comment]: \u003c\u003e (| [Docs]\u0026#40;https://crslab.readthedocs.io/en/latest/?badge=latest\u0026#41;)\n\n[comment]: \u003c\u003e (| [中文版]\u0026#40;./README_CN.md\u0026#41;)\n\n**Simtester** is an open-source toolkit for evaluating user simulator of task-oriented dialogue system(TOD). It is\ndeveloped based on Python and PyTorch. You can easily construct agents(system variants) with Simtester using either your\nown implemented model or our trained model and construct tester with different combinations of agents. In tester-based\nevaluation, the user simulator you implemented interacts with agents in the tester, ranks them, and tester calculates\nthe Exact Distinct score of the user simulator. [[paper]](https://arxiv.org/pdf/2204.00763.pdf)\n\n![Simtester](resource/fig/tester.png)\n\n- [Installation](#Installation)\n- [Quick-Start](#Quick-Start)\n- [Contributions](#Contributions)\n- [Citing](#Citing)\n- [Team](#Team)\n- [License](#License)\n\n[comment]: \u003c\u003e (## Updates)\n\n[comment]: \u003c\u003e (2022.10.28:)\n\n[comment]: \u003c\u003e (-Add )\n\n## Installation\n\nSimtester works with the following operating systems：\n\n- Linux\n- Windows 10\n- macOS X\n\nSimtester requires Python version 3.6 or later.\n\n[comment]: \u003c\u003e (Simtester requires torch version 1.4.0 or later. If you want to use CRSLab with GPU, please ensure that CUDA or CUDAToolkit version is 9.2 or later. Please use the combinations shown in this [Link]\u0026#40;https://pytorch-geometric.com/whl/\u0026#41; to ensure the normal operation of PyTorch Geometric.)\n\nYou can install from pip(the current version is the old version and we will update it soon):\n\n```bash\n\npip install simtester\n\n```\n\nOr you can install Simtester from source:\n\n```bash\ngit clone https://github.com/Superbooming/simtester \u0026\u0026 cd simtester\npip install -e .\n```\n\n## Quick-Start\n\nWhen you construct a agent using your own implemented model, you should fill your model instance `your_model`\nand your model interact function `your_model_interact_fn` which input the dialogue context and output the response.\n\nWhen you construct a agent using own trained model, the model archive file and model state file will be downloaded\nautomate and you just need fill the model name. The remaining configuration will be loaded according to default config\nfile(in `simtester/config/multiwoz/`). Or you can fill your own config file(.yaml format) path and the config file\nshould include the model name and the responding model directory.See the config file example\nin `simtester/config/multiwoz/soloist/soloist-base.yaml`.\n\n```bash\n# construct a agent using your own implemented model \nfrom simtester import Agent\nagent = Agent(your_model, your_model_interact_fn)\n\n# construct a gent using our trained model\nfrom simtester import SoloistAgent\nagent = SoloistAgent('multiwoz-soloist-base') # fill model name\nagent = SoloistAgent(config='simtester/config/multiwoz/soloist-base.yaml') # fill config path\n\n# interact with agent and get the response\nresopnse = agent.interact('I want to book a resturant in the center of city.')\n\n# get dialogue context and start a new dialogue\nagent.get_context()\nagent.start_dialogue()\n```\n\nWhen you construct a tester using your own agents, you should fill your agent list which includes your own agent\ninstance and the ground truth agent rank where a smaller number indicates a higher rank. If rank is not filled, the\nnumber is default to incremented.\n\nWhen you construct a agent with strategy we provided, you can just fill the strategy name.\n\nWhen you interact with tester, you can either input the utterance list, or the rank your simulator predicts. If the\nutterance list is input, tester will return the response list which includes response responding to each agent and the\ndialogue state list which indicates whether each agent ends the dialog. You can set the end_token of tester and when the\nagent receives response which includes the end_token, the agent's dialogue will be ended. If the rank your simulator\npredicts is input, tester will compare the ground truth rank with input rank and return the result. Notice that the\ninput of the rank represents the end of the dialogue for all agents, and a new round of dialogue will start.\n\n```bash\nfrom simtester import Tester\n\n# construct tester with your own agents\ntester = Tester([agentA, agentB, agentC], [2, 1, 3])\n\n# construct tester with strategy we provided\ntester = Tester('multiwoz-context-tester')\n\n# interact with tester\ntester.interact(['I want to book a resturant.', \n                 'I want to book a hotel.', \n                 'I want to find a attraction.'])\ntester.interact(rank=[2, 3, 1])\n\n# get current tester information and exact distinct score\ntester.get_info()\ntester.get_score()\n\n# reset tester\ntester.reset()\n```\n\n[comment]: \u003c\u003e (## Models)\n\n[comment]: \u003c\u003e (In CRSLab, we unify the task description of conversational recommendation into three sub-tasks, namely recommendation \u0026#40;)\n\n[comment]: \u003c\u003e (recommend user-preferred items\u0026#41;, conversation \u0026#40;generate proper responses\u0026#41; and policy \u0026#40;select proper interactive action\u0026#41;.)\n\n[comment]: \u003c\u003e (The recommendation and conversation sub-tasks are the core of a CRS and have been studied in most of works. The policy)\n\n[comment]: \u003c\u003e (sub-task is needed by recent works, by which the CRS can interact with users through purposeful strategy. As the first)\n\n[comment]: \u003c\u003e (release version, we have implemented 18 models in the four categories of CRS model, Recommendation model, Conversation)\n\n[comment]: \u003c\u003e (model and Policy model.)\n\n[comment]: \u003c\u003e (|       Category       |                            Model                             |      Graph Neural Network?      |       Pre-training Model?       |)\n\n[comment]: \u003c\u003e (| :------------------: | :----------------------------------------------------------: | :-----------------------------: | :-----------------------------: |)\n\n[comment]: \u003c\u003e (|      CRS Model       | [ReDial]\u0026#40;https://arxiv.org/abs/1812.07617\u0026#41;\u003cbr/\u003e[KBRD]\u0026#40;https://arxiv.org/abs/1908.05391\u0026#41;\u003cbr/\u003e[KGSF]\u0026#40;https://arxiv.org/abs/2007.04032\u0026#41;\u003cbr/\u003e[TG-ReDial]\u0026#40;https://arxiv.org/abs/2010.04125\u0026#41;\u003cbr/\u003e[INSPIRED]\u0026#40;https://www.aclweb.org/anthology/2020.emnlp-main.654.pdf\u0026#41; |       ×\u003cbr/\u003e√\u003cbr/\u003e√\u003cbr/\u003e×\u003cbr/\u003e×       |       ×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e√\u003cbr/\u003e√       |)\n\n[comment]: \u003c\u003e (| Recommendation model | Popularity\u003cbr/\u003e[GRU4Rec]\u0026#40;https://arxiv.org/abs/1511.06939\u0026#41;\u003cbr/\u003e[SASRec]\u0026#40;https://arxiv.org/abs/1808.09781\u0026#41;\u003cbr/\u003e[TextCNN]\u0026#40;https://arxiv.org/abs/1408.5882\u0026#41;\u003cbr/\u003e[R-GCN]\u0026#40;https://arxiv.org/abs/1703.06103\u0026#41;\u003cbr/\u003e[BERT]\u0026#40;https://arxiv.org/abs/1810.04805\u0026#41; | ×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e√\u003cbr/\u003e× | ×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e√ |)\n\n[comment]: \u003c\u003e (|  Conversation model  | [HERD]\u0026#40;https://arxiv.org/abs/1507.04808\u0026#41;\u003cbr/\u003e[Transformer]\u0026#40;https://arxiv.org/abs/1706.03762\u0026#41;\u003cbr/\u003e[GPT-2]\u0026#40;http://www.persagen.com/files/misc/radford2019language.pdf\u0026#41; |          ×\u003cbr/\u003e×\u003cbr/\u003e×          |          ×\u003cbr/\u003e×\u003cbr/\u003e√          |)\n\n[comment]: \u003c\u003e (|     Policy model     | PMI\u003cbr/\u003e[MGCG]\u0026#40;https://arxiv.org/abs/2005.03954\u0026#41;\u003cbr/\u003e[Conv-BERT]\u0026#40;https://arxiv.org/abs/2010.04125\u0026#41;\u003cbr/\u003e[Topic-BERT]\u0026#40;https://arxiv.org/abs/2010.04125\u0026#41;\u003cbr/\u003e[Profile-BERT]\u0026#40;https://arxiv.org/abs/2010.04125\u0026#41; |    ×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e×\u003cbr/\u003e×    |    ×\u003cbr/\u003e×\u003cbr/\u003e√\u003cbr/\u003e√\u003cbr/\u003e√    |)\n\n[comment]: \u003c\u003e (Among them, the four CRS models integrate the recommendation model and the conversation model to improve each other,)\n\n[comment]: \u003c\u003e (while others only specify an individual task.)\n\n[comment]: \u003c\u003e (For Recommendation model and Conversation model, we have respectively implemented the following commonly-used automatic)\n\n[comment]: \u003c\u003e (evaluation metrics:)\n\n[comment]: \u003c\u003e (|        Category        |                           Metrics                            |)\n\n[comment]: \u003c\u003e (| :--------------------: | :----------------------------------------------------------: |)\n\n[comment]: \u003c\u003e (| Recommendation Metrics |      Hit@{1, 10, 50}, MRR@{1, 10, 50}, NDCG@{1, 10, 50}      |)\n\n[comment]: \u003c\u003e (|  Conversation Metrics  | PPL, BLEU-{1, 2, 3, 4}, Embedding Average/Extreme/Greedy, Distinct-{1, 2, 3, 4} |)\n\n[comment]: \u003c\u003e (|     Policy Metrics     |        Accuracy, Hit@{1,3,5}           |)\n\n[comment]: \u003c\u003e (## Datasets)\n\n[comment]: \u003c\u003e (We have collected and preprocessed 6 commonly-used human-annotated datasets, and each dataset was matched with proper)\n\n[comment]: \u003c\u003e (KGs as shown below:)\n\n[comment]: \u003c\u003e (|                           Dataset                            | Dialogs | Utterances |   Domains    | Task Definition | Entity KG  |  Word KG   |)\n\n[comment]: \u003c\u003e (| :----------------------------------------------------------: | :-----: | :--------: | :----------: | :-------------: | :--------: | :--------: |)\n\n[comment]: \u003c\u003e (|       [ReDial]\u0026#40;https://redialdata.github.io/website/\u0026#41;        | 10,006  |  182,150   |    Movie     |       --        |  DBpedia   | ConceptNet |)\n\n[comment]: \u003c\u003e (|      [TG-ReDial]\u0026#40;https://github.com/RUCAIBox/TG-ReDial\u0026#41;      | 10,000  |  129,392   |    Movie     |   Topic Guide   | CN-DBpedia |   HowNet   |)\n\n[comment]: \u003c\u003e (|        [GoRecDial]\u0026#40;https://arxiv.org/abs/1909.03922\u0026#41;         |  9,125  |  170,904   |    Movie     |  Action Choice  |  DBpedia   | ConceptNet |)\n\n[comment]: \u003c\u003e (|        [DuRecDial]\u0026#40;https://arxiv.org/abs/2005.03954\u0026#41;         | 10,200  |  156,000   | Movie, Music |    Goal Plan    | CN-DBpedia |   HowNet   |)\n\n[comment]: \u003c\u003e (|      [INSPIRED]\u0026#40;https://github.com/sweetpeach/Inspired\u0026#41;      |  1,001  |   35,811   |    Movie     | Social Strategy |  DBpedia   | ConceptNet |)\n\n[comment]: \u003c\u003e (| [OpenDialKG]\u0026#40;https://github.com/facebookresearch/opendialkg\u0026#41; | 13,802  |   91,209   | Movie, Book  |  Path Generate  |  DBpedia   | ConceptNet |)\n\n[comment]: \u003c\u003e (## Performance)\n\n[comment]: \u003c\u003e (We have trained and test the integrated models on the TG-Redial dataset, which is split into training, validation and)\n\n[comment]: \u003c\u003e (test sets using a ratio of 8:1:1. For each conversation, we start from the first utterance, and generate reply)\n\n[comment]: \u003c\u003e (utterances or recommendations in turn by our model. We perform the evaluation on the three sub-tasks.)\n\n[comment]: \u003c\u003e (### Recommendation Task)\n\n[comment]: \u003c\u003e (|   Model   |    Hit@1    |   Hit@10   |   Hit@50   |    MRR@1    |   MRR@10   |   MRR@50   |   NDCG@1    |  NDCG@10   |  NDCG@50   |)\n\n[comment]: \u003c\u003e (| :-------: | :---------: | :--------: | :--------: | :---------: | :--------: | :--------: | :---------: | :--------: | :--------: |)\n\n[comment]: \u003c\u003e (|  SASRec   |  0.000446   |  0.00134   |   0.0160   |   0.000446  |  0.000576  |  0.00114   |  0.000445   |  0.00075   |  0.00380   |)\n\n[comment]: \u003c\u003e (|  TextCNN  |   0.00267   |   0.0103   |   0.0236   |   0.00267   |  0.00434   |  0.00493   |   0.00267   |  0.00570   |  0.00860   |)\n\n[comment]: \u003c\u003e (|   BERT    |   0.00722   |  0.00490   |   0.0281   |   0.00722   |   0.0106   |   0.0124   |   0.00490   |   0.0147   |   0.0239   |)\n\n[comment]: \u003c\u003e (|   KBRD    |   0.00401   |   0.0254   |   0.0588   |   0.00401   |  0.00891   |   0.0103   |   0.00401   |   0.0127   |   0.0198   |)\n\n[comment]: \u003c\u003e (|   KGSF    |   0.00535   | **0.0285** | **0.0771** |   0.00535   |   0.0114   | **0.0135** |   0.00535   | **)\n\n[comment]: \u003c\u003e (0.0154** | **0.0259** |)\n\n[comment]: \u003c\u003e (| TG-ReDial | **0.00793** |   0.0251   |   0.0524   | **0.00793** | **0.0122** |   0.0134   | **)\n\n[comment]: \u003c\u003e (0.00793** |   0.0152   |   0.0211   |)\n\n[comment]: \u003c\u003e (### Conversation Task)\n\n[comment]: \u003c\u003e (|    Model    |  BLEU@1   |  BLEU@2   |   BLEU@3   |   BLEU@4   |  Dist@1  |  Dist@2  |  Dist@3  |  Dist@4  |  Average  |  Extreme  |  Greedy   |   PPL    |)\n\n[comment]: \u003c\u003e (| :---------: | :-------: | :-------: | :--------: | :--------: | :------: | :------: | :------: | :------: | :-------: | :-------: | :-------: | :------: |)\n\n[comment]: \u003c\u003e (|    HERD     |   0.120   |  0.0141   |  0.00136   |  0.000350  |  0.181   |  0.369   |  0.847   |   1.30   |   0.697   |   0.382   |   0.639   |   472    |)\n\n[comment]: \u003c\u003e (| Transformer |   0.266   |  0.0440   |   0.0145   |  0.00651   |  0.324   |  0.837   |   2.02   |   3.06   |   0.879   |   0.438   |   0.680   |   30.9   |)\n\n[comment]: \u003c\u003e (|    GPT2     |  0.0858   |  0.0119   |  0.00377   |   0.0110   | **2.35** | **4.62** | **8.84** | **)\n\n[comment]: \u003c\u003e (12.5** |   0.763   |   0.297   |   0.583   |   9.26   |)\n\n[comment]: \u003c\u003e (|    KBRD     |   0.267   |  0.0458   |   0.0134   |  0.00579   |  0.469   |   1.50   |   3.40   |   4.90   |   0.863   |   0.398   |   0.710   |   52.5   |)\n\n[comment]: \u003c\u003e (|    KGSF     | **0.383** | **0.115** | **0.0444** | **0.0200** |  0.340   |  0.910   |   3.50   |   6.20   | **)\n\n[comment]: \u003c\u003e (0.888** | **0.477** | **0.767** |   50.1   |)\n\n[comment]: \u003c\u003e (|  TG-ReDial  |   0.125   |  0.0204   |  0.00354   |  0.000803  |  0.881   |   1.75   |   7.00   |   12.0   |   0.810   |   0.332   |   0.598   | **)\n\n[comment]: \u003c\u003e (7.41** |)\n\n[comment]: \u003c\u003e (### Policy Task)\n\n[comment]: \u003c\u003e (|   Model    |   Hit@1   |  Hit@10   |  Hit@50   |   MRR@1   |  MRR@10   |  MRR@50   |  NDCG@1   |  NDCG@10  |  NDCG@50  |)\n\n[comment]: \u003c\u003e (| :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: |)\n\n[comment]: \u003c\u003e (|    MGCG    |   0.591   |   0.818   |   0.883   |   0.591   |   0.680   |   0.683   |   0.591   |   0.712   |   0.729   |)\n\n[comment]: \u003c\u003e (| Conv-BERT  |   0.597   |   0.814   |   0.881   |   0.597   |   0.684   |   0.687   |   0.597   |   0.716   |   0.731   |)\n\n[comment]: \u003c\u003e (| Topic-BERT |   0.598   |   0.828   |   0.885   |   0.598   |   0.690   |   0.693   |   0.598   |   0.724   |   0.737   |)\n\n[comment]: \u003c\u003e (| TG-ReDial  | **0.600** | **0.830** | **0.893** | **0.600** | **0.693** | **0.696** | **0.600** | **0.727** | **)\n\n[comment]: \u003c\u003e (0.741** |)\n\n[comment]: \u003c\u003e (The above results were obtained from our CRSLab in preliminary experiments. However, these algorithms were implemented)\n\n[comment]: \u003c\u003e (and tuned based on our understanding and experiences, which may not achieve their optimal performance. If you could)\n\n[comment]: \u003c\u003e (yield a better result for some specific algorithm, please kindly let us know. We will update this table after the)\n\n[comment]: \u003c\u003e (results are verified.)\n\n[comment]: \u003c\u003e (## Releases)\n\n[comment]: \u003c\u003e (| Releases |     Date      |   Features   |)\n\n[comment]: \u003c\u003e (| :------: | :-----------: | :----------: |)\n\n[comment]: \u003c\u003e (|  v0.1.1  | 1 / 4 / 2021  | Basic CRSLab |)\n\n[comment]: \u003c\u003e (|  v0.1.2  | 3 / 28 / 2021 |    CRSLab    |)\n\n## Contributions\n\nPlease let us know if you encounter a bug or have any suggestions\nby [filing an issue](https://github.com/Superbooming/simtester/issues).\n\nWe welcome all contributions from bug fixes to new features and extensions.\n\nWe expect all contributions discussed in the issue tracker and going through PRs.\n\n## Citing\n\nIf you find Simtester useful for your research or development, please cite\nour [Paper](https://arxiv.org/pdf/2204.00763.pdf):\n\n```\n@article{sun2022metaphorical,\n  title={Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems},\n  author={Sun, Weiwei and Guo, Shuyu and Zhang, Shuo and Ren, Pengjie and Chen, Zhumin and de Rijke, Maarten and Ren, Zhaochun},\n  journal={arXiv preprint arXiv:2204.00763},\n  year={2022}\n}\n```\n\n## Team\n\n**Simtester** was developed and maintained by Shandong University IR Lab.\n\n## License\n\n**Simtester** uses [MIT License](./LICENSE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperbooming%2Fsimtester","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuperbooming%2Fsimtester","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperbooming%2Fsimtester/lists"}