{"id":20690291,"url":"https://github.com/merck/matcher-mmpdb","last_synced_at":"2026-03-07T20:04:43.850Z","repository":{"id":90243157,"uuid":"536669477","full_name":"Merck/matcher-mmpdb","owner":"Merck","description":null,"archived":false,"fork":false,"pushed_at":"2025-03-27T13:19:11.000Z","size":420,"stargazers_count":7,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-22T17:07:32.078Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Merck.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-09-14T16:33:45.000Z","updated_at":"2025-03-21T16:23:43.000Z","dependencies_parsed_at":"2025-03-26T10:40:17.208Z","dependency_job_id":null,"html_url":"https://github.com/Merck/matcher-mmpdb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Merck/matcher-mmpdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fmatcher-mmpdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fmatcher-mmpdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fmatcher-mmpdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fmatcher-mmpdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Merck","download_url":"https://codeload.github.com/Merck/matcher-mmpdb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fmatcher-mmpdb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30229589,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T19:01:10.287Z","status":"ssl_error","status_checked_at":"2026-03-07T18:59:58.103Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T23:12:32.461Z","updated_at":"2026-03-07T20:04:43.827Z","avatar_url":"https://github.com/Merck.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This version of mmpdb was derived from mmpdb 2.2-dev1, and includes the following new features to support the [matcher](https://github.com/Merck/matcher) application:\n\n* Support for PostgreSQL\n* Integration with RDKit cartridge for PostgreSQL, including formatting of molecules for robust substructure searching\n* Extension of the mmpdb database schema to enable novel query strategies, as manifested in the [matcher](https://github.com/Merck/matcher) application\n\n# mmpdb 2.2-dev1 - matched molecular pair database generation and analysis\n\n\n## Synopsis\n\n\nA package to identify matched molecular pairs and use them to predict\nproperty changes.\n\n\n------------------\n\n\n## Requirements\n\n\nThe package has been tested on both Python 2.7 and Python 3.6.\n\nYou will need a copy of the RDKit cheminformatics toolkit, available\nfrom http://rdkit.org/ . Apart from other standard scientific python \nlibraries like scipy and numpy, this is the only required third-party\ndependency for normal operation, though several optional third-party\npackages may be used if available.\n\n  - The matched molecular pairs are stored in a SQLite database. The\nAPSW module from https://github.com/rogerbinns/apsw gives slightly\nbetter analysis performance than Python's built-in SQLite module.\n\n  - The fragment file is in JSON Lines format (see http://jsonlines.org/ ).\nThe ujson (https://github.com/esnme/ultrajson) and slightly slower\ncjson (https://github.com/AGProjects/python-cjson ) are both about\n25% faster than Python 2.7's built-in 'json' module.\n\n  - The \"`--memory`\" option in the index command requires the psutil\nmodule (see https://pypi.python.org/pypi/psutil/5.2.2 ) to get memory\nuse information.\n\n\n------------------\n\n\n## How to run the program and get help\n\n\nThe package includes a command-line program named \"mmpdb\". This\nsupport many subcommands. For examples:\n\n  \"`mmpdb fragment`\" -- fragment a SMILES file  \n  \"`mmpdb index`\" -- find matched molecular pairs in a fragment file  \n\nUse the \"`--help`\" option to get more information about any of the\ncommands. For example, \"`mmpdb fragment --help`\" will print the\ncommand-line arguments, describe how they are used, and show\nexamples of use.\n\nThe subcommands starting with \"help-\" print additional information\nabout a given topic. The next few sections are the output from\n\n```shell\n   % mmpdb help-analysis\n```\n\nIf you wish to experiment with a simple test set, use\ntests/test_data.smi, with molecular weight and melting point\nproperties in tests/test_data.csv.\n\n\n------------------\n\n\n## Publication\n\n\nAn open-access publication describing this package has been \npublished in the Journal of Chemical Information and Modeling:\n\nA. Dalke, J. Hert, C. Kramer. mmpdb: An Open-Source Matched \nMolecular Pair Platform for Large Multiproperty Data Sets. *J. Chem. \nInf. Model.*, **2018**, *58 (5)*, pp 902–910. \nhttps://pubs.acs.org/doi/10.1021/acs.jcim.8b00173\n\n\n------------------\n\n\n## Background\n\n\nThe overall process is:\n\n  1) Fragment structures in a SMILES file, to produce fragments.\n\n  2) Index the fragments to produces matched molecular pairs.\n     (you might include property information at this point)\n\n  3) Load property information.\n\n  4) Find transforms for a given structure; and/or\n\n  5) Predict a property for a structure given the known\n     property for another structure\n\nSome terminology:\nA fragmentation cuts 1, 2, or 3 non-ring bonds to\nconvert a structure into a \"constant\" part and a \"variable\" part. The\nsubstructure in the variable part is a single fragment, and often\nconsidered the R-groups, while the constant part contains one\nfragment for each cut, and it often considered as containing the\ncore.\n\nThe matched molecular pair indexing process finds all pairs which have\nthe same constant part, in order to define a transformation from one\nvariable part to another variable part. A \"rule\" stores information\nabout a transformation, including a list of all the pairs for that\nrule.\n\nThe \"rule environment\" extends the transformation to include\ninformation about the local environment of the attachment points on\nthe constant part. The environment fingerprint is based on the RDKit\ncircular fingerprints for the attachment points. There is one rule\nenvironment for each available radius. Larger radii correspond to more\nspecific environments. The \"rule environment statistics\" table stores\ninformation about the distribution of property changes for all of the\npairs which contain the given rule and environment, with one table\nfor each property.\n\n\n\n#### 1) Fragment structures\n\n\nUse \"`smifrag`\" to see how a given SMILES is fragmented. Use \"`fragment`\"\nto fragment all of the compounds in a SMILES file.\n\n\"`mmpdb smifrag`\" is a diagnostic tool to help understand how a given\nSMILES will be fragmented and to experiment with the different\nfragmentation options. For example:\n\n```shell\n  % mmpdb smifrag 'c1ccccc1OC'\n                     |-------------  variable  -------------|       |---------------------  constant  --------------------\n  #cuts | enum.label | #heavies | symm.class | smiles       | order | #heavies | symm.class | smiles           | with-H   \n  ------+------------+----------+------------+--------------+-------+----------+------------+------------------+----------\n    1   |     N      |    2     |      1     | [*]OC        |    0  |    6     |      1     | [*]c1ccccc1      | c1ccccc1 \n    1   |     N      |    6     |      1     | [*]c1ccccc1  |    0  |    2     |      1     | [*]OC            | CO       \n    2   |     N      |    1     |     11     | [*]O[*]      |   01  |    7     |     12     | [*]C.[*]c1ccccc1 | -        \n    1   |     N      |    1     |      1     | [*]C         |    0  |    7     |      1     | [*]Oc1ccccc1     | Oc1ccccc1\n    1   |     N      |    7     |      1     | [*]Oc1ccccc1 |    0  |    1     |      1     | [*]C             | C        \n```\n\nUse \"`mmpdb fragment`\" to fragment a SMILES file and produce a fragment\nfile for the MMP analysis. Start with the test data file named\n\"test_data.smi\" containing the following structures:\n\nOc1ccccc1 phenol  \nOc1ccccc1O catechol  \nOc1ccccc1N 2-aminophenol  \nOc1ccccc1Cl 2-chlorophenol  \nNc1ccccc1N o-phenylenediamine  \nNc1cc(O)ccc1N amidol  \nOc1cc(O)ccc1O hydroxyquinol  \nNc1ccccc1 phenylamine  \nC1CCCC1N cyclopentanol  \n\n```shell\n  % mmpdb fragment test_data.smi -o test_data.fragments\n```\nFragmentation can take a while. You can save time by asking the code\nto reuse fragmentations from a previous run. If you do that then the\nfragment command will reuse the old fragmentation parameters. (You\ncannot override them with command-line options.). Here is an example:\n\n```shell\n  % mmpdb fragment data_file.smi -o new_data_file.fragments \\ \n         --cache old_data_file.fragments\n```\n\nThe \"`--cache`\" option will greatly improve the fragment performance when\nthere are only a few changes from the previous run.\n\nThe fragmentation algorithm is configured to ignore structures which\nare too big or have too many rotatable bonds. There are also options\nwhich change where to make cuts and the number of cuts to make. Use\nthe \"`--help`\" option on each command for details.\n\nUse \"`mmpdb help-smiles-format`\" for details about to parse different\nvariants of the SMILES file format.\n\nThe \"`--cut-smarts`\" option sets the SMARTS pattern used to determine\nwhich bonds to cut during fragmentation. Use \"`--cut-rgroups`\" or\n\"`--cut-rgroup-file`\" to cut R-groups specified by fragment SMILES.\n\n#### 2) Index the MMPA fragments to create a database\n\n\nThe \"`mmpa index`\" command indexes the output fragments from \"`mmpa\nfragment`\" by their variable fragments, that is, it finds\nfragmentations with the same R-groups and puts them together. Here's\nan example:\n\n```shell\n  % mmpdb index test_data.fragments -o test_data.mmpdb\n```\nThe output from this is a SQLite database.\n\nIf you have activity/property data and you do not want the database to\ninclude structures where there is no data, then you can specify\nthe properties file as well:\n\n```shell\n  % mmpdb index test_data.fragments -o test_data.mmpdb --properties test_data.csv\n```\nUse \"`mmpdb help-property-format`\" for property file format details.\n\nFor more help use \"`mmpdb index --help`\".\n\n\n\n#### 3) Add properties to a database\n\n\nUse \"`mmpdb loadprops`\" to add or modify activity/property data in the\ndatabase. Here's an example property file named 'test_data.csv' with\nmolecular weight and melting point properties:\n\nID      MW      MP  \nphenol  94.1    41  \ncatechol        110.1   105  \n2-aminophenol   109.1   174  \n2-chlorophenol  128.6   8  \no-phenylenediamine      108.1   102  \namidol  124.1   *  \nhydroxyquinol   126.1   140  \nphenylamine     93.1    -6  \ncyclopentanol   86.1    -19  \n\nThe following loads the property data to the MMPDB database file\ncreated in the previous section:\n\n```shell\n  % mmpdb loadprops -p test_data.csv test_data.mmpdb\n```\n\nUse \"`mmpdb help-property-format`\" for property file format details.\n\nFor more help use \"`mmpdb loadprops --help`\". Use \"`mmpdb list`\" to see\nwhat properties are already loaded.\n\n\n\n#### 4) Identify possible transforms\n\n\nUse \"`mmpdb transform`\" to transform an input structure using the rules\nin a database. For each transformation, it can estimate the effect on\nany properties. The following looks at possible ways to transform\n2-pyridone using the test dataset created in the previous section, and\npredict the effect on the \"MW\" property (the output is reformatted for\nclarity):\n\n```shell\n  % mmpdb transform --smiles 'c1cccnc1O' test_data.mmpdb --property MW\n  ID      SMILES MW_from_smiles MW_to_smiles  MW_radius  \\ \n   1  Clc1ccccn1         [*:1]O      [*:1]Cl          1\n   2   Nc1ccccn1         [*:1]O       [*:1]N          1\n   3    c1ccncc1         [*:1]O     [*:1][H]          1\n\n                               MW_fingerprint  MW_rule_environment_id  \\ \n  tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8                     298\n  tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8                     275\n  tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8                     267\n\n  MW_count  MW_avg  MW_std  MW_kurtosis  MW_skewness  MW_min  MW_q1  \\ \n         1    18.5     NaN          NaN          NaN    18.5   18.5\n         3    -1.0     0.0          NaN          0.0    -1.0   -1.0\n         4   -16.0     0.0          NaN          0.0   -16.0  -16.0\n\n  MW_median  MW_q3  MW_max  MW_paired_t  MW_p_value\n       18.5   18.5    18.5          NaN         NaN\n       -1.0   -1.0    -1.0  100000000.0         NaN\n      -16.0  -16.0   -16.0  100000000.0         NaN\n```\n\nThis says that \"c1cccnc1O\" can be transformed to \"Clc1ccccn1\" using\nthe transformation \\[\\*:1\\]O\u003e\u003e\\[\\*:1\\]Cl (that is, replace the oxygen with a\nchlorine). The best transformation match has a radius of 1, which\nincludes the aromatic carbon at the attachment point but not the\naromatic nitrogen which is one atom away.\n\nThere is only one pair for this transformation, and it predicts a shift\nin molecular weight of 18.5. This makes sense as the [OH] is replaced\nwith a [Cl].\n\nOn the other hand, there are three pairs which transform it to\npyridine. The standard deviation of course is 0 because it's a simple\nmolecular weight calculation. The 100000000.0 is the mmpdb way of\nwriting \"positive infinity\".\n\nMelting point is more complicated. The following shows that in the\ntransformation of 2-pyridone to pyridine there are still 3 matched\npairs and in this case the average shift is -93C with a standard\ndeviation of 76.727C:\n\n```shell\n  % mmpdb transform --smiles 'c1cccnc1O' test_data.mmpdb --property MP\n  ID      SMILES MP_from_smiles MP_to_smiles  MP_radius  \\ \n  1  Clc1ccccn1         [*:1]O      [*:1]Cl          1\n  2   Nc1ccccn1         [*:1]O       [*:1]N          1\n  3    c1ccncc1         [*:1]O     [*:1][H]          1\n\n                               MP_fingerprint  MP_rule_environment_id  \\ \n tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8                     298\n tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8                     275\n tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8                     267\n\n  MP_count  MP_avg  MP_std  MP_kurtosis  MP_skewness  MP_min   MP_q1  \\ \n        1 -97.000     NaN          NaN          NaN     -97  -97.00\n        3 -16.667  75.235         -1.5     -0.33764     -72  -65.75\n        3 -93.000  76.727         -1.5      0.32397    -180 -151.00\n\n  MP_median  MP_q3  MP_max  MP_paired_t  MP_p_value\n       -97 -97.00     -97          NaN         NaN\n       -47  40.00      69       0.3837     0.73815\n       -64 -42.25     -35       2.0994     0.17062\n```\n\nYou might try enabling the \"`--explain`\" option to see why the algorithm\nselected a given tranformation.\n\nFor more help use \"`mmpdb transform --help`\".\n\n\n\n#### 5) Use MMP to make a prediction\n\n\nUse \"`mmpdb predict`\" to predict the property change in a transformation\nfrom a given reference structure to a given query structure. Use this\nwhen you want to limit the transform results when you know the\nstarting and ending structures. The following predicts the effect on\nmolecular weight in transforming 2-pyridone to pyridone:\n\n```shell\n  % mmpdb predict --smiles 'c1cccnc1' --reference 'c1cccnc1O' \\ \n            test_data.mmpdb --property MP\n  predicted delta: -93 +/- 76.7268\n```\n\nThis is the same MP_value and MP_std from the previous section using\n'`transform`'.\n\n```shell\n  % mmpdb predict --smiles 'c1cccnc1' --reference 'c1cccnc1O' \\ \n            test_data.mmpdb --property MP --value -41.6\n```\n\nI'll redo the calculation with the molecular weight property, and have\nmmpdb do the trival calculation of adding the known weight to the\npredicted delta:\n\n```shell\n  % mmpdb predict --smiles 'c1cccnc1' --reference 'c1cccnc1O' \\ \n            test_data.mmpdb --property MW --value 95.1\n  predicted delta: -16 predicted value: 79.1 +/- 0\n```\n\nYou might try enabling the \"`--explain`\" option to see why the algorithm\nselected a given transformation, or use \"`--save-details`\" to save the \nlist of possible rules to the file 'pred_detail_rules.txt' and to save \nthe list of rule pairs to \"pred_detail_pairs.txt\".\n\n\n------------------\n\n\n## History and Acknowledgements\n\n\nThe project started as a fork of the matched molecular pair program\n'mmpa' written by Jameed Hussain, then at GlaxoSmithKline Research \u0026\nDevelopment Ltd.. Many thanks to them for contributing the code to the\nRDKit project under a free software license.\n\nSince then it has gone through two rewrites. Major changes to the\nfirst version included:\n  - performance improvements,\n\n  - support for property prediction\n\n  - environmental fingerprints\n  \nThat version supported both MySQL and SQLite, and used the third-party\n\"peewee.py\" and \"playhouse\" code to help with for database\nportability. Many thanks to Charlies Leifer for that software.\n\nThe second version dropped MySQL support but added APSW support, which\nwas already available in the peewee/playhouse modules. The major goals\nin version 2 were:\n\n  - better support for chiral structures\n\n  - canonical variable fragments, so the transforms are canonical\n    on both the left-hand and right-hand sides. (Previously only\n    the entire transform was canonical.)\n\n\n------------------\n\n\n## Copyright\n\n\nThe mmpdb package is copyright 2015-2018 by F. Hoffmann-La\nRoche Ltd and distributed under the 3-clause BSD license. See [LICENSE](LICENSE)\nfor details.\n\n\n------------------\n\n\n## License information\n\n\nThe software derives from software which is copyright 2012-2013 by\nGlaxoSmithKline Research \u0026 Development Ltd., and distributed under the\n3-clause BSD license. To the best of our knowledge, mmpdb does not contain any\nof the mmpa original source code. We thank the authors for releasing this\npackage and include their license in the credits. See [LICENSE](LICENSE) for details.\n\nThe file fileio.py originates from [chemfp](http://chemfp.com) and is therefore\ncopyright by Andrew Dalke Scientific AB under the MIT license. See\n[LICENSE](LICENSE) for details. Modifications to this file are covered under\nthe mmpdb license.\n\nThe files peewee.py and playhouse/\\*.py are copyright 2010 by Charles\nLeifer and distributed under the MIT license. See [LICENSE](LICENSE) for details.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fmatcher-mmpdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmerck%2Fmatcher-mmpdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fmatcher-mmpdb/lists"}