{"id":13751955,"url":"https://github.com/rdkit/mmpdb","last_synced_at":"2026-01-25T19:13:15.785Z","repository":{"id":43115413,"uuid":"100089651","full_name":"rdkit/mmpdb","owner":"rdkit","description":"A package to identify matched molecular pairs and use them to predict property changes.","archived":false,"fork":false,"pushed_at":"2023-12-20T14:36:59.000Z","size":915,"stargazers_count":186,"open_issues_count":20,"forks_count":51,"subscribers_count":19,"default_branch":"master","last_synced_at":"2024-03-25T20:06:00.911Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rdkit.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-12T04:45:21.000Z","updated_at":"2024-06-13T04:54:49.317Z","dependencies_parsed_at":"2022-09-26T17:00:51.421Z","dependency_job_id":"cda7c1fe-ee4c-4554-b976-6aee6336e2f7","html_url":"https://github.com/rdkit/mmpdb","commit_stats":{"total_commits":53,"total_committers":7,"mean_commits":7.571428571428571,"dds":0.339622641509434,"last_synced_commit":"29b44a085af32f1d69da89911045d53a9eba9b44"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2Fmmpdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2Fmmpdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2Fmmpdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdkit%2Fmmpdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rdkit","download_url":"https://codeload.github.com/rdkit/mmpdb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247280272,"owners_count":20912967,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:00:57.530Z","updated_at":"2026-01-25T19:13:15.778Z","avatar_url":"https://github.com/rdkit.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"# mmpdb 3.1.4 - matched molecular pair database generation and analysis\n\n\n## Synopsis\n\n\nA package to identify matched molecular pairs and use them to predict\nproperty changes and generate new molecular structures.\n\n\n------------------\n\n## Installation\n\nmmpdb 3.1.4 must be installed before use. (Earlier versions of mmpdb\ncould be run in-place, in the top-level directory.) This will also\nensure that the SciPy, peewee, and click packages are installed.\n\nTo install from PyPI using\n[pip](https://pip.pypa.io/en/stable/user_guide/), which comes with\nPython:\n\nOn macOS and other Unix-like systems:\n```\npython -m pip install mmpdb\n```\nOn Windows:\n```\npy -m pip install mmpdb\n```\n\nIf you are using a virtual environment (e.g. with\n[venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment)\nor\n[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html))\nthen the installer will place a copy of the relevant files into the\nvirtual environment's Python package directory, and place the\n`mmpdb` command-line driver in your path.\n\nIf you are not using a virtual environment (and you should be using a\nvirtual environment) then the installer will place the relevant files\ninto Python's system directory. If you do not have write permissions\nto that directory, or only want to install it for your personal use,\nthen add the `--user` flag at the end of the install command.\n\nTo install from source directory, go to the top-level directory then\ndo:\n\nOn macOS and other Unix-like systems:\n```\npython -m pip install .\n```\n\nOn Windows:\n```\npy -m pip install .\n```\n\nIf you plan to modify the source code and want the installation to use\nthe local directory rather than make a copy into the package\ndirectory, then you need an \"editable\" installation,\n\nOn macOS and other Unix-like systems:\n```\npython -m pip install -e .\n```\n\nOn Windows:\n```\npy -m pip install -e .\n```\n\n(Assuming the dependencies are installed then it is possible to use\nmmpdb without installation by going to the top-level directory and\nusing `python -m mmpdblib` or, for Windows, `py -m mmpdlib`. This is\nnot recommended.)\n\n## Requirements\n\nThe package has been tested on Python 3.9 and 3.10. It should work\nunder newer versions of Python.\n\nYou will need a copy of the RDKit cheminformatics toolkit, available\nfrom http://rdkit.org/ , which in turn requires NumPy. You will also\nneed SciPy, peewee, and click. The latter three are listed as\ndependencies in setup.cfg and should be installed automatically.\n\nOptional components you may find useful are:\n\n  - The matched molecular pairs may instead by be stored in a Postgres\ndatabase. These were tested using the\n[psycopg2](https://www.psycopg.org/) adapter. See `mmpdb\nhelp-postgres` for more information.\n\n - The \"`--memory`\" option in the index command requires the\n[psutil](https://pypi.python.org/pypi/psutil/) module to get memory\nuse information.\n\nNOTE: mmpdb 2 used a JSON-Lines format for the fragment files, and\nsuggested an optional package with faster JSON parsing. mmpdb 3 no\nlonger uses this format.\n\n\n------------------\n\n\n## How to run the program and get help\n\n\nThe package includes a command-line program named \"mmpdb\". This\nsupport many subcommands. For examples:\n\n* \"`mmpdb fragment`\" -- fragment a SMILES file\n\n* \"`mmpdb index`\" -- find matched molecular pairs in a fragment file\n\nUse the \"`--help`\" option to get more information about any of the\ncommands. For example, \"`mmpdb fragment --help`\" will print the\ncommand-line arguments, describe how they are used, and show\nexamples of use.\n\nThe subcommands starting with \"help-\" print additional information\nabout a given topic. Much of the text of this README come from the\noutput of\n\n```shell\n % mmpdb help-analysis \n % mmpdb help-distributed \n```\n\nIf you wish to experiment with a simple test set, use\n`tests/test_data.smi`, with molecular weight and melting point\nproperties in `tests/test_data.csv`.\n\n\n------------------\n\n\n## Publication\n\n\nAn open-access publication describing this package has been \npublished in the Journal of Chemical Information and Modeling:\n\nA. Dalke, J. Hert, C. Kramer. mmpdb: An Open-Source Matched \nMolecular Pair Platform for Large Multiproperty Data Sets. *J. Chem. \nInf. Model.*, **2018**, *58 (5)*, pp 902–910. \nhttps://pubs.acs.org/doi/10.1021/acs.jcim.8b00173\n\nFor more about the methods to scale mmpdb to larger datasets and\ngenerate new molecules, see:\n\nM. Awale, J. Hert, L. Guasch, S. Riniker, C. Kramer.\nThe Playbooks of Medicinal Chemistry Design Moves. *J. Chem. \nInf. Model.*, **2021**,  *61 (2)*, pp 729-742.\nhttps://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c01143\n\n\n------------------\n\n\n## Background\n\nThe overall process is:\n\n1) Fragment structures in a SMILES file, to produce fragments.\n\n2) Index the fragments to produces matched molecular pairs. (you might include\nproperty information at this point)\n\n3) Load property information.\n\n4) Find transforms for a given structure; and/or\n\n5) Predict a property for a structure given the known    property for another\nstructure; and/or\n\n6) Apply 1-cut rules to generate new structures from a given    structure.\n\nSome terminology:\n\nA fragmentation cuts 1, 2, or 3 non-ring bonds to convert a structure into a\n\"constant\" part and a \"variable\" part. The substructure in the variable part\nis a single fragment, and often considered the R-groups, while the constant\npart contains one fragment for each cut, and it often considered as containing\nthe core.\n\nThe matched molecular pair indexing process finds all pairs which have the\nsame constant part, in order to define a transformation from one variable part\nto another variable part. A \"rule\" stores information about a transformation,\nincluding a list of all the pairs for that rule.\n\nThe \"rule environment\" extends the transformation to include information about\nthe local environment of the attachment points on the constant part. The\nenvironment fingerprint is based on the RDKit circular fingerprints for the\nattachment points, expressed as a canonical SMARTS pattern, and alternatively,\nas a \"pseudo\"-SMILES string, which is a bit less precise but easier to\nunderstand and visualize.\n\nThe fingerprint SMARTS pattern describes the Morgan circular fingerprint\ninvariants around the attachment points. Here's a 2-cut example split across\nthree lines:\n\n```\n[#0;X1;H0;+0;!R:1]-[#6;X4;H1;+0;R](-[#6;X4;H2;+0;R])-[#6;X4;H2;+0;R].\n[#0;X1;H0;+0;!R:2]-[#7;X3;H0;+0;R](-[#6;X4;H2;+0;R])-[#6;X4;H2;+0;R].\n[#0;X1;H0;+0;!R:3]-[#6;X3;H0;+0;R](:[#6;X3;H1;+0;R]):[#6;X3;H1;+0;R]\n```\n\nThe SMARTS modifiers, like \"H0\" to require no hydrogens, are needed to match\nthe Morgan invariants but are quite the eye-full. The psuedosmiles alternative\nis:\n\n```\n[*:1]-[CH](-[CH2](~*))-[CH2](~*).\n[*:2]-[N](-[CH2](~*))-[CH2](~*).\n[*:3]-[c](:[cH](~*)):[cH](~*)\n```\n\nThis can be processed by RDKit, if sanitization is disabled, and turned into\nan image.\n\nCAUTION! The \"`(~*)`\" terms are used to represent the SMARTS connectivity\nterms \"X\u003cdigit\u003e\", but they do not necessarily all represent distinct atoms!\n\nThere is one rule environment for each available radius. Larger radii\ncorrespond to more specific environments. The \"rule environment statistics\"\ntable stores information about the distribution of property changes for all of\nthe pairs which contain the given rule and environment, with one table for\neach property.\n\n### 1) Fragment structures\n\nUse \"`smifrag`\" to see how a given SMILES is fragmented. Use \"`fragment`\" to\nfragment all of the compounds in a SMILES file.\n\n\"`mmpdb smifrag`\" is a diagnostic tool to help understand how a given SMILES\nwill be fragmented and to experiment with the different fragmentation options.\nFor example:\n\n```shell\n% mmpdb smifrag 'c1ccccc1OC'\n                   |-------------  variable  -------------|       |---------------------  constant  --------------------\n#cuts | enum.label | #heavies | symm.class | smiles       | order | #heavies | symm.class | smiles           | with-H   \n------+------------+----------+------------+--------------+-------+----------+------------+------------------+----------\n  1   |     N      |    2     |      1     | [*]OC        |    0  |    6     |      1     | [*]c1ccccc1      | c1ccccc1 \n  1   |     N      |    6     |      1     | [*]c1ccccc1  |    0  |    2     |      1     | [*]OC            | CO       \n  2   |     N      |    1     |     11     | [*]O[*]      |   01  |    7     |     12     | [*]C.[*]c1ccccc1 | -        \n  1   |     N      |    1     |      1     | [*]C         |    0  |    7     |      1     | [*]Oc1ccccc1     | Oc1ccccc1\n  1   |     N      |    7     |      1     | [*]Oc1ccccc1 |    0  |    1     |      1     | [*]C             | C        \n```\n\nUse \"`mmpdb fragment`\" to fragment a SMILES file and produce a fragment file\nfor the MMP analysis. Start with the test data file named \"test_data.smi\"\ncontaining the following structures:\n\n```text\nOc1ccccc1 phenol  \nOc1ccccc1O catechol  \nOc1ccccc1N 2-aminophenol  \nOc1ccccc1Cl 2-chlorophenol  \nNc1ccccc1N o-phenylenediamine  \nNc1cc(O)ccc1N amidol  \nOc1cc(O)ccc1O hydroxyquinol  \nNc1ccccc1 phenylamine  \nC1CCCC1N cyclopentanol  \n```\n\nthen run the following command generate a fragment database.\n\n```shell\n% mmpdb fragment test_data.smi -o test_data.fragdb\n```\n\nFragmentation can take a while. You can save time by asking the code to reuse\nfragmentations from a previous run. If you do that then the fragment command\nwill reuse the old fragmentation parameters. (You cannot override them with\ncommand-line options.). Here is an example:\n\n```shell\n% mmpdb fragment data_file.smi -o new_data_file.fragdb \\\n       --cache old_data_file.fragdb\n```\n\nThe \"`--cache`\" option will greatly improve the fragment performance when\nthere are only a few changes from the previous run.\n\nThe fragmentation algorithm is configured to ignore structures which are too\nbig or have too many rotatable bonds. There are also options which change\nwhere to make cuts and the number of cuts to make. Use the \"`--help`\" option\non each command for details.\n\nUse \"`mmpdb help-smiles-format`\" for details about to parse different variants\nof the SMILES file format.\n\n### 2) Index the MMPA fragments to create a database\n\nThe \"`mmpa index`\" command indexes the output fragments from \"`mmpa fragment`\"\nby their variable fragments, that is, it finds fragmentations with the same\nR-groups and puts them together. Here's an example:\n\n```shell\n% mmpdb index test_data.fragdb -o test_data.mmpdb\n```\n\nThe output from this is a SQLite database.\n\nIf you have activity/property data and you do not want the database to include\nstructures where there is no data, then you can specify the properties file as\nwell:\n\n```shell\n% mmpdb index test_data.fragdb -o test_data.mmpdb --properties test_data.csv\n```\n\nUse \"`mmpdb help-property-format`\" for more details about the property file\nformat.\n\nFor more help use \"`mmpdb index --help`\".\n\n### 3) Add properties to a database\n\nUse \"`mmpdb loadprops`\" to add or modify activity/property data in the\ndatabase. Here's an example property file named 'test_data.csv' with molecular\nweight and melting point properties:\n\n```text\nID      MW      MP  \nphenol  94.1    41  \ncatechol        110.1   105  \n2-aminophenol   109.1   174  \n2-chlorophenol  128.6   8  \no-phenylenediamine      108.1   102  \namidol  124.1   *  \nhydroxyquinol   126.1   140  \nphenylamine     93.1    -6  \ncyclopentanol   86.1    -19  \n```\n\nThe following loads the property data to the MMPDB database file created in\nthe previous section:\n\n```shell\n% mmpdb loadprops -p test_data.csv test_data.mmpdb\nUsing dataset: MMPs from 'test_data.fragdb'\nReading properties from 'tests/test_data.csv'\nRead 2 properties for 9 compounds from 'tests/test_data.csv'\nImported 9 'MW' records (9 new, 0 updated).\nImported 8 'MP' records (8 new, 0 updated).\nNumber of rule statistics added: 533 updated: 0 deleted: 0\nLoaded all properties and re-computed all rule statistics.\n```\n\nUse \"`mmpdb help-property-format`\" for more details about the property file\nformat.\n\nFor more help use \"`mmpdb loadprops --help`\". Use \"`mmpdb list`\" to see what\nproperties are already loaded.\n\n### 4) Identify possible transforms\n\nUse \"`mmpdb transform`\" to transform an input structure using the rules in a\ndatabase. For each transformation, it can estimate the effect on any\nproperties. The following looks at possible ways to transform 2-pyridone using\nthe test dataset created in the previous section, and predict the effect on\nthe \"MW\" property (the output is reformatted for clarity):\n\n```shell\n% mmpdb transform --smiles 'c1cccnc1O' test_data.mmpdb --property MW\nID     SMILES    MW_from_smiles    MW_to_smiles    MW_radius\n1    Clc1ccccn1     [*:1]O          [*:1]Cl           1\n2     Nc1ccccn1     [*:1]O          [*:1]N            1\n3     c1ccncc1      [*:1]O          [*:1][H]          1\n\n      MW_smarts                        MW_pseudosmiles    MW_rule_environment_id \n[#0;X1;H0;+0;!R:1]-[#6;X3;H0;+0;R]    [*:1]-[#6](~*)(~*)    299\n[#0;X1;H0;+0;!R:1]-[#6;X3;H0;+0;R]    [*:1]-[#6](~*)(~*)    276\n[#0;X1;H0;+0;!R:1]-[#6;X3;H0;+0;R]    [*:1]-[#6](~*)(~*)    268\n\nMW_count    MW_avg    MW_std    MW_kurtosis    MW_skewness\n    1        18.5\n    3        -1         0            0\n    4       -16         0            0\n\nMW_min  MW_q1  MW_median  MW_q3  MW_max  MW_paired_t    MW_p_value\n 18.5    18.5   18.5      18.5    18.5\n -1      -1     -1        -1      -1      1e+08    \n-16     -16    -16       -16     -16      1e+08 \n```\n\nThis says that \"c1cccnc1O\" can be transformed to \"Clc1ccccn1\" using the\ntransformation \\[\\*:1\\]O\u003e\u003e\\[\\*:1\\]Cl (that is, replace the oxygen with a\nchlorine). The best transformation match has a radius of 1, which includes the\naromatic carbon at the attachment point but not the aromatic nitrogen which is\none atom away.\n\nThere is only one pair for this transformation, and it predicts a shift in\nmolecular weight of 18.5. This makes sense as the [OH] is replaced with a\n[Cl].\n\nOn the other hand, there are three pairs which transform it to pyridine. The\nstandard deviation of course is 0 because it's a simple molecular weight\ncalculation. The 1e+08.0 is the mmpdb way of writing \"positive infinity\".\n\nMelting point is more complicated. The following shows that in the\ntransformation of 2-pyridone to pyridine there are still 3 matched pairs and\nin this case the average shift is -93C with a standard deviation of 76.727C:\n\n```shell\n% mmpdb transform --smiles 'c1cccnc1O' test_data.mmpdb --property MP\nID   SMILES    MP_from_smiles   MP_to_smiles   MP_radius   \n1   Clc1ccccn1    [*:1]O           [*:1]Cl        1\n2    Nc1ccccn1    [*:1]O           [*:1]N         1\n3    c1ccncc1     [*:1]O           [*:1][H]       1\n\nMP_smarts                            MP_pseudosmiles     MP_rule_environment_id\n[#0;X1;H0;+0;!R:1]-[#6;X3;H0;+0;R]   [*:1]-[#6](~*)(~*)   299\n[#0;X1;H0;+0;!R:1]-[#6;X3;H0;+0;R]   [*:1]-[#6](~*)(~*)   276\n[#0;X1;H0;+0;!R:1]-[#6;X3;H0;+0;R]   [*:1]-[#6](~*)(~*)   268\n\nMP_count   MP_avg   MP_std   MP_kurtosis   MP_skewness   \n   1       -97            \n   3       -16.667   75.235     -1.5        -0.33764   \n   3       -93       76.727     -1.5        -0.32397   \n\nMP_min   MP_q1   MP_median   MP_q3   MP_max   MP_paired_t   MP_p_value\n -97      -97       -97       -97     -97      \n -72      -65.75    -47        40      69        0.3837      0.73815\n-180     -151       -64       -42.25  -35       -2.0994      0.17062\n```\n\nYou might try enabling the \"`--explain`\" option to see why the algorithm\nselected a given tranformation.\n\nFor more help use \"`mmpdb transform --help`\".\n\n### 5) Use MMP to make a prediction\n\nUse \"`mmpdb predict`\" to predict the property change in a transformation from\na given reference structure to a given query structure. Use this when you want\nto limit the transform results when you know the starting and ending\nstructures. The following predicts the effect on molecular weight in\ntransforming 2-pyridone to pyridone:\n\n```shell\n% mmpdb predict --smiles 'c1cccnc1' --reference 'c1cccnc1O' \\\n          test_data.mmpdb --property MP\npredicted delta: -93 +/- 76.7268\n```\n\nThis is the same MP_avg and MP_std from the previous section using\n'`transform`'.\n\nThe reference value may also be included in the calulation, to give a\npredicted value.\n\n```shell\n% mmpdb predict --smiles 'c1cccnc1' --reference 'c1cccnc1O' \\\n          test_data.mmpdb --property MP --value -41.6\npredicted delta: -93 predicted value: -134.6 +/- 76.7268\n```\n\nI'll redo the calculation with the molecular weight property, and have mmpdb\ndo the trival calculation of adding the known weight to the predicted delta:\n\n```shell\n% mmpdb predict --smiles 'c1cccnc1' --reference 'c1cccnc1O' \\\n          test_data.mmpdb --property MW --value 95.1\npredicted delta: -16 predicted value: 79.1 +/- 0\n```\n\nYou might try enabling the \"`--explain`\" option to see why the algorithm\nselected a given transformation, or use \"`--save-details`\" to save the  list\nof possible rules to the file `pred_detail_rules.txt` and to save  the list of\nrule pairs to `pred_detail_pairs.txt`.\n\n### 6) Use MMP to generate new structures\n\nThe rules in a MMP database give a sort of \"playbook\" about the\ntransformations which might be explored in medicinal chemistry. These rules can\nbe applied to a given structure to generate new related structures, following\na method related to the transform command but ignoring any property\ninformation. Here's an example using the default radius of 0, which means the\nenvironment fingerprint is ignored. (The columns have been re-formatted for\nthe documentation.)\n\n```shell\n% mmpdb generate --smiles 'c1ccccc1C(O)C' test_data.mmpdb\nstart             constant  from_smiles  to_smiles          r  pseudosmiles  final\nCC(O)c1ccccc1  *C(C)c1ccccc1  [*:1]O    [*:1][H]            0  [*:1](~*)  CCc1ccccc1\nCC(O)c1ccccc1  *C(C)c1ccccc1  [*:1]O    [*:1]N              0  [*:1](~*)  CC(N)c1ccccc1\nCC(O)c1ccccc1  *C(C)c1ccccc1  [*:1]O    [*:1]Cl             0  [*:1](~*)  CC(Cl)c1ccccc1\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccccc1O      0  [*:1](~*)  CC(O)c1ccccc1O\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccccc1N      0  [*:1](~*)  CC(O)c1ccccc1N\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1cc(O)ccc1N   0  [*:1](~*)  CC(O)c1cc(O)ccc1N\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccc(O)cc1N   0  [*:1](~*)  CC(O)c1ccc(O)cc1N\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]C1CCCC1        0  [*:1](~*)  CC(O)C1CCCC1\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccccc1Cl     0  [*:1](~*)  CC(O)c1ccccc1Cl\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccc(N)c(N)c1 0  [*:1](~*)  CC(O)c1ccc(N)c(N)c1\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1cc(O)ccc1O   0  [*:1](~*)  CC(O)c1cc(O)ccc1O\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccc(O)c(O)c1 0  [*:1](~*)  CC(O)c1ccc(O)c(O)c1\nCC(O)c1ccccc1  *C(C)O     [*:1]c1ccccc1 [*:1]c1ccc(O)cc1O   0  [*:1](~*)  CC(O)c1ccc(O)cc1O\n\n#pairs  pair_from_id  pair_from_smiles  pair_to_id  pair_to_smiles\n4       2-aminophenol  Nc1ccccc1O     phenylamine        Nc1ccccc1\n3       phenol         Oc1ccccc1      phenylamine        Nc1ccccc1\n1       catechol       Oc1ccccc1O     2-chlorophenol     Oc1ccccc1Cl\n2       phenylamine    Nc1ccccc1      2-aminophenol      Nc1ccccc1O\n2       phenylamine    Nc1ccccc1      o-phenylenediamine Nc1ccccc1N\n1       phenylamine    Nc1ccccc1      amidol             Nc1ccc(O)cc1N\n1       phenylamine    Nc1ccccc1      amidol             Nc1ccc(O)cc1N\n1       phenylamine    Nc1ccccc1      cyclopentanol      NC1CCCC1\n1       phenol         Oc1ccccc1      2-chlorophenol     Oc1ccccc1Cl\n1       phenol         Oc1ccccc1      amidol             Nc1ccc(O)cc1N\n1       phenol         Oc1ccccc1      hydroxyquinol      Oc1ccc(O)c(O)c1\n1       phenol         Oc1ccccc1      hydroxyquinol      Oc1ccc(O)c(O)c1\n1       phenol         Oc1ccccc1      hydroxyquinol      Oc1ccc(O)c(O)c1\n```\n\nThe second half the output shows the number of known pairs for the given rule\nenvironment (use `--min-pairs N` to require at least N pairs), and gives a\nrepresentative pair from the dataset.\n\nIn the above example, all of the fragmentations in the specified `--smiles`\nare used. Alternatively, you may specify `--smiles` and one of `--constant` or\n`--query` to use that specific fragmentation, or use `--constant` and\n`--query` (without `--smiles`) to specify the exact pair.\n\nThere is also an option to generate `--subqueries`. This generates all of the\nunique 1-cut fragmentations of the query, and uses them as additional queries.\nI'll use the `--constant` to specify the phynol group, leaving the\naminomethanol available as the query. I'll use `--subqueries` to include\nfragments of the query. I'll limit the output `--columns` to the start and\nfinal SMILES structures, and the number of pairs. I'll use `--explain` to\ndisplay debug information, and finally, I'll use `--no-header` to make the\noutput a bit less complicated:\n\n```shell\n% mmpdb generate --smiles 'c1ccccc1C(O)N' --constant '*c1ccccc1' test_data.mmpdb \\\n     --subqueries --columns start,final,#pairs --explain --no-header\nNumber of subqueries: 4\nSubqueries are: ['*CN', '*CO', '*N', '*O']\nUsing constant SMILES *c1ccccc1 with radius 0.\nEnvironment SMARTS: [#0;X1;H0;+0;!R:1] pseudoSMILES: [*:1](~*)\nNumber of matching environment rules: 42\nQuery SMILES [*:1]C(N)O is not a rule_smiles in the database.\nQuery SMILES [*:1]CN is not a rule_smiles in the database.\nQuery SMILES [*:1]CO is not a rule_smiles in the database.\nNc1ccccc1     Oc1ccccc1       3\nNc1ccccc1     c1ccccc1        2\nNc1ccccc1     Clc1ccccc1      1\nNumber of rules for [*:1]N: 3\nOc1ccccc1     c1ccccc1        4\nOc1ccccc1     Nc1ccccc1       3\nOc1ccccc1     Clc1ccccc1      1\nNumber of rules for [*:1]O: 3\n```\n\n## Distributed computing\n\nThese commands enable MMP generation on a distributed compute cluster, rather\nthan a single machine.\n\nNOTE: This method does not support properties, and you must use the\nSQLite- based \"mmpdb\" files, not Postgres databases. The\n[Postgres wiki](https://wiki.postgresql.org/wiki/Converting_from_other_Databases_to_PostgreSQL)\nmentions [pgloader](https://github.com/dimitri/pgloader) as a possible\ntool to have Postgres load a SQLite database.\n\nThese examples assume you work in a queueing environment with a shared\nfile system, and a queueing system which lets you submit a command and\na list of filenames, to enqueue the command once for each filename.\n\nThis documentation will use the command 'qsub' as a wrapper around [GNU\nParallel](https://www.gnu.org/software/parallel/):\n\n```shell\nalias qsub=\"parallel --no-notice -j 1 --max-procs 4\"\n```\n\nThis alias suppresses the request to cite GNU parallel in scientific papers,\nand has it process one filename at a time, with at most 4 processes in\nparallel.\n\nI'll pass the filenames to process via stdin, like this example:\n\n```shell\n% ls /etc/passwd ~/.bashrc | qsub wc\n       2       5      88 /Users/dalke/.bashrc\n     120     322    7630 /etc/passwd\n```\n\nThis output shows that `wc` received only a single filename because with two\nfilenames it also shows a 'total' line.\n\n```shell\n% wc /etc/passwd ~/.bashrc\n     120     322    7630 /etc/passwd\n       2       5      88 /Users/dalke/.bashrc\n     122     327    7718 total\n```\n\n### Distributed fragmentation generation\n\nNOTE: This method can also be used to process larger data sets on a single\nmachine because the `mmpdb merge` step uses less memory than the `mmpdb\nindex`.\n\nThe `fragment` command supports multi-processing with the `-j` flag, which\nscales to about 4 or 8 processors. For larger data sets you can break the\nSMILES dataset into multiple files, fragment each file indepenently, then\nmerge the results.\n\nThese steps are:\n\n* smi_split - split the SMILES file into smaller files\n* fragment - fragment the each smaller SMILES file into its own fragb file.\n* fragdb_merge - merge the smaller fragdb files together.\n\n#### Use smi_split to create N smaller SMILES files\n\nI'll start with a SMILES file containing a header and 20267 SMILES lines:\n\n```shell\n% head -3 ChEMBL_CYP3A4_hERG.smi\nSMILES  CMPD_CHEMBLID\n[2H]C([2H])([2H])Oc1cc(ncc1C#N)C(O)CN2CCN(C[C@H](O)c3ccc4C(=O)OCc4c3C)CC2       CHEMBL3612928\n[2H]C([2H])(N[C@H]1C[S+]([O-])C[C@@H](Cc2cc(F)c(N)c(O[C@H](COC)C(F)(F)F)c2)[C@@H]1O)c3cccc(c3)C(C)(C)C  CHEMBL2425617\n% wc -l ChEMBL_CYP3A4_hERG.smi\n   20268 ChEMBL_CYP3A4_hERG.smi\n```\n\nBy default the \"smi_split\" command splits a SMILES file into 10 files. (Use\n`-n` or `--num-files` to change the number of files, or use `--num-records` to\nhave N records per file.)\n\n```shell\n% mmpdb smi_split ChEMBL_CYP3A4_hERG.smi\nCreated 10 SMILES files containing 20268 SMILES records.\n```\n\nThat \"20268 SMILES record\" shows that all 20268 lines were used to generate\nSMILES records, which is a mistake as it includes the header line. I'll re-do\nthe command with `--has-header` to have it skip the header:\n\n```shell\n% mmpdb smi_split ChEMBL_CYP3A4_hERG.smi --has-header\nCreated 10 SMILES files containing 20267 SMILES records.\n```\n\nBy default this generates files which look like:\n\n```shell\n% ls -l ChEMBL_CYP3A4_hERG.*.smi\n-rw-r--r--  1 dalke  admin  141307 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0000.smi\n-rw-r--r--  1 dalke  admin  152002 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0001.smi\n-rw-r--r--  1 dalke  admin  127397 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0002.smi\n-rw-r--r--  1 dalke  admin  137930 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0003.smi\n-rw-r--r--  1 dalke  admin  130585 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0004.smi\n-rw-r--r--  1 dalke  admin  150072 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0005.smi\n-rw-r--r--  1 dalke  admin  139620 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0006.smi\n-rw-r--r--  1 dalke  admin  133347 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0007.smi\n-rw-r--r--  1 dalke  admin  131310 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0008.smi\n-rw-r--r--  1 dalke  admin  129344 Feb 10 15:10 ChEMBL_CYP3A4_hERG.0009.smi\n```\n\nThe output filenames are determined by the `--template` option, which defaults\nto `{prefix}.{i:04}.smi`, where `i` is the output file index. See `smi_split\n--help` for details.\n\n#### Fragment the SMILES files\n\nThese files can be fragmented in parallel:\n\n```shell\n% ls ChEMBL_CYP3A4_hERG.*.smi | qsub mmpdb fragment -j 1\n```\n\nI used the `-j 1` flag to have `mmpdb fragment` use only a single thread,\notherwise each of the four fragment commands will use 4 threads even though my\nlaptop only has 4 cores. You should adjust the value to match the resources\navailable on your compute node.\n\nThe `parallel` command doesn't forward output until the program is done, so it\ntakes a while to see messages like:\n\n```\nUsing 'ChEMBL_CYP3A4_hERG.0002.fragdb' as the default --output file.\nFragmented record 249/2026 (12.3%)[15:04:16] Conflicting single bond\ndirections around double bond at index 5.\n[15:04:16]   BondStereo set to STEREONONE and single bond directions set to NONE.\n```\n\nIf no `-o`/`--output` is specified, the `fragment` command uses a named based\non the input name, for example, if the input file is\n`ChEMBL_CYP3A4_hERG.0002.smi` then the default output file is\n`ChEMBL_CYP3A4_hERG.0002.mmpdb`.\n\n#### Merge the fragment files\n\nNOTE: This step is only needed if you want to use the merged file as a\n`--cache` for new fragmentation. The `fragdb_constants` and `fragdb_partition`\ncommands can work directly on the un-merged fragdb files.\n\nAbout 28 minutes later I have 10 fragdb files:\n\n```shell\n% ls -l ChEMBL_CYP3A4_hERG.*.fragdb\n-rw-r--r--  1 dalke  admin  17862656 Feb 10 15:17 ChEMBL_CYP3A4_hERG.0000.fragdb\n-rw-r--r--  1 dalke  admin  38285312 Feb 10 15:27 ChEMBL_CYP3A4_hERG.0001.fragdb\n-rw-r--r--  1 dalke  admin  15024128 Feb 10 15:16 ChEMBL_CYP3A4_hERG.0002.fragdb\n-rw-r--r--  1 dalke  admin  15929344 Feb 10 15:16 ChEMBL_CYP3A4_hERG.0003.fragdb\n-rw-r--r--  1 dalke  admin  18063360 Feb 10 15:23 ChEMBL_CYP3A4_hERG.0004.fragdb\n-rw-r--r--  1 dalke  admin  20586496 Feb 10 15:24 ChEMBL_CYP3A4_hERG.0005.fragdb\n-rw-r--r--  1 dalke  admin  24911872 Feb 10 15:26 ChEMBL_CYP3A4_hERG.0006.fragdb\n-rw-r--r--  1 dalke  admin  16875520 Feb 10 15:28 ChEMBL_CYP3A4_hERG.0007.fragdb\n-rw-r--r--  1 dalke  admin  12451840 Feb 10 15:28 ChEMBL_CYP3A4_hERG.0008.fragdb\n-rw-r--r--  1 dalke  admin  11010048 Feb 10 15:29 ChEMBL_CYP3A4_hERG.0009.fragdb\n```\n\nI'll merge these with the `fragdb_merge` command:\n\n```shell\n% mmpdb fragdb_merge ChEMBL_CYP3A4_hERG.*.fragdb -o ChEMBL_CYP3A4_hERG.fragdb\nMerge complete. #files: 10 #records: 18759 #error records: 1501\n```\n\nThis took about 4 seconds.\n\n#### Use the merged fragment file as cache\n\nThe merged file can be used a a cache file for future fragmentations, such as:\n\n```shell\n% ls ChEMBL_CYP3A4_hERG.*.smi | \\\n    qsub mmpdb fragment --cache ChEMBL_CYP3A4_hERG.fragdb -j 1\n```\n\nThis re-build using cache takes about 20 seconds.\n\n# Distributed indexing\n\nThe `mmpdb index` command is single-threaded. It's possible to parallelize\nindexing by partitioning the fragments with the same constant SMILES into\ntheir own fragdb data sets, indexing those files, then merging the results\nback into a full MMP database.\n\nNote: the merge command can only be used to merge MMP databases with distinct\nconstants. It cannot be used to merge arbitrary MMP databases.\n\nNote: the MMP database only stores aggregate information about pair\nproperties, and the aggregate values cannot be meaningfully merged, so the\nmerge command will ignore any properties in the database.\n\n#### Partitioning on all constants\n\nThe `mmpdb fragdb_partition` command splits one or more fragment databases\ninto N smaller files. All of the fragmentations with the same constant are in\nthe same file.\n\nNOTE: the fragdb files from the `fragment` command have a slightly different\nstructure than the ones from the `partition` command. The fragment fragdb\nfiles only contain the input records that were fragmented. Each partition\nfragdb file contains *all* of the input records from the input fragment\nfile(s). This is needed to handle 1-cut hydrogen matched molecular pairs.\n\nIf you specify multiple fragdb files then by default the results are put into\nfiles matching the template \"partition.{i:04d}.fragdb\", as in the following:\n\n```shell\n% mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.*.fragdb\nAnalyzed 'ChEMBL_CYP3A4_hERG.0000.fragdb': #constants: 48895 #fragmentations: 109087\nAnalyzed 'ChEMBL_CYP3A4_hERG.0001.fragdb': #constants: 70915 #fragmentations: 212777\nAnalyzed 'ChEMBL_CYP3A4_hERG.0002.fragdb': #constants: 52370 #fragmentations: 100594\nAnalyzed 'ChEMBL_CYP3A4_hERG.0003.fragdb': #constants: 49021 #fragmentations: 103350\nAnalyzed 'ChEMBL_CYP3A4_hERG.0004.fragdb': #constants: 52318 #fragmentations: 112930\nAnalyzed 'ChEMBL_CYP3A4_hERG.0005.fragdb': #constants: 55977 #fragmentations: 123463\nAnalyzed 'ChEMBL_CYP3A4_hERG.0006.fragdb': #constants: 64083 #fragmentations: 164259\nAnalyzed 'ChEMBL_CYP3A4_hERG.0007.fragdb': #constants: 51605 #fragmentations: 114113\nAnalyzed 'ChEMBL_CYP3A4_hERG.0008.fragdb': #constants: 44149 #fragmentations: 80613\nAnalyzed 'ChEMBL_CYP3A4_hERG.0009.fragdb': #constants: 35889 #fragmentations: 69029\nAnalyzed 10 databases. Found #constants: 467865 #fragmentations: 1190215\nExporting 1 constants to 'partition.0000.fragdb' (#1/10, weight: 334589647)\nExporting 1 constants to 'partition.0001.fragdb' (#2/10, weight: 270409141)\nExporting 1 constants to 'partition.0002.fragdb' (#3/10, weight: 225664391)\nExporting 1 constants to 'partition.0003.fragdb' (#4/10, weight: 117895691)\nExporting 77977 constants to 'partition.0004.fragdb' (#5/10, weight: 52836587)\nExporting 77978 constants to 'partition.0005.fragdb' (#6/10, weight: 52836587)\nExporting 77975 constants to 'partition.0006.fragdb' (#7/10, weight: 52836586)\nExporting 77976 constants to 'partition.0007.fragdb' (#8/10, weight: 52836586)\nExporting 77977 constants to 'partition.0008.fragdb' (#9/10, weight: 52836586)\nExporting 77978 constants to 'partition.0009.fragdb' (#10/10, weight: 52836586)\n```\n\nThe command's `--template` option lets you specify how to generate the output\nfilenames.\n\nWhy are there so few constants in first files and so many in the other? And\nwhat are the \"weight\"s?\n\nI'll use the `fragdb_constants` command to show the distinct constants in each\nfile and the number of occurrences.\n\n```shell\n% mmpdb fragdb_constants partition.0000.fragdb\nconstant        N\n*C      25869\n```\n\nThat's a lot of methyls (25,869 to be precise).\n\nThe indexing command does `N*(N-1)/2` indexing comparisions, plus a 1-cut\nhydrogen match, so the cost estimate for the methyls is `25869*(25869-1)/2+1 =\n334589647`, which is the `weight` value listed above.\n\nI'll next list the three most common and least constants in\nChEMBL_CYP3A4_hERG.0004.fragdb:\n\n```shell\n% mmpdb fragdb_constants partition.0004.fragdb --limit 3\nconstant        N\n*C.*C.*OC       7076\n*C.*Cl  4388\n*C.*C.*CC       3261\n% mmpdb fragdb_constants partition.0004.fragdb | tail -3\n*n1nnnc1SCC(=O)Nc1nc(-c2ccc(Cl)cc2)cs1  1\n*n1nnnc1SCc1nc(N)nc(N2CCOCC2)n1 1\n*n1s/c(=N/C)nc1-c1ccccc1        1\n```\n\nThe values of N are much smaller, so the corresponding weight is significantly\nsmaller.\n\nBy default the partition command tries to split the constants evenly (by\nweight) across `-n` / `--num-files` files, defaulting to 10, which combined\nwith the quadratic weighting is why the first few files have only a single,\nvery common, constant, and why all of the \"1\" counts are used to fill space in\nthe remaining files\n\nYou can alternatively use `--max-weight` to set an upper bound for the weights\nin each file. In this example I'll use the merged fragdb file from the\nprevious step:\n\n```shell\n% mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.fragdb --max-weight 50000000\nAnalyzed 'ChEMBL_CYP3A4_hERG.fragdb': #constants: 467865 #fragmentations: 1190215\nExporting 1 constants to 'ChEMBL_CYP3A4_hERG-partition.0000.fragdb' (#1/11, weight: 334589647)\nExporting 1 constants to 'ChEMBL_CYP3A4_hERG-partition.0001.fragdb' (#2/11, weight: 270409141)\nExporting 1 constants to 'ChEMBL_CYP3A4_hERG-partition.0002.fragdb' (#3/11, weight: 225664391)\nExporting 1 constants to 'ChEMBL_CYP3A4_hERG-partition.0003.fragdb' (#4/11, weight: 117895691)\nExporting 10 constants to 'ChEMBL_CYP3A4_hERG-partition.0004.fragdb' (#5/11, weight: 49918518)\nExporting 11 constants to 'ChEMBL_CYP3A4_hERG-partition.0005.fragdb' (#6/11, weight: 49916276)\nExporting 13 constants to 'ChEMBL_CYP3A4_hERG-partition.0006.fragdb' (#7/11, weight: 49899719)\nExporting 7 constants to 'ChEMBL_CYP3A4_hERG-partition.0007.fragdb' (#8/11, weight: 49896681)\nExporting 43 constants to 'ChEMBL_CYP3A4_hERG-partition.0008.fragdb' (#9/11, weight: 49893145)\nExporting 9 constants to 'ChEMBL_CYP3A4_hERG-partition.0009.fragdb' (#10/11, weight: 49879752)\nExporting 467768 constants to 'ChEMBL_CYP3A4_hERG-partition.0010.fragdb' (#11/11, weight: 17615427)\n```\n\nIf you specify a single fragdb filename then the default output template is\n\"{prefix}-partition.{i:04}.fragdb\" where \"{prefix}\" is the part of the fragdb\nfilename before its extension. The idea is to help organize those files\ntogether.\n\nOdds are, you don't want to index the most common fragments. The next two\nsections help limits which constants are used.\n\n#### Selecting constants\n\nAs you saw, the `mmpdb fragdb_constants` command can be used to list the\nconstants. It can also be used to list a subset of the constants.\n\nThe count for each constant quickly decreases to something a bit more\nmanageable.\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.*.fragdb --limit 20\nconstant        N\n*C      25869\n*C.*C   23256\n*C.*C.*C        21245\n*C.*C.*O        15356\n*C.*O   8125\n*C.*C.*OC       7076\n*C.*OC  6878\n*F      6201\n*C.*F   6198\n*C.*c1ccccc1    5124\n*C.*O.*O        5117\n*c1ccccc1       5073\n*OC     4944\n*Cl     4436\n*C.*Cl  4388\n*O      4300\n*F.*F   4281\n*C.*F.*F        3935\n*C.*C.*F        3656\n*F.*F.*F        3496\n```\n\nI'll select those constants which occur only 2,000 matches or fewer, and limit\nthe output to the first 5.\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.*.fragdb --max-count 2000 --limit 5\nconstant        N\n*C.*CC.*O       1954\n*C.*C(F)(F)F    1915\n*C.*C.*OC(C)=O  1895\n*C(F)(F)F       1892\n*Cl.*Cl 1738\n```\n\nor count the number of constants which only occur once (the 1-cut constants\nmight match with a hydrogen substitution while the others will never match).\nI'll use `--no-header` so the number of lines of output matches the number of\nconstants:\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.fragdb --max-count 1 --no-header | wc -l\n  370524\n```\n\nThese frequent constants are for small fragments. I'll limit the selection to\nconstants where each part of the constant has at least 5 heavy atoms:\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.*.fragdb --min-heavies-per-const-frag 5 --limit 4\nconstant        N\n*c1ccccc1       5073\n*c1ccccc1.*c1ccccc1     1116\n*Cc1ccccc1      1050\n*c1ccc(F)cc1    921\n```\n\nI'll also require `N` be between 10 and 1000.\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.*.fragdb --min-heavies-per-const-frag 5 \\\n   --min-count 10 --max-count 1000 --no-header | wc -l\n1940\n```\n\nThat's a much more tractable size for this example.\n\nAs you saw earlier, the `mmpdb fragdb_partition` command by default partitions\non all constants. Alternatively, use the `--constants` flag to pass in a list\nof constants to use. This can be a file name, or `-` to accept constants from\nstdin, as in the following three lines:\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.*.fragdb --min-heavies-per-const-frag 5 \\\n     --min-count 10 --max-count 1000 | \\\n     mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.*.fragdb --constants -\nExporting 1 constants to 'ChEMBL_CYP3A4_hERG.0000.fragdb' (weight: 423661)\nExporting 1 constants to 'ChEMBL_CYP3A4_hERG.0001.fragdb' (weight: 382376)\nExporting 109 constants to 'ChEMBL_CYP3A4_hERG.0002.fragdb' (weight: 382044)\nExporting 261 constants to 'ChEMBL_CYP3A4_hERG.0003.fragdb' (weight: 382013)\nExporting 261 constants to 'ChEMBL_CYP3A4_hERG.0004.fragdb' (weight: 382013)\nExporting 260 constants to 'ChEMBL_CYP3A4_hERG.0005.fragdb' (weight: 382010)\nExporting 261 constants to 'ChEMBL_CYP3A4_hERG.0006.fragdb' (weight: 382010)\nExporting 262 constants to 'ChEMBL_CYP3A4_hERG.0007.fragdb' (weight: 382010)\nExporting 262 constants to 'ChEMBL_CYP3A4_hERG.0008.fragdb' (weight: 382009)\nExporting 262 constants to 'ChEMBL_CYP3A4_hERG.0009.fragdb' (weight: 382003)\n```\n\nNote: the `--constants` parser expects the first line to be a header, which is\nwhy I don't use `--no-header` in the `fragdb_constants` command.\nAlternatively, also use `--no-header` in the `fragdb_partition` command if the\ninput does not have a header.\n\n#### Partitioning in parallel\n\nPartioning large data sets may take significant time because the export\nprocess is single-threaded.\n\nThe `fragdb_partition` command can be configured to export only subset of the\npartitions using a simple round-robin scheme. If you specify `--task-id n` and\n`--num-tasks N` then the given fragdb_partition will only export partitions\n`i` such that `i % N == n`.\n\nThe expected approach is to create a single constants files which will be\nshared by multiple partition commands.\n\n```shell\n% mmpdb fragdb_constants ChEMBL_CYP3A4_hERG.*.fragdb --min-heavies-per-const-frag 5 \\\n     --min-count 10 --max-count 1000 -o constants.dat\n```\n\nThe following splits the job across two partition commands, with task ids 0\nand 1, respectively:\n\n```shell\n% mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.*.fragdb --constants constants.dat --task-id 0 --num-tasks 2\nExporting 1 constants to 'partition.0000.fragdb' (#1/10, weight: 423661)\nExporting 109 constants to 'partition.0002.fragdb' (#3/10, weight: 382044)\nExporting 261 constants to 'partition.0004.fragdb' (#5/10, weight: 382013)\nExporting 261 constants to 'partition.0006.fragdb' (#7/10, weight: 382010)\nExporting 262 constants to 'partition.0008.fragdb' (#9/10, weight: 382009)\n% mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.*.fragdb --constants constants.dat --task-id 1 --num-tasks 2\nExporting 1 constants to 'partition.0001.fragdb' (#2/10, weight: 382376)\nExporting 261 constants to 'partition.0003.fragdb' (#4/10, weight: 382013)\nExporting 260 constants to 'partition.0005.fragdb' (#6/10, weight: 382010)\nExporting 262 constants to 'partition.0007.fragdb' (#8/10, weight: 382010)\nExporting 262 constants to 'partition.0009.fragdb' (#10/10, weight: 382003)\n```\n\nUse the `--dry-run` option to get an idea of how many files will be created:\n```shell\n% mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.*.fragdb --constants constants.dat --dry-run\ni       #constants      weight  filename\n0       10      423661  'partition.0000.fragdb'\n1       10      382376  'partition.0001.fragdb'\n2       10      382044  'partition.0002.fragdb'\n3       10      382013  'partition.0003.fragdb'\n4       10      382013  'partition.0004.fragdb'\n5       10      382010  'partition.0005.fragdb'\n6       10      382010  'partition.0006.fragdb'\n7       10      382010  'partition.0007.fragdb'\n8       10      382009  'partition.0008.fragdb'\n9       10      382003  'partition.0009.fragdb'\n```\n \n#### Indexing in parallel\n\nThe partitioned fragdb files can be indexed in parallel:\n\n```shell\n% ls partition.*.fragdb | qsub mmpdb index\nWARNING: No --output filename specified. Saving to 'partition.0000.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0001.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0002.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0003.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0004.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0005.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0006.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0007.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0008.mmpdb'.\nWARNING: No --output filename specified. Saving to 'partition.0009.mmpdb'.\n```\n\n(If you don't like these warning messages, use the `--quiet` flag.)\n\n#### Merging partitioned mmpdb files\n\nThe last step is to merge the partitioned mmpdb files with the `merge` option,\nwhich only works if no two mmpdb files share the same constant:\n\n```shell\n% mmpdb merge partition.*.mmpdb -o ChEMBL_CYP3A4_hERG_distributed.mmpdb\n[Stage 1/7] Merging compound records ...\n[Stage 1/7] Merged 4428 compound records in 0.046 seconds.\n[Stage 2/7] Merging rule_smiles tables ...\n[Stage 2/7] Merged 3159 rule_smiles records in 0.030 seconds.\n[Stage 3/7] Merging rule tables ...\n[Stage 3/7] Merged 21282 rule records in 0.072 seconds.\n[Stage 4/7] Merging environment_fingerprint records ...\n[Stage 4/7] Merged 1753 environment_fingerprint records in 0.035 seconds.\n[Stage 5/7] Merging rule environment records ...\n[Stage 5/7] Merged 143661 rule environment records in 0.47 seconds.\n[Stage 6/7] Merging constant_smiles and pair records ...\n[Stage 6/7] Merged 893 constant SMILES and 203856 pair records in 0.26 seconds\n[Stage 7/7] Indexed and analyzed the merged records in 0.33 seconds.\nMerged 10 files in 1.3 seconds.\n```\n\nLet's take a look:\n\n```shell\n% mmpdb list ChEMBL_CYP3A4_hERG_distributed.mmpdb\n                Name                 #cmpds #rules #pairs #envs  #stats  |-------- Title --------| Properties\nChEMBL_CYP3A4_hERG_distributed.mmpdb   4428  21282 203856 143661      0  Merged MMPs from 10 files \u003cnone\u003e\n```\n\nFinally, I'll cross-check this with a normal `mmpdb index`. I need to create\nthe same subset\n\n```shell\n% mmpdb fragdb_partition ChEMBL_CYP3A4_hERG.fragdb --constants constants.dat \\\n      -n 1 --template ChEMBL_CYP3A4_hERG_subset.fragdb\nExporting 1940 constants to 'ChEMBL_CYP3A4_hERG_subset.fragdb' (#1/1, weight: 3862149)\n```\n\nThen index the subset:\n\n```shell\n% mmpdb index ChEMBL_CYP3A4_hERG_subset.fragdb\nWARNING: No --output filename specified. Saving to 'ChEMBL_CYP3A4_hERG_subset.mmpdb'.\n```\n\nAnd finally, compare the two:\n\n```shell\n% mmpdb list ChEMBL_CYP3A4_hERG_subset.mmpdb ChEMBL_CYP3A4_hERG_distributed.mmpdb\n                Name                 #cmpds #rules #pairs #envs  #stats  |----------------- Title ------------------| Properties\n     ChEMBL_CYP3A4_hERG_subset.mmpdb   4428  21282 203856 143661      0  MMPs from 'ChEMBL_CYP3A4_hERG_subset.fragdb' \u003cnone\u003e\nChEMBL_CYP3A4_hERG_distributed.mmpdb   4428  21282 203856 143661      0  Merged MMPs from 10 files                    \u003cnone\u003e\n```\n\nThey are the same, except for the title.\n\n\n------------------\n\n\n## History and Acknowledgements\n\n\nThe project started as a fork of the matched molecular pair program\n'mmpa' written by Jameed Hussain, then at GlaxoSmithKline Research \u0026\nDevelopment Ltd.. Many thanks to them for contributing the code to the\nRDKit project under a free software license.\n\nSince then it has gone through two rewrites before the 1.0\nrelease. Major changes to the first version included:\n\n  - performance improvements,\n\n  - support for property prediction\n\n  - environmental fingerprints\n  \nThat version supported both MySQL and SQLite, and used the third-party\n\"peewee.py\" and \"playhouse\" code to help with for database\nportability. Many thanks to Charlies Leifer for that software.\n\nThe second version dropped MySQL support but added APSW support, which\nwas already available in the peewee/playhouse modules. The major goals\nin version 2 were:\n\n- better support for chiral structures\n\n- canonical variable fragments, so the transforms are canonical\n   on both the left-hand and right-hand sides. (Previously only\n   the entire transform was canonical.)\n\nThe project then forked into three branches:\n\n1. The public GitHub branch, with a few improvements by Christian\n  Kramer\n\n2. Andrew Dalke's crowd-funded branch which:\n  - replaced the Morgan fingerprint-based hashed environment\n  fingerprint with its canonical SMARTS equivalent, and a\n  \"pseudo-SMILES\" which might be used in depictions\n  - added Postgres support\n  - added export methods to tab-separated and database\n  dump formats\n\n3. Mahendra Awale's improvements for:\n  - large-database mmpdb generation by partitioning\n   on fragment constants\n  - playbook generation\n\nRoche funded Andrew Dalke to merge these three branches, resulting in\nmmpdb 3.0.\n\n------------------\n\n\n## Copyright\n\n\nThe mmpdb package is copyright 2015-2023 by F. Hoffmann-La Roche Ltd\nand Andrew Dalke Scientific AB, and distributed under the 3-clause BSD\nlicense. See [LICENSE](LICENSE) for details.\n\n\n------------------\n\n\n## License information\n\n\nThe software derives from software which is copyright 2012-2013 by\nGlaxoSmithKline Research \u0026 Development Ltd., and distributed under the\n3-clause BSD license. To the best of our knowledge, mmpdb does not contain any\nof the mmpa original source code. We thank the authors for releasing this\npackage and include their license in the credits. See [LICENSE](LICENSE) for details.\n\nThe file fileio.py originates from [chemfp](http://chemfp.com) and is therefore\ncopyright by Andrew Dalke Scientific AB under the MIT license. See\n[LICENSE](LICENSE) for details. Modifications to this file are covered under\nthe mmpdb license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdkit%2Fmmpdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frdkit%2Fmmpdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdkit%2Fmmpdb/lists"}