{"id":17255154,"url":"https://github.com/unixjunkie/propbox","last_synced_at":"2025-03-26T08:15:58.130Z","repository":{"id":144782617,"uuid":"141226631","full_name":"UnixJunkie/propbox","owner":"UnixJunkie","description":"mirror of https://bitbucket.org/dalke/propbox","archived":false,"fork":false,"pushed_at":"2018-07-17T03:32:57.000Z","size":61,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-31T09:31:18.631Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UnixJunkie.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-17T03:32:25.000Z","updated_at":"2022-07-15T17:36:31.000Z","dependencies_parsed_at":"2024-03-31T01:15:14.112Z","dependency_job_id":null,"html_url":"https://github.com/UnixJunkie/propbox","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fpropbox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fpropbox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fpropbox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fpropbox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UnixJunkie","download_url":"https://codeload.github.com/UnixJunkie/propbox/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245615009,"owners_count":20644376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T07:10:50.077Z","updated_at":"2025-03-26T08:15:58.109Z","avatar_url":"https://github.com/UnixJunkie.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"                      propbox 0.5\n\nSummary\n=======\n\nPropbox is a Python package for computing molecular properties and\nmodels, and handing the dependencies between the calculations.\n\nThe dependencies form a workflow. For example, the steps in building a\nconsensus model may look like this:\n\n  - the input is a SMILES string\n  - turn the SMILES into a molecule\n  - desalt it and standardize the charge model\n  - use the clean molecule to compute logP,\n      molecular weight, and a few other desciptors\n  - use the descriptors to compute model-1,\n      model-2, and model-3\n  - use model-1, model-2, and model-3 to compute\n      a consensus model\n\nRather than arrange the steps by hand, propbox uses a set of resolvers\nto fill out a table of properties. The table starts with the input\ndata - one row per record. You ask the table for the output columns\nyou want. If a property isn't available, the table asks the resolver\nto fill in the missing column. That operation may require additional\ndata, in which case the resolver goes back to the table to ask for\nthose columns. This process continues recursively until it gets to\navailable data. (Or if there's a cycle, until Python's reaches its\nmaximum recursion depth and throws an exception.) Each resolver then\nresolves the column data and the process unwinds until all of the\nneeded columns are filled in.\n\n\nInstallation\n============\n\nThis package does not yet support the standard Python installer. You\ncan run it from the current directory, or copy/move/link the 'propbox'\nsubdirectory to your location of choice.\n\nLicense\n=======\n\nThe propbox package is distributed under the MIT license. (See\nCOPYING.) The package includes a distribution of the third-party\npylru.py, which is copyright Jay Hutchinson and distributed under the\nGPLv2 or later. (See COPYING.pylru.)\n\n\n\n'rdprops' command-line tool\n===========================\n\nThe 'rdprops' command-line program computes molecular descriptors\nusing the RDKit cheminformatics toolkit from rdkit.org . It implements\nthe descriptors from rdkit.Chem.Descriptors as well as a few versions\nof SMILES strings.\n\nBy default it reads a SMILES file from stdin and writes the results to\nstdout. I'll ask it to read from a named SMILES file instead, and only\nshow the first few lines of output::\n\n  % ./rdprops tests/benzodiazepine.smi | head\n  id\tsmiles\tMolWt\n  1688\tCN1C(=O)CN=C(c2ccc(Cl)cc2)c2cc(Cl)ccc21\t319.191\n  1963\tOCc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1Cl)=NC2\t359.216\n  2118\tCc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1)=NC2\t308.772\n  2802\tO=C1CN=C(c2ccccc2Cl)c2cc([N+](=O)[O-])ccc2N1\t315.716\n  2809\tO=C(O)C1N=C(c2ccccc2)c2cc(Cl)ccc2NC1=O\t314.728\n  2997\tO=C1CN=C(c2ccccc2)c2cc(Cl)ccc2N1\t270.719\n  3016\tCN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21\t284.746\n  3261\tClc1ccc2c(c1)C(c1ccccc1)=NCc1nncn1-2\t294.745\n  3299\tCCOC(=O)C1N=C(c2ccccc2F)c2cc(Cl)ccc2NC1=O\t360.772\n\n\nThe default output contains the record identifier (\"id\"), the\ncanonical isomeric SMILES (\"smiles\"), and the molecular weight\n(\"MolWt\"). Use the `--columns` option to specify different columns::\n\n  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' | head\n  id\tHeavyAtomCount\tMolWt\n  1688\t21\t319.191\n  1963\t24\t359.216\n  2118\t22\t308.772\n  2802\t22\t315.716\n  2809\t22\t314.728\n  2997\t19\t270.719\n  3016\t20\t284.746\n  3261\t21\t294.745\n  3299\t25\t360.772\n\nPropbox uses the RDKit descriptor names for the columns, and by\ndefault uses the names for the column headers. You might prefer\na different header::\n\n  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --headers 'ID,HEAVIES,MW' | head\n  ID\tHEAVIES\tMW\n  1688\t21\t319.191\n  1963\t24\t359.216\n  2118\t22\t308.772\n  2802\t22\t315.716\n  2809\t22\t314.728\n  2997\t19\t270.719\n  3016\t20\t284.746\n  3261\t21\t294.745\n  3299\t25\t360.772\n\nor perhaps don't want a header at all::\n\n  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --no-header | head\n  1688\t21\t319.191\n  1963\t24\t359.216\n  2118\t22\t308.772\n  2802\t22\t315.716\n  2809\t22\t314.728\n  2997\t19\t270.719\n  3016\t20\t284.746\n  3261\t21\t294.745\n  3299\t25\t360.772\n  3369\t21\t302.736\n\nThe default output is tab-separated, but you can change that with the\n`--dialect` option, which can be one of 'tab', 'space', 'whitespace',\n'excel' or 'excel-tab'. (The 'whitespace' option is the same as\n'space', and the Excel dialects are as defined by Python's csv\nmodule, and include the special rules for quoting)::\n\n  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --dialect excel | head\n  id,HeavyAtomCount,MolWt\n  1688,21,319.191\n  1963,24,359.216\n  2118,22,308.772\n  2802,22,315.716\n  2809,22,314.728\n  2997,19,270.719\n  3016,20,284.746\n  3261,21,294.745\n  3299,25,360.772\n\n\nList the available descriptors\n------------------------------\n\nuse the `--list` option to get a list of the available descriptors::\n\n  % ./rdprops --list | wc -l\n       124\n\nThat's rather a lot, so I'll elide some of them::\n\n  % ./rdprops --list\n  _chargeDescriptors\n  BalabanJ\n  BertzCT\n  cansmiles\n  chargeDescriptorVersion\n  Chi0\n  Chi0n\n  Chi0v\n     ...\n  ExactMolWt\n  FractionCSP3\n  HallKierAlpha\n  HeavyAtomCount\n  HeavyAtomMolWt\n  id\n  input_format\n  input_mol\n  input_record\n     ...\n  mol\n  MolLogP\n  MolMR\n  MolWt\n  MolWt_version\n  nci_iupac_name\n  nci_names\n    ...\n  TPSA\n    ...\n  VSA_EState8\n  VSA_EState9\n\nA future version will include a way to get a description of each\ndescriptor.\n\nWhat's also missing is a naming convention or some other mechanism to\ndescribe if it makes sense to print a descriptor as text. For example,\nthe 'mol' property is the RDKit molecule object for the input\nstructure, after de-salting. It doesn't make sense to display the\nopaque text representation of a molecule object ::\n\n  % ./rdprops tests/benzodiazepine.smi --columns 'id,mol' | head -5 \n  id\tmol\n  1688\t\u003crdkit.Chem.rdchem.Mol object at 0x105c44910\u003e\n  1963\t\u003crdkit.Chem.rdchem.Mol object at 0x105c44980\u003e\n  2118\t\u003crdkit.Chem.rdchem.Mol object at 0x105c449f0\u003e\n  2802\t\u003crdkit.Chem.rdchem.Mol object at 0x105c44a60\u003e\n\nSimilarly, the _chargeDescriptors property is another internal\nproperty that shouldn't really be exposed. (I'll use this as an\nexample of how the quoting rules work for the 'excel' dialect.)::\n\n  % ./rdprops tests/benzodiazepine.smi --columns 'id,_chargeDescriptors' --dialect excel | head -3\n  id,_chargeDescriptors\n  1688,\"ChargeDescriptor(minCharge=-0.31319991842931816, maxCharge=0.24791727974294836)\"\n  1963,\"ChargeDescriptor(minCharge=-0.38834256479943147, maxCharge=0.16298797813009208)\"\n\nI may move to the convention that a leading '_', and perhaps also a\nleading lowercase character, indicate an internal variable. Or I may\nhave some way to mark certain descriptors as only being for internal\nuse. Then again, I like how IPython supports adapters to, for example,\nshow inline images for a molecule in a table. Perhaps I'll do that.\n\nSpecify the format\n------------------\n\nPropbox uses the filename extension to determine the file format, and\nto see if the file is gzip compressed. The following case-insensitive\nextensions are supported:\n\n  .smi, .ism, .isosmi - SMILES\n  .smi.gz, .ism.gz, .isosmi.gz - gzip compressed SMILES\n\n  .sdf, .sd, .mdl - SD file\n  .sdf.gz, .sd.gz, .mdl.gz - gzip compressed SD file\n\nIf propbox does not recognize the file format extension, or if the\ninput comes from stdin, then it will assume the input is an\nuncompressed file format.\n\nYou can specify the format directly using `--format` instead of\ndepending on propbox's auto-detection code. For example, since rdprops\nexpects a SMILES file from stdin, pipeing in an SD file will cause a\nproblem::\n\n\n  % ./rdprops \u003c tests/CHEMBL11862.sdf\n  [01:33:10] SMILES Parse Error: syntax error for input: CHEMBL11862\n  [01:33:10] SMILES Parse Error: syntax error for input: SciTegic11101117232D\n  Traceback (most recent call last):\n    File \"rdprops\", line 9, in \u003cmodule\u003e\n      rdprops.main()\n    File \"/Users/dalke/cvses/propbox/propbox/rdprops.py\", line 174, in main\n      ids_and_mols = list(batch_reader)\n    File \"/Users/dalke/cvses/propbox/propbox/rdkit_toolkit.py\", line 183, in _read_smiles\n      raise ValueError(\"Line %d is empty\" % (lineno,))\n  ValueError: Line 3 is empty\n  \n\nI'll instead tell it the input is an uncompressed SD file::\n\n  % ./rdprops --format sdf \u003c tests/CHEMBL11862.sdf\n  id\tsmiles\tMolWt\n  CHEMBL11862\tOc1cc2c(cc1O)CNCC2\t165.192\n\nThe supported formats are 'smi', 'smi.gz', 'sdf', and 'sdf.gz', with\nthe expected meanings.\n\n\nUse an SD tag as a title\n------------------------\n\nBy default propbox will use the title line of the SD file as the\nidentifier. Sometimes the identifier is in one of the tags, as ChEBI\nand older ChEMBL data sets, or if you want to use the InChI or other\nprimary key stored in a tag.\n\nFor example, the title line in CHEMBL11862.sdf is \"CHEMBL11862\"::\n\n  % ./rdprops tests/CHEMBL11862.sdf\n  id\tsmiles\tMolWt\n  CHEMBL11862\tOc1cc2c(cc1O)CNCC2\t165.192\n\nwhile the SD tag 'nci_iupac_name' contains the IUPAC name that I got\nfrom passing the structure over to NCI::\n\n  % ./rdprops --id-tag nci_iupac_name tests/CHEMBL11862.sdf \n  id\tsmiles\tMolWt\n  1,2,3,4-tetrahydroisoquinoline-6,7-diol\tOc1cc2c(cc1O)CNCC2\t165.192\n  \n\n\n\nReader arguments\n----------------\n\nThe RDKit SMILES and SDF readers support a few options:\n\n  SMILES:\n    has_header - Is the first line of the SMILES file a\n       header line? (boolean, with default of False)\n       \n    delimiter - Specify how to parse the fields of a\n       SMILES files? (One of 'space'/\" \", 'tab'/\"\\t\",\n       'whitespace', or 'to-eol', with default of 'to-eol')\n\n    sanitize - Should the newly parsed molecule be\n       sanitized? (boolean, with default of True)\n\n\n  SDF:\n    strictParsing - Use strict parsing rules? (boolean,\n       with default of True)\n    \n    removeHs - Should hydrogens be removed from the\n       molecule? (boolean, with default of True)\n    \n    sanitize - same as in SMILES\n\n\nThe \"delimiter\" option is a bit unusual. Different people have a\ndifferent interpretation of what a SMILES file means. The orignal\nDaylight definition was that the file contains a SMILES, followed by a\nwhitespace, and the rest of the line is the identifier.\n\nIn propbox (and in chemfp) this is called the 'to-eol' delimiter, and\nis the default.\n\nOther people think of a SMILES file as a space, tab, or whitespace\nseparated file, where the first column is the SMILES, the second\ncolumn is the identifier, and additional columns are ignored.  In\npropbox these are refered to as the \"space\", \"tab\", and \"whitespace\"\ndelimiter styles, respectively. (\"Whitespace\" means that each word is\ntreated as its own field.)\n\nYou can specify these reader arguments on the command line. For\nexample, in \"tests/drugs.smi\" is a file I got from Daylight many years\nago::\n\n  % cat tests/drugs.smi \n  N12CCC36C1CC(C(C2)=CCOC4CC5=O)C4C3N5c7ccccc76 Strychnine\n  c1ccccc1C(=O)OC2CC(N3C)CCC3C2C(=O)OC cocaine\n  COc1cc2c(ccnc2cc1)C(O)C4CC(CC3)C(C=C)CN34 quinine\n  OC(=O)C1CN(C)C2CC3=CCNc(ccc4)c3c4C2=C1 lyseric acid\n  CCN(CC)C(=O)C1CN(C)C2CC3=CNc(ccc4)c3c4C2=C1 LSD\n  C123C5C(O)C=CC2C(N(C)CC1)Cc(ccc4O)c3c4O5 morphine\n  C123C5C(OC(=O)C)C=CC2C(N(C)CC1)Cc(ccc4OC(=O)C)c3c4O5 heroin\n  c1ncccc1C1CCCN1C nicotine\n  CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine\n  C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a\n\nTwo of the identifiers, \"lyseric acid\" and \"vitamin a\", contain a\nspace in them. The default delimiter style is 'to-eol', which is why\nthe following show the full names::\n\n  % ./rdprops --columns 'id,MolWt' tests/drugs.smi\n  id\tMolWt\n  Strychnine\t334.419\n  cocaine\t303.358\n  quinine\t324.424\n  lyseric acid\t282.343\n  LSD\t323.44\n  morphine\t285.343\n  heroin\t369.417\n  nicotine\t162.236\n  caffeine\t194.194\n  vitamin a\t272.432\n\nTo specify the 'whitespace' delimiter style, use the `-R` parameter,\nwhich takes a NAME=VALUE setting::\n\n  % ./rdprops --columns 'id,MolWt' -R delimiter=whitespace tests/drugs.smi \n  id\tMolWt\n  Strychnine\t334.419\n  cocaine\t303.358\n  quinine\t324.424\n  lyseric\t282.343\n  LSD\t323.44\n  morphine\t285.343\n  heroin\t369.417\n  nicotine\t162.236\n  caffeine\t194.194\n  vitamin\t272.432\n  \n\nThe boolean reader args interpret the strings \"True\", \"true\", or \"1\" a\na true value, and \"False\", \"false\", or \"0\" for a false value. For\nexample, the following will skip the first line of drugs.smi on the\nassumption that it's a header line::\n\n  % ./rdprops --columns 'id,MolWt' -R has_header=true tests/drugs.smi\n  id\tMolWt\n  cocaine\t303.358\n  quinine\t324.424\n  lyseric acid\t282.343\n  LSD\t323.44\n  morphine\t285.343\n  heroin\t369.417\n  nicotine\t162.236\n  caffeine\t194.194\n  vitamin a\t272.432\n  \n\nBatch size\n----------\n\nThe 'nci_iupac_name' uses the NCI web service API to turn a SMILES\ninto an IUPAC name. This is mostly a proof-of-concept API, and it's\nrather slow since I make a request for each record. (Does the NCI\nresolver have a batch mode API?) Still, let's give it a whirl::\n\n  % ./rdprops --columns 'id,nci_iupac_name' tests/drugs.smi\n  id\tnci_iupac_name\n  Strychnine\t*\n  cocaine\tmethyl 3-(benzoyloxy)-8-methyl-8-azabicyclo[3.2.1]octane-2-carboxylate\n  quinine\t(5-ethenyl-1-azabicyclo[2.2.2]octan-7-yl)-(6-methoxyquinolin-4-yl)methanol\n  lyseric acid\t*\n  LSD\t*\n  morphine\t*\n  heroin\t*\n  nicotine\t3-(1-methylpyrrolidin-2-yl)pyridine\n  caffeine\t1,3,7-trimethylpurine-2,6-dione\n  vitamin a\t*\n\nThis took about 3 seconds, but you'll notice that there was no output\nuntil everything was ready. This is because propbox by default\nprocesses the records in batches of 1,000 records. It will compute the\nproperties for the first 1,000 structures, then display the result,\nthen compute the properties for the second 1,000 structures, then\ndisplay those results, etc.\n\nI can ask it to process one record at a time using the `--batch-size`\nparameter::\n\n  % ./rdprops --columns 'id,nci_iupac_name' --batch-size 1 tests/drugs.smi \n  id\tnci_iupac_name\n  Strychnine\t*\n  cocaine\tmethyl 3-(benzoyloxy)-8-methyl-8-azabicyclo[3.2.1]octane-2-carboxylate\n  quinine\t(5-ethenyl-1-azabicyclo[2.2.2]octan-7-yl)-(6-methoxyquinolin-4-yl)methanol\n  lyseric acid\t*\n  LSD\t*\n  morphine\t*\n  heroin\t*\n  nicotine\t3-(1-methylpyrrolidin-2-yl)pyridine\n  caffeine\t1,3,7-trimethylpurine-2,6-dione\n  vitamin a\t*\n\n\n(Propbox uses a '*' for records which had a problem. There is currently\nno way to use another symbol.)\n\nIn the NCI case there is no timing difference between a batch size of\n1 and of 1,000 records because the propbox NCI client makes one\nrequest at a time. Batch mode exists because in some cases it's faster\nto process N molecules at once than to process each one\nindividually. Eg, in the future propbox might be able to send all of\nthe queries to the server in a single request, which would save a lot\nof network overhead.\n\nUse `--batch-size all` to process all of the structures in a single\nbatch.\n\n\nAdd a resolver\n--------------\n\nUse `-r` or `--resolver` to add a resolver to the built-in resolver.\n\nI'll cover the details in the next section. For an example of how it\nworks, I'll create a simple model based on the molecular weight and\nthe number of hydrogen bond donors. The descriptor will be called\n'model', and located in a file called \"model.py\" in the current\ndirectory (or somewhere else on the Python path)::\n\n  % cat model.py\n  \n  from propbox import calculate, collect_resolvers\n  \n  @calculate()\n  def calc_model(MolWt, NumHDonors):\n    return MolWt * 12.34 / (NumHDonors + 1)\n  \n  resolver = collect_resolvers()\n\nThis is a non-standard resolver, so I need to tell rdprops the path\nfor how to load it::\n\n  % ./rdprops --columns 'id,model' -r model.resolver tests/CHEMBL11862.sdf\n  id\tmodel\n  CHEMBL11862\t509.61732\n\n\nTo double-check, I'll get the molecular weight and number of hbond\ndonors to do the math myself::\n\n  % ./rdprops --columns 'id,MolWt,NumHDonors,model' -r model.resolver tests/CHEMBL11862.sdf\nid\tMolWt\tNumHDonors\tmodel\nCHEMBL11862\t165.192\t3\t509.61732\n\nAnd what do you know, it matches!\n\n  \u003e\u003e\u003e 165.192 * 12.34 / (3 + 1)\n  509.61732000000001\n  \n\nThe propbox resolver framework\n==============================\n\n\nPropbox is built around two concepts: a table and a resolver. The rows\nof the table are structure records, and the columns are molecular\nproperties, referenced by name. A resolver is an object which can fill\nin columns of a table. A resolver may get columns from the table in\norder to do its job.\n\nCreate a table\n--------------\n\nThere are two ways to create a table; by rows (\"records\") or by\ncolumns. I'll create a table with no resolver and a single column,\n\"smiles\", with some SMILES data::\n\n  \u003e\u003e\u003e import propbox\n  \u003e\u003e\u003e table = propbox.make_table_from_columns(None, {\"smiles\": [\"C\", \"O=O\"]})\n  \u003e\u003e\u003e table.get_values(\"smiles\")\n  ['C', 'O=O']\n\nMissing identifiers will be created automatically::\n\n  \u003e\u003e\u003e table.get_values(\"id\")\n  ['ID1', 'ID2']\n\nor you can specify the identifiers yourself::\n\n  \u003e\u003e\u003e table = propbox.make_table_from_columns(None, \n  ...   {\"smiles\": [\"C\", \"O=O\"], \"id\": [\"methane\", \"water\"]})\n  \u003e\u003e\u003e table.get_values(\"smiles\")\n  ['C', 'O=O']\n  \u003e\u003e\u003e table.get_values(\"id\")\n  ['methane', 'water']\n  \nUse make_table_from_records() if you have per-record dictionary data::\n\n  \u003e\u003e\u003e table = propbox.make_table_from_records(None,\n  ...   [{\"smiles\": \"O=O\", \"id\": \"water\"}, {\"smiles\": \"c1ccccc1O\", \"id\": \"phenol\"}])\n  \u003e\u003e\u003e table.get_values(\"smiles\")\n  ['O=O', 'c1ccccc1O']\n  \u003e\u003e\u003e table.get_values(\"id\")\n  ['water', 'phenol']\n\n\nI used None as the resolver, but the None object doesn't support the\nresolver protocol, so if I try to get a column that doesn't yet exist,\nI'll get the following::\n\n  \u003e\u003e\u003e table.get_values(\"MW\")\n  Traceback (most recent call last):\n    File \"\u003cstdin\u003e\", line 1, in \u003cmodule\u003e\n    File \"propbox/__init__.py\", line 709, in get_values\n      futures = self.get_futures(name)\n    File \"propbox/__init__.py\", line 693, in get_futures\n      self.resolver.resolve_column(name, self)\n  AttributeError: 'NoneType' object has no attribute 'resolve_column'\n\n\nDefine a resolver\n-----------------\n\nHere's a resolver which returns a constant value::\n\n  from __future__ import print_function\n  import propbox\n  \n  class Constant(propbox.Resolver):\n      output_names = [\"value\"]\n  \n      def __init__(self, value):\n          self.value = value\n  \n      def resolve_column(self, name, table):\n          table.set_values(\"value\", [self.value] * len(table))\n  \n  table = propbox.make_table_from_records(Constant(4), [{}, {}])\n  print(\"ids\", table.get_values(\"id\"))\n  print(\"values\", table.get_values(\"value\"))\n  \n\nThis creates the following output::\n\n  ids ['ID1', 'ID2']\n  values [4, 4]\n\nHow this works is, the table doesn't know about the 'value' column, so\nit asks the resolver to resolve the column 'value'. The table passes\nitself as the table, so the resolver can use the table to get or set\ndata.\n\nThe Constant resolver uses len(table) to get the number of rows in the\ntable -- two in this case -- and create the list [4, 4], which it then\nuses to set the table column named 'value', which is then available to\nthe table.\n\nThes 'output_names' attribute contains a list of the column names that\nthe resolver can compute. It isn't actually used in this case, since\nthe table will ask the resolver to handle any unknown column. I could,\nfor example, ask for 'xyzzy' and it would call the resolver::\n\n  table = propbox.make_table_from_records(Constant(4), [{}, {}])\n  print(\"ids\", table.get_values(\"id\"))\n  print(\"xyzzy\", table.get_values(\"xyzzy\"))\n\n\nHowever, the table does double-check that the resolver adds the\nrequested column, so the above will generate the following error::\n\n  ids ['ID1', 'ID2']\n  Traceback (most recent call last):\n    File \"tmp.py\", line 15, in \u003cmodule\u003e\n      print(\"xyzzy\", table.get_values(\"xyzzy\"))\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 709, in get_values\n      futures = self.get_futures(name)\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 698, in get_futures\n      % (self.resolver, name))\n  AssertionError: Resolver \u003c__main__.Constant object at 0x1007e74d0\u003e did not set values for column 'xyzzy'\n\n\n\nDefine a Propbox\n----------------\n\nA Propbox is a resolver which contains other resolvers. It uses the\n'output_names' of the other resolvers to figure out which resolver to\nuse. For example, I'll modify the Constant resolver so I can specify\nwhich column it will set::\n\n  from __future__ import print_function\n  import propbox\n  \n  class Constant(propbox.Resolver):\n      def __init__(self, descriptor, value):\n          self.value = value\n          self.descriptor = descriptor\n          self.output_names = [descriptor]\n  \n      def resolve_column(self, name, table):\n          table.set_values(self.descriptor, [self.value] * len(table))\n  \n\nthen create a Propbox which contains two Constants; one which sets\n'value' to 8 and the other which sets 'xyzzy' to 13::\n\n  resolver = propbox.Propbox()\n  resolver.add_resolver(Constant(\"value\", 8))\n  resolver.add_resolver(Constant(\"xyzzy\", 13))\n  \nand finally create a table which uses that Propbox resolver::\n\n  table = propbox.make_table_from_records(resolver, [{}, {}])\n  print(\"value\", table.get_values(\"value\"))\n  print(\"xyzzy\", table.get_values(\"xyzzy\"))\n  print(\"unknown\", table.get_values(\"unknown\"))\n  \n\nWhen I run it, I get the following output::\n\n  value [8, 8]\n  xyzzy [13, 13]\n  Traceback (most recent call last):\n    File \"tmp.py\", line 22, in \u003cmodule\u003e\n      print(\"unknown\", table.get_values(\"unknown\"))\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 709, in get_values\n      futures = self.get_futures(name)\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 693, in get_futures\n      self.resolver.resolve_column(name, self)\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 171, in resolve_column\n      raise PropboxKeyError(self, name)\n  propbox.PropboxKeyError: unknown\n\nIn case you were wondering, the PropboxKeyError inherits from the\nregular KeyError, as well as from propbox.PropboxError.\n\n\nA resolver that uses the table\n------------------------------\n\nThe Constant resolver is pretty boring. What about a resolver which\nreturns the length of the SMILES string, stored in the 'smiles'\ncolumn, and sets the column 'len'?::\n\n  from __future__ import print_function\n  import propbox\n  \n  class Len(propbox.Resolver):\n      output_names = [\"len\"]\n      def resolve_column(self, name, table):\n          smiles_list = table.get_values(\"smiles\")\n          table.set_values(\"len\", [len(smiles) for smiles in smiles_list])\n  \n  \n  resolver = Len()\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"c1ccccc1O\"]})\n                                          \n  print(table.get_values(\"len\"))\n\nThe output from this is::\n\n  [1, 3, 9]\n\nwhich I think you expected.\n\nWhat's new here is that the resolver asked the table to get the\n\"smiles\" column. This is a recursive call, since the table was the one\nto call the resolver in the first place.\n\nThe recursion might go several levels deep. I'll also create a Len2,\nwhich doubles the value of \"len\". Since I have to resolvers, I'll need\nto put them into a Propbox::\n\n  from __future__ import print_function\n  import propbox\n  \n  class Len(propbox.Resolver):\n      output_names = [\"len\"]\n      def resolve_column(self, name, table):\n          smiles_list = table.get_values(\"smiles\")\n          table.set_values(\"len\", [len(smiles) for smiles in smiles_list])\n  \n  class DoubleLen(propbox.Resolver):\n      output_names = [\"len2\"]\n      def resolve_column(self, name, table):\n          len_list = table.get_values(\"len\")\n          table.set_values(\"len2\", [value*2 for value in len_list])\n  \n  resolver = propbox.Propbox([Len(), DoubleLen()])\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"c1ccccc1O\"]})\n  \n  print(table.get_values(\"len2\"))\n\n\nHere's what happened:\n\n  - The column \"len2\" does not exist, so the table asks the\n      Propbox resolver to fill it in;\n  - The Propbox resolver used the output_names to figure out\n      that the DoubleLen resolver could resolve that column.\n  - The DoubleLen resolver needs the values for the \"len\" column\n      from the table;\n  - The column 'len' doesn't exist, so the table asks the\n      Propbox resolver to fill it in;\n  - The Propbox resolver used the output_names to figure out\n      that the Len resolver could resolve that column;\n  - The Len resolver needs the values for the \"smiles\" column\n      from the table;\n  - The table returns \"smiles\" column;\n  - The Len resolver computes the string lengths and sets the\n      values for the \"len\" column;\n  - The DoubleLen resolver doubles those values, and sets the\n      \"len2\" column;\n  - The calculations are complete and returned to the caller.\n\n\nAll of the intermediate values are stored in the table in case they\nare needed for additional calculations.\n\n\nFutures\n-------\n\nWhat if there was an error during the calculation? For that matter,\nhow does a resolver even indicate an error?\n\nYou'll need to understand 'futures' to understand how errors work in\npropbox.\n\nA future is something which wraps a return value, or raised\nexception. It's often used in modern asynchronous I/O libraries,\nincluding Python 3.4, where it is used as a placeholder for the actual\nreturn value, which will be available in the future. (It's called a\n'promise' in some libraries.)\n\nProbox is not asynchronous, though I want it to go that way. I use the\n'future' concept as a way to keep track of if something was a return\nvalue or an exception.\n\nHere's what it looks like, using part of the propbox API that should\nonly be used by resolvers. I'll store the value 12 as if it were a\nsuccessfully computed descriptor::\n\n  \u003e\u003e\u003e from propbox import simple_futures\n  \u003e\u003e\u003e future = simple_futures.new_future(12)\n  \u003e\u003e\u003e future\n  \u003cpropbox.simple_futures.Future object at 0x100666490\u003e\n  \u003e\u003e\u003e future.result()\n  12\n\nThis being Python, I can store anything in the future's result::\n\n  \u003e\u003e\u003e future = simple_futures.new_future(\"twelve\")\n  \u003e\u003e\u003e future.result()\n  'twelve'\n\nI can even store an exception instance as the value::\n\n  \u003e\u003e\u003e future = simple_futures.new_future(ValueError(\"must be a string\"))\n  \u003e\u003e\u003e future.result()\n  ValueError('must be a string',)\n\n\nWhat if I have a \"real\" exception, that is, something which shouldn't\nbe treated as a return value? I'll create a future that contains an\nexception::\n\n  \u003e\u003e\u003e future = simple_futures.new_future_exception(ValueError(\"must be a string\"))\n  \u003e\u003e\u003e future.result()\n  Traceback (most recent call last):\n    File \"\u003cstdin\u003e\", line 1, in \u003cmodule\u003e\n    File \"propbox/simple_futures.py\", line 34, in result\n      raise self._exception\n  ValueError: must be a string\n\nI asked the future for its result, but since it contained an\nexception, it raised the exception.\n\n(One limitation is that the exception no longer has stack\ninformation. If you enable propbox.DEBUG=1 then you'll see the stack\ntrace printed to stderr when a resolver raises an unhandled exception.)\n\nIf you want the exception value, without going through the try/except\nmechanism, then ask the future for it::\n\n  \u003e\u003e\u003e future.exception()\n  ValueError('must be a string',)\n\nThis will return None if there was no exception.\n\n\nSetting descriptor exceptions\n-----------------------------\n\nTwo sections earlier I used get_values() and set_values() to get and\nset the column values. If there's an error then get_values() by\ndefault will use None as a placeholder error value, and there's no way\nto use set_values() to specify an error.\n\nThe get_values()/set_values() functions are really just wrappers\naround the underlying futures data. You can access the futures\ndirectly with get_futures() and set_futures(). In the following, I'll\nmodify the \"len\" resolver so it gives an error if the SMILES string\ncontains the letter 'O'::\n\n\n  from __future__ import print_function\n  import propbox\n  from propbox import simple_futures\n  \n  class Len(propbox.Resolver):\n      output_names = [\"len\"]\n      def resolve_column(self, name, table):\n          smiles_list = table.get_values(\"smiles\")\n          futures = []\n          for smiles in smiles_list:\n              if \"O\" in smiles:\n                  err = ValueError(\"No 'O's allowed: %r\" % (smiles,))\n                  future = simple_futures.new_future_exception(err)\n              else:\n                  future = simple_futures.new_future(len(smiles))\n              futures.append(future)\n          table.set_futures(\"len\", futures)\n  \n  resolver = Len()\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"c1ccccc1O\"]})\n  \n  print(table.get_values(\"len\"))\n\nThe result from when I run this is::\n\n  [1, 3, None]\n\nbecause None is the placeholder error value. I can change that to\nsomething else. In the following I use zero::\n\n  print(table.get_values(\"len\", 0))\n\nwhich creates the output::\n\n  [1, 3, 0]\n\n\nException chaining (advanced)\n-----------------------------\n\nThis is an advanced topic. In almost all cases you can use the\n'Calculator' class in the next section, which handles exception\nchaining automatically.\n\nSuppose you want a \"doubled\" property, which is twice the \"len\"\nproperty. What if \"len\" has an error? Since \"len\" is supposed to be a\nnumber, it's easy to check if it's the None value, and do something\ndifferent in that case. In the following, the 'Len' class is unchanged\nfrom the previous section. What's new is the 'DoubleLen' class, the\npropbox which includes both Len and DoubleLen, and the output, where\nthis time I show the future's exception for each record::\n\n  from __future__ import print_function\n  import propbox\n  from propbox import simple_futures\n  \n  class Len(propbox.Resolver):\n      output_names = [\"len\"]\n      def resolve_column(self, name, table):\n          smiles_list = table.get_values(\"smiles\")\n          futures = []\n          for smiles in smiles_list:\n              if \"O\" in smiles:\n                  err = ValueError(\"No 'O's allowed: %r\" % (smiles,))\n                  future = simple_futures.new_future_exception(err)\n              else:\n                  future = simple_futures.new_future(len(smiles))\n              futures.append(future)\n          table.set_futures(\"len\", futures)\n  \n  # This version does not implement exception chaining\n  class DoubleLen(propbox.Resolver):\n      output_names = [\"doubled\"]\n      def resolve_column(self, name, table):\n          len_list = table.get_values(\"len\")\n          futures = []\n          for len_value in len_list:\n              if len_value is None:\n                  err = Exception(\"No 'len' available\")\n                  future = simple_futures.new_future_exception(err)\n              else:\n                  future = simple_futures.new_future(len_value*2)\n              futures.append(future)\n          table.set_futures(\"doubled\", futures)\n          \n  resolver = propbox.Propbox([Len(), DoubleLen()])\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"c1ccccc1O\"]})\n  \n  print(\"id  exception\")\n  for id, double_future in zip(table.get_values(\"id\"),\n                               table.get_futures(\"doubled\")):\n      print(id, double_future.exception())\n\nThis gives the following output::\n\n  id  exception\n  ID1 None\n  ID2 None\n  ID3 No 'len' available\n  \nIt would be nice to know *why* 'len' isn't available. Propbox\nimplements exception chaining, which is where one resolver exception\ncan wrap another, all the way back to the actual exception that caused\nthe problem.\n\nTo do that correctly, I'll need to make two changes. The first is to\nwrap the ValueException of the Len class inside of a ResolverError\nexception, so that callers know the descriptor that caused the\noriginal problem. That's one new line of code in the following::\n\n  class Len(propbox.Resolver):\n      output_names = [\"len\"]\n      def resolve_column(self, name, table):\n          smiles_list = table.get_values(\"smiles\")\n          futures = []\n          for smiles in smiles_list:\n              if \"O\" in smiles:\n                  err = ValueError(\"No 'O's allowed: %r\" % (smiles,))\n                  # I added the next line for better exception chaining.\n                  # It will include the name of the descriptor that has the problem.\n                  err = propbox.ResolverError(err, table.table_name, \"len\")\n                  future = simple_futures.new_future_exception(err)\n              else:\n                  future = simple_futures.new_future(len(smiles))\n              futures.append(future)\n          table.set_futures(\"len\", futures)\n\n\nThe second is to change Doubled to use get_futures() instead of\nget_values(), and if one of the 'len' futures has an exception, to\nwrap it insides of another ResolverError::\n\n\n  class DoubleLen(propbox.Resolver):\n      output_names = [\"doubled\"]\n      def resolve_column(self, name, table):\n          len_futures = table.get_futures(\"len\")\n          futures = []\n          for len_future in len_futures:\n              prev_exception = len_future.exception()\n              if prev_exception is None:\n                  # There was no error\n                  len_value = len_future.result()\n                  future = simple_futures.new_future(len_value*2)\n              else:\n                  # Create the chain.\n                  err = propbox.ResolverError(prev_exception, table.table_name, \"doubled\")\n                  future = simple_futures.new_future_exception(err)\n              futures.append(future)\n          table.set_futures(\"doubled\", futures)\n\nThe resulting code now generates::\n\n  id  exception\n  ID1 None\n  ID2 None\n  ID3 doubled -\u003e len: ValueError(\"No 'O's allowed: 'c1ccccc1O'\",)\n\nThis is more helpful because it says that 'doubled' failed because\n'len' failed because of a ValueError.\n\nUse 'get_original_exception()' if you only care about the actual\nexception that caused the problem, and not the full resolver chain, as\nin the following variation, which only prints those records with an\nerror::\n\n  print(\"id  exception\")\n  for id, double_future in zip(table.get_values(\"id\"),\n                               table.get_futures(\"doubled\")):\n      exception = double_future.exception()\n      if exception is not None:\n          print(id, exception.get_original_exception())\n\n\nwhich produces the following output::\n\n  id  exception\n  ID3 No 'O's allowed: 'c1ccccc1O'\n\n\nCalculator\n----------\n\nThe previous section showed the nitty-gritty of how to handle and\nreport errors during a calculation. For the most part, you don't need\nto do that sort of low-level code. Instead, use the 'Calculator'\nclass. Here's an example::\n\n\n  from __future__ import print_function\n  import propbox\n  from propbox import simple_futures\n  \n  class Len(propbox.Calculator):\n      input_names = [\"smiles\"]\n      output_names = [\"len\"]\n      def calculate(self, name, table, input_values, output):\n          for (smiles,) in input_values:\n              if \"O\" in smiles:\n                  output.add_exception(ValueError(\"No 'O's allowed: %r\" % (smiles,)))\n              else:\n                  output.add_result(len(smiles))\n                  \n  class DoubleLen(propbox.Calculator):\n      input_names = [\"len\"]\n      output_names = [\"doubled\"]\n      def calculate(self, name, table, input_values, output):\n          for (len_value,) in input_values:\n            output.add_result(len_value*2)\n          \n  resolver = propbox.Propbox([Len(), DoubleLen()])\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"c1ccccc1O\"]})\n  \n  print(\"id  exception\")\n  for id, doubled_future in zip(table.get_values(\"id\"),\n                               table.get_futures(\"doubled\")):\n      print(id, doubled_future.exception() or doubled_future.result())\n  \n\nIf you compare it to the previous section you'll see it's much\nshorter. If you skipped the previous section, then great! You didn't\nneed to read it to understand this section.\n\n(For that matter, you can skip to 'Decorators for simple functions'\nfor an even easier way to handle this code.)\n\nThe 'input_names' contains a list of columns that will be passed in as\ninput, and the calculator must set values for all of the\n'output_names'. This is a bit more strict than a normal resolver,\nwhich doesn't need to list its input names, and only needs to compute\nthe requested output name.\n\nThe Calculator's own resolve_column() will filter out the inputs which\ncontain an exception, and pass only the actual values to the\n'calculate()' method. By \"actual values\" I mean there aren't even\nplaceholders for the error values. If one of the inputs causes an\nexception then the Calculator will automatically set up the resolver\nchain.\n\nThe values are passed to the \"calculate()\" function as a list of\nlists, where the order depends on the order in input_names. For\nexample, in the following::\n\n  class Example(propbox.Calculator):\n      input_names = [\"smiles\", \"doubled\"]\n      output_names = [\"example\"]\n      def calculate(self, name, table, input_values, output):\n          print(\"input_values\", input_values)\n          for (smiles, doubled) in input_values:\n            output.add_result(\"2*len(%r)=%d\" % (smiles, doubled))\n\nthe inputs are 'smiles' and 'doubled', which are passed in as:\n\n  input_values [['C', 2], ['C#N', 6]]\n\nThis is row order, so the first element contains the columns for the\nfirst non-error record, the second for the second non-error record,\nand so on.\n\n\nIt's a bit tricky to remember that a single input name still gets a\nlist of lists, even though its a single element list. I use \"for\n(len_value,) in\" in the following as an explicit reminder that I am\nusing a term from a single element list::\n\n  class DoubleLen(propbox.Calculator):\n      input_names = [\"len\"]\n      output_names = [\"doubled\"]\n      def calculate(self, name, table, input_values, output):\n          for (len_value,) in input_values:\n            output.add_result(len_value*2)\n\nOtherwise it's very easy to make a mistake and do \"for len_value in input_values\".\n\nThe 'name' and 'table' should look familiar by this time. While you\ncan use the table to get or set columns, you really shouldn't because\nyour code will end up interfering with the Calculator code. It's there\nin case the calculator needs to access any of the table configuration\ninformation, or needs to get/set a cache value on the table.\n\n\nThe 'output' term is special. Use it to specify information for the\ncurrent record number, or specify information for all of the remaining\nrecords.\n\nThe 'add_result()' method can be used when output_names contains only\na single name. add_result() sets the corresponding column for the\ncurrent record, then advances to the next record.\n\nUse 'add_results()' when there are more outputs. The function takes\nthe result values as a list or tuple, and sets the corresponding\nfutures. Here's an example of it in use::\n\n  class Scaling(propbox.Calculator):\n      input_names = [\"len\"]\n      output_names = [\"tripled\", \"third\"]\n      def calculate(self, name, table, input_values, output):\n          for (n,) in input_values:\n              output.add_results((n*3 ,n/3.0))\n\nIn this case the \"n*3\" sets the 'tripled' column, and 'n/3.0' sets the\n'third' column.\n\n\nSometimes it's more convenient to set all of the results at once,\nwhich you can do with the 'add_column_results()' method::\n\n  class Scaling(propbox.Calculator):\n      input_names = [\"len\"]\n      output_names = [\"tripled\", \"third\"]\n      def calculate(self, name, table, input_values, output):\n          tripled_list = []\n          third_list = []\n          for (n,) in input_values:\n              tripled_list.append(n*3)\n              third_list.append(n/3.0)\n          output.add_column_results( (tripled_list, third_list) )\n\nHowever, this doesn't appear to be one of those cases where the result\nis simpler.\n\n\nYou saw already how to tell the output to use an exception for a given\nrow, by using the 'add_exception()' method::\n\n  class Len(propbox.Calculator):\n      input_names = [\"smiles\"]\n      output_names = [\"len\"]\n      def calculate(self, name, table, input_values, output):\n          for (smiles,) in input_values:\n              if \"O\" in smiles:\n                  output.add_exception(ValueError(\"No 'O's allowed: %r\" % (smiles,)))\n              else:\n                  output.add_result(len(smiles))\n\nYou can also specify a futures, either for a record via 'add_futures'\nor for all of the columns via 'add_column_futures'.\n\n\n\nCalculateName / CalculateNames\n------------------------------\n\nYou may have functions which you want to turn into propbox\ndescriptors. The CalculateName and CalculateNames classes are\nsubclasses of Calculator which know how to call a function to compute a\nproperty or a set of properties, respectively.\n\nFor example, here are three functions that might be useful for\npropbox::\n\n  from rdkit import Chem\n  \n  def smilin(smiles):\n      mol = Chem.MolFromSmiles(smiles)\n      if mol is None:\n          raise ValueError(\"RDKit cannot parse the SMILES %r\" % (smiles,))\n      return mol\n  \n  def num_heavies(mol):\n      return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() \u003e 1)\n  \n  def heavy_range(mol):\n      \"Return the lightest and heaviest element numbers of the heavy atoms\"\n      atomic_nums = []\n      for atom in mol.GetAtoms():\n          atomic_num = atom.GetAtomicNum()\n          if atomic_num \u003e 1:\n              atomic_nums.append(atomic_num)\n      if not atomic_nums:\n          return (0, 0)\n      return min(atomic_nums), max(atomic_nums)\n  \n\nThe first two only return a single value, so I'll use a CalculateName\ninstance for them. The last returns two values, so I'll use a\nCalculateNames for them::\n\n  import propbox\n  \n  resolver = propbox.Propbox([\n      propbox.CalculateName([\"smiles\"], \"mol\", smilin),\n      propbox.CalculateName([\"mol\"], \"nHEAVIES\", num_heavies),\n      propbox.CalculateNames([\"mol\"], [\"LIGHTEST_HEAVY\", \"HEAVIEST_HEAVY\"], heavy_range),\n      ])\n  \nThe first parameter is the input_names list. The second parameter is\nthe output name (for CalculateName) or the list of output names (for\nCalculateNames). The third parameter is the function to call.\n\nI'll use that resolver to make a table::\n\n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"Q\", \"c1ccccc1O\", \"[U](F)(F)(F)(F)(F)F\"]})\n\nthen use the table to generate CSV output, be default as a tab separated file::\n\n  import sys\n  table.save(sys.stdout, [\"id\", \"nHEAVIES\", \"LIGHTEST_HEAVY\", \"HEAVIEST_HEAVY\"])\n\nThe output in this case is::\n\n  [14:20:24] SMILES Parse Error: syntax error for input: Q\n  id\tnHEAVIES\tLIGHTEST_HEAVY\tHEAVIEST_HEAVY\n  ID1\t1\t6\t6\n  ID2\t2\t6\t7\n  ID3\t*\t*\t*\n  ID4\t7\t6\t8\n  ID5\t7\t9\t92\n\n\nSuppose though you want this in 'excel' format, which uses commas\ninstead of tabs, and knows how to quote terms that contain a\ncomma. And suppose you wanted to use '???' when the nHEAVIES could not\nbe computed, and 'n/a' for when the element range could not be\ncomputed. In that case, use the following::\n\n  import sys\n  table.save(sys.stdout, [\"id\", \"nHEAVIES\", \"LIGHTEST_HEAVY\", \"HEAVIEST_HEAVY\"],\n             dialect=\"excel\", missing_values=[\"x\", \"???\", \"n/a\", \"n/a\"])\n\nwhich generates::\n\n  id,nHEAVIES,LIGHTEST_HEAVY,HEAVIEST_HEAVY\n  ID1,1,6,6\n  ID2,2,6,7\n  ID3,???,n/a,n/a\n  ID4,7,6,8\n  ID5,7,9,92\n\n\nDecorators for simple functions\n-------------------------------\n\nThe previous section assumed that you couldn't modify the code that\ncomputed the property values. If on the other hand you can modify\nthem, then the 'propbox.calculate' decorator will create a\nCalculateName (if output_names is a string) or CalculateNames (if\noutput_names is a list) for each function, and store it as the\nfunction attribute 'propbox_resolver'.\n\nThe function 'collect_resolvers()' will look for resolvers in the\nmodule's namespace. If an object has a 'propbox_resolver' then that\nwill be used a resolver. Objects which are instances of\npropbox.Resolver will also be treated as a resolver. All of the\nresolvers will be placed into a Propbox.\n\nHere's an example::\n\n\n  from __future__ import print_function\n  \n  from rdkit import Chem\n  import propbox\n  #propbox.DEBUG = True # Uncomment for a bit better debugging\n  import sys\n  \n  \n  @propbox.calculate(output_names=\"mol\")\n  def smilin(smiles):\n      mol = Chem.MolFromSmiles(smiles)\n      if mol is None:\n          raise ValueError(\"RDKit cannot parse the SMILES %r\" % (smiles,))\n      return mol\n  \n  @propbox.calculate(output_names=\"nHEAVIES\")\n  def num_heavies(mol):\n      return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() \u003e 1)\n  \n  @propbox.calculate(output_names=[\"LIGHTEST_HEAVY\", \"HEAVIEST_HEAVY\"])\n  def heavy_range(mol):\n      \"Return the lightest and heaviest element numbers of the heavy atoms\"\n      atomic_nums = []\n      for atom in mol.GetAtoms():\n          atomic_num = atom.GetAtomicNum()\n          if atomic_num \u003e 1:\n              atomic_nums.append(atomic_num)\n      if not atomic_nums:\n          return (0, 0)\n      return min(atomic_nums), max(atomic_nums)\n  \n  \n  resolver = propbox.collect_resolvers()\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"Q\", \"c1ccccc1O\", \"[U](F)(F)(F)(F)(F)F\"]})\n  \n  table.save(sys.stdout, [\"id\", \"nHEAVIES\", \"LIGHTEST_HEAVY\", \"HEAVIEST_HEAVY\"],\n             dialect=\"excel\", missing_values=[\"x\", \"???\", \"n/a\", \"n/a\"])\n\nNot surprisingly, this gives the same output as before::\n\n  ID1,1,6,6\n  ID2,2,6,7\n  ID3,???,n/a,n/a\n  ID4,7,6,8\n  ID5,7,9,92\n\nBut wait! Why didn't I need to configure the list of input_names?\n\nI could have. I could have said:\n\n  @propbox.calculate(input_names=[\"smiles\"], output_names=\"mol\")\n  def smilin(s):\n      mol = Chem.MolFromSmiles(s)\n      if mol is None:\n          raise ValueError(\"RDKit cannot parse the SMILES %r\" % (s,))\n      return mol\n\nIf input_names isn't given then the decorator assume that the function\narguments are the expected property names. In the original function,\nthe function took a 'smiles' parameter, which happened to be the same\nname as the property, so I let it be.\n\nIn this modified version, I changed the function to take an 's'\ninstead of a 'smiles', so I needed to specify the input_names to get\nthe input values from the 'smiles' column instead of the 's' column.\n\nThere's another shortcut. If the function computes a single\ndescriptor, and the function name starts with 'calc_', then the rest\nof the function name will be used as the descriptor.\n\nThat is, the first two functions could be rewritten as::\n\n  @propbox.calculate()\n  def calc_mol(smiles):\n      mol = Chem.MolFromSmiles(smiles)\n      if mol is None:\n          raise ValueError(\"RDKit cannot parse the SMILES %r\" % (smiles,))\n      return mol\n  \n  @propbox.calculate()\n  def calc_nHEAVIES(mol):\n      return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() \u003e 1)\n\n\nI'll use this technique to rewrite the 'len' and 'doubled' descriptors\nfrom an earlier section::\n\n  from __future__ import print_function\n  \n  import propbox\n  import sys\n  \n  @propbox.calculate()\n  def calc_len(smiles):\n      if \"O\" in smiles:\n          raise ValueError(\"No 'O's allowed: %r\" % (smiles,))\n      return len(smiles)\n  \n  @propbox.calculate()\n  def calc_doubled(len):\n      return len*2\n  \n  \n  resolver = propbox.collect_resolvers()\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"Q\", \"c1ccccc1O\", \"[U](F)(F)(F)(F)(F)F\"]})\n  \n  table.save(sys.stdout, [\"id\", \"doubled\", \"smiles\"])\n  \nThe output from this is:\n\n  id\tdoubled\tsmiles\n  ID1\t2\tC\n  ID2\t6\tC#N\n  ID3\t2\tQ\n  ID4\t*\tc1ccccc1O\n  ID5\t38\t[U](F)(F)(F)(F)(F)F\n\nIf I want to see the exception message I can traverse the rows myself:\n\n  for id, doubled in table.get_future_rows([\"id\", \"doubled\"]):\n      print(id.result(), doubled.exception() or doubled.result())\n\nThis prints::\n\n  ID1 2\n  ID2 6\n  ID3 2\n  ID4 doubled -\u003e len: ValueError(\"No 'O's allowed: 'c1ccccc1O'\",)\n  ID5 38\n\n\nProperty Aliases\n----------------\n\nYou'll sometimes need multiple names for the same descriptor.\n\nFor example, you might have a resolver which expects \"MW\" for the\nmolecular weight, but you follow the RDKit convention and use\n\"MolWt\". Propbox comes with an \"Aliases\" resolver, which will forward\na request to the right value.\n\nHere's an example, which says that a molecule is \"large\" if it as an\nmolecular weight of at least 75.0 (yes, this is made up)::\n\n  from __future__ import print_function\n  \n  import propbox\n  import sys\n  \n  from rdkit import Chem\n  from rdkit.Chem import Descriptors\n  \n  \n  @propbox.calculate()\n  def calc_mol(smiles):\n    mol = Chem.MolFromSmiles(smiles)\n    if mol is None:\n        raise ValueError(\"RDKit cannot parse the SMILES %r\" % (smiles,))\n    return mol\n  \n  @propbox.calculate()\n  def calc_MolWt(mol):\n      return Descriptors.MolWt(mol)\n  \n\n  # This wants a \"MW\" property, but the molecular weight is available\n  # as the \"MolWt\" property.\n  @propbox.calculate()\n  def calc_is_large(MW):\n      return MW \u003e 75.0\n\n  # Set up an alias from 'MW' to 'MolWt'\n  aliases = propbox.Aliases({\"MW\": \"MolWt\"})\n      \n  resolver = propbox.collect_resolvers()\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"C\", \"C#N\", \"Q\", \"c1ccccc1O\", \"[U](F)(F)(F)(F)(F)F\"]})\n  \n  table.save(sys.stdout, [\"is_large\", \"smiles\"])\n\nThe output table from the above is::\n\n  is_large\tsmiles\n  False\tC\n  False\tC#N\n  *\tQ\n  True\tc1ccccc1O\n  True\t[U](F)(F)(F)(F)(F)F\n\n\nThe alias resolver also takes part in exception chaining. Printing the\nexception the 'Q' entry gives:\n\n  MW -\u003e MolWt -\u003e mol: ValueError(\"RDKit cannot parse the SMILES 'Q'\",)\n\n\nThis MW/MolWt example is a bit contrived. It's best that you make\neverything use a consistent naming scheme.\n\nYou're more likely to use aliases as a predictive model evolves. You\nmight start off with a blood-brain barrier model called \"BBB\". After a\nyear, you retrain. Now you have BBB_v1 and BBB_v2. That's fine - you\ncan implement both models.\n\nWhat aliases do is let you define that \"BBB\" means \"the most recent\nvalidated BBB model\", and have it point to BBB_v1 while BBB_v2 is\nunder evaluation. Once it's validated, change the alias to point to\nBBB_v2. (You'll likely need BBB_v1 for a while, since other models may\nhave been validated against it, and now need to be revalidated on\nBBB_v2.)\n\n\nProperty Modules\n----------------\n\nOver time you'll likely run into naming conflicts, where one person\nuses $NAME to mean concept X while another person uses $NAME to mean\nrelated concept Y. Or to make things more fun, you might have a\nresolver based on OEChem, and another based on RDKit, and both provide\noverlapping functionality.\n\nPropbox implements a module system. The main resolver is in the main\nmodule, and gets/sets the columns for the main table.\n\nWhat a propbox.Module does is create a subtable. The resolver for the\nmodule can get/set values in the subtable. The table and a subtable\nare independent, so there is no conflict between the names.\n\nThe exceptions are the two sets of aliases. The module has a set of\noutput aliases which says that property X for the main table should be\nresolved as property Y in the subtable. It also has input aliases,\nwhich say that property B for the subtable should be resolved as\nproperty A in the main table.\n\nFor example, the following file (named 'oceania.py') uses OEChem to\ncompute the OEGraphMols as 'mol', and the molecular weight as 'MW'. It\nrequires a SMILES string as the 'smiles' property::\n\n  # This is 'oceania.py', based on OEChem\n  from openeye.oechem import *\n  \n  from propbox import calculate, collect_resolvers\n  \n  @calculate()\n  def calc_mol(smiles):\n      mol = OEGraphMol()\n      if OEParseSmiles(mol, smiles):\n          return mol\n      raise ValueError(\"OEChem cannot parse %r\" % (smiles,))\n  \n  @calculate()\n  def calc_MW(mol):\n      return OECalculateMolecularWeight(mol)\n  \n  \n  resolver = collect_resolvers()\n\n\nWhile the following file (named 'eurasia.py') uses RDKit to compute\nroughly equivalent properties::\n\n  # This is 'eurasia.py', based on RDKit\n  from rdkit import Chem\n  from rdkit.Chem import Descriptors\n  \n  from propbox import calculate, collect_resolvers\n  \n  @calculate()\n  def calc_mol(smiles):\n      mol = Chem.MolFromSmiles(smiles)\n      if mol is None:\n          raise ValueError(\"RDKit cannot parse %r\" % (smiles,))\n      return mol\n  \n  @calculate()\n  def calc_MW(mol):\n      return Descriptors.MolWt(mol)\n  \n  \n  resolver = collect_resolvers()\n\nIn \"smith.py\", I'll try to combine both resolvers into the same Propbox::\n\n\n  import propbox\n  \n  import eurasia, oceania\n  \n  resolver = propbox.Propbox([eurasia.resolver, oceania.resolver])\n\nThis doesn't work. Propbox complains, saying::\n\n                   \n  Traceback (most recent call last):\n    File \"smith.py\", line 5, in \u003cmodule\u003e\n      resolver = propbox.Propbox([eurasia.resolver, oceania.resolver])\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 134, in __init__\n      self.add_resolver(resolver)\n    File \"/Users/dalke/cvses/propbox/propbox/__init__.py\", line 163, in add_resolver\n      self._name_to_resolver[output_name]))\n  ValueError: Resolver \u003cpropbox.Propbox object at 0x108d5f810\u003e defines the output name 'mol', which was already defined by resolver \u003cpropbox.Propbox object at 0x108d4b910\u003e\n\n\nTo resolve the conflict, I'll place the oceania resolver in its own\npropbox.Module. I'll also say that \"OE_MW\" in the main table is an\nalias for \"MW\" in the subtable, and that \"smiles\" in the subtable is\nan aliase for \"smiles\" in the main table::\n\n  import sys\n  import propbox\n  \n  import eurasia, oceania\n  \n  resolver = propbox.Propbox([\n      eurasia.resolver, \n      propbox.Module(\"oceania\", oceania.resolver,\n                     {\"smiles\": \"smiles\"},\n                     {\"OE_MW\": \"MW\"}),\n                     ])\n  \n  table = propbox.make_table_from_columns(\n      resolver, {\"smiles\": [\"CC\", \"CCO\", \"O=O\"]})\n  \n  table.save(sys.stdout, [\"smiles\", \"MW\", \"OE_MW\"])\n\n\n\nAs a result I can now compare the two molecular weights:\n\n  smiles\tMW\tOE_MW\n  CC\t30.07\t30.06904\n  CCO\t46.069\t46.06844\n  O=O\t31.998\t31.9988\n  \n\n\n\nTable configuration\n-------------------\n\nThe table supports a \"config\" dictionary, which can be used to pass\nconfiguration around. It's still experimental, and I don't really want\nto document it.\n\nIt exists so you can define configuration information like the object\nto use to de-salt, or if you don't want to specify the object, the\nconfiguration file or the configuration data to use.\n\nI can't help but wonder if it would be better to do configuration\nthrough the resolvers, when I create the resolvers, rather than\nthrough the table's \"config\".\n\nAn additional question is, how do I configure modules? I'm\nexperimenting with namespaces, so \"oceania.SaltRemove_filename\" would\nbe the salt remover for the oceania module. It's still up in the air.\n\n\nTable cache\n-----------\n\nThis is another experimental feature. The get_cache_value() and\nset_cache_value() are used to get/set the subtable information. It's\nalso used for more long-term storage by the resolver.\n\nFor example, if the config defines a SaltRemover filename, then the\nresolver which actually needs to remove the salts must create a\nSaltRemover, configured to use that filename. What then?\n\nObviously I don't want to recreate the SaltRemover for each\nrecord. Instead, my options are to 1) store it in the table (in which\ncase it's reloaded each time I process a batch), 2) store the\ninformation in some sort of local cache for each resolver. But when is\nthe cache reset? Does everything have a unique cache key, or 3)\nsomething else.\n\nThe get/set cache value API is used for #1. I don't think I like it\nthough.\n\n\n\nCredits\n=======\n\nAndrew Dalke, Dalke Scientific, dalke@dalkescientific.com\n9 June 2015, Trollhättan, Sweden\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funixjunkie%2Fpropbox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funixjunkie%2Fpropbox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funixjunkie%2Fpropbox/lists"}