{"id":17255115,"url":"https://github.com/unixjunkie/molenc","last_synced_at":"2025-04-14T05:31:45.249Z","repository":{"id":37401100,"uuid":"148603695","full_name":"UnixJunkie/molenc","owner":"UnixJunkie","description":"MolEnc: a molecular encoder using rdkit and OCaml.","archived":false,"fork":false,"pushed_at":"2025-03-24T16:26:12.000Z","size":9475,"stargazers_count":19,"open_issues_count":18,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T19:50:59.374Z","etag":null,"topics":["atom-pairs","chemical-fingerprint","chemoinformatics","counted-unfolded-fingerprint","lbvs","molecular-encoding","ocaml-program","pharmacophore-points","python-script","qsar","rdkit","signature-molecular-descriptor"],"latest_commit_sha":null,"homepage":"","language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UnixJunkie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-13T08:03:25.000Z","updated_at":"2025-03-25T10:14:15.000Z","dependencies_parsed_at":"2023-09-23T06:47:57.249Z","dependency_job_id":"4daf4ec4-4c48-4fee-97c0-6320d5f0315a","html_url":"https://github.com/UnixJunkie/molenc","commit_stats":null,"previous_names":[],"tags_count":141,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fmolenc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fmolenc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fmolenc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UnixJunkie%2Fmolenc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UnixJunkie","download_url":"https://codeload.github.com/UnixJunkie/molenc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248826623,"owners_count":21167724,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atom-pairs","chemical-fingerprint","chemoinformatics","counted-unfolded-fingerprint","lbvs","molecular-encoding","ocaml-program","pharmacophore-points","python-script","qsar","rdkit","signature-molecular-descriptor"],"created_at":"2024-10-15T07:10:44.031Z","updated_at":"2025-04-14T05:31:44.599Z","avatar_url":"https://github.com/UnixJunkie.png","language":"OCaml","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Introduction\n\nMolEnc: a molecular encoder using rdkit and OCaml.\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3546675.svg)](https://doi.org/10.5281/zenodo.3546675)\n\nThe implemented fingerprint is J-L Faulon's \"Signature Molecular Descriptor\"\n(SMD [1]).\nThis is an unfolded-counted chemical fingerprint.\nSuch fingerprints are less lossy than famous chemical fingerprints like ECFP4.\nSMD encoding doesn't introduce feature collisions upon encoding.\nAlso, a feature dictionary is created at encoding time.\nThis dictionary can be used later on to map a given feature index to an\natom environment.\nMolenc also implements unfolded-counted atom pairs [2].\n\nFor SMD, we recommend using a radius of zero to one (molenc.sh -r 0:1 ...) or\nzero to two.\n\nCurrently, the atom typing scheme being used is:\n(#pi-electrons, element symbol, #HA neighbors, formal charge).\n\nIn the future, we might add pharmacophore feature points[3]\n(Donor, Acceptor, PosIonizable, NegIonizable, Aromatic, Hydrophobe),\nto allow a fuzzier description of molecules.\n\n# How to install the software\n\nFor beginners/non opam users:\ndownload and execute the latest self-installer\nshell script from (https://github.com/UnixJunkie/molenc/releases).\n\nThen execute:\n```\n./molenc-5.0.1.sh ~/usr/molenc-5.0.1\n```\n\nThis will create ~/usr/molenc-5.0.1/bin/molenc.sh, among other things\ninside the same directory.\n\nFor opam users:\n```\nopam install molenc\n```\n\nDo not hesitate to contact the author in case you have problems installing\nor using the software or if you have any question.\n\n# Usage\n\n```\nmolenc.sh -i input.smi -o output.txt\n         [-d encoding.dix]: reuse existing feature dictionary\n         [-r i:j]: fingerprint radius (default=0:1)\n         [--pairs]: use atom pairs instead of Faulon's FP\n         [-m \u003cint\u003e]: maximum allowed atom-pair distance\n                     (default: no limit)\n         [--seq]: sequential mode (disable parallelization)\n         [-v]: debug mode; keep temp files\n         [-n \u003cint\u003e]: max jobs in parallel\n         [-c \u003cint\u003e]: chunk size\n         [--no-std]: don't standardize input file molecules\n                     ONLY USE IF THEY HAVE ALREADY BEEN STANDARDIZED\n```\n\nHow to encode a database of molecules:\n\n```\nmolenc.sh -i molecules.smi -o molecules.txt\n\n```\n\nHow to encode another database of molecules, but reusing the feature\ndictionary from another database:\n\n```\nmolenc.sh -i other_molecules.smi -o other_molecules.txt -d molecules.txt.dix\n```\n\n# Bibliography\n\n[1] Faulon, J. L., Visco, D. P., \u0026 Pophale, R. S. (2003). The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. Journal of chemical information and computer sciences, 43(3), 707-720.\n\n[2] Carhart, R. E., Smith, D. H., \u0026 Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73.\n\n[3] Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T., \u0026 Sheridan, R. P. (1996). Chemical similarity using physiochemical property descriptors. Journal of Chemical Information and Computer Sciences, 36(1), 118-127.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funixjunkie%2Fmolenc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funixjunkie%2Fmolenc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funixjunkie%2Fmolenc/lists"}