{"id":37670245,"url":"https://github.com/hmcezar/clusttraj","last_synced_at":"2026-01-16T12:02:52.662Z","repository":{"id":63741590,"uuid":"110118148","full_name":"hmcezar/clusttraj","owner":"hmcezar","description":"Python script that receives a molecular dynamics or Monte Carlo trajectory and performs agglomerative clustering to classify similar structures.","archived":false,"fork":false,"pushed_at":"2025-11-01T14:31:13.000Z","size":6666,"stargazers_count":27,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2026-01-05T16:27:29.526Z","etag":null,"topics":["clustering","distance-matrix","molecular-dynamics","monte-carlo-trajectory","openbabel","python-script","rmsd","trajectory"],"latest_commit_sha":null,"homepage":"https://hmcezar.github.io/clusttraj/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hmcezar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2017-11-09T13:22:29.000Z","updated_at":"2025-11-01T14:30:41.000Z","dependencies_parsed_at":"2023-09-21T18:41:29.445Z","dependency_job_id":"9b790bbb-5638-4d06-808e-e0898de2ac3d","html_url":"https://github.com/hmcezar/clusttraj","commit_stats":{"total_commits":114,"total_committers":2,"mean_commits":57.0,"dds":0.06140350877192979,"last_synced_commit":"64147b9f68c3364274d01f55895e858606c442e3"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/hmcezar/clusttraj","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmcezar%2Fclusttraj","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmcezar%2Fclusttraj/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmcezar%2Fclusttraj/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmcezar%2Fclusttraj/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hmcezar","download_url":"https://codeload.github.com/hmcezar/clusttraj/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmcezar%2Fclusttraj/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","distance-matrix","molecular-dynamics","monte-carlo-trajectory","openbabel","python-script","rmsd","trajectory"],"created_at":"2026-01-16T12:02:52.576Z","updated_at":"2026-01-16T12:02:52.648Z","avatar_url":"https://github.com/hmcezar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# clusttraj - Solvent-Informed Clustering of Trajectories with Python\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"imgs/logo.png\" alt=\"clusttraj logo\" width=\"250\"  /\u003e\n\u003c/div\u003e\n\n[![License: GPL v3](https://img.shields.io/badge/License-LGPLv3-blue.svg)](https://www.gnu.org/licenses/lgpl-3.0.html) ![build](https://github.com/hmcezar/clusttraj/actions/workflows/ci.yml/badge.svg) [![docs](https://github.com/hmcezar/clusttraj/actions/workflows/documentation.yml/badge.svg)](https://hmcezar.github.io/clusttraj/) [![codecov](https://codecov.io/gh/hmcezar/clusttraj/graph/badge.svg?token=DYOKR4JZEN)](https://codecov.io/gh/hmcezar/clusttraj) [![PyPI version](https://badge.fury.io/py/clusttraj.svg)](https://badge.fury.io/py/clusttraj) [![DOI](https://img.shields.io/badge/DOI-10.1021%2Facs.jctc.5c00634-blue)](https://doi.org/10.1021/acs.jctc.5c00634)\n\n-----------\nThis Python package receives a molecular dynamics or Monte Carlo trajectory (in .pdb, .xyz or any format supported by OpenBabel), finds the minimum RMSD between the structures with label reordering and optimal alignment, and performs agglomerative clustering (a kind of unsupervised machine learning) to classify similar conformations. \n\nWhat the script does is to calculate the distance (using the minimum RMSD) between each configuration of the trajectory, building a RMSD matrix (stored in the condensed form).\nDifferent strategies can be used in order to compute distances that correspond to the expected minimum RMSD, such as atom reordering or stepwise alignments.\nNotice that calculating the RMSD matrix might take some time depending on how long your trajectories are and how many atoms there are in each configuration.\nThe RMSD matrix can also be read from a file (with the `-i` option) to avoid recalculating it every time you want to change the linkage method (with`-m`) or distance of the clustering.\n\n## Installation\nThe following libraries are used by clusttraj:\n- [argparse](https://docs.python.org/3/library/argparse.html)\n- [NumPy](http://www.numpy.org/)\n- [OpenBabel](http://openbabel.org/)\n- [RMSD](https://github.com/charnley/rmsd)\n- [SciPy](https://www.scipy.org/)\n- [scikit-learn](http://scikit-learn.org/stable/index.html)\n- [matplotlib](https://matplotlib.org/)\n\nWe also have [qmllib](https://github.com/qmlcode/qmllib) as an optional dependency as one of the reordering algorithms.\n\nFor `openbabel`, we use the `pip` package `openbabel-wheel` which provides pre-built `openbabel` packages for Linux and MacOS.\nMore details can be seen in the [projects' GitHub page](https://github.com/njzjz/openbabel-wheel).\n\nYou can install clusttraj using `pip`\n```bash\npip install clusttraj\n``` \n\nIf you want to use the `qmllib` reordering algorithm, you can install it with:\n```bash\npip install clusttraj[qml]\n```\n\n## Citation\nIf you use clusttraj in your academic work, please cite:\n\u003e Rafael Bicudo Ribeiro and Henrique Musseli Cezar \u003c/br\u003e\n\u003e \"clusttraj: A Solvent-Informed Clustering Tool for Molecular Modeling\" \u003c/br\u003e\n\u003e Journal of Chemical Theory and Computation, 21, 6759–6768, 2025. \u003c/br\u003e\n\u003e https://pubs.acs.org/doi/10.1021/acs.jctc.5c00634\n\n## Usage\nTo see all the options run the script with the `-h` command option:\n```bash\nclusttraj -h\n```\n\nor\n\n```bash\npython -m clusttraj -h\n```\n\nThe mandatory arguments are the path to the file containing the trajectory (in a format that OpenBabel can read with Pybel), and either the maximum RMSD to join two configurations option (`-rmsd`) or the silhouette score option (`-ss`).\n```\nclusttraj trajectory.xyz -rmsd 1.0\n```\nor\n```\nclusttraj trajectory.xyz -ss\n```\n\nAdditional options are available for specifying the input and output files and selecting how the clustering is done.\nThe possible methods used for the agglomerative clustering are the ones available in the linkage method of SciPy's hierarchical clustering.\nA list with the possible methods (selected with `-m`) and the description of each of them can be found [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html).\n\nThe default method for the linkage is `average`, since [it was found](https://dx.doi.org/10.1021/ct700119m) to have a good compromise with the number of clusters and the actual similarity.\nTo learn more about how the clustering is performed using this algorithm, see [UPGMA](https://en.wikipedia.org/wiki/UPGMA).\n\nIf the `-n` option is used, the hydrogens are ignored when performing the Kabsch algorithm to find the superposition and calculating the RMSD.\nThis is useful to avoid clustering identical structures with just a methyl group rotated as different.\n\nThe `-e` or `--reorder` option, tries to reorder the atoms to increase the overlap and reduce the RMSD. \nThe algorithm can be selected with `--reorder-alg`, between qml (default), hungarian, brute or distance. \nFor more information about the implementation, see the [RMSD](https://github.com/charnley/rmsd) package.\nThe reorder option can be used together with the `-ns` option, that receives an integer with the number of atoms of the solute.\nWhen the `-ns` option is used, the script will first superpose the configurations considering only the solute atoms and then reorder considering only the solvent atoms (the atoms in the input that are after the ns atoms).\nFor solute-solvent systems, the use of `-ns` is strongly encouraged.\n\nTo use an already saved RMSD matrix, specify the file containing the RMSD matrix in the condensed form with the `-i` option.\nThe options `-i` and `-od` are mutually exclusive.\n\nThe `-p` flag specifies that pdf plots of some information will be saved.\nIn this case, the filenames will start with the same name used for the clusters output (specified with the `-oc` option).\nWhen the option is used, the following is saved to disk:\n- A plot with the [multidimensional scaling](http://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling) representation of the RMSD matrix, colored with the clustering information\n- The [dendrogram](https://en.wikipedia.org/wiki/Dendrogram)\n- The cluster classification evolution, that shows how during the trajectory, the configurations were classificated. This might be useful to analyze the quality of your sampling.\n\nIf the `-cc` option is specified (along with a format supported by OpenBabel) the configurations belonging to the same cluster are superposed and printed to a file.\nThe superpositions are done considering the [medoid](https://en.wikipedia.org/wiki/Medoid) of the cluster as reference.\nThe medoid is printed as the first structure in the clustered strcuture files.\nIf you did not consider the hydrogens while building the RMSD matrix, remember to use the `-n` option even if with `-i` in this case, since the superposition is done considering the flag.\n\n## Threading and parallelization\nThe `-np` option specified the number of processes to be used to calculate the RMSD matrix.\nSince this is the most time consuming task of the clustering, and due to being a embarassingly parallel problem, it was parallelized using a Python [multiprocessing pool](https://docs.python.org/3/library/multiprocessing.html).\nThe default value for `-np` is 4.\n\nWhen using `-np` make sure you also set the correct number of threads for `numpy`.\nIf you want to use just the `multiprocessing` parallelization (recommended) use the following bash commands to set the number of `numpy` threads to one:\n```bash\nexport OMP_NUM_THREADS=1\nexport OPENBLAS_NUM_THREADS=1\nexport MKL_NUM_THREADS=1\nexport VECLIB_MAXIMUM_THREADS=1\nexport NUMEXPR_NUM_THREADS=1\n```\n\n## Output\nThe logging is done both to `stdout` and to the file `clusttraj.log`.\nThe number of clusters that were found, as well as the number of members for each cluster are printed in a table.\nBelow there is an example of how this information is printed:\n```\n$ clusttraj trajectory.xyz -rmsd 3.2 -np 4 -p -n -cc xyz\n2024-12-12 17:48:19,268 INFO     [distmat.py:34] \u003cget_distmat\u003e Calculating RMSD matrix using 4 threads\n\n2024-12-12 17:48:23,800 INFO     [distmat.py:38] \u003cget_distmat\u003e Saving condensed RMSD matrix to distmat.npy\n\n2024-12-12 17:48:23,801 INFO     [classify.py:97] \u003cclassify_structures\u003e Clustering using 'average' method to join the clusters\n\n2024-12-12 17:48:23,803 INFO     [classify.py:105] \u003cclassify_structures\u003e Saving clustering classification to clusters.dat\n\n2024-12-12 17:48:23,804 INFO     [main.py:59] \u003cmain\u003e Writing superposed configurations per cluster to files clusters_confs_*.xyz\n\n2024-12-12 17:48:26,729 INFO     [main.py:102] \u003cmain\u003e A total 100 snapshots were read and 7 cluster(s) was(were) found.\nThe cluster sizes are:\nCluster\tSize\n1\t3\n2\t3\n3\t31\n4\t30\n5\t18\n6\t3\n7\t12\n\n2024-12-12 17:48:26,729 INFO     [main.py:126] \u003cmain\u003e Total wall time: 7.462641 s\n\n```\n\nIn the cluster output file (`-oc` option, default filename `clusters.dat`) the classification for each structure in the trajectory is printed.\nFor example, if the first structure of the trajectory belongs to the cluster number *2*, the second structure belongs to cluster *1*, the third to cluster *2* and so on, the file `clusters.dat` will start with\n```\n$ head clusters.dat\n7\n4\n5\n3\n4\n7\n6\n7\n4\n3\n```\n\nThe plot of the multidimensional representation (when the `-p` option is used) have each cluster colored in one color as the following picture:\n![Example MDS](imgs/example_mds.png)\n\nThe dendrogram has an horizontal line plotted with it indicating the cutoff used for defining the clusters:\n![Example dendrogram](imgs/example_dendrogram.png)\n\nThe evolution of the classification with the trajectory looks like:\n![Example evolution](imgs/example_evo.png)\n\nIf you wish to use the RMSD matrix file to other uses, bear in mind that the matrix is stored in the condensed form, i.e., only the superior diagonal matrix is printed (not including the diagonal) in NumPy's `.npy` format.\nIt means that if you have `N` structures in your trajectory, your file (specified with `-od` option, default filename `distmat.npy`) will have `N(N-1)/2` lines, with each line representing a distance.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmcezar%2Fclusttraj","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhmcezar%2Fclusttraj","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmcezar%2Fclusttraj/lists"}