{"id":21864751,"url":"https://github.com/redesignscience/easytrajh5","last_synced_at":"2025-03-21T21:11:28.547Z","repository":{"id":206902587,"uuid":"714897862","full_name":"RedesignScience/easytrajh5","owner":"RedesignScience","description":null,"archived":false,"fork":false,"pushed_at":"2024-08-15T01:34:03.000Z","size":566,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-01-26T15:33:39.515Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RedesignScience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-06T04:16:47.000Z","updated_at":"2024-08-15T01:32:51.000Z","dependencies_parsed_at":"2023-11-24T05:23:27.341Z","dependency_job_id":"bb524190-84f0-4cee-bce6-9cc40b83fb64","html_url":"https://github.com/RedesignScience/easytrajh5","commit_stats":null,"previous_names":["redesignscience/easytrajh5"],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RedesignScience%2Feasytrajh5","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RedesignScience%2Feasytrajh5/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RedesignScience%2Feasytrajh5/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RedesignScience%2Feasytrajh5/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RedesignScience","download_url":"https://codeload.github.com/RedesignScience/easytrajh5/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244868762,"owners_count":20523590,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-28T04:11:52.680Z","updated_at":"2025-03-21T21:11:28.504Z","avatar_url":"https://github.com/RedesignScience.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# EasyTrajH5\n\nTrajectory management for mdtraj H5 files with atom selection language\nand efficient data operations via the h5py library.\n\n## Installation\n\n    pip install easytrajh5\n\n## Quick Guide\n\nOur main file object `EasyTrajH5` is a drop-in replacement\nfor `mdtraj.H5TrajectryFile`:\n\n```python\nfrom easytrajh5.traj import EasyTrajH5File\n \nh5 = EasyTrajH5File('traj.h5')\ntraj = h5.read_as_traj()\n```\nThis loads the data progressively in chunks, allowing online streaming \nin advanced usage. \n\nLoad individual frames\n\n```python\nlast_frame_traj = h5.read_frame_as_traj(-1)\n```\n\nAs we use the `h5py` library, we can use efficient\nfancy indexing to load just certain atoms:\n\n```python\natom_indices = [100, 115, 116]\nthree_atom_traj = h5.read_as_traj(atom_indices=atom_indices)\n```\n\nWe provide atom selection using a new selection language (described in detail below).\nThis is particular efficient as it only loads the atoms you want, without \nrequiring the entire trajectory to be loaded into memory:\n\n```python\nfrom easytrajh5.traj import EasyTrajH5File\n \nmask = \"intersect {mdtraj name CA} {protein}\"\nca_trace_traj = EasyTrajH5File('traj.h5', atom_mask=mask).read_as_traj()\n```\n\nDrop in replacement for `mdtraj.reporters.HDF5Reporter` in openmm\nthat uses `EasyTrajH5File`: \n\n```python\nfrom easytrajh5.traj import EasyTrajH5Reporter\n```\n\n\n## Atom Selection Language\n\nWhy another atom selection language (we have AMBER and MDTRAJ)?\nTwo main reasons. \n\nFirst, we wanted user-defined \nresidue selections. These are stored in `easytrajh5/data/select.yaml`. \nEdit this file to create any new residue selections.\n\nSecond, we wanted to fix residue selection. The problem\nis that AMBER uses residue numbering (`:3,5,10-12`) defined in the PDB file \nand not 0-based residue indexing. This means that in PDB files with multiple\nchains, the residue number is not unique. MDTRAJ on the other hand, uses \n0-based indexing, but only allows you to use ranges (`resi 10 to 15`). \n\nWe've combined these ideas to provide our new flexible 0-based residue indexing \n`resi 3,5,10-12,100-150,300`.\n\nWe also allow you to easily drop in to AMBER and MDTRAJ simply by \nusing the `amber` and `mdtraj` keywords. When combined with set \noperations, everything is now at your disposal.\n\nSome useful masks:\n\n- no solvent: `not {solvent}`\n- just the protein: `protein`\n- ligand and specific residues: `ligand resi 5,1,22-200`\n- heavy protein atoms: `diff {protein} {amber @/H}`\n- no hydrogens: `not {amber @/H}`\n- ligand and 6 closest residues: `pocket ligand`\n- specified ligand with 10 closest neighbours: `resname UNL near UNL 10`\n\n#### User-defined and operator keywords\n\nIf more than one keyword is specified, it is assumed they are joined with \"or\"\noperation (i.e. `ligand protein` will return both ligand and protein atom indices).\n\nThis default keywords are:\n- `ligand`, `protein`, `water`, `lipid`, `salt`, `solvent`, `lipid`, `nucleic`\n- as defined in `easytrajh5/data/select.yaml`\n- `ligand` will find the residues `LIG`, `UNL`, `UNK`\n\nSpecial operator keywords:\n\n- `pocket` will find the closest 6 residues to the `ligand` group.\n- `near` will require a following resname, with an optional integer, e.g.:\n    `near ATP`\n    `near ATP 5`\n- `resname` identifies a single residue type\n    `resname LEU`\n- `resi` for 0-indexed residue selections\n    `resi 0,10-13` - selects atoms in the first and 11th to 14th residues\n- `atom` for 0-indexed atoms selections\n    `atom 0,55,43,101-105` - selects the first, 56th, 44th, 102 to 106th atom\n\n#### AMBER-style atom selection\n\n- https://parmed.github.io/ParmEd/html/amber.html#amber-mask-syntax\n- `amber :ALA,LYS` - selects all alanine and lysine residues\n\n#### MDTraj-style atom selection \n- https://mdtraj.org/1.9.4/atom_selection.html\n- `mdtraj protein and water` - selects protein and water\n\n#### Set operations\n\nSelections can be combined with set operators: `not`, `intersect`, `merge`, `diff`:\n\n- `intersect {not {amber :ALA}} {protein}`\n- `diff {protein} {not {amber :ALA}}`\n- `not {resname LEU}`\n- `merge {near BSM 8} {amber :ALA}`\n\n#### Use in python\n\nIn your python code, there is a `select_mask` fn that operates on `parmed.Structure`\nobjects:\n\n```python\nfrom easytrajh5.traj import EasyTrajH5File\nfrom easytrajh5.select import select_mask\nfrom easytrajh5.struct import slice_parmed\n\npmd = EasyTrajH5File(\"traj.h5\").get_topology_parmed()\ni_atoms = select_mask(pmd, \"not {solvent}\")\nsliced_pmd = slice_parmed(pmd, i_atoms)\n```\n\nSome common conversions and loaders in `easytrajh5.struct` for `parmed.Structure` and\n`mdtraj.Trajectory` objects:\n\n```python\nimport parmed, mdtraj\n\ndef dump_parmed(pmd: parmed.Structure, fname: str): \ndef load_parmed(fname: str) -\u003e parmed.Structure:\ndef get_parmed_from_pdb(pdb: str) -\u003e parmed.Structure:\ndef get_parmed_from_parmed_or_pdb(pdb_or_parmed: str) -\u003e parmed.Structure:\ndef get_parmed_from_mdtraj(traj: mdtraj.Trajectory, i_frame=0) -\u003e parmed.Structure:\ndef get_parmed_from_openmm(openmm_topology, openmm_positions=None) -\u003e parmed.Structure:\ndef get_mdtraj_from_parmed(pmd: parmed.Structure) -\u003e mdtraj.Trajectory:\ndef get_mdtraj_from_openmm(openmm_topology, openmm_positions) -\u003e mdtraj.Trajectory:\n```\n\n## Use as H5\n\nThere are convenience functions to insert different types\nof data. \n\nTo save/load strings:\n\n```python\nh5.set_str_dataset('my_string', 'a string')\nh5.flush()\nnew_str = h5.get_str_dataset('my_string')\n```\n\nTo save/load json:\n```python\nh5.set_json_dataset('my_obj', {\"a\", \"b\"})\nh5.flush()\nnew_obj = h5.get_json_dataset('my_obj')\n```\nTo insert/extract binary files:\n\n```python\nh5.insert_file_to_dataset('blob', 'blob.bin')\nh5.flush()\nh5.extract_file_from_dataset('blob', 'new_blob.bin')\n```\n\nWe can get information about the h5 file:\n\n```python\nschema_json = h5.get_schema()\ndataset_keys = h5.get_dataset_keys()\nattr_keys = h5.get_attr_keys()\n```\n\nWe can extract data\n\n```python\ndataset = h5.get_dataset(\"coordinates\")\nvalue_list = dataset[:]\nlast_value = dataset[-1]\n\n# if the attrs are set\nvalue = h5.get_attr('user')\n```\n\nConvenience function to append values to an `h5` file without\nworrying about file or dataset creation:\n\n```python\nfrom easytrajh5.h5 import dump_value_to_h5, EasyH5File\n\ndump_value_to_h5('new.h5', [1,2], 'my_data_set')\ndump_value_to_h5('new.h5', [3,4], 'my_data_set')\ndump_value_to_h5('new.h5', [5,7], 'my_data_set')\n\nreturn_values = EasyH5File('new.h5').get_dataset(\"my_data_set\")[:]\n# [[1,2], [3,4], [5,6]]\n```\n\n## Command-line utility `easyh5`\n\n`easyh5` provides a bunch of useful cli subcommands to interrogate `h5` and related files:\n\n```bash\nUsage: easyh5 [OPTIONS] COMMAND [ARGS]...\n\n  h5: preprocessing and analysis tools\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  dataset        Examine contents of h5\n  insert-parmed  Insert parmed into dataset:parmed of an H5\n  mask           Explore residues/atoms of H5/PDB/PARMED using mask\n  merge          Merge a list of H5 files\n  parmed         Extract parmed from dataset:parmed of an H5 with...\n  pdb            Extract PDB of a frame of an H5\n  schema         Examine layout of H5\n  show-chimera   Use CHIMERA to show H5/PDB/PARMED with mask, needs PARMED\n  show-pymol     Use PYMOL to show H5/PDB/PARMED with mask\n  show-vmd       Use VMD to show H5/PDB/PARMED with mask\n```\n\nTo get a schema of the dataset layout and attributes:\n\n```bash\n\u003e easyh5 schema traj.h5\n# {\n# │   'datasets': [\n# ....\n# │   │   {\n# │   │   │   'key': 'coordinates',\n# │   │   │   'shape': [200, 3340, 3],\n# │   │   │   'chunks': [3, 3340, 3],\n# │   │   │   'is_extensible': True,\n# │   │   │   'frame_shape': [3340, 3],\n# │   │   │   'n_frame': 200,\n# │   │   │   'dtype': 'float32',\n# │   │   │   'attr': {'CLASS': 'EARRAY', 'EXTDIM': 0, 'TITLE': None, 'VERSION': '1.1', 'units': 'nanometers'}\n# │   │   },\n#\n# ...\n#\n# │   │   {\n# │   │   │   'key': 'topology',\n# │   │   │   'shape': [1],\n# │   │   │   'dtype': 'string(217329)',\n# │   │   │   'attr': {'CLASS': 'ARRAY', 'FLAVOR': 'python', 'TITLE': None, 'VERSION': '2.4'}\n# │   │   }\n# │   ],\n# │   'attr': {\n# │   │   'CLASS': 'GROUP',\n# │   │   'FILTERS': 65793,\n# │   │   'PYTABLES_FORMAT_VERSION': '2.1',\n# │   │   'TITLE': None,\n# │   │   'VERSION': '1.0',\n# │   │   'application': 'MDTraj',\n# │   │   'conventionVersion': '1.1',\n# │   │   'conventions': 'Pande',\n# │   │   'program': 'MDTraj',\n# │   │   'programVersion': '1.9.7',\n# │   │   'title': 'title'\n# │   }\n# }\n```\n\nOr as a quick summary table:\n\n```bash\n\u003e easyh5 dataset examples/trajectory.h5 \n# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.\n# \n#                   sims/high_bf/trajectory.h5                  \n#                                                               \n#   dataset           shape              dtype       size (MB)  \n#  ──────────────────────────────────────────────────────────── \n#   cell_angles       (1500, 3)          float32       0.02 MB  \n#   cell_lengths      (1500, 3)          float32       0.02 MB  \n#   coordinates       (1500, 25767, 3)   float32     442.32 MB  \n#   kineticEnergy     (1500,)            float32         \u003c1 KB  \n#   potentialEnergy   (1500,)            float32         \u003c1 KB  \n#   temperature       (1500,)            float32         \u003c1 KB  \n#   time              (1500,)            float32         \u003c1 KB  \n#   topology          (1,)               |S2083249     1.99 MB  \n#                                                               \n#   total                                            444.36 MB  \n```\n\nTo get an overview of a dataset:\n\n```bash\n\u003e easyh5 dataset examples/trajectory.h5 coordinates\n# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.\n# \n#   examples/trajectory\n#      dataset=coordinates\n#      shape=(1500, 25767, 3)\n# \n# [[[1.291678   7.558739   1.5199517 ]\n#   [1.368739   7.5888386  1.4620152 ]\n#   [1.2175218  7.6268845  1.5275735 ]\n#   ...\n#   [2.375777   0.09478953 4.0356894 ]\n#   [3.107005   3.3255231  2.8464174 ]\n#   [3.0329072  3.9307644  1.3600407 ]]\n# \n#  ...\n# \n#  [[2.9693408  7.1466036  1.4656581 ]\n#   [2.9327238  7.198606   1.3871984 ]\n#   [3.0665123  7.171176   1.4781022 ]\n#   ...\n#   [4.944392   0.56028575 4.301907  ]\n#   [2.6180382  0.3969128  1.4842175 ]\n#   [3.281546   4.9666233  2.4855924 ]]]\n```\n\nOr to focus on a selected frames, use a numbered lis:\n\n```bash\n\u003e easyh5 dataset examples/trajectory.h5 coordinates 1,3,4-10\n# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.\n# \n#   sims/high_bf/trajectory.h5\n#      dataset=coordinates\n#      shape=(1500, 25767, 3)\n# \n# frames(1,3,4-10)=\n# [[[1.2958181  7.5481067  1.5513833 ]\n#   [1.2766361  7.4586782  1.5085387 ]\n#   [1.3654946  7.58469    1.4880756 ]\n#   ...\n#   [2.0149727  0.20826703 3.712016  ]\n#   [3.3603299  3.6615734  2.6487541 ]\n#   [3.1595583  4.0199933  1.509442  ]]\n# \n#  ...\n# \n#  [[1.2550778  7.4836254  1.5989571 ]\n#   [1.278228   7.403505   1.5419852 ]\n#   [1.2919694  7.571082   1.5644412 ]\n#   ...\n#   [2.6036768  5.9193387  3.7148886 ]\n#   [4.2752028  3.8813443  2.6205144 ]\n#   [2.343824   3.9689744  0.05281828]]]\n# \n```\n\nTo check atom selections of the protein:\n\n```bash\n\u003e easyh5 mask sims/high_bf/trajectory.h5 \"amber :PRO\" --res\n# Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.\n# EasyTrajH5File: fname='sims/high_bf/trajectory.h5' mode='a' atom_mask='' is_dry_cache=False\n# open connection:started...\n# open connection:finished in \u003c1ms\n# loading topology:started...\n# loading topology:finished in 134ms\n# select_mask \"amber :PRO\" -\u003e 112 atoms, 8 residues\n# \u003cResidue PRO[7]; chain=1\u003e\n# \u003cResidue PRO[12]; chain=1\u003e\n# \u003cResidue PRO[42]; chain=1\u003e\n# \u003cResidue PRO[48]; chain=1\u003e\n# \u003cResidue PRO[50]; chain=1\u003e\n# \u003cResidue PRO[87]; chain=1\u003e\n# \u003cResidue PRO[129]; chain=1\u003e\n# \u003cResidue PRO[167]; chain=1\u003e\n```\n\nTo extract that as PDB:\n\n```bash\n\u003e easyh5 mask sims/high_bf/trajectory.h5 \"amber :PRO\" --pdb pro.pdb\n```\n\nThere are three sub-commands that help visualize selections in standard viewers:\n\n- `easyh5 show-pymol \u003cPDB\u003e \u003cMASK1\u003e \u003cMASK2\u003e`\n- `easyh5 show-vmd \u003cPDB|PARMED|H5\u003e \u003cMASK1\u003e \u003cMASK2\u003e`\n- `easyh5 show-chimera \u003cPDB|PARMED|H5\u003e \u003cMASK1\u003e \u003cMASK2\u003e`\n\nIt will open the structure or trajectory in the corresponding viewers with the first selection\ncolored in green, and the second selection in pink.\n\nA configuration file in your systems config directory `rseed.binary.yaml` will be created\nthat list the full path name of PYMOL/VMD/CHIMERA. Change this if your copy of the viewer \nis in a different location.\n\n\n## Miscellaneous utility \n\nIn `easytrajh5.quantity` we have some useful transforms to handle those\npesky unit objects from openmm. These transforms are used in our yaml and\njson convenience functions\n\n```python\nfrom easytrajh5 import quantity\nfrom parmed import unit\n\nx = 5 * unit.nanosecond\nd = quantity.get_dict_from_quantity(x)\n# {\n#│   'type': 'quantity',\n#│   'value': 5,\n#│   'unit': 'nanosecond',\n#│   'unit_repr': 'Unit({BaseUnit(base_dim=BaseDimension(\"time\"), name=\"nanosecond\", symbol=\"ns\"): 1.0})'\n#}\ny = quantity.get_quantity_from_dict(d)\n# Quantity(value=5, unit=nanosecond)\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredesignscience%2Feasytrajh5","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fredesignscience%2Feasytrajh5","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredesignscience%2Feasytrajh5/lists"}