{"id":28183296,"url":"https://github.com/thegreatherrlebert/proteolizard-algorithm","last_synced_at":"2025-05-16T04:15:45.162Z","repository":{"id":53820478,"uuid":"478526483","full_name":"theGreatHerrLebert/proteolizard-algorithm","owner":"theGreatHerrLebert","description":"prototype machine-learning solutions for high-throughput MS with ease","archived":false,"fork":false,"pushed_at":"2023-07-17T00:31:23.000Z","size":1041,"stargazers_count":6,"open_issues_count":9,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-31T15:43:58.124Z","etag":null,"topics":["bigdata","deep-learning","ion-mobility-spectrometry","machine-learning","mass-spectrometry","pybind11","tensorflow","timstof"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/theGreatHerrLebert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-06T11:18:31.000Z","updated_at":"2024-03-03T21:07:05.000Z","dependencies_parsed_at":"2023-02-14T07:31:14.594Z","dependency_job_id":null,"html_url":"https://github.com/theGreatHerrLebert/proteolizard-algorithm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theGreatHerrLebert%2Fproteolizard-algorithm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theGreatHerrLebert%2Fproteolizard-algorithm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theGreatHerrLebert%2Fproteolizard-algorithm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theGreatHerrLebert%2Fproteolizard-algorithm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/theGreatHerrLebert","download_url":"https://codeload.github.com/theGreatHerrLebert/proteolizard-algorithm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254464874,"owners_count":22075572,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","deep-learning","ion-mobility-spectrometry","machine-learning","mass-spectrometry","pybind11","tensorflow","timstof"],"created_at":"2025-05-16T04:15:37.998Z","updated_at":"2025-05-16T04:15:45.147Z","avatar_url":"https://github.com/theGreatHerrLebert.png","language":"Python","readme":"# proteolizard-algorithm\n### A collection of algorithms and tooling to process ion-mobility mass-spectrometry raw-data\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"ProteolizardLogo.png\" alt=\"logo\" width=\"350\"/\u003e\n\u003c/p\u003e\n\nThis repository is part of the `proteolizard` project, a free and open-source solution \nfor raw-data access, algorithms and raw-data visualization of mass spectrometry data generated with \nthe bruker timsTOF device.\n\nWe are a relatively small team of developers and therefore decided to keep things loosely coupled. \nThis means that \n\n* **data access**  : [`proteolizard-data`](https://github.com/theGreatHerrLebert/proteolizard-data)\n* **algorithms**   : [`proteolizard-algorithm`](https://github.com/theGreatHerrLebert/proteolizard-algorithm) \n* **visualization**: [`proteolizard-vis`](https://github.com/theGreatHerrLebert/proteolizard-vis) \n\nare made available at different repositories. \nThis makes it easier for us to develop all pieces independently.\nWe try to keep dependencies as small as possible, which should allow you to exchange parts of your \ncustom pipelines against other data-access backends such as \n[`timspy`](https://github.com/MatteoLacki/timspy) or \n[`alphatims`](https://github.com/MannLabs/alphatims).\n\nDevelopment is still ongoing. \nIf you experience weird behaviour, bugs or errors please let us know!\n\n\n## Why proteolizard-algorithm?\n`proteolizard-algorithm` provides you with algorithms and tools that are tailored to deal with the huge \namount of raw-data generated by liquid chromatography coupled to ion-mobility tandem mass-spectrometry (LC-IMS-MS-MS).\nThe additional recording of ion-mobility adds another dimension to experiments while \ndata-sparsity increases as well. \nThis makes a lot of traditional approaches used for LC-MS-MS processing either \ntoo slow or their design unsuited for these datasets. \n\nOur goal is to translate ideas developed in other disciplines in data science that have to deal with related problems.\nWe especially want to make use of modern hardware such as multicore systems and GPU parallelization. \n\n## Navigation\n* [**Build and install proteolizard-algorithm**](#build-and-install-proteolizard-algorithm)\n* [**Locality Sensitive Hashing (LSH)**](#locality-sensitive-hashing-(lsh))\n* [**Clustering**](#clustering)\n* [**Supervised (Deep) Learning**](#supervised-(deep)-learning)\n\n---\n### Build and install proteolizard-algorithm\nWe highly recommend to install all libraries that are part of the `proteolizard` project into a python [virtual environment](https://docs.python.org/3/tutorial/venv.html) or [conda environment](https://docs.conda.io/en/latest/).\n\nTo use `proteolizatd-algorithm`, you will need to install [`proteolizard-data`](https://github.com/theGreatHerrLebert/proteolizard-data) first. After that, build the C++ shared library for python:\n\n```sh\nshell\u003e git clone https://github.com/theGreatHerrLebert/proteolizard-algorithm\nshell\u003e cd proteolizard-algorithm\n```\n\n```sh\nshell\u003e mkdir build \u0026\u0026 cd build\nshell\u003e cmake ../cpp -DCMAKE_BUILD_TYPE=Release\nshell\u003e make\n```\n\nOr, if you did not install `proteolizard-data` into a global install directory, you also need to set CMAKE_PREFIX_PATH to the same installation prefix used for proteolizard-data:\n\n```sh\nshell\u003e mkdir build \u0026\u0026 cd build\nshell\u003e cmake ../cpp -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=path/to/proteolizard-data/install\nshell\u003e make\n```\n\n```sh\nshell\u003e cmake --install . --prefix=some/prefix/path\n```\n---\n### Locality Sensitive Hashing (LSH)\nLSH is a stochastic technique to find similar objects, where similarity is estimated using a family of hash functions \nthat are tailored to approximate some similarity measure. One of its main advantages over other algorithms is the fact\nthat similar pairs can be found in linear time, trading a guarantee to find all similar objects against high\nprobability of detection.\n\n`proteolizard-algorithm` implements approximation of cosine similarity of mass spectra. To do so, it allows you to\ngenerate a set of keys for mz spectra in a vectorized representation. This keys can then be used for detection of \nself-collision, reference search or generally anything related to distance matrices like clustering. Key calculation\nis based on [`tensorflow`](https://www.tensorflow.org/) Tensors and can therefore be put onto the GPU if you have a \n[CUDA](https://developer.nvidia.com/cuda-toolkit) enabled NVIDIA card and \n[cuDNN](https://developer.nvidia.com/cudnn) is available in your environment. \n\nWe will briefly go over how LSH is performed for timsTOF data.\n\n**TODO**: explain and show workflow plot.\n\nIf you want to learn more about LSH in context of mass spectrometry, have a look at \nBob et al.[^fn1] or \nWang et al.[^fn2][^fn3]\n\n```python\nimport numpy as np\nimport tensorflow as tf\n\nfrom proteolizarddata.data import PyTimsDataHandle, TimsFrame, MzSpectrum\nfrom proteolizardalgo.hashing import TimsHasher, IsotopeReferenceSearch, ReferencePattern\nfrom proteolizardalgo.utility import create_reference_dict, get_refspec_list, get_ref_pattern_as_spectra\n\n# create a data handle and read a precursor frame\ndh = PyTimsDataHandle('/path/to/data.d')\nframe = dh.get_frame(dh.precursor_frames[250])\n\n# create a set of dense windows indexed by scan and mz-bin\nscan, mz_bin, W = frame.get_dense_windows(window_length=4, resolution=2, min_peaks=5, \n                                          min_intensity=50, overlapping=True)\n\n# create a spectrum hasher\n# by picking a fixed seed, you can guarantee that keys can be reproduced\nhasher = TimsHasher(trials=256, len_trial=22, seed=42, num_dalton=4, resolution=2)\n\n# calculate trials number of keys, each having len_tral bits for each window\nK = hasher.calculate_keys(W)\n\nprint(K)\n```\nThis will give you:\n```python\n\u003ctf.Tensor: shape=(10682, 512), dtype=int32, numpy=\narray([[ 362167, 3700797, 3061941, ..., 1147456, 1968934,   98534],\n       [2538463, 3497250, 2595794, ..., 2643667, 2048648, 3815282],\n       [2003423, 3821990, 2528830, ..., 1697390, 1763353, 1735530],\n       ...,\n       [2898374, 1166177, 1438584, ..., 2115578,  769518,  448939],\n       [1382299, 3202454, 3824606, ..., 2843920, 1615614, 3689973],\n       [ 877019, 3258715, 4001803, ..., 1603336, 2742681, 2790119]],\n      dtype=int32)\u003e\n```\nwhere shape = (number_windows, number_keys_per_window).\n\n---\n### Clustering\nDUMMY\n\n---\n### Supervised (Deep) Learning\nZohora et al.[^fn4][^fn5]\n\n---\n[^fn1]: Locality-sensitive hashing enables efficient and scalable signal classification in high-throughput mass spectrometry raw data.\nBMC Bioinformatics, 2022. https://doi.org/10.1186/s12859-022-04833-5\n\n[^fn2]: A Fast and Memory-Efficient Spectral Library Search Algorithm Using Locality-Sensitive Hashing. \nProteomics, 2020. https://doi.org/10.1002/pmic.202000002\n\n[^fn3]: msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing.\njournal of proteome, 2019. https://pubs.acs.org/doi/10.1021/acs.jproteome.8b00448\n\n[^fn4]: DeepIso: A Deep Learning Model for Peptide Feature Detection from LC-MS map.\nNature scientific reports, 2019. https://doi.org/10.1038/s41598-019-52954-4\n\n[^fn5]: Deep neural network for detecting arbitrary precision peptide features through attention based segmentation.\nNature scientific reports, 2021. https://doi.org/10.1038/s41598-021-97669-7\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthegreatherrlebert%2Fproteolizard-algorithm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthegreatherrlebert%2Fproteolizard-algorithm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthegreatherrlebert%2Fproteolizard-algorithm/lists"}