{"id":17011252,"url":"https://github.com/poke1024/pyalign","last_synced_at":"2025-09-08T18:40:03.578Z","repository":{"id":49754975,"uuid":"377895601","full_name":"poke1024/pyalign","owner":"poke1024","description":"Fast and Versatile Alignments for Python","archived":false,"fork":false,"pushed_at":"2023-06-08T06:45:56.000Z","size":576,"stargazers_count":50,"open_issues_count":8,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-08T18:40:01.873Z","etag":null,"topics":["alignment","bioinformatics","digital-humanities","gotoh-algorithm","needleman-wunsch-algorithm","smith-waterman-algorithm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/poke1024.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-17T16:29:15.000Z","updated_at":"2025-09-04T01:46:05.000Z","dependencies_parsed_at":"2024-10-27T12:49:29.251Z","dependency_job_id":"8fe39f84-d38a-4382-8ddd-ab9559c9d7c2","html_url":"https://github.com/poke1024/pyalign","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/poke1024/pyalign","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poke1024%2Fpyalign","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poke1024%2Fpyalign/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poke1024%2Fpyalign/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poke1024%2Fpyalign/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/poke1024","download_url":"https://codeload.github.com/poke1024/pyalign/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poke1024%2Fpyalign/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274229381,"owners_count":25245189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-08T02:00:09.813Z","response_time":121,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","bioinformatics","digital-humanities","gotoh-algorithm","needleman-wunsch-algorithm","smith-waterman-algorithm"],"created_at":"2024-10-14T06:06:35.677Z","updated_at":"2025-09-08T18:40:03.548Z","avatar_url":"https://github.com/poke1024.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pyalign\n\nFast and simple alignments in Python:\n\n![Example inside Jupyter](https://github.com/poke1024/pyalign/raw/main/docs/jupyter_example.png)\n\nMake sure to take a look at \"other alignment libraries\" below to better understand if this is\nthe right library for you.\n\n\u003chr\u003e\n\nAlignments have been a staple algorithm in bioinformatics for decades now,\nbut most packages implementing tend to be either easy to use and slow, or\nfast but very difficult to use and highly domain specific.\n\npyalign is a small and hopefully rather versatile Python package that aims to\nbe fast and easy to use. At its core, it is an optimizer for finding \"optimum\ncorrespondences between sequences\" (Kruskal, 1983) - the main proponents of which\nare alignments and dynamic time warping.\n\nGeneral Features:\n\n* easy to install and easy to use\n* robust and efficient implementation of standard algorithms\n* very fast for smaller problem sizes (see below for details)\n* built-in visualization functionality for teaching purposes\n\nIn terms of alignment algorithms:\n\n* computes local, global and semiglobal alignments on pairs of sequences\n* supports different gap costs (commonly used ones as well as custom ones)\n* automatically selects best suitable algorithm (e.g. Gotoh)\n* no assumptions on matched items, i.e. not limited to characters\n* supports any given similarity or distance function (i.e. can maximize or minimize)\n* can return one as well as *all* optimal alignments and scores\n\nThe implementation should be rather fast due to highly optimized code paths\nfor every special case. While it does *not* support GPUs, here are some facts:\n\n* optimized C++ core employing \u003ca href=\"https://github.com/xtensor-stack/xtensor\"\u003extensor\u003c/a\u003e\n* supports SIMD via batching (i.e. simple SIMD parallelism as first\nsuggested by Alpern et al. and more recently by Rudnicki et al.)\n* carefully designed to avoid dynamic memory allocation\n* extensive metaprogramming to provide different optimized code paths for different\nusage patterns - for example, computing \"only single score\" won't write tracebacks,\nwhereas computing \"all alignments\" will track multiple traceback edges\n\n# Installation\n\n## via pip (recommended)\n\npyalign currently provides precompiled packages for Windows (Intel), Linux (Intel)\nand  macOS (Intel).\n\n`pip install pyalign`\n\n## on Google Colab\n\nFirst install conda via:\n\n```\n!pip install -q condacolab\nimport condacolab\ncondacolab.install()\n```\n\nThen run:\n\n```\n!git clone https://github.com/poke1024/pyalign \u0026\u0026 cd pyalign \u0026\u0026 conda env create -f environment.yml \u0026\u0026 conda activate pyalign \u0026\u0026 python setup.py install\n```\n\n## locally\n\nInstalling pyalign locally will require a modern C++ compiler. It also requires\nvarious  libraries from the [xtensor stack](https://github.com/xtensor-stack)\nwhich are best installed via cona; for the full list of required packages, see\n[environment.yml](environment.yml).\n\nLocal installation via conda:\n\n```\ngit clone https://github.com/poke1024/pyalign\ncd pyalign\nconda env create -f environment.yml\nconda activate pyalign\npython setup.py install\n```\n\n# Example\n\nRunning\n\n```python\nimport pyalign\nalignment = pyalign.global_alignment(\"INDUSTRY\", \"INTEREST\", gap_cost=0, eq=1, ne=-1)\nalignment\n```\n\nwill compute an optimal global alignment between \"INDUSTRY\" and \"INTEREST\".\n\nWe instruct the optimizer to use of scores 1 and -1 (for matching and non-matching letters) and no (i.e. 0) gap costs.\n\nIn Jupyter, this will give\n\n```\nIN----DUSTRY\n||      ||  \nINTERE--ST--\n```\n\nOf course you can also extract the actual score:\n\n```python\nalignment.score\n```\n\nas\n\n```python\n4.0\n```\n\nIt's also possible to extract the traceback matrix and path and generate\nvisuals (and thus a detailed rationale for the obtained score and solution).\n\nIn contrast to the first example above, which used the simplified high level\nAPI, we now use the full, more detailed API, which gives much more detailed\naccess to different gap costs, solvers and scoring configurations. To make\nthings a bit more interesting, we switch from 0 gap cost to 0.2 (which will\nnot  change the result in this case, but shows in the traceback matrix):\n\n```python\nimport pyalign.problems\nimport pyalign.solve\nimport pyalign.gaps\n\npf = pyalign.problems.general(\n    pyalign.problems.Equality(eq=1, ne=-1),\n    direction=\"maximize\")\nsolver = pyalign.solve.GlobalSolver(\n    gap_cost=pyalign.gaps.LinearGapCost(0.2),\n    codomain=pyalign.solve.Solution)\nproblem = pf.new_problem(\"INDUSTRY\", \"INTEREST\")\nsolver.solve(problem)\n```\n\n![traceback and path](https://raw.githubusercontent.com/poke1024/pyalign/main/docs/traceback.svg)\n\nAs a final example, here is how we would modify the `solver` above to get a list over all optimal\nsolutions of a problem:\n\n```python\nfrom typing import Iterator\n\nsolver = pyalign.solve.GlobalSolver(\n    gap_cost=pyalign.gaps.LinearGapCost(0.2),\n    codomain=List[pyalign.solve.Solution])\n```\n\nThis will now return a list of solutions, each with its own traceback, e.g.:\n\n```python\n[\u003cpyalign.solve.Solution at 0x10f0f15b0\u003e,\n \u003cpyalign.solve.Solution at 0x10f0f1580\u003e]\n```\n\nTo learn more about the API, take a look at\n\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/poke1024/pyalign-demo/HEAD?filepath=example.ipynb)\n\n# Performance\n\nHere are a few benchmarks. The \"pure python\" implementation seen in this\nbenchmark is found at https://github.com/eseraygun/python-alignment.\n\n`+alphabet` means using `pyalign.problems.alphabetic` instead of\nthe simpler `pyalign.problems.general` to construct a problem.\n\n`+SIMD` means feeding groups of equally-structured aligment problems into\none `solve` call by using `pyalign.problem.ProblemBatch` - doing this will\ninternally make use of [AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)\non Intel and [Neon](https://developer.arm.com/technologies/neon) on ARM processors.\n\nThe following benchmarks were done on an Apple M1 Max. SIMD-128 refers to the M1's 128-bit SIMD.\n\nThe benchmark code can be found under [benchmark.py](demo/py/benchmark.py).\n\n![traceback and path](https://raw.githubusercontent.com/poke1024/pyalign/main/docs/benchmark_10_100.svg)\n\n![traceback and path](https://raw.githubusercontent.com/poke1024/pyalign/main/docs/benchmark_5000_10000.svg)\n\n# Other Alignment Libraries\n\nHere is a short overview of other libraries.\n\n## Nice General Purpose Implementations\n\n* https://github.com/stanfordnlp/string2string (RECOMMENDED!)\n* https://pypi.org/project/textdistance/\n* https://edist.readthedocs.io/en/latest/\n* https://github.com/maxbachmann/RapidFuzz\n\n## For large scale / bioinformatics problems\n\nWhat you will *not* find in pyalign:\n\n* SIMD acceleration for single pairs of sequences as in e.g. (Farrar 2007)\n* GPU acceleration, see e.g. (Barnes, 2020)\n* approximate or randomized algorithms\n* advanced preprocessing or indexing\n\nIf you need any of the above, you might want to take a look at:\n\n* https://pypi.org/project/edlib/\n* https://github.com/smarco/WFA2-lib and https://github.com/kcleal/pywfa\n* https://github.com/vishnubob/ssw\n* https://github.com/lh3/ksw2\n* https://github.com/Daniel-Liu-c0deb0t/block-aligner\n* https://github.com/jeffdaily/parasail\n* https://biopython.org/docs/latest/api/Bio.Align.html\n* http://cudasw.sourceforge.net/homepage.htm\n* https://blast.ncbi.nlm.nih.gov/Blast.cgi\n\n## Even more alignment libraries\n\n* https://github.com/mbreese/swalign/\n* https://github.com/seqan/seqan3\n* https://github.com/wannesm/dtaidistance\n\n# References\n\n## Original Works\n\nAltschul, S. (1998). Generalized affine gap costs for protein sequence alignment. Proteins: Structure, 32.\n\nGotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3), 705–708. https://doi.org/10.1016/0022-2836(82)90398-9\n\nSankoff, D. (1972). Matching Sequences under Deletion/Insertion Constraints. Proceedings of\nthe National Academy of Sciences, 69(1), 4–6. https://doi.org/10.1073/pnas.69.1.4\n\nSmith, T. F., \u0026 Waterman, M. S. (1981). Identification of common\nmolecular subsequences. Journal of Molecular Biology, 147(1), 195–197.\nhttps://doi.org/10.1016/0022-2836(81)90087-5\n\nMiller, W., \u0026 Myers, E. W. (1988). Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology, 50(2), 97–120. https://doi.org/10.1007/BF02459948\n\nNeedleman, S. B., \u0026 Wunsch, C. D. (1970). A general method applicable\nto the search for similarities in the amino acid sequence of two proteins.\nJournal of Molecular Biology, 48(3), 443–453. https://doi.org/10.1016/0022-2836(70)90057-4\n\nWaterman, M. S., Smith, T. F., \u0026 Beyer, W. A. (1976). Some biological sequence metrics.\nAdvances in Mathematics, 20(3), 367–387. https://doi.org/10.1016/0001-8708(76)90202-4\n\nWaterman, M. S. (1984). Efficient sequence alignment algorithms. Journal of Theoretical Biology, 108(3), 333–337. https://doi.org/10.1016/S0022-5193(84)80037-5\n\n## Other Algorithms\n\nChakraborty, A., \u0026 Bandyopadhyay, S. (2013). FOGSAA: Fast Optimal Global Sequence Alignment Algorithm. Scientific Reports, 3(1), 1746. https://doi.org/10.1038/srep01746\n\n## Surveys\n\nAluru, S. (Ed.). (2005). Handbook of Computational Molecular Biology.\nChapman and Hall/CRC. https://doi.org/10.1201/9781420036275\n\nStojmirović, A., \u0026 Yu, Y.-K. (2009). Geometric Aspects of Biological Sequence Comparison. Journal of Computational Biology, 16(4), 579–610. https://doi.org/10.1089/cmb.2008.0100\n\nKruskal, J. B. (1983). An Overview of Sequence Comparison: Time Warps,\nString Edits, and Macromolecules. SIAM Review, 25(2), 201–237. https://doi.org/10.1137/1025045\n\nMüller, M. (2007). Information Retrieval for Music and Motion. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-74048-3\n\n## Implementations\n\nAlpern, B., Carter, L., \u0026 Su Gatlin, K. (1995). Microparallelism and high-performance protein matching. Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM)  - Supercomputing ’95, 24-es. https://doi.org/10.1145/224170.224222\n\nBarnes, R. (2020). A Review of the Smith-Waterman GPU Landscape. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-152.html\n\nFarrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2), 156–161. https://doi.org/10.1093/bioinformatics/btl582\n\nFlouri, T., Kobert, K., Rognes, T., \u0026 Stamatakis, A. (2015). Are all global alignment algorithms and implementations correct? [Preprint]. Bioinformatics. https://doi.org/10.1101/031500\n\nRognes, T. (2011). Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12(1), 221. https://doi.org/10.1186/1471-2105-12-221\n\nRudnicki, W. R., Jankowski, A., Modzelewski, A., Piotrowski, A., \u0026 Zadrożny, A. (2009). The new SIMD Implementation of the Smith-Waterman Algorithm on Cell Microprocessor. Fundamenta Informaticae, 96(1–2), 181–194. https://doi.org/10.3233/FI-2009-173\n\nTran, T. T., Liu, Y., \u0026 Schmidt, B. (2016). Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi. 26th International Symposium on Computer Architecture and High Performance Computing, 54, 128–138. https://doi.org/10.1016/j.parco.2015.11.001\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpoke1024%2Fpyalign","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpoke1024%2Fpyalign","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpoke1024%2Fpyalign/lists"}