{"id":27645933,"url":"https://github.com/open2c/pairtools","last_synced_at":"2025-04-24T01:14:01.188Z","repository":{"id":39167192,"uuid":"82891794","full_name":"open2c/pairtools","owner":"open2c","description":"Extract 3D contacts (.pairs) from sequencing alignments","archived":false,"fork":false,"pushed_at":"2025-01-31T13:41:24.000Z","size":3518,"stargazers_count":111,"open_issues_count":50,"forks_count":33,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-24T01:13:53.309Z","etag":null,"topics":["3d-genome","bioinformatics","file-formatter","hi-c","ngs","pairs-file","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/open2c.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-23T06:17:43.000Z","updated_at":"2025-04-12T22:09:18.000Z","dependencies_parsed_at":"2024-01-10T14:45:06.403Z","dependency_job_id":"87da2b86-725b-40df-a44e-f78861cf25ca","html_url":"https://github.com/open2c/pairtools","commit_stats":{"total_commits":533,"total_committers":10,"mean_commits":53.3,"dds":"0.22138836772983117","last_synced_commit":"3679207089c97bba52a316031ee2180586fe78f8"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open2c%2Fpairtools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open2c%2Fpairtools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open2c%2Fpairtools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open2c%2Fpairtools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/open2c","download_url":"https://codeload.github.com/open2c/pairtools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250540916,"owners_count":21447427,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3d-genome","bioinformatics","file-formatter","hi-c","ngs","pairs-file","python"],"created_at":"2025-04-24T01:14:00.643Z","updated_at":"2025-04-24T01:14:01.178Z","avatar_url":"https://github.com/open2c.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pairtools\n\n[![Documentation Status](https://readthedocs.org/projects/pairtools/badge/?version=latest)](http://pairtools.readthedocs.org/en/latest/)\n[![Build Status](https://travis-ci.org/mirnylab/pairtools.svg?branch=master)](https://travis-ci.org/mirnylab/pairtools)\n[![Join the chat on Slack](https://img.shields.io/badge/chat-slack-%233F0F3F?logo=slack)](https://bit.ly/2UaOpAe)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1490831.svg)](https://doi.org/10.5281/zenodo.1490831)\n\n## Process Hi-C pairs with pairtools\n\n`pairtools` is a simple and fast command-line framework to process sequencing\ndata from a Hi-C experiment.\n\n`pairtools` process pair-end sequence alignments and perform the following\noperations:\n\n- detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules\n- sort .pairs files for downstream analyses\n- detect, tag and remove PCR/optical duplicates \n- generate extensive statistics of Hi-C datasets\n- select Hi-C pairs given flexibly defined criteria\n- restore .sam alignments from Hi-C pairs\n- annotate restriction digestion sites\n- get the mutated positions in Hi-C pairs\n\nTo get started:\n- Visit [pairtools tutorials](https://pairtools.readthedocs.io/en/latest/examples/pairtools_walkthrough.html),\n- Take a look at a [quick example](https://github.com/open2c/pairtools#quick-example),\n- Check out the detailed [documentation](http://pairtools.readthedocs.io).\n\n## Data formats\n\n`pairtools` produce and operate on tab-separated files compliant with the\n[.pairs](https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md) \nformat defined by the [4D Nucleome Consortium](https://www.4dnucleome.org/). All\npairtools properly manage file headers and keep track of the data\nprocessing history.\n\nAdditionally, `pairtools` define the [.pairsam format](https://pairtools.readthedocs.io/en/latest/formats.html#pairsam), an extension of .pairs that includes the SAM alignments \nof a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that\noperates on .pairs files.\n\n`pairtools` produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks.\nThe full list of possible extra columns is provided in the [`pairtools` format specification](https://pairtools.readthedocs.io/en/latest/formats.html#extra-columns). \n\n## Installation\n\nRequirements:\n\n- Python 3.x\n- Python packages `cython`, `pysam`, `bioframe`, `pyyaml`, `numpy`, `scipy`, `pandas` and `click`.\n- Command-line utilities `sort` (the Unix version), `samtools` and `bgzip` (shipped with `samtools`). If available, `pairtools` can compress outputs with `pbgzip` and `lz4`.\n\nFor the full list of recommended versions, see [the requirements section in the pyproject.toml](https://github.com/open2c/pairtools/blob/main/pyproject.toml). \n\nThere are three options for installing pairtools:\n\n1. We highly recommend using the `conda` package manager to install `pairtools` together with all its dependencies. To get it, you can either install the full [Anaconda](https://www.continuum.io/downloads) Python distribution or just the standalone [conda](http://conda.pydata.org/miniconda.html) package manager.\n\nWith `conda`, you can install `pairtools` and all of its dependencies from the [bioconda](https://bioconda.github.io/index.html) channel:\n```sh\n$ conda install -c conda-forge -c bioconda pairtools\n```\n\n2. Alternatively, install non-Python dependencies (`sort`, `samtools`, `bgzip`, `pbgzip` and `lz4`) separately and download `pairtools` with Python dependencies from PyPI using pip:\n```sh\n$ pip install pairtools\n```\n\n3. Finally, when the two options above don't work or when you want to modify `pairtools`, build `pairtools` from source via pip's \"editable\" mode:\n```sh\n$ pip install numpy cython pysam \n$ git clone https://github.com/open2c/pairtools\n$ cd pairtools\n$ pip install -e ./ --no-build-isolation\n```\n\n\n## Quick example\n\nSetup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:\n```bash\n$ mkdir /tmp/test-pairtools\n$ cd /tmp/test-pairtools\n$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam\n```\n\nAdditionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:\n```bash\n$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes\n```\n\nWith `pairtools parse`, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:\n\n```bash\n$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam \n```\n\nInspect the resulting table:\n\n```bash\n$ less MATalpha_R1.pairs.gz\n```\n\n## Pipelines\n\n- We provide a simple working example of a mapping bash pipeline in /examples/.\n- [distiller](https://github.com/open2c/distiller-nf) is a powerful\nHi-C data analysis workflow, based on `pairtools` and [nextflow](https://www.nextflow.io/).\n\n\n## Tools\n\n- `parse`: read .sam/.bam files produced by bwa and form Hi-C pairs\n    - form Hi-C pairs by reporting the outer-most mapped positions and the strand\n    on the either side of each molecule;\n    - report unmapped/multimapped (ambiguous alignments)/chimeric alignments as\n    chromosome \"!\", position 0, strand \"-\";\n    - perform upper-triangular flipping of the sides of Hi-C molecules \n    such that the first side has a lower sorting index than the second side;\n    - form hybrid pairsam output, where each line contains all available data \n    for one Hi-C molecule (outer-most mapped positions on the either side, \n    read ID, pair type, and .sam entries for each alignment);\n    - report .sam tags or mutations of the alignments;\n    - print the .sam header as #-comment lines at the start of the file.\n\n- `parse2`: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks \n    - identify and rescue chrimeric alignments produced by singly-ligated Hi-C \n    molecules with a sequenced ligation junction on one of the sides;\n    - annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);\n    - perform intra-molecule deduplication of paired-end data when one side reads through the DNA on the other side of the read;\n    - report index of the pair in the complex walk;\n    - make combinatorial expansion of pairs produced from the same walk; \n\n- `sort`: sort pairs files (the lexicographic order for chromosomes, \n    the numeric order for the positions, the lexicographic order for pair types).\n\n- `merge`: merge sorted .pairs files\n    - merge sort .pairs;\n    - combine the .pairs headers from all input files;\n    - check that each .pairs file was mapped to the same reference genome index \n    (by checking the identity of the @SQ sam header lines).\n\n- `select`: select pairs according to specified criteria\n    - select pairs entries according to the provided condition. A programmable\n    interface allows for arbitrarily complex queries on specific pair types, \n    chromosomes, positions, strands, read IDs (including matches to a\n    wildcard/regexp/list).\n    - optionally print the non-matching entries into a separate file.\n\n- `dedup`: remove PCR duplicates from a sorted triu-flipped .pairs file\n    - remove PCR duplicates by finding pairs of entries with both sides mapped\n    to similar genomic locations (+/- N bp);\n    - optionally output the PCR duplicate entries into a separate file;\n    - detect optical duplicates from the original Illumina read ids;\n    - apply filtering by various properties of pairs (MAPQ; orientation; distance) together with deduplication; \n    - output yaml or convenient tsv deduplication stats into text file.\n    - NOTE: in order to remove all PCR duplicates, the input must contain \\*all\\* \n      mapped read pairs from a single experimental replicate;\n\n- `maskasdup`: mark all pairs in a pairsam as Hi-C duplicates\n    - change the field pair_type to DD;\n    - change the pair_type tag (Yt:Z:) for all sam alignments;\n    - set the PCR duplicate binary flag for all sam alignments (0x400).\n\n- `split`: split a .pairsam file into .pairs and .sam.\n\n- `flip`: flip pairs to get an upper-triangular matrix\n\n- `header`: manipulate the .pairs/.pairsam header\n    - generate new header for headerless .pairs file\n    - transfer header from one .pairs file to another\n    - set column names for the .pairs file\n    - validate that the header corresponds to the information stored in .pairs file\n\n- `stats`: calculate various statistics of .pairs files\n\n- `restrict`: identify the span of the restriction fragment forming a Hi-C junction\n\n- `phase`: phase pairs mapped to a diploid genome \n\n## Contributing\n\n[Pull requests](https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/) are welcome.\n\nFor development, clone and install in \"editable\" (i.e. development) mode with the `-e` option. This way you can also pull changes on the fly.\n```sh\n$ git clone https://github.com/open2c/pairtools.git\n$ cd pairtools\n$ pip install -e .\n```\n\n## Citing `pairtools`\n\nOpen2C*, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna*, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. \"Pairtools: from sequencing data to chromosome contacts\" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen2c%2Fpairtools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen2c%2Fpairtools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen2c%2Fpairtools/lists"}