https://github.com/deeptools/py2bit
A python library for accessing 2bit files
https://github.com/deeptools/py2bit
bioinformatics twobit
Last synced: 4 months ago
JSON representation
A python library for accessing 2bit files
- Host: GitHub
- URL: https://github.com/deeptools/py2bit
- Owner: deeptools
- License: mit
- Created: 2016-07-29T08:24:12.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2025-01-21T15:41:54.000Z (over 1 year ago)
- Last Synced: 2025-10-10T18:52:34.106Z (7 months ago)
- Topics: bioinformatics, twobit
- Language: C
- Size: 59.6 KB
- Stars: 21
- Watchers: 2
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
[](https://travis-ci.org/dpryan79/py2bit)
# py2bit
A python extension, written in C, for quick access to [2bit](https://genome.ucsc.edu/FAQ/FAQformat.html#format7) files. The extension uses [lib2bit](https://github.com/dpryan79/lib2bit) for file access.
Table of Contents
=================
* [Installation](#installation)
* [Usage](#usage)
* [Load the extension](#load-the-extension)
* [Open a 2bit file](#open-a-2bit-file)
* [Access the list of chromosomes and their lengths](#access-the-list-of-chromosomes-and-their-lengths)
* [Print file information](#print-file-information)
* [Fetch a sequence](#fetch-a-sequence)
* [Fetch per-base statistics](#fetch-per-base-statistics)
* [Fetch masked blocks](#fetch-masked-blocks)
* [Close a file](#close-a-file)
* [A note on coordinates](#a-note-on-coordinates)
# Installation
You can install the extension directly from github with:
pip install git+https://github.com/dpryan79/py2bit
# Usage
Basic usage is as follows:
## Load the extension
>>> import py2bit
## Open a 2bit file
This will work if your working directory is the py2bit source code directory.
>>> tb = py2bit.open("test/foo.2bit")
Note that if you would like to include information about soft-masked bases, you need to manually specify that:
>>> tb = py2bit.open("test/foo.2bit", True)
## Access the list of chromosomes and the lengths
`TwoBit` objects contain a dictionary holding the chromosome/contig lengths, which can be accessed with the `chroms()` method.
>>> tb.chroms()
{'chr1': 150L, 'chr2': 100L}
You can directly access a particular chromosome by specifying its name.
>>> tb.chroms('chr1')
150L
The lengths are stored as a "long" integer type, which is why there's an `L` suffix. If you specify a nonexistent chromosome then nothing is output.
>>> tb.chroms("foo")
>>>
## Print file information
The following information about and contained within a 2bit file can be accessed with the `info()` method:
* file size, in bytes (`file size`)
* number of chromosomes/contigs (`nChroms`)
* total sequence length, in bases (`sequence length`)
* total number of hard-masked (N) bases (`hard-masked length`)
* total number of soft-masked (lower case) bases(`soft-masked length`).
Note that `soft-masked length` will only be present if `open("file.2bit", True)` is used, since handling soft-masking increases memory requirements and decreases perfomance.
>>> tb.info()
{'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8}
## Fetch a sequence
The sequence of a full or partial chromosome/contig can be fetched with the `sequence()` method.
>>> tb.sequence("chr1")
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'
By default, the whole chromosome/contig is returned. A specific range can also be requested.
>>> tb.sequence("chr1", 24, 74)
NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC
The first number is the (0-based) position on the chromosome/contig where the sequence should begin. The second number is the (1-based) position on the chromosome where the sequence should end.
If it was requested during file opening that soft-masking information be stored, then lower case bases may be present. If a nonexistent chromosome/contig is specified then a runtime error occurs.
## Fetch per-base statistics
It's often required to compute the percentage of 1 or more bases in a chromosome. This can be done with the `bases()` method.
>>> tb.bases("chr1")
{'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}
This returns a dictionary with bases as keys and the fraction of the sequence composed of them as values. Note that this will not sum to 1 if there are any hard-masked bases (the chromosome is 2/3 `N` in this case). One can also request this information over a particular region.
>>> tb.bases("chr1", 24, 74)
{'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}
The start and end position are as with the `sequence()` method described above.
If integer counts are preferred, then they can instead be returned.
>>> tb.bases("chr1", 24, 74, False)
{'A': 6, 'C': 6, 'T': 6, 'G': 6}
## Fetch masked blocks
There are two kinds of masking blocks that can be present in 2bit files: hard-masked and soft-masked. Hard-masked blocks are stretches of NNNN, as are commonly found near telomeres and centromeres. Soft-masked blocks are runs of lowercase A/C/T/G, typically indicating repeat elements or low-complexity stretches. In can sometimes be useful to query this information from 2bit files:
>>> tb.hardMaskedBlocks("chr1")
[(0, 50), (100, 150)]
In this (small) example, there are two stretches of hard-masked sequence, from 0 to 50 and again from 100 to 150 (see the note below about coordinates). If you would instead like to query all blocks overlapping with a specific region, you can specify the region bounds:
>>> tb.hardMaskedBlocks("chr1", 75, 101)
[(100, 150)]
If there are no overlapping regions, then an empty list is returned:
>>> tb.hardMaskedBlocks("chr1", 75, 100)
[]
Instead of `hardMaskedBlocks()`, one can use `softMaskedBlocks()` in an identical manner:
>>> tb = py2bit.open("foo.2bit", storeMasked=True)
>>> tb.softMaskedBlocks("chr1")
[(62, 70)]
As shown, you **must** specify `storeMasked=True` or you will receive a run time error.
## Close a file
A `TwoBit` object can be closed with the `close()` method.
>>> tb.close()
# A note on coordinates
0-based half-open coordinates are used by this python module. So to access the value for the first base on `chr1`, one would specify the starting position as `0` and the end position as `1`. Similarly, bases 100 to 115 would have a start of `99` and an end of `115`. This is simply for the sake of consistency with most other bioinformatics packages.