https://github.com/deeptools/py2bit

A python library for accessing 2bit files
https://github.com/deeptools/py2bit

bioinformatics twobit

Last synced: 4 months ago
JSON representation

A python library for accessing 2bit files

Host: GitHub
URL: https://github.com/deeptools/py2bit
Owner: deeptools
License: mit
Created: 2016-07-29T08:24:12.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2025-01-21T15:41:54.000Z (over 1 year ago)
Last Synced: 2025-10-10T18:52:34.106Z (7 months ago)
Topics: bioinformatics, twobit
Language: C
Size: 59.6 KB
Stars: 21
Watchers: 2
Forks: 9
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          [![Build Status](https://travis-ci.org/dpryan79/py2bit.svg?branch=master)](https://travis-ci.org/dpryan79/py2bit)

# py2bit

A python extension, written in C, for quick access to [2bit](https://genome.ucsc.edu/FAQ/FAQformat.html#format7) files. The extension uses [lib2bit](https://github.com/dpryan79/lib2bit) for file access.

Table of Contents

=================

 * [Installation](#installation)

 * [Usage](#usage)

   * [Load the extension](#load-the-extension)

   * [Open a 2bit file](#open-a-2bit-file)

   * [Access the list of chromosomes and their lengths](#access-the-list-of-chromosomes-and-their-lengths)

   * [Print file information](#print-file-information)

   * [Fetch a sequence](#fetch-a-sequence)

   * [Fetch per-base statistics](#fetch-per-base-statistics)

   * [Fetch masked blocks](#fetch-masked-blocks)

   * [Close a file](#close-a-file)

 * [A note on coordinates](#a-note-on-coordinates)

# Installation

You can install the extension directly from github with:

    pip install git+https://github.com/dpryan79/py2bit

# Usage

Basic usage is as follows:

## Load the extension

    >>> import py2bit

## Open a 2bit file

This will work if your working directory is the py2bit source code directory.

    >>> tb = py2bit.open("test/foo.2bit")

Note that if you would like to include information about soft-masked bases, you need to manually specify that:

    >>> tb = py2bit.open("test/foo.2bit", True)

## Access the list of chromosomes and the lengths

`TwoBit` objects contain a dictionary holding the chromosome/contig lengths, which can be accessed with the `chroms()` method.

    >>> tb.chroms()

    {'chr1': 150L, 'chr2': 100L}

You can directly access a particular chromosome by specifying its name.

    >>> tb.chroms('chr1')

    150L

The lengths are stored as a "long" integer type, which is why there's an `L` suffix. If you specify a nonexistent chromosome then nothing is output.

    >>> tb.chroms("foo")

    >>>

## Print file information

The following information about and contained within a 2bit file can be accessed with the `info()` method:

 * file size, in bytes (`file size`)

 * number of chromosomes/contigs (`nChroms`)

 * total sequence length, in bases (`sequence length`)

 * total number of hard-masked (N) bases (`hard-masked length`)

 * total number of soft-masked (lower case) bases(`soft-masked length`).

Note that `soft-masked length` will only be present if `open("file.2bit", True)` is used, since handling soft-masking increases memory requirements and decreases perfomance.

    >>> tb.info()

    {'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8}

## Fetch a sequence

The sequence of a full or partial chromosome/contig can be fetched with the `sequence()` method.

    >>> tb.sequence("chr1")

    'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

By default, the whole chromosome/contig is returned. A specific range can also be requested.

    >>> tb.sequence("chr1", 24, 74)

    NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC

The first number is the (0-based) position on the chromosome/contig where the sequence should begin. The second number is the (1-based) position on the chromosome where the sequence should end.

If it was requested during file opening that soft-masking information be stored, then lower case bases may be present. If a nonexistent chromosome/contig is specified then a runtime error occurs.

## Fetch per-base statistics

It's often required to compute the percentage of 1 or more bases in a chromosome. This can be done with the `bases()` method.

    >>> tb.bases("chr1")

    {'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}

This returns a dictionary with bases as keys and the fraction of the sequence composed of them as values. Note that this will not sum to 1 if there are any hard-masked bases (the chromosome is 2/3 `N` in this case). One can also request this information over a particular region.

    >>> tb.bases("chr1", 24, 74)

    {'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}

The start and end position are as with the `sequence()` method described above.

If integer counts are preferred, then they can instead be returned.

    >>> tb.bases("chr1", 24, 74, False)

    {'A': 6, 'C': 6, 'T': 6, 'G': 6}

## Fetch masked blocks

There are two kinds of masking blocks that can be present in 2bit files: hard-masked and soft-masked. Hard-masked blocks are stretches of NNNN, as are commonly found near telomeres and centromeres. Soft-masked blocks are runs of lowercase A/C/T/G, typically indicating repeat elements or low-complexity stretches. In can sometimes be useful to query this information from 2bit files:

    >>> tb.hardMaskedBlocks("chr1")

    [(0, 50), (100, 150)]

In this (small) example, there are two stretches of hard-masked sequence, from 0 to 50 and again from 100 to 150 (see the note below about coordinates). If you would instead like to query all blocks overlapping with a specific region, you can specify the region bounds:

    >>> tb.hardMaskedBlocks("chr1", 75, 101)

    [(100, 150)]

If there are no overlapping regions, then an empty list is returned:

    >>> tb.hardMaskedBlocks("chr1", 75, 100)

    []

Instead of `hardMaskedBlocks()`, one can use `softMaskedBlocks()` in an identical manner:

    >>> tb = py2bit.open("foo.2bit", storeMasked=True)

    >>> tb.softMaskedBlocks("chr1")

    [(62, 70)]

As shown, you **must** specify `storeMasked=True` or you will receive a run time error.

## Close a file

A `TwoBit` object can be closed with the `close()` method.

    >>> tb.close()

# A note on coordinates

0-based half-open coordinates are used by this python module. So to access the value for the first base on `chr1`, one would specify the starting position as `0` and the end position as `1`. Similarly, bases 100 to 115 would have a start of `99` and an end of `115`. This is simply for the sake of consistency with most other bioinformatics packages.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deeptools/py2bit

Awesome Lists containing this project

README