Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alexpreynolds/region-selection

Methods for filtering for high-scoring genomic intervals
https://github.com/alexpreynolds/region-selection

Last synced: about 21 hours ago
JSON representation

Methods for filtering for high-scoring genomic intervals

Host: GitHub
URL: https://github.com/alexpreynolds/region-selection
Owner: alexpreynolds
License: mit
Created: 2022-05-12T22:09:33.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-05-17T06:26:27.000Z (over 2 years ago)
Last Synced: 2024-04-25T18:41:26.680Z (7 months ago)
Language: Python
Size: 34.7 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # region-selection

Methods for filtering for high-scoring genomic intervals

## Usage

### Importing the module and creating a Selection instance

```

>>> from region_selection import Selection

>>> s = Selection()

```

### Specify properties

```

>>> s.method = "pq"

>>> s.input_fn = "/Users/areynolds/Developer/Github/region_selection/tests/windows.fixed.25k.bed"

>>> s.bin_size = 200

>>> s.exclusion_span = 24800

```

The `method` can be one of `pq`, `wis`, or `maxmean`, for selecting from one of priority-queue, weighted interval scheduling, or max-mean window sweep methods, respectively.

The `input_fn` property points to a file on the file system. This is optional, unless using the `read()` method.

The `bin_size` and `exclusion_span` properties are integers. These represent the size of elements, and the distance required between them (exclusing the bin, itself).

The default values are 200 and 24800, respectively. This means bins are 200 nt wide, and we require at least 25000 nt of distance between any filtered bins. 

### Input data

You can read in data from a four-column, tab-delimited text file:

```

>>> in_df = s.read(s, s.method, s.input_fn)

[region_selection] Reading input file into dataframe...

[region_selection] Read dataframe

```

Otherwise, you must provide a Pandas dataframe containing four columns, each labeled: `Chromosome`, `Start`, `End`, and `Score`, respecively.

In the above snippet, the input dataframe is called `in_df`.

### Running the selection method

Use `run()` to run the specified method on the input dataframe `in_df` (or whatever its name is):

```

>>> out_df = s.run(s, s.method, in_df)

[region_selection] Bin size (nt): 200

[region_selection] Exclusion span (nt): 24800

[region_selection] Exclusion bins: 124

[region_selection] Method: Priority-Queue (PQ)

[region_selection] Constructing heap

[region_selection] Constructing qualifying bin list from heap

[region_selection] Returning sorted bin list

[region_selection] Method (runtime in sec): 140.50703937999998

```

The result is stored as a Pandas dataframe. Here it is called `out_df` and you can call all the usual Pandas properties on this:

```

>>> print(out_df.head())

    Chromosome   Start     End  Score

47        chr1    9400   34400   0.41

172       chr1   34400   59400   0.41

304       chr1   60800   85800   0.41

429       chr1   85800  110800   0.41

554       chr1  110800  135800   0.41

```

Or use the `write()` to write to standard output:

```

>>> s.write(out_df)

...

```

Or write with `to_csv()` etc.