Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/robaina/parallelbam
Tools to parallelize operations on large BAM files
https://github.com/robaina/parallelbam
Last synced: 10 days ago
JSON representation
Tools to parallelize operations on large BAM files
- Host: GitHub
- URL: https://github.com/robaina/parallelbam
- Owner: Robaina
- License: cc-by-4.0
- Created: 2021-08-24T17:31:15.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-09-07T08:09:20.000Z (about 2 years ago)
- Last Synced: 2024-10-17T19:38:28.019Z (20 days ago)
- Language: Python
- Size: 326 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.ipynb
- License: LICENSE
Awesome Lists containing this project
README
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parallelizing operations on SAM/BAM files\n",
"\n",
"SAM/BAM files are typically large, thus, operations on these files are time intensive. This project provides tools to parallelize operations on SAM/BAM files. The workflow follows:\n",
"\n",
"1. Split BAM/SAM file in _n_ chunks\n",
"2. Perform operation in each chunk in a dedicated process and save resulting SAM/BAM chunk \n",
"3. Merge results back into a single SAM/BAM file\n",
"\n",
"Depends on:\n",
"\n",
"1. Samtools\n",
"\n",
"# Installation\n",
"\n",
"1. Git clone project\n",
"2. cd to cloned project directory\n",
"3. ```sudo python setup.py install```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Usage\n",
"\n",
"There is one main function named ```parallelizedBAMoperation```. This function takes as mandatory arguments:\n",
"\n",
"1. path to original bam file (should be ordered)\n",
"2. a callable function to perform the operation on each bam file chunk\n",
"\n",
"The callable function must accept the following two first arguments: (i) path to input bam file and (ii) path to resulting output bam file, in this order.\n",
"\n",
"# Note\n",
"\n",
"Preparing a bam file to run an operation in parallel takes a while, thus is not worth it when the operatin itself takes a short time. For example, preparing a typical bam file for parallelization (in 8 processes) can take almost a minute."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def foo(input_bam, output_bam):\n",
" shutil.copyfile(input_bam, output_bam)\n",
" \n",
" \n",
"parallelizeBAMoperation('parallelbam2/tests/toy_sample.bam',\n",
" foo, output_path=None,\n",
" n_processes=4)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1000"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"getNumberOfReads('parallelbam2/tests/toy_sample.bam')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"getNumberOfReads('parallelbam2/tests/processed.bam')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('parallelbam')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"vscode": {
"interpreter": {
"hash": "d7acfd6951bc0fa0485d92016d3c99713d079b52e001e9aed733ba0bc441773f"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}