Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/se7en69/bio-wrangler
Bio-Wrangler is a Python package for bioinformatics data wrangling. It helps load, filter, merge, and summarize bioinformatics datasets from formats like FASTA, FASTQ, VCF, and GFF.
https://github.com/se7en69/bio-wrangler
bio-wrangler bioinformatics development fasta fastq gff python pythonpackage vcf wrangler
Last synced: 3 months ago
JSON representation
Bio-Wrangler is a Python package for bioinformatics data wrangling. It helps load, filter, merge, and summarize bioinformatics datasets from formats like FASTA, FASTQ, VCF, and GFF.
- Host: GitHub
- URL: https://github.com/se7en69/bio-wrangler
- Owner: se7en69
- License: mit
- Created: 2024-10-09T07:52:04.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-09T08:13:52.000Z (4 months ago)
- Last Synced: 2024-11-12T19:25:00.156Z (3 months ago)
- Topics: bio-wrangler, bioinformatics, development, fasta, fastq, gff, python, pythonpackage, vcf, wrangler
- Language: Python
- Homepage: https://pypi.org/project/bio-wrangler/
- Size: 123 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Bio-Wrangler
**Bio-Wrangler** is a Python package designed for wrangling bioinformatics data, including formats such as **FASTA**, **FASTQ**, **VCF**, and **GFF**. The package allows users to load, filter, summarize, and merge bioinformatics datasets into pandas DataFrames, enabling efficient data manipulation and analysis.
## Features
- Load **FASTA**, **FASTQ**, **VCF**, and **GFF** files into pandas DataFrames.
- Filter data based on quality, chromosome, position, and attributes.
- Summarize datasets to extract key statistics.
- Merge multiple datasets.
- Save processed data to CSV or Excel formats.## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Loading Data](#loading-data)
- [Filtering Data](#filtering-data)
- [Summarizing Data](#summarizing-data)
- [Merging Datasets](#merging-datasets)
- [Saving Data](#saving-data)
- [Contributing](#contributing)
- [License](#license)## Installation
To install Bio-Wrangler, use the following command:
pip install bio-wrangler
## Usage
### 1. Loading Data
Bio-Wrangler provides methods to load bioinformatics files into pandas DataFrames.
#### Example: Loading FASTA, FASTQ, VCF, and GFF Files
from bio_wrangler.bio_wrangler import BioWrangler
# Initialize the class
wrangler = BioWrangler()# Load a FASTA file
fasta_data = wrangler.load_fasta('path/to/sample.fasta')
print(fasta_data.head())# Load a FASTQ file
fastq_data = wrangler.load_fastq('path/to/sample.fastq')
print(fastq_data.head())# Load a VCF file
vcf_data = wrangler.load_vcf('path/to/sample.vcf')
print(vcf_data.head())# Load a GFF file
gff_data = wrangler.load_gff('path/to/sample.gff')
print(gff_data.head())### 2. Filtering Data
Bio-Wrangler allows you to filter data based on various criteria such as quality, chromosome, position, and attributes.
#### Example: Filtering FASTQ by Quality
# Filter FASTQ data by average quality score threshold
filtered_fastq = wrangler.filter_fastq_by_quality(fastq_data, 30.0)
print(filtered_fastq.head())#### Example: Filtering VCF by Chromosome
# Filter VCF data to retain only records from a specific chromosome
filtered_vcf = wrangler.filter_by_chromosome(vcf_data, 'chr1')
print(filtered_vcf.head())#### Example: Filtering GFF by Attribute
# Filter GFF data by a specific attribute, such as gene_id
filtered_gff = wrangler.filter_by_attribute(gff_data, 'ID', 'gene1')
print(filtered_gff.head())#### Example: Filtering VCF by Position Range
# Filter VCF data by a specific position range
filtered_vcf_range = wrangler.filter_by_position_range(vcf_data, 100000, 500000)
print(filtered_vcf_range.head())### 3. Summarizing Data
Bio-Wrangler can generate summary statistics for your datasets.
#### Example: Summarizing FASTQ Data
# Summarize FASTQ data
fastq_summary = wrangler.summarize_fastq(fastq_data)
print(fastq_summary)#### Example: Summarizing VCF Data
# Summarize VCF data
vcf_summary = wrangler.summarize_data(vcf_data)
print(vcf_summary)### 4. Merging Datasets
Bio-Wrangler can merge multiple datasets into a single DataFrame.
#### Example: Merging VCF Datasets
# Merge two VCF datasets
merged_vcf = wrangler.merge_datasets(vcf_data, filtered_vcf)
print(merged_vcf.head())### 5. Saving Data
After processing your data, you can save it to CSV or Excel formats using Bio-Wrangler.
#### Example: Saving Filtered VCF Data to CSV
# Save the filtered VCF data to a CSV file
wrangler.save_data(filtered_vcf, 'filtered_vcf_output.csv', 'csv')#### Example: Saving Data to Excel
# Save data to an Excel file
wrangler.save_data(filtered_fastq, 'filtered_fastq_output.xlsx', 'xlsx')## Contributing
Contributions to Bio-Wrangler are welcome! If you have any bug reports, feature requests, or pull requests, please follow the guidelines in the **CONTRIBUTING.md** file.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---