https://github.com/vmikk/phredsort
`phredsort` is a cli tool for sorting sequences in a FASTQ file by their quality scores
https://github.com/vmikk/phredsort
bash bioinformatics cli fastq phred-quality-scores sequence-quality
Last synced: 3 months ago
JSON representation
`phredsort` is a cli tool for sorting sequences in a FASTQ file by their quality scores
- Host: GitHub
- URL: https://github.com/vmikk/phredsort
- Owner: vmikk
- License: mit
- Created: 2024-11-04T18:24:16.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-12-31T12:14:47.000Z (9 months ago)
- Last Synced: 2025-04-12T05:59:40.212Z (6 months ago)
- Topics: bash, bioinformatics, cli, fastq, phred-quality-scores, sequence-quality
- Language: Go
- Homepage:
- Size: 480 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# phredsort
[](https://doi.org/10.5281/zenodo.14395125)
[](https://codecov.io/gh/vmikk/phredsort)`phredsort` is a command-line tool for sorting sequences in FASTQ files by their quality scores.
## Usage
Basic usage:
```bash
# Read from `input.fastq.gz` and write to `output.fastq.gz`
phredsort -i input.fastq.gz -o output.fastq.gz# Read from stdin and write to stdout
zcat input.fastq.gz | phredsort --in - --out - | less -S
```
## Installation
### Download compiled binary (for Linux)
```bash
wget https://github.com/vmikk/phredsort/releases/download/1.3.0/phredsort
chmod +x phredsort
./phredsort --help
```### Build from source
```bash
git clone --depth 1 https://github.com/vmikk/phredsort
cd phredsort
go build -ldflags="-s -w" phredsort.go
./phredsort --help
```## Quality metrics
`phredsort` supports several metrics (`--metric` parameter) to assess sequence quality:
#### 1. (Back-transformed) average Phred score (`avgphred`)
- Properly calculated mean quality score that accounts for the logarithmic nature of Phred scores
- Converts Phred scores to error probabilities, calculates their arithmetic mean, then converts back to Phred scale
- Formula: `-10 * log10(mean(10^(-Q/10)))`
- More accurate than simple arithmetic mean of Phred scores, which would overestimate quality#### 2. Maximum expected error (`maxee`) (as per Edgar & Flyvbjerg, 2014)
- Sum of error probabilities for all bases in a sequence
- Formula: `sum(10^(-Q/10))`
- Higher values indicate lower quality
- Depends on sequence length (longer sequences tend to have higher MaxEE)#### 3. Maximum expected error percentage (`meep`)
- MaxEE standardized by sequence length
- Represents expected number of errors per 100 bases
- Formula: `(MaxEE * 100) / sequence_length`
- Higher values indicate lower quality
- Allows fair comparison between sequences of different lengths#### 4. Low quality base count (`lqcount`)
- Number of bases below specified quality threshold
- Useful for binned quality scores (e.g., data from Illumina NovaSeq platform)
- Counts bases with Phred score < threshold (default: 15)
- Higher values indicate lower quality#### 5. Low quality base percentage (`lqpercent`)
- Percentage of bases below quality threshold
- Formula: `(lqcount * 100) / sequence_length`
- Higher values indicate lower quality
- Normalizes low-quality base count by sequence length