https://github.com/nbtm-sh/samplesheet-utils
Utility script(s) used for creating samplesheets from various sources
https://github.com/nbtm-sh/samplesheet-utils
bioinformatics python samplesheet tool utils
Last synced: 5 months ago
JSON representation
Utility script(s) used for creating samplesheets from various sources
- Host: GitHub
- URL: https://github.com/nbtm-sh/samplesheet-utils
- Owner: nbtm-sh
- Created: 2024-09-17T03:46:22.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-07-24T01:55:43.000Z (11 months ago)
- Last Synced: 2025-09-09T12:58:22.371Z (9 months ago)
- Topics: bioinformatics, python, samplesheet, tool, utils
- Language: Python
- Homepage:
- Size: 106 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# samplesheet-utils

`samplesheet-utils` (or `samplesheetutils`) is a collection of scripts and utilities for working with samplesheets and FASTA files at the command line. It is primarily designed for use within pipelines.
## Installation
### pip
```bash
pip3 install samplesheetutils
```
### git
```bash
git clone https://github.com/nbtm-sh/create-samplesheet
cd create-samplesheet
pip3 install .
```
## Commands
### sample-name
This command is used to read the sample name(s) from a FASTA file. This is useful for dynamically creating directories based on the actual sample name.
```bash
sample-name [ARGS] [FASTA(s)]
```
- `-i --index`: Index of the sample you wish to read the name from. This can be an integer, -1 for the last sample, or a range `(1:5)`
- `--sanitize --sanitise`: Replaces any problematic characters in the sample name(s) with an underscore
- `-d --delim`: Change the delimiter between each sample name. By default this is a new-line character
### create-samplesheet
This command is used to create a samplesheet from different inputs, including string, and directories containing FASTA files
```bash
create-samplesheet [ARGS]
```
- `-a --aa-string`: Input a single amino acid sequence
- `-d --directory`: Input a directory containing FASTA files
- `-o --output-file`: Samplesheet filename. Default is `samplesheet.[ext]` [ext] depends on mode
- `-j --json`: Ouptut JSON formatted samplesheet
- `-y --yaml`: Output YAML formatted samplesheet
- `-m --msa-dir`: Directory to search for corresponding MSA files in (Only accessible in yaml output)
- `--yaml-rfaa`: Output YAML formatted sequence files with a samplesheet.csv file
#### `--msa-dir`
When using the YAML output mode (`-y`, `--yaml`), you can provide a path to a directory containg sample's pre-computed multiple sequence alignment files (`.a3m` files). In order for these files to automatically be associated with it's corresponding sample, the filenames must follow the following format:
```
[SAMPLE NAME].a3m
```
**Example Usage:**
```bash
create-samplesheet --directory /home/nathan/experiment/fastas --msa-dir /home/nathan/experiment/fastas/msas --yaml
```
**Directory Structure**
```
/home/nathan/experiment/fastas
├── A1.fasta
├── A2.fasta
└── msas
├── A1.a3m
└── A2.a3m
```
> **_NOTE:_** Assume that each FASTA file contains a sample with the same name as the file itself. `create-samplesheet` will search for a3m files based on the **sample name** in the FASTA file, not the FASTA filename itself.
### truncate-msa
This command can be used to edit an a3m file to target a specific region of the alignment for special use cases. The module will preserve the first sample of the file, and truncate the remaining entries.
```bash
truncate-msa [ARGS] [INPUT_FILE] [REGION_START] [REGION_END]
```
- `-i --in-place` Modify the input file directly, instead of making a new file.
- `-r --inverse` Invert the output, removing the target region instead.
- `--version` Output version information.
- `--debug` Display debug information.
### TODO
- [ ] Finish documentation