https://github.com/moka-guys/samplesheet_validator
A package to validate the formatting of Illumina SampleSheets and which notifies the user of any issues
https://github.com/moka-guys/samplesheet_validator
samplesheet validation-tool
Last synced: 4 months ago
JSON representation
A package to validate the formatting of Illumina SampleSheets and which notifies the user of any issues
- Host: GitHub
- URL: https://github.com/moka-guys/samplesheet_validator
- Owner: moka-guys
- License: other
- Created: 2020-10-15T07:59:29.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2026-01-30T10:05:39.000Z (5 months ago)
- Last Synced: 2026-01-31T02:22:48.915Z (5 months ago)
- Topics: samplesheet, validation-tool
- Language: Python
- Homepage:
- Size: 171 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Samplesheet Validator
This tool is designed to validate NGS samplesheets prior to downstream processing by performing a series of checks.
It can be used as a standalone process but was designed for integration into automated workflows through instantiation of the SamplesheetCheck class, which records validation outcome in a boolean flag Attribute (self.errors) and errors in a dict (self.errors_dict).
## Use case
The tool has been designed for:
1. Illumina sequencing runs with Samplesheets expected to end in "_SampleSheet.csv".
2. AVITI runs.
Expect run types include:
1. Panel based NGS testing
2. TSO500
3. Oncodeep
4. Archer
5. MSK
**Please note** this tool has been specifically designed for the Genome Informatics Service at Synnovis (including the use of the [seglh-naming](https://github.com/moka-guys/seglh-naming/) library) and therefore might require modifications for integration into alternative workflows.
## Protocol
Samplesheet validation is carried out in a series of consecutive steps with any errors identified recorded in the log file as per the [config file](samplesheet_validator/config.py).
Checks:
1. Samplesheet path provided is valid.
2. Samplesheet matches expected naming:
- Illumina: checked against[seglh-naming](https://github.com/moka-guys/seglh-naming/) library
- AVITI: samplesheet name matches run folder name.
3. The sequencer_id is in the allowed/validated list of sequencers for that run type.
4. The samplesheet is not empty (>10 bytes)
5. If the run is a development run. **N.B.** If the run is a dev run no further samplesheet validation is performed. Further checks are only carried out for clinical runs.
6. Samplesheet contains the minimum expected section headers
7. Content in columns "Sample_ID" and "Sample_Name" match for each sample in the samplesheet
8. Samplesheet doesn't contain any illegal characters
9. Sample name matches expected naming convention for all samples. Assessed against [seglh-naming](https://github.com/moka-guys/seglh-naming/) library.
10. The test code (pannumber) for each sample is in the list of expected test codes for the run type.
11. Whether any TSO samples have been included on the run - Sets Boolean Attribute to true
12. Whether any OKD samples are included on the run - Sets Boolean Attribute to true
## Installation & Usage
### From Python package
1. Clone a copy of the repository locally
`git clone https://github.com/moka-guys/samplesheet_validator.git`
2. cd in to the project root directory
3. Install from python package
`python3 setup.py install`
NB's: Requires setuptools to be installed; Use the --user flag or install into an virtualenv/pipenv if not installing globally.
4. Execute functionality from within a python script.
```python
from samplesheet_validator.samplesheet_validator import SamplesheetCheck
sscheck_obj = SamplesheetCheck(
samplesheet_path, # str
sequencer_ids, # list
panels, # list
tso_panels, # list
okd_panels, # list
dev_pannos, # list
logdir, # str
illumina, # bool
runname, # str
)
sscheck_obj.ss_checks() # Carry out samplesheeet validation
print(sscheck_obj.errors_dict) # View the dictionary of error messages
```
### Command line
To use the validator from the command line set up an environment as below:
```bash
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
```
The script can then be executed as follows:
```bash
usage: Used to validate a samplesheet using the seglh-naming conventions
Given an input samplesheet, will validate the samplesheet using seglh-naming conventions and output a logfile
options:
-h, --help show this help message and exit
-S SAMPLESHEET_PATH, --samplesheet_path SAMPLESHEET_PATH
Path to samplesheet requiring validation
-SI SEQUENCER_IDS, --sequencer_ids SEQUENCER_IDS
Comma separated string of allowed sequencer IDS
-P PANELS, --panels PANELS
Comma separated string of allowed panel numbers
-T TSO_PANELS, --tso_panels TSO_PANELS
Comma separated string of tso panels
-O OKD_PANELS, --okd_panels OKD_PANELS
Comma separated string of okd panels
-D DEV_PANNOS, --dev_pannos DEV_PANNOS
Comma separated development pan numbers
-L LOGDIR, --logdir LOGDIR
Directory to save the output logfile to
-NSH NO_STREAM_HANDLER, --no_stream_handler NO_STRAM_HANDLER
Provide flag when we dont want a stream handler (prevents
duplication of log messages to terminal if using another
logging instance)
-R RUN_FOLDER_NAME, --runname RUN_FOLDER_NAME
Str for processed folder name
```
## Testing
This repository currently has **93% test coverage**.
Test datasets are stored in [/test/data](../test/data). The script has a full test suite:
* [test_samplesheet_validator.py](../test/test_samplesheet_validator.py)
See [test/README.md](test/README.md) for details about test cases.
These tests should be run before pushing any code to ensure all tests in the GitHub Actions workflow pass. These can be run as follows:
```bash
python3 -m pytest
```
**N.B. Tests and test cases/files MUST be maintained and updated accordingly in conjunction with script development. This includes ensuring that the arguments passed to pytest in the [pytest.ini](pytest.ini) file are kept up to date**
## Logging
Logging is performed by [ss_logger](samplesheet_validator/ss_logger.py). The directory to save the log file to is supplied as an argument. The output log file is named by the script as follows:
- `$LOGFILE_DIR/$RUNFOLDER_NAME_$TIMESTAMP_samplesheet_validator.log`
The script also collects the error messages as it runs, which can be used by other scripts when this script is used as an import.
### Developed by the Synnovis Genome Informatics Team