https://github.com/cellgeni/nf-processed2irods
Pipeline to move processed datasets to irods
https://github.com/cellgeni/nf-processed2irods
Last synced: 4 months ago
JSON representation
Pipeline to move processed datasets to irods
- Host: GitHub
- URL: https://github.com/cellgeni/nf-processed2irods
- Owner: cellgeni
- License: mit
- Created: 2025-07-31T12:40:18.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-10-01T09:57:40.000Z (8 months ago)
- Last Synced: 2025-10-01T11:36:13.471Z (8 months ago)
- Language: Nextflow
- Size: 62.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# nf-processed2i## Pipeline Workflow
1. **Dataset/Sample Discovery** - Reads dataset or sample information from CSV input file
2. **Public Dataset Detection** - Identifies public datasets (GSE*, E-MTAB-*, PRJEB* patterns)
3. **Metadata Parsing** - Extracts metadata from public repositories for public datasets
4. **Quality Control** - Generates mapping QC statistics from STARsolo output (if not already present)
5. **File Collection** - Gathers all data files and metadata files for upload
6. **iRODS Upload** - Transfers files to iRODS with checksums
7. **Metadata Attachment** - Attaches comprehensive metadata to iRODS collections
## Dry Run Mode
The pipeline supports a dry run mode (`--dry_run true`) that allows you to test and validate your pipeline configuration without actually uploading data to iRODS. This is particularly useful for:
### What Dry Run Does:
- **Validates input parameters** - Checks that all required parameters are provided and correctly formatted
- **Processes metadata** - Runs all metadata collection and parsing steps
- **Generates QC reports** - Creates mapping statistics and quality control files
- **Creates output files** - Generates `sample_metadata.csv` and `dataset_metadata.csv` files locally
- **Shows intended operations** - Displays what files would be uploaded and where
### What Dry Run Skips:
- **iRODS file uploads** - No files are actually transferred to iRODS storage
- **iRODS metadata attachment** - No metadata is attached to iRODS collections
- **MD5 checksum validation** - No checksums are computed or compared with iRODS
## Overview
This pipeline processes datasets containing single-cell RNA sequencing results (typically from STARsolo/CellRanger), extracts metadata, performs quality control analysis, and uploads the data with comprehensive metadata to iRODS for long-term storage and data management.
## Contents of Repo:
* `main.nf` — the main Nextflow pipeline that orchestrates data processing and iRODS upload
* `nextflow.config` — configuration script for IBM LSF submission on Sanger's HPC with Singularity containers and global parameters
* `modules/local/reprocess10x/parsemetadata/` — module for parsing metadata from public repositories (GEO, ENA)
* `modules/local/reprocess10x/mappingqc/` — module for extracting mapping quality control statistics
* `modules/local/irods/storefile/` — module for uploading files to iRODS
* `modules/local/irods/attachcollectionmeta/` — module for attaching metadata to iRODS collections
## Pipeline Workflow
1. **Dataset Discovery**: Reads dataset information from CSV input file
2. **Public Dataset Detection**: Identifies public datasets (GSE*, E-MTAB-*, PRJEB* patterns)
3. **Metadata Parsing**: Extracts metadata from public repositories for public datasets
4. **Quality Control**: Generates mapping QC statistics from STARsolo output (if not already present)
5. **File Collection**: Gathers all data files and metadata files for upload
6. **iRODS Upload**: Transfers files to iRODS with checksums
7. **Metadata Attachment**: Attaches comprehensive metadata to iRODS collections
## Examples
### Basic Usage with Datasets
Upload processed datasets to iRODS:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/"
```
### Basic Usage with Individual Samples
Upload individual samples to iRODS:
```bash
nextflow run main.nf \
--samples samples.csv \
--irodspath "/archive/cellgeni/sanger/"
```
### Custom File Filtering
Specify which file types to ignore during upload:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--ignore_pattern ".bam,.fastq.gz"
```
### Disable Public Metadata Collection
Skip automatic metadata retrieval for public datasets:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--collect_public_metadata false
```
### Enable Verbose Output
Get detailed logging information:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--verbose true
```
### Custom Output Directory
Specify a different output directory:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--output_dir "my_results"
```
### Dry Run (Test Mode)
Test the pipeline without actually uploading files to iRODS:
```bash
nextflow run main.nf \
--datasets datasets.csv \
--irodspath "/archive/cellgeni/sanger/" \
--dry_run true
```
This mode will:
- Validate all input parameters and files
- Process metadata and generate QC reports
- Show what would be uploaded without actual iRODS operations
- Create local output files (metadata CSV files) for review
## Pipeline Parameters
### Required Parameters:
* `--datasets` — Path to a CSV file containing dataset information with columns: `id` (dataset identifier) and `path` (local filesystem path to processed data directory)
**OR**
* `--samples` — Path to a CSV file containing sample information with columns: `id` (sample identifier), `path` (local filesystem path), and optionally `dataset_id`
* `--irodspath` — Base path in iRODS where datasets will be stored (e.g., "/archive/cellgeni/sanger/")
### Optional Parameters:
* `--output_dir` — Output directory for pipeline results (`default: "results"`)
* `--publish_mode` — File publishing mode (`default: "copy"`)
* `--ignore_pattern` — Comma-separated list of file patterns to ignore during upload (`default: ".bam,.bai,.cram,.crai,.fastq.gz,.fq.gz,.fastq,.fq,.mate1.bz2,.mate2.bz2,.sh,.bsub,.pl"`)
* `--collect_public_metadata` — Collect metadata from public repositories for public datasets (`default: true`)
* `--parse_mapper_metrics` — Parse mapping QC metrics from STARsolo output (`default: true`)
* `--verbose` — Enable verbose output (`default: false`)
* `--dry_run` — Perform a dry run without uploading files to iRODS (`default: false`)
## Input File Format
The pipeline supports two input modes:
### Option 1: Dataset-based input (`--datasets`)
CSV file with the following structure:
```csv
id,path
GSE123456,/path/to/processed/GSE123456
PRJEB12345,/path/to/processed/PRJEB12345
EGA_DATASET,/path/to/processed/EGA_DATASET
```
Where:
- `id`: Unique dataset identifier (can be GEO accession, ENA project, or custom ID)
- `path`: Absolute path to the directory containing processed single-cell data with sample subdirectories
### Option 2: Sample-based input (`--samples`)
CSV file with the following structure:
```csv
id,path,dataset_id
SAMPLE1,/path/to/processed/SAMPLE1,GSE123456
SAMPLE2,/path/to/processed/SAMPLE2,GSE123456
SAMPLE3,/path/to/processed/SAMPLE3,
```
Where:
- `id`: Unique sample identifier
- `path`: Absolute path to the sample directory containing processed single-cell data
- `dataset_id`: (Optional) Dataset identifier to group samples under. If omitted, samples will be uploaded directly to irodspath/id
## iRODS Path Structure
The pipeline uploads data to different iRODS paths depending on the input type:
- **Datasets (`--datasets`)**: Uploaded to `irodspath/id`
- Example: `/archive/cellgeni/sanger/GSE123456/`
- **Samples with dataset_id (`--samples`)**: Uploaded to `irodspath/dataset_id/id`
- Example: `/archive/cellgeni/sanger/GSE123456/SAMPLE1/`
- **Samples without dataset_id (`--samples`)**: Uploaded to `irodspath/id`
- Example: `/archive/cellgeni/sanger/SAMPLE1/`
## Expected Data Structure
### For Dataset-based Input
Each dataset directory should contain:
- **Sample directories**: Named with sample identifiers containing STARsolo output files
- **QC files**: `*solo_qc.tsv` files (generated automatically if missing)
- **Metadata files**: For public datasets, metadata will be automatically retrieved
Example directory structure:
```
GSE123456/
├── SAMPLE1/
│ ├── Aligned.sortedByCoord.out.bam
│ ├── Log.final.out
│ ├── Solo.out/
│ └── ...
├── SAMPLE2/
│ └── ...
└── GSE123456_solo_qc.tsv
```
### For Sample-based Input
Each sample directory should contain STARsolo output files:
```
SAMPLE1/
├── Aligned.sortedByCoord.out.bam
├── Log.final.out
├── Solo.out/
│ ├── Gene/
│ ├── GeneFull/
│ └── ...
└── ...
```
## Metadata Handling
### Public Datasets
For datasets matching patterns `GSE*`, `E-MTAB-*`, or `PRJEB*`, the pipeline automatically:
- Retrieves sample metadata from public repositories
- Generates accession mapping files
- Creates sample relationship files
- Downloads study metadata
### All Datasets
The pipeline extracts and stores:
- **Sample-level metadata**: Species, sequencing type (paired/single), strand information, read counts, whitelist version
- **Dataset-level metadata**: Study accession numbers, dataset identifiers
- **Technical metadata**: File checksums, upload timestamps
## iRODS Structure
Data is organized in iRODS based on the input type:
### For Dataset-based Input (`--datasets`)
```
/archive/cellgeni/sanger/
├── GSE123456/ # Dataset uploaded to irodspath/id
│ ├── SAMPLE1/ # Individual samples within dataset
│ │ ├── [STARsolo output files]
│ │ └── [metadata attached to collection]
│ ├── SAMPLE2/
│ └── [dataset metadata files]
└── PRJEB12345/ # Another dataset
└── ...
```
### For Sample-based Input (`--samples`)
**With dataset_id specified:**
```
/archive/cellgeni/sanger/
├── GSE123456/ # Dataset ID from CSV
│ ├── SAMPLE1/ # Sample uploaded to irodspath/dataset_id/id
│ │ ├── [STARsolo output files]
│ │ └── [metadata attached to collection]
│ └── SAMPLE2/
└── ...
```
**Without dataset_id specified:**
```
/archive/cellgeni/sanger/
├── SAMPLE1/ # Sample uploaded directly to irodspath/id
│ ├── [STARsolo output files]
│ └── [metadata attached to collection]
├── SAMPLE2/
└── ...
```
## System Requirements
- **Nextflow**: Version 25.04.4 or higher
- **Singularity**: For containerized execution
- **LSF**: For job scheduling on HPC
- **iRODS**: Client tools for data upload
- **Storage**: Sufficient space in `/lustre` and `/nfs` mount points
## Configuration
The pipeline is configured for Sanger's HPC environment with:
- LSF executor with per-job memory limits
- Singularity containers from `/nfs/cellgeni/singularity/images/`
- Bind mounts for `/lustre` and `/nfs` filesystems
- Retry strategy with up to 5 attempts per process
## Monitoring
The pipeline generates comprehensive reports:
- Timeline report: `reports/YYYYMMDD-HH-mm-ss_timeline.html`
- Execution report: `reports/YYYYMMDD-HH-mm-ss_report.html`
- Trace file: `reports/YYYYMMDD-HH-mm-ss_trace.tsv`
## Pipeline Outputs
- **iRODS Collections**: Organized dataset collections with attached metadata
- **QC Reports**: Mapping statistics and quality metrics
- **Metadata Files**: Comprehensive sample and dataset annotations
- **Upload Logs**: Records of transferred files with checksums
## Troubleshooting
### Common Issues:
1. **Missing QC files**: The pipeline will generate them automatically
2. **iRODS connection**: Ensure iRODS client is properly configured
3. **File permissions**: Check read access to input directories
4. **Storage space**: Ensure sufficient space for temporary files
### Log Files:
Check Nextflow work directory (`nf-work/`) for detailed process logs and error messages.
## Version
Current version: 0.0.1
## License
See repository for license information.