https://github.com/cellgeni/nf-metacells
https://github.com/cellgeni/nf-metacells
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/cellgeni/nf-metacells
- Owner: cellgeni
- Created: 2025-02-19T15:49:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-10-15T11:24:24.000Z (8 months ago)
- Last Synced: 2025-10-16T04:09:35.305Z (8 months ago)
- Language: Python
- Size: 69.3 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# nf-metacells
Pipeline to aggregate metacells/pseudobulks using Nextflow. This pipeline performs metacell-aggregation using SEACells or Hierarchial clustering on single-cell RNA/ATAC sequencing data.
## Usage
```bash
nextflow run main.nf [OPTIONS]
```
### Required Options
- `--filelist` - File containing the list of files to be aggregated
- `--type` - Type of data to be aggregated (gex or atac)
- `--celltype_label` - Cell type label for the aggregation (required for hierarchial clustering; optional for SEACells)
### Optional Options
- `--help` - Show help message
- `--output_dir` - Output directory for the aggregated files
- `--raw` - Convert raw files to H5AD format (if raw files' paths are provided)
- `--delimiter` - Specify delimiter if you want to add sample id to the barcode names
- `--cell_metadata` - Specify metadata.csv file to attach to the .obs section of AnnData object
- `--barcode_column` - Barcode column name in the cell metadata file (specify "obs_names" if you want to use the .obs index column)
### SEACells Options
- `--seacells.enabled` - Enable SEACells aggregation
- `--seacells.n_cells` - Number of metacells to be aggregated
- `--seacells.gamma` - Gamma value for the adaptive bandwidth kernel
- `--seacells.n_top_genes` - Number of top variable genes to be used for aggregation (GEX only). Default: 2000
- `--seacells.n_components` - Number of components to be used for aggregation (PCA for GEX and LSI for ATAC). Default: 50
- `--seacells.convergence_epsilon` - Convergence epsilon for the optimization. Default: 0.00001
- `--seacells.min_iterations` - Minimum number of iterations for the optimization. Default: 10
- `--seacells.max_iterations` - Maximum number of iterations for the optimization. Default: 100
- `--seacells.use_sparse` - Use sparse matrix for the optimization. Default: false
- `--seacells.precomputed` - Use precomputed distance matrix for the aggregation
### Hierarchial Options
- `--hierarchial.enabled` - Enable Hierarchial clustering aggregation
- `--hierarchial.n_min` - Minimum number of clusters for the aggregation
- `--hierarchial.n_max` - Maximum number of clusters for the aggregation
- `--hierarchial.method` - Method for the aggregation (kmeans or louvain)
- `--hierarchial.n_top_genes` - Number of top variable genes to be used for aggregation (GEX only). Default: 2000
- `--hierarchial.n_components` - Number of components to be used for aggregation (PCA for GEX and LSI for ATAC). Default: 50
- `--hierarchial.n_neighbors` - Number of neighbors for the aggregation. Default: 15
- `--hierarchial.precomputed` - Use precomputed distance matrix for the aggregation
## Examples
### 1. Perform metacell aggregation using SEACells
```bash
nextflow run main.nf \
--filelist filelist.csv \
--type gex \
--seacells.enabled \
--seacells.gamma 75
```
### 2. Full SEACells example with all parameters
```bash
nextflow run main.nf \
--filelist samples.csv \
--type gex \
--seacells.enabled \
--seacells.gamma 75 \
--seacells.n_top_genes 2000 \
--seacells.n_components 50 \
--celltype_label celltype \
--seacells.convergence_epsilon 0.00001 \
--seacells.min_iterations 10 \
--seacells.max_iterations 50 \
-resume
```
## Filelist Format
The filelist should be a CSV file with the following format:
```csv
item,filepath
pbmc_10k,/lustre/scratch126/cellgen/team361/data/pbmc/results/pbmc_10k/pbmc10k/pbmc10k.h5ad
pbmc_3k,/lustre/scratch126/cellgen/team361/data/pbmc/results/pbmc3k/pbmc3k/pbmc3k.h5ad
```