{"id":41443651,"url":"https://github.com/cellgeni/cellatac","last_synced_at":"2026-01-23T14:59:21.183Z","repository":{"id":52280481,"uuid":"214141844","full_name":"cellgeni/cellatac","owner":"cellgeni","description":"Sanger Cellular Genetics single-cell ATAC-seq pipeline.","archived":false,"fork":false,"pushed_at":"2023-08-25T09:57:04.000Z","size":212,"stargazers_count":13,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-09-09T23:52:05.561Z","etag":null,"topics":["10x-genomics","atac-seq","atac-seq-pipeline","nextflow-pipeline","single-cell-atac-seq"],"latest_commit_sha":null,"homepage":"","language":"Nextflow","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cellgeni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-10-10T09:29:24.000Z","updated_at":"2025-08-19T13:41:34.000Z","dependencies_parsed_at":"2025-09-09T22:25:41.641Z","dependency_job_id":null,"html_url":"https://github.com/cellgeni/cellatac","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/cellgeni/cellatac","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cellgeni%2Fcellatac","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cellgeni%2Fcellatac/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cellgeni%2Fcellatac/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cellgeni%2Fcellatac/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cellgeni","download_url":"https://codeload.github.com/cellgeni/cellatac/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cellgeni%2Fcellatac/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28694459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-23T14:15:13.573Z","status":"ssl_error","status_checked_at":"2026-01-23T14:09:05.534Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["10x-genomics","atac-seq","atac-seq-pipeline","nextflow-pipeline","single-cell-atac-seq"],"created_at":"2026-01-23T14:59:21.081Z","updated_at":"2026-01-23T14:59:21.154Z","avatar_url":"https://github.com/cellgeni.png","language":"Nextflow","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cellatac\n\n1. [Introduction](#introduction)\n2. [Basic workflow](#basic-workflow)\n3. [Cellatac functionality](#cellatac-functionality)\n4. [Running cellatac](#running-cellatac)\n    - [Cellatac needs](#cellatac-needs)\n    - [Useful options](#useful-options)\n    - [Example invocations](#example-invocations)\n5. [Downstream analysis](#downstream-analysis)\n6. [Outputs](#outputs)\n\n## Introduction\n\nSanger Cellular Genetics ATAC-seq pipeline by Luz Garcia Alonso,\nSimon Murray, Ni Huang and Stijn van Dongen.\n\n**cellatac** takes scATAC-seq aligned data (such as the fragments file from\nCell Ranger ATAC) and outputs a _count matrix of accessible chromatin peaks by\ncell_ (i.e. analogous to the `filtered_peak_bc_matrix` from Cell Ranger ATAC).\nThe output matrix can then be used for dowstream analysis in Seurat, Scanpy,\ncisTopic or any other tool.\n\nCell Ranger ATAC identifies the peaks by aggregating the signal of all the\nbarcodes in the sample. There are some papers reporting that this may be\nunsuitable to detect peaks appearing in rare cell types/states. **cellatac**\nuses [Cusanovich approach](https://www.sciencedirect.com/science/article/pii/S0092867418308559)\nto increase the peak detection sensitivity by, first, identifying cell clusters\non a windows x cell rather than peaks per cell matrix, and then doing a peak\ncalling for each cluster. \n\n## Basic workflow\n\n1. **Compute window coverage**. The genome is broken into 5kb windows and then\n  each cell is scored for insertions in each window, generating a binary matrix\n  (large and sparce) of windows by cells. Note that if multiple samples are\n  provided, these are aggregated into a unique matrix.\n2. **Cluster cells based on window coverage**. Matrix is filtered to retain\n  only top 200k most commonly used windows. Using `Signac`, the binary matrix is\n  normalized with Term Frequency-Inverse Document Frequency (TF-IDF) approach\n  followed by a dimensionality reduction step using Singular Value Decomposition\n  (SVD). The first LSI component is ignored as it often captures sequencing depth\n  (technical variation) rather than biological variation. The 2-30 top remaining\n  components are used to perform graph-based Louvain clustering (at a X\n  resolution) and clusters are reported.\n3. **Accessible chromatin peak calling per cluster**. Peaks are called\n  separately on each cluster using `macs2`.\n4. **Merge per-cluster peaks and generate peak by cell matrix.** Peaks from all\n  clusters are merged into a master peak set (i.e. overlapping peaks are\n  aggregated), and the corresponfding peak by cell matrix (indicating any reads\n  occuring in each peak for each cell) is reported. Note that if multiple samples\n  are provided, these are aggregated into a unique matrix. **This is the relevant\n  matrix that you should use for clustering.**\n\n## Cellatac functionality\n\n* The clustering approach from the Cusanovich 2018 manuscript.\n* Joint analysing of multiple 10x samples.\n* A clustering step utilising Seurat.\n* User-specified clustering.\n* Peak/cell matrix based on merging per-cluster peaks.\n* Peak/cell matrix per-cluster.\n\n\n## Running cellatac\n\n### Cellatac needs\n\n* Singularity\n\n\n### Useful options\n\n```\n--mermul true           merge multiplets using CR bam file\n--mermul false          [default] use CR fragments.tsv.gz\n\n--usecls __seurat__        [default] use Seurat/Signac approach resembling Cusanovich. It uses Louvain clustering instead.\n--usecls __cusanovich__    use cusanovich-strict approach. It uses bi-clustering of cells and windows based on cosine distances using the ward algorithm.\n--usecls \u003cfilename\u003e        use custom clustering\n\n--mergepeaks true       [default] merge cluster peaks, compute master cell/peak matrix\n--perclusterpeaks false [default] computer per-cluster cell/peak matrix  \n                            Note both can be set to true.\n\n--cellbatchsize 500     [default] parallelisation bucket size (number of cells per bucket)\n--nclades 10            [default] number of clusters to use (only applies to cusanovich-strict approach)\n--sampleid \u003ctag\u003e        use \u003ctag\u003e in naming outputs. Not yet consistently applied\n```\n\n\n### Example invocations\n\nThis pipeline will need a singularity installation.  It supports two executing\nplatforms, *local* (simply execute on the machine you're currently on) and\n*lsf*. To use the latter specify `-profile lsf`.\n\n\n```\nsource=cellgeni/cellatac\n\nmanifest=/some/path/to/singlecell.csv\nposbam=/some/path/to/possorted_bam.bam\nfragments=/some/path/to/fragments.tsv.gz\n\ncellbatchsize=400\nnclades=10\n\nnextflow run $source        \\\n  --cellcsv $manifest       \\\n  --fragments $fragments    \\\n  --cellbatchsize $cellbatchsize   \\\n  --posbam $posbam          \\\n  --outdir results          \\\n  --sampleid CR12345678     \\\n  -profile local            \\\n  --mermul true             \\\n  --usecls __seurat__       \\\n  --mergepeaks true         \\\n  -with-report reports/report.html \\\n  -resume -w work -ansi-log false \\\n  -config my.config\n```\n\nwhere `my.config` supplies singularity mount options and tells nextflow how\nmany CPUs it can utilise when using the local executor, e.g.\n\n```\nsingularity {\n  runOptions = '-B /some/path1 -B /another/path2'\n  cacheDir = '/home/jovyan/singularity/'\n}\n\nexecutor {\n    cpus   = 56\n    memory = 600.GB\n}\n```\n\nTo run multiple samples:\n\n```\nnextflow run $source        \\\n  --muxfile mux.txt         \\\n  --cellbatchsize $cellbatchsize   \\\n  --outdir results          \\\n  -profile local            \\\n  --usecls __seurat__       \\\n  --mermul false            \\\n  --mergepeaks true         \\\n  -with-report reports/report.html \\\n  -resume -w work -ansi-log false \\\n  -c my.config\n```\n\nwhere `mux.txt` is a tab separated file that looks like this:\n\n```\n1   sampleX   /path/to/cellranger/output/for/sampleX\n2   sampleY   /path/to/cellranger/output/for/sampleY\n3   sampleZ   /path/to/cellranger/output/for/sampleZ\n4   sampleU   /path/to/cellranger/output/for/sampleU\n```\n\nThe first column will be used to make the barcodes in each sample unique across\nthe merged samples. As such it can be anything, but it is suggested to simply\nuse a range of integers starting at 1, or to use the last one or two\nsignificant digits of the sample ID provided they are unique to each sample.\n\nThe cellranger output directories need not contain the full output. Currently\nthe pipeline expects these files:\n\n```\nfragments.tsv.gz  possorted_bam.bam singlecell.csv\n```\n\nWhen running multiple samples, the bam file is only used for its header. It is\npossible to substitute the original bam file with the output of `samtools view\n-H possorted_bam.bam`. This can be useful if it is necessary to copy the data\nprior to running this pipeline; it is not necessary in this case to copy the\nfull position sorted bam file (they tend to be very large).  Currently it is\nnecessary that the substituted file has the same name `possorted_bam.bam`.\n\n## Downstream analysis\n\nThe snippet below shows how to read in cellatac output as a Seurat object.\n\n```\n### Load scATAC binary matrix\n# This is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. \n# However, instead of genes, each row of the matrix represents a PEAK of the genome learned by cellatac. \n# The matrix is not binary, \u003e 0 if there is any Tn5 cut site for each single barcode (i.e. cell) that map within each peak.\nf_binary_mat \u003c- readMM(file = paste0(cellatac_dir, 'peak_matrix/peaks_bc_matrix.mmtx.gz'))\nregions.names = read.delim(paste0(cellatac_dir, 'peak_matrix/peaks.txt'), header = FALSE, stringsAsFactors = FALSE)\ncells.names = read.delim(paste0(cellatac_dir, 'peak_matrix/bc.txt'), header = FALSE, stringsAsFactors = FALSE)\ncolnames(f_binary_mat) = cells.names$V1\nrownames(f_binary_mat) = regions.names$V1\n\n# Make binary\nf_binary_mat@x[f_binary_mat@x \u003e 0] \u003c- 1\n\n### Get some stats\n# check distributions\nmessage('Matrix size:\\n', 'rows ', f_binary_mat@Dim[1], '\\ncolumns ', f_binary_mat@Dim[2])\nn_cells_with_site = rowSums(f_binary_mat)\noptions(repr.plot.width = 8, repr.plot.height = 4)\npar(mfrow = c(1, 2))\nhist(log10(n_cells_with_site), main = 'No. of Cells Each Site is Observed In', breaks = 50)\nhist(n_cells_with_site, main = 'No. of Cells Each Site is Observed In', breaks = 50)\n\nsites_per_cell = colSums(f_binary_mat)\noptions(repr.plot.width = 8, repr.plot.height = 4)\npar(mfrow = c(1, 2))\nhist(log10(sites_per_cell), main = 'No. of Sites Observed per Cell', breaks = 50)\nhist(sites_per_cell, main = 'No. of Sites Observed per Cell', breaks = 50)\n\n# compare coverage vs peak length \npos = sapply(strsplit(rownames(f_binary_mat), split= ':'), tail , 1)\npos_len = sapply(strsplit(pos, split= '-'), function(x) as.numeric(x[2])-as.numeric(x[1]) )\npar(mfrow = c(1, 1))\nplot(pos_len, n_cells_with_site)\nabline(v = f_binary_mat@Dim[2]*0.75)\n\n\n# filter non-informative peaks: length \u003c 2k bp or \u003e75% frequency\nf_binary_mat = f_binary_mat[ pos_len \u003c 2000 \u0026 n_cells_with_site \u003c f_binary_mat@Dim[2]*0.75, ]\n\n\n# create CreateSeuratObject\nchrom_assay \u003c- CreateChromatinAssay(\n  counts = f_binary_mat,\n  sep = c(\":\", \"-\"),\n  genome = 'hg38',\n  fragments = NULL,\n  min.cells = 5,\n  min.features = 100\n)\n\nso \u003c- CreateSeuratObject(\n  counts = chrom_assay,\n  assay = \"peaks\",\n  meta.data = metadata\n)\n```\n\n## Outputs\n\nThe most import outputs are described below. Cellatac creates a toplevel\noutput directory by default called 'results' (change with the `--outdir` option).\n\n```\nresults/qc/seurat-clades.tsv        (cluster annotation)\nresults/qc/seurat.pdf               (cluster-annotated and sample-annotated UMAP plots)\nresults/peak_matrix/bc.txt                    (barcode(cell) labels)\nresults/peak_matrix/peaks.txt                 (peak labels)\nresults/peak_matrix/peaks_bc_matrix.mmtx.gz   (main output object)\nresults/peak_matrix/bc_peaks_matrix.mmtx.gz   (transpose of above)\nresults/cellmetadata/singlecell.tsv           (joined metadata)\nresults/cellmetadata/tagmap.txt               (links two-digit sampletag and samplename)\n```\n\nThe list of all directories with a short description:\n\n```\ncellmetadata      (see above)\npeak_matrix       (see above)\nqc                (see above)\nclus_peak_matrix  (per-cluster results, multiple bundles as in peak_matrix)\nmacs2             (per-cluster macs2 results)\npeaks             (bed files with peak information)\nwin_matrix        (outputs relating to the windows selected for clustering, can be ignored)\n```\n\nThe bed files in `peaks` are `allclusters_peaks_sorted.bed` and\n`allclusters_masterlist_sps.bed`.  The first is the concatenated and\nposition-sorted list of all per-cluster peaks.  The second is the result of\nmerging these peaks so that overlapping and book-ended peaks are joined in a\nsingle peak. The infix `sps` indicates it is sample-position-sorted, indicating\nwe use the chromosome order as found in the cellranger bam file.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcellgeni%2Fcellatac","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcellgeni%2Fcellatac","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcellgeni%2Fcellatac/lists"}