{"id":19749570,"url":"https://github.com/mskcc/temposig","last_synced_at":"2026-02-23T07:03:12.876Z","repository":{"id":55895746,"uuid":"258989998","full_name":"mskcc/tempoSig","owner":"mskcc","description":"Fitting mutational catalog to signatures with maximum likelihood","archived":false,"fork":false,"pushed_at":"2025-03-07T20:26:02.000Z","size":4167,"stargazers_count":7,"open_issues_count":0,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-06-20T21:01:59.415Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mskcc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-26T09:26:59.000Z","updated_at":"2024-07-04T01:42:10.000Z","dependencies_parsed_at":"2023-02-14T08:17:19.895Z","dependency_job_id":null,"html_url":"https://github.com/mskcc/tempoSig","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mskcc/tempoSig","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FtempoSig","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FtempoSig/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FtempoSig/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FtempoSig/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mskcc","download_url":"https://codeload.github.com/mskcc/tempoSig/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FtempoSig/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262863637,"owners_count":23376449,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T02:27:17.614Z","updated_at":"2025-10-08T18:14:47.368Z","avatar_url":"https://github.com/mskcc.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tempoSig\nMutational Signature Extraction using Maximum Likelihood and NMF\n\n## Overview\n**tempoSig** implements maximum likelihood-based extraction of mutational signature proportions of a set of mutation count data under a known set of input signature lists (refitting). In addition, it also includes de novo extraction based on Bayesian non-negative matrix factorization, which enables the determination of the most likely number of signatures. \n\nThe basic algorithm for refitting is the same as in [mutation-signatures](https://github.com/mskcc/mutation-signatures), but re-implemention in R/C++ here enables a substantial speed-up of the order of ~100x. This speed-up allows for the fast estimation of p-values via permutation-based sampling. The basic object (S4 class) can store input data, reference signature list, output exposure of samples, and p-values. Utilities for plotting and file ouput are also included. \n\n## Algorithm\nInput data are of the form of catalog matrix:\n\nMutation context | Tumor Sample Barcode 1 | Tumor Sample Barcode 2\n---------------- | ---------------------- | ----------------------\nA[C\u003eA]A          |                      0 |                      1\nA[C\u003eA]C          |                      3 |                      0\nA[C\u003eA]G          |                      2 |                      5\nA[C\u003eA]T          |                      0 |                      2\nC[C\u003eA]A          |                      0 |                      0\nC[C\u003eA]C          |                      1 |                      1\nC[C\u003eA]G          |                      0 |                      0\nC[C\u003eA]T          |                      1 |                      1\n\nMutation context is the set of categories to which mutation data from sequencing experiments have been classified. Trinucleotide contexts shown above are the pyrimidine bases before and after mutation flanked by upstream and downstream nucleotides (96 in total). Each column corresponding to **Tumor_Sample_Barcode** contains non-negative counts of single nucleotide variants (SNVs). This trinucleotide matrix can be generated from [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) files using the R package [maftools](https://bioconductor.org/packages/maftools/). MAF files can be generated from VCF files using [vcf2maf](https://github.com/mskcc/vcf2maf).\n\nThe other input is the set of reference signature proportions:\n\nMutation context | Signature.1 | Signature.2 | Signature.3 | Signature.4\n---------------- | ---- | ---- | ---- | ----\nA[C\u003eA]A          | 9e-4 | 6e-7 | 0.02 | 0.04\nA[C\u003eA]C          | 2e-3 | 1e-4 | 0.02 | 0.03\n\nBoth [version 2](https://github.com/mskcc/tempoSig/blob/master/inst/extdata/cosmic_snv_signatures_v2.txt) and [version 3](https://github.com/mskcc/tempoSig/blob/master/inst/extdata/cosmic_sigProfiler_SBS_signatures.txt) tables of [COSMIC signature lists](https://cancer.sanger.ac.uk/cosmic/signatures) are included.\n\nThe \"refitting\" (as opposed to de novo discovery) of signature propotion solves the non-negative matrix factorization problem\n\n    X = W * H\n    \nwhere **X** is the catalog matrix, **W** is the signature matrix (assumed to be known and fixed), and **H** is the exposure matrix of dimension (no. of reference signatures x no. of samples). The maximum likelihood estimation (MLE) algorithm formulates this problem in terms of a multinomial statistical model with observed counts **X** of categorical groups (trinucleotide contexts) mixed by given fixed proportions **W**. **tempoSig** uses the quasi-Newton [Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm](https://www.gnu.org/software/gsl/doc/html/multimin.html) for multi-dimensional optimization.\nThe output matrix is the transpose of **H**:\n\nTumor Sample Barcode | Signature.1 | Signature.2 | Signature.3 | Signature.4\n-------------------- | ---- | ---- | ---- | ----\nSAMPLE_1             | 0.37 | 0.01 | 0.00 | 0.00\nSAMPLE_2             | 0.31 | 0.18 | 0.00 | 0.00\n\nEach row is a vector of proportions that add up to 1.\n\nTo estimate p-values of significance of each proportion, the signature profile of each reference signature (columns of **W**) are randomly shuffled by permutation to sample the null distribution. The exposure vectors inferred from these null samples are compared with the observed vector from the original data, with the p-value defined as the fraction of null samples whose proprotions are higher than the observed values.\n\n## Installation\n\n### Install without docker\nCompilation requires GNU Scientific Library [(GSL)](https://www.gnu.org/software/gsl/). In Ubuntu Linux,\n\n    $ sudo apt-get install libgsl-dev\n    \nIn OS-X,\n\n    $ brew install gsl\n\nDependencies include [Rcpp](https://cran.r-project.org/package=Rcpp), [gtools](https://cran.r-project.org/package=gtools), [argparse](https://cran.r-project.org/package=argparse), and [coneproj](https://cran.r-project.org/package=coneproj). Installing **tempoSig** via\n\n    \u003e devtools::install_github(\"mskcc/tempoSig\")\n\nwill also install dependencies.\n\n### Install with docker\nClone this repository,\n\n    $ git clone https://github.com/mskcc/tempoSig.git\n\nGo to the repository folder,\n\n    $ cd tempoSig\n\nBuild the image,\n\n    $ docker build -t temposig .\n\nUse by the following command:\n\n`YOUR_CATALOG_FILE`: the absolute path of your catalog file \\\n`YOUR_OUTPUT_FOLDER`: the absolute path of your folder where output will be saved\n\nCreate tempoSig program container:\n```bash\ndocker run -it -d \\\n--name tempoSig_container \\\n-v \u003cYOUR_CATALOG_FILE\u003e:/tempoSig/input/catalog.txt \\\n-v \u003cYOUR_OUTPUT_FOLDER\u003e:/tempoSig/output \\\ntemposig\n```\n\nUsage:\n```bash\ndocker exec tempoSig_container \\\n./exec/tempoSig.R input/catalog.txt output/exposure.txt\n```\n\nThe other parameter can be added to the last line of this command. For example you can change your command to:\n```bash\ndocker exec tempoSig_container \\\n./exec/tempoSig.R input/catalog.txt output/exposure.txt --pvalue --nperm 1000 --pv.out output/pvalue.txt\n```\n\n## Quick start with command-line interface\n\n### Main inference\nIf you are not interested in interactive usages with more flexibility and functionality, or want to use **tempoSig** as a part of a pipeline, use the command-line script [tempoSig.R](https://github.com/mskcc/tempoSig/blob/master/exec/tempoSig.R). If you installed **tempoSig** as an R package using `install_github`, find the path via\n\n    \u003e system.file('exec', 'tempoSig.R', package = 'tempoSig')\n   \nIf you cloned the repository, the file is located at the `./exec` subdirectory of the github main directory. We denote this package directory path as `PKG_PATH`. The command syntax is\n\n    $ $PKG_PATH/exec/tempoSig.R -h\n     usage: ./tempoSig.R [-h]\n                    [--cosmic_v2 | --cosmic_v3 | --cosmic_v3_SA | --cosmic_v3_exome]\n                    [--sigfile SIGFILE]\n                    [--pvalue]\n                    [--nperm NPERM]\n                    [--seed SEED]\n                    [--pv.out PV.OUT] \n                    [--cbio]\n                    CATALOG OUTPUT\n\n     Fit mutational catalog to signatures\n\n     positional arguments:\n       CATALOG            input catalog data file\n       OUTPUT             output file name\n\n     optional arguments:\n       -h, --help         show this help message and exit\n       --cosmic_v2        use COSMIC v2 reference signatures (default)\n       --cosmic_v3        use COSMIC v3 reference signatures\n       --cosmic_v3_SA     use COSMIC v3 SigAnalyzer reference signatures\n       --cosmic_v3_exome  use COSMIC v3 exome reference signatures\n       --sigfile SIGFILE  custom input reference signature file; overrides\n                          --cosmic_v2/3\n       --pvalue           estimate p-values (default FALSE)\n       --nperm NPERM      number of permutations for p-value estimation; default\n                          1000\n       --seed SEED        random number seed\n       --pv.out PV.OUT    p-value output file\n       --cbio             output in cBioPortal format (default FALSE)\n     \nOnly two arguments are mandatory: `CATALOG` and `OUTPUT`, each specifying the paths of input catalog data and output file to be written. Both are tab-delimited text files with headers. See [tcga-brca_catalog.txt](https://github.com/mskcc/tempoSig/blob/master/inst/extdata/tcga-brca_catalog.txt) for a catalog file example. For instance,\n\n    $ $PKG_PATH/exec/tempoSig.R $PKG_PATH/extdata/tcga-brca_catalog.txt output.txt\n    \nfits catalog data for 10 samples in `tcga-brca_catalog.txt` to [COSMIC v2 signatures](https://github.com/mskcc/tempoSig/edit/master/inst/extdata/cosmic_snv_signatures_v2.txt) (default). The output file `output.txt` has the following format:\n\nSample Name    | Number of Mutations | Signature.1 | Signature.2 | Signature.3 | Signature.4\n-------------- | ------------------- | ----------- | ----------- | ----------- | -----------\nTCGA.BH.A0EI   | 18                  | 0.61        | 0.01        | 0.00        | 0.00\nTCGA.E9.A22B   | 50                  | 0.51        | 0.22        | 0.00        | 0.00\nTCGA.OL.A5RV   | 10                  | 0.41        | 0.00        | 0.23        | 0.00 \n\nThe following will use the [COSMIC v3 signatures](https://github.com/mskcc/tempoSig/edit/master/inst/extdata/cosmic_sigProfiler_SBS_signatures.txt):\n\n    $ $PKG_PATH/exec/tempoSig.R  --cosmic_v3 $PKG_PATH/extdata/tcga-brca_catalog.txt output_v3.txt\n\nThe output is similar, with the columns corresponding to 67 signatures:\n\nSample Name    | Number of Mutations | SBS.1.      | SBS.2       | SBS.3\n-------------- | ------------------- | ----------- | ----------- | -----------\nTCGA.BH.A0EI   | 18                  | 0.373       | 8.5e-3      | 0\nTCGA.E9.A22B   | 50                  | 0.310       | 0.180       | 0\nTCGA.OL.A5RV   | 10                  | 0.337       | 0           | 0 \n\nOne can use a custom reference signature list (in the same format as the default version 3 file) via the optional argument `--sigfile SIGFILE`.\n\n### Catalog matrix generation\n\nIf a MAF file contains the column `Ref_Tri` [trinucleotide contexts surrounding the mutation site; use `make_trinuc_maf.py` script in [mutation-signatures](https://github.com/mskcc/mutation-signatures)], the catalog matrix can also be generated using [maf2cat()](https://github.com/mskcc/tempoSig/blob/master/man/maf2cat.Rd) or its command-line wrapper:\n\n    $ ./maf2cat2.R -h\n    usage: ./maf2cat2.R [-h] MAF CATALOG\n\n    Construct mutational catalog from MAF file with Ref_Tri column\n\n    positional arguments:\n      MAF         input MAF file\n      CATALOG     output catalog file\n\n    optional arguments:\n      -h, --help  show this help message and exit\n\nIf the MAF file does not contain the column `Ref_Tri`, use [maf2cat3()](https://github.com/mskcc/tempoSig/blob/master/man/maf2cat3.Rd). It requires the reference genome package [BSgenome.Hsapiens.UCSC.hg19](https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19) installed:\n\n    \u003e library(BSgenome.Hsapeisn.UCSC.hg19)\n    \u003e maf \u003c- system.file('extdata', 'brca.maf', package = 'tempoSig')\n    \u003e x \u003c- maf2cat3(maf = maf, ref.genome = BSgenome.Hsapiens.UCSC.hg19)\n    \u003e write.table(x, file = 'brca_catalog.txt', row.names = TRUE, col.names = TRUE, sep = '\\t', quote = F)\n    \nIf you do not want to use R-interface, a command-line script is available, assuming that Bsgenome.Hsapiens.UCSC.hg19 package has been installed:\n\n    $ ./maf2cat3.R -h\n    usage: ./maf2cat3.R [-h] MAF CATALOG\n\n    Construct mutational catalog from MAF file with Ref_Tri column\n\n    positional arguments:\n      MAF         input MAF file\n      CATALOG     output catalog file\n\n    optional arguments:\n      -h, --help  show this help message and exit\n\n### P-value estimation\nOptionally, statistical significance of the set of proportions (rows in the exposure output) can be estimated by permutation sampling. For each signature, the exposure inference is repeated multiple times after permutation of the reference signature profile. P-values are the fractions of permuted replicates whose proportions (**H0**) are not lower than those of the original (**H1**). The p-value estimation is turned on by the argument `--pvalue`. The number of permutations is 1,000 by default and can be set with `--nperm NPERM`. The default output has the format:\n\nSample Name          | Number of Mutations | Signature.1.observed | Signature.1.pvalue   | Signature.2.observed | Signature.2.pvalue\n-------------------- | ------------------- | -------------------- | -------------------- | -------------------- | ------------------\nTCGA.BH.A0EI         | 18                  | 0.61                 | 0                    | 0.066                | 0.05              \nTCGA.E9.A22B         | 50                  | 0.51                 | 0                    | 0.20                 | 0              \nTCGA.OL.A5RV         | 10                  | 0.41                 | 0.6                  | 1.2e-11              | 0.55        \n\nNote that p-value of 0 indicates that out of `NPERM` samples, none exceeded **H1**, and therefore must be interpreted as *P* \u003c 1/`NPERM`. Alternatively, one can have two output files, one for exposure and the other for p-vaues, by specifiying the argument `--pv.out PV.OUT`. The exposure output `OUTPUT` is in the same format as that without p-value computation. The p-value output `PV.OUT` has the analogous format with columns for each signature p-values.\n\n### De novo inference\n\nSee [vignettes](http://htmlpreview.github.io/?https://github.com/mskcc/tempoSig/blob/master/old/tempoSig.html) for de novo inference.\n\n### Hybrid inference\n\nFor higher sensitivity and specificity, a hybrid approach (\"piggyback inference\"), combining elements of both refitting and de novo, can be used. There is a command-line script that invokes a standard 11-signature de novo reference (and optionally a pre-optimized filtering cutoff parameters):\n\n    $ ./pgback.R -h\n    usage: ./pgback.R [-h] [--filter] [--cutoff CUTOFF] [--seed SEED]\n                      CATALOG OUTPUT\n\n    Perform piggyback de novo inference\n\n    positional arguments:\n      CATALOG          input catalog data file\n      OUTPUT           output file name\n\n    optional arguments:\n      -h, --help       show this help message and exit\n      --filter         filter exposures using CV cutoffs (default FALSE)\n      --cutoff CUTOFF  custom cutoff file\n      --seed SEED      random number seed\n\nUsing the --filter option without a cutoff file input will produce 11-signature filtered exposures minimizing false positives in WES and IMPACT data.\n\n## Documentation\n\nSee [vignettes](http://htmlpreview.github.io/?https://github.com/mskcc/tempoSig/blob/master/old/tempoSig.html) for more detailed documentations of interactive usages for refitting as well as de novo extraction.\n\n## Benchmark\n\n\nIn **Fig. 1**, the overall accuracy of exposure proportions inferred from data sets simulated with breast cancer-like signature proportions were compared to true values using cosine similarity (higher the better; ranges from 0 to 1). Five other existing algorithms for refitting ([deconstructSigs](https://cran.r-project.org/package=deconstructSigs), [YAPSA](https://www.bioconductor.org/packages/YAPSA/), [MutationalPatterns](https://bioconductor.org/packages/MutationalPatterns/), [MutationalCone](https://doi.org/10.1371/journal.pone.0221235), and [decompTumor2Sig](https://www.bioconductor.org/packages/decompTumor2Sig/)) were applied to the same data sets and their results compared to `tempoSig`. Although those more recent than `deconstructSigs`, one of the earliest refitting algorithms, exhibited similar performance, the maximum likelihood-based inference (`tempoSig` and `mutation-signatures`) consistently outperformed all others.\n\n\u003cbr\u003e\n\u003cfigure\u003e\n\u003cimg src=\"https://github.com/mskcc/tempoSig/blob/master/old/cosim6.png\" align=\"center\" height=\"480\" width=\"600\"/\u003e\n    \u003cfigcaption\u003e Fig. 1: Accuracy comparison of exposures predicted from simulated data of varying mutation loads with six refitting algorithms. \u003c/figcaption\u003e\n\u003c/figure\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmskcc%2Ftemposig","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmskcc%2Ftemposig","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmskcc%2Ftemposig/lists"}