{"id":13773708,"url":"https://github.com/gersteinlab/MUSIC","last_synced_at":"2025-05-11T06:30:29.266Z","repository":{"id":16033462,"uuid":"18777257","full_name":"gersteinlab/MUSIC","owner":"gersteinlab","description":"MUltiScale enrIchment Calling for ChIP-Seq Datasets","archived":false,"fork":false,"pushed_at":"2019-03-14T16:38:10.000Z","size":981,"stargazers_count":21,"open_issues_count":2,"forks_count":1,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-03-26T12:26:41.567Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://music.gersteinlab.org","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gersteinlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-04-14T21:23:23.000Z","updated_at":"2023-07-21T21:53:08.000Z","dependencies_parsed_at":"2022-08-31T01:24:23.190Z","dependency_job_id":null,"html_url":"https://github.com/gersteinlab/MUSIC","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FMUSIC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FMUSIC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FMUSIC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gersteinlab%2FMUSIC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gersteinlab","download_url":"https://codeload.github.com/gersteinlab/MUSIC/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253528016,"owners_count":21922574,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T17:01:19.276Z","updated_at":"2025-05-11T06:30:28.943Z","avatar_url":"https://github.com/gersteinlab.png","language":"C++","readme":"\u003chtml\u003e\n\u003cfont face=\"arial\"\u003e\n\u003c!------------------- \u003ctitle\u003eMUSIC\u003c/title\u003e ----\u003e\n\u003cdiv style=\"text-align:center\"\u003e\u003cimg src=\"music_logo.png\" alt=\"Could not load logo.\" width=\"1000\" align=\"center\"\u003e\u003c/div\u003e\n\u003c!-------------------\u003ch1\u003eMUSIC: MUltiScale enrIchment Calling\u003c/h1\u003e----\u003e\n\u003c!---------- MUSIC is an algorithm for identification of enriched regions at multiple scales.\u003cbr\u003e\nIt takes as input:(1) Mapped ChIP and control reads, (2) Smoothing scale window lengths (in base pairs), \u003cbr\u003e\nand outputs: (1) Enriched regions at multiple scales, (2) Significantly enriched regions from all the scales.\u003cbr\u003e\n\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e\u003cbr\u003e----------\u003e\n\u003ctitle\u003eMUSIC\u003c/title\u003e\n\u003cbr\u003e\u003cbr\u003e\nMUSIC is an algorithm for identification of enriched regions at multiple scales in the read depth signals from ChIP-Seq experiments.\u003cbr\u003eIt takes as input: \u003cbr\u003e\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n- Mapped ChIP and control reads (optional),\u003cbr\u003e\n- Smoothing scale window lengths (in base pairs),\n\u003c/div\u003e\u003cbr\u003e\nand outputs:\u003cbr\u003e\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n- Enriched regions at multiple scales,\u003cbr\u003e\n- Significantly enriched regions from all the scales.\n\u003c/div\u003e\n\u003cbr\u003e\nUnlike other ER identification methods, MUSIC allows analyzing the scale length spectrum of ChIP-Seq datasets and also selecting a user specific slice in the \nscale length spectrum with custom granularity and generates the enriched regions at each length scale.\n\nMUSIC does not strictly require a control experiment (for example mock IP, input DNA) to be performed. It is, however, strongly advised to generate control datasets matching \nthe sample of interest (See Landt et al 2012).\n\n\u003ch2\u003eDownload and Installation\u003c/h2\u003e\nYou can download MUSIC C++ code \u003ca href=\"https://github.com/gersteinlab/MUSIC/archive/master.zip\"\u003ehere\u003c/a\u003e. There are no dependencies for building MUSIC. After download, type:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003ci\u003e\u003cfont face=\"courier\"\u003e\nunzip MUSIC.zip\u003cbr\u003e\ncd MUSIC\u003cbr\u003e\nmake clean\u003cbr\u003e\nmake\n\u003c/font\u003e\u003c/i\u003e\n\u003c/div\u003e\u003cbr\u003e\nto build MUSIC. The executable is located under directory \u003cfont face=\"courier\"\u003ebin/\u003c/font\u003e. It may be useful to install \u003ca href=\"http://samtools.sourceforge.net/\"\u003esamtools\u003c/a\u003e for processing BAM files.\n\nTo get help on which options are available, use:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\nMUSIC -help\n\u003c/font\u003e\n\u003c/div\u003e\n\n\u003ch2\u003eUsage\u003c/h2\u003e\nMUSIC run starts with preprocessing the reads for ChIP and control samples (Note that we use samtools for converting BAM file to SAM files.):\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\nmkdir chip;mkdir input\u003cbr\u003e\nsamtools view chip.bam | MUSIC -preprocess SAM stdin chip/ \u003cbr\u003e\nsamtools view input.bam | MUSIC -preprocess SAM stdin input/ \u003cbr\u003e\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nIf there are multiple replicates to be pooled, they can be done at once or separately. If done separately, MUSIC pools the reads automatically. Then it is \nnecessary to sort and remove duplicate reads in control and ChIP samples:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\nmkdir chip/sorted;mkdir chip/dedup;mkdir input/sorted;mkdir input/dedup\u003cbr\u003e\nMUSIC -sort_reads chip chip/sorted \u003cbr\u003e\nMUSIC -sort_reads input input/sorted \u003cbr\u003e\nMUSIC -remove_duplicates chip/sorted 2 chip/dedup \u003cbr\u003e\nMUSIC -remove_duplicates input/sorted 2 input/dedup \u003cbr\u003e\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nWe do enriched region identification:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\nMUSIC -get_multiscale_broad_ERs -chip chip/dedup -control input/dedup -mapp Mappability_36bp -l_mapp 36 -begin_l 1000 -end_l 16000 -step 1.5\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nThis code tells MUSIC to identify the enriched regions starting from 1kb smoothing window length upto 16kb with multiplicative factor of 1.5 using the default\nparameters for the remaining parameters. The ERs for each scale are dumped. \n\u003cbr\u003e\u003cbr\u003e\nIn case there is no control, skip the option that specifies preprocessed control reads directory (i.e., '-control input/dedup')\n\u003cbr\u003e\u003cbr\u003e\nThere are 3 different ER identification modes by default. \n\u003ch3\u003e-get_TF_peaks\u003c/h3\u003e\nIdentifies point binding events. This uses very small scale level to identify the small transcription factor binding peaks. Use this mode for TF's like CTCF. MUSIC aims\nat trimming the reported regions and identifying the peaks with most dense signal in it.\n\n\u003ch3\u003e-get_multiscale_punctate_ERs\u003c/h3\u003e\nIdentifies punctate enriched regions. Uses 100 base pairs to 2 kbps scale levels. This mode is useful for punctate histone marks like H3K4me3, H3K27ac, and marks that behave\nin a mixed manner with a dominating spectrum at punctate scales like H3K4me1.\n\n\u003ch3\u003e-get_multiscale_broad_ERs\u003c/h3\u003e\nIdentifies the ERs at broad scales that can span megabases. Use this option for marks like H3K9me3, H3K27me3, H3K36me3, H3K79me2, H4K20me1... Make sure you select p-value \nnormalization parameter to balance power and false positive rate.\n\n\u003cbr\u003e\u003cbr\u003e\nMUSIC can save the smoothed tracks in \u003ca href=\"http://genome.ucsc.edu/goldenPath/help/bedgraph.html\"\u003ebedGraph\u003c/a\u003e format that can be viewed locally:\n\u003c/div\u003e\u003cbr\u003e\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\nMUSIC -write_MS_decomposition -chip chip/dedup -control input/dedup -mapp Mappability_36bp -l_mapp 36 \n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nThe smoothed bedGraph files are usually very small in size and can easily be stored/transferred.\n\n\u003ch2\u003eOutput format\u003c/h2\u003e\nMUSIC output a large number of files that contain the SSERs at different scales (named SSER_....bed). \u003cbr\u003e\n\nThe final set of ERs are reported in two files: One is broadPeak formatted (http://genome.ucsc.edu/FAQ/FAQformat.html#format13)\u003cbr\u003e\n\nOther file is in an extended BED format and has 9 columns:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\n[chromosome]\t[start]\t[end]\t[\".\"]\t[log_10 Q-value]\t[Strand (\"+\")]\t[Summit position]\t[Mappable Trough Position]\t[Fold Change]\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nThe entries are sorted with respect to increasing Q-values.\n\nNote that MUSIC reports log_10(Q-values) and not -log_10(Q-value).\n\n\u003ch2\u003eIssue with Vieweing BED file in IGV\u003c/h2\u003e\nIf you are vieweing the BED file with IGV, it sometimes does not plot the ERs correctly. If you wish to look at the bed file in IGV, use only the first 6 columns in the file:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\ncut -f1-6 ERs_1000.0_16000.0_1.50_1750_4.0.bed \u003e ERs.bed\n\u003c/font\u003e\n\u003c/div\u003e\nthen open ERs.bed in IGV.\n\n\u003ch2\u003eParameter Selection Guideline for the Studied ChIP-Seq Datasets\u003c/h2\u003e\nMUSIC has a set of default parameter sets for broad marks (like H3K36me3, H3K27me3), punctate marks (H3K4me3), and point binding (like transcription factors) which work well in most \ncases. When one is not sure about the ER scale spectrum for a signal profile at hand, one can perform a scale spectrum analysis using a large spectrum with dense sampling and get an \nidea about the dominant ER length scale (if there is one) for the signal profile and match the parameters used in the manuscript.\n\n\u003ch2\u003eParametrization for New ChIP-Seq Datasets\u003c/h2\u003e\nWhen a new ChIP-Seq dataset is going to be processed, it is necessary to choose begin and end length scales and p-value normalization window length. The length\nscales enables one to concentrate on the correct scale spectrum and p-value normalization window length compensates for variation in sequencing depth and allows\ncontrolling the estimated false positive rates. \u003cbr\u003e\n\nThere are two steps to parameter selection:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\n1. Select a stringent p-value normalization window length: -get_per_win_p_vals_FC option. \n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nThis option estimates the false positive and negative rates using a large selection of p-value window lengths. The output is a file where each row look like this:\n\u003cbr\u003e\n...\u003cbr\u003e\nl_win: 1700\tFNR: (FC:0.001) (p-val:0.001)\tFPR: (FC:0.010) (p-val:0.005)\tSentitivity: 0.999\u003cbr\u003e\n...\u003cbr\u003e\n\u003cbr\u003e\nThis option evaluates several windows lengths and estimates the false positive rate and false negative rates. We recommends using the maximum window length (l_win) where false\npositive rate (FPR) for FC and p-val values are below 1%.\u003cbr\u003e\n\u003cbr\u003e\nNote that you can skip going through the file manually if you specify p-value normalization window length as 0 ('-l_p 0') in the peak calling step; which tells MUSIC to select p-value normalization window length \nfrom the above file automatically. For this to work, make sure that you do not delete any files after running -get_per_win_p_vals_FC option, otherwise MUSIC will complain that it cannot find \nthe file. See below for complete automation (of p-value normalization window length selection) with default parameters. \u003cbr\u003e\n\u003cbr\u003e\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\n2. Using the p-value normalization window length in step 1, generate the scale specific ER scale spectrum: -get_scale_spectrum option.\u003cbr\u003e\n\n\u003c/font\u003e\u003cbr\u003e\n\u003c/div\u003e\u003cbr\u003e\nMUSIC generates the spectrum (using the scale lengths 100 base pairs to 1 megabase. The output is reported in a text file where each \nrow corresponds to a scale length. In each row, the coverage of the SSERs is given. For example: \n\u003cbr\u003e\n...\u003cbr\u003e\n27\t56815.13\t111\t11675769\u003cbr\u003e\n...\u003cbr\u003e\n\u003cbr\u003e\nwhere 2nd column is the scale length and 4th column is the coverage of the ERs that are specific to that scale. It is best to plot the spectrum, i.e., the scale lengths versus \nthe fraction of coverage of the SSERs (4th column in the file), then match the spectrum with the studied HMs in the \nmanuscript. If the spectrum is very different from all of the parametrized ChIP-Seq datasets, it is useful to generate the statistics on ER length distribution \nand distribution of ER-ER distances and use the procedure described in the manuscript to select the scale levels.\u003cbr\u003e\n\u003cbr\u003e\nThe other parameters (namely, gamma and sigma) do not depend on experimental variables and are optimized for minimizing overmerging and maximizing sensitivity.\nWe suggest using the values specified in the manuscript, i.e., gamma=4, sigma=1.5.\u003cbr\u003e\n\n\u003ch2\u003eImportant Note on Punctate ERs\u003c/h2\u003e\nAfter this analysis, if the scale spectrum turns out to be punctate, i.e., the dominating scale is smaller than 10kb; there is a possibility that the p-value normalization window\nlength parameter (l_p) yields low sensitivity. To compensate for this, we recommend running with default l_p parameter of MUSIC.\n\n\u003ch2\u003eRunning MUSIC with Default Parameters and Automatic Selection of l_p Parameter: \u003c/h2\u003e\nWe just added a new script run_MUSIC.csh. This script automates the parameter selection for -l_p option. This script automates running MUSIC with default parameters. It simply calls MUSIC \nwith above parameters in order. Here is how this script can be used: \n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\nread_fp=\"wgEncodeBroadHistoneGm12878H3k27me3StdAlnRep1.bam\";\u003cbr\u003e\nmappability_map=\"mappability/36bp\"; \u003cbr\u003e\ninput_processed_dir=\"input/pruned\";\u003cbr\u003e\nmkdir preprocessed sorted pruned\u003cbr\u003e\nrun_MUSIC.csh -preprocess ${read_fp} preprocessed\u003cbr\u003e\nrun_MUSIC.csh -remove_duplicates preprocessed sorted pruned\u003cbr\u003e\nrun_MUSIC.csh -get_optimal_broad_ERs pruned ${input_processed_dir} ${mappability_map}\u003cbr\u003e\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nMake sure that run_MUSIC.csh is included in PATH. Currently only BAM files are supported in preprocessing. We will add more file types, soon. Note that it is necessary to install samtools for making \nsure this runs smoothly. The above code calls the ER's using the optimal l_p selection with the default parameters.\n\n\u003ch2\u003eMulti-Mappability Signals\u003c/h2\u003e\nUsing Mappability correction increases the accuracy of MUSIC. You can download the multi-mappability signals for several common read lengths \u003ca href=\"http://archive.gersteinlab.org/proj/MUSIC/multimap_profiles/\"\u003ehere\u003c/a\u003e.\n\n\u003ch2\u003eMulti-Mappability Profile Generation\u003c/h2\u003e\nTo generate the multi-mappability profile, MUSIC depends on a short read aligner. By default, \u003ca href=\"http://bowtie-bio.sourceforge.net/bowtie2/index.shtml\"\u003ebowtie2\u003c/a\u003e generated SAM alignments are supported. Following are necessary for multi-Mappability profile generation:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\n- The FASTA file for the genomic sequence with all the chromosomes\u003cbr\u003e\n- The read length\u003cbr\u003e\n- bowtie2 installation and the genome indices for bowtie2\u003cbr\u003e\n- Read length\u003cbr\u003e\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nUse this script generate_multimappability_signal.csh under bin/ directory to generate the multi-mappability profile. For example, to generate the 50 bp multi-Mappability profile for human genome, \n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\ncd bin\u003cbr\u003e\n./bin/generate_multimappability_signal.csh hg19.fa 50 /home/users/music_user/bt2_indices/hg19_indexes/hg19\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nNote that the bowtie indices must be supplied in the command line and the path must be the absolute path. For the above command, they are under /home/users/music_user/bt2_indices/hg19_indexes/hg19.  \n\nThis command processes the FASTA file and writes two temporary scripts (temp_map_reads.csh, temp_process_mapping.csh) that fragments the genome, maps the fragments, and then generates the profile.\nNext, it is necessary to run these two scripts, in order:\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cfont face=\"courier\"\u003e\n./temp_map_reads.csh\u003cbr\u003e\n./temp_process_mapping.csh\n\u003c/font\u003e\n\u003c/div\u003e\u003cbr\u003e\nAfter the scripts are run, multi-mappability profile for each chromosome should be created with '.bin' extension. 'temp_map_reads.csh' is the most time consuming script that maps the fragments to the genome. \nEach line in the script is a command that maps the reads to a chromosome that can be run in parallel on a cluster. It is important to make sure 'temp_map_reads.csh' \nfinishes before running 'temp_process_mapping.csh'.\n\nYour can also email me (arif.o.harmanci@uth.tmc.edu) to generate multimappability profiles for new species.\n\n\u003ch2\u003eDatasets\u003c/h2\u003e\nThe ENCODE datasets can be downloaded from \u003ca href=\"http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/\"\u003eUCSC Genome Browser\u003c/a\u003e. \u003cbr\u003e\nThe H3K36me3 datasets for K562 and GM12878 cell lines can be downloaded from \u003ca href=\"http://archive.gersteinlab.org/proj/MUSIC/h3k36me3.tar.bz2\"\u003ehere\u003c/a\u003e.\n\u003c!--------------- The enriched regions identified by MUSIC for the HMs and Polymerase ChIP-Seq datasets for several cell lines can be downloaded from \u003ca href=\"multiMappability_signals/\"\u003ehere\u003c/a\u003e. ----\u003e\n\u003cdiv style=\"padding:8px;background-color:#ddd;line-height:1.4;\"\u003e\n\u003cb\u003e\u003ci\u003eArif Harmanci (arif.harmanci@yale.edu), Joel Rozowsky, Mark Gerstein, 2014\u003c/i\u003e\u003c/b\u003e\n\u003c/div\u003e\n\u003c/font\u003e\n\u003c/html\u003e","funding_links":[],"categories":["DNase, ATAC, and ChIP-seq"],"sub_categories":["Peak Callers"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgersteinlab%2FMUSIC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgersteinlab%2FMUSIC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgersteinlab%2FMUSIC/lists"}