{"id":20425036,"url":"https://github.com/databio/cellspecificopenchromatin","last_synced_at":"2026-04-20T14:33:33.998Z","repository":{"id":86287772,"uuid":"235172139","full_name":"databio/cellSpecificOpenChromatin","owner":"databio","description":"Preprocessing scripts to create a matrix, where columns are different cell types, raws are open chromatin regions and values are signal ATAC-seq signal intensities.","archived":false,"fork":false,"pushed_at":"2020-10-13T20:58:58.000Z","size":400,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-09-11T10:16:13.972Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-20T18:42:17.000Z","updated_at":"2020-11-12T18:46:16.000Z","dependencies_parsed_at":"2023-03-13T09:28:30.256Z","dependency_job_id":null,"html_url":"https://github.com/databio/cellSpecificOpenChromatin","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/databio/cellSpecificOpenChromatin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FcellSpecificOpenChromatin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FcellSpecificOpenChromatin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FcellSpecificOpenChromatin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FcellSpecificOpenChromatin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databio","download_url":"https://codeload.github.com/databio/cellSpecificOpenChromatin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2FcellSpecificOpenChromatin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32050925,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T07:12:05.454Z","updated_at":"2026-04-20T14:33:33.968Z","avatar_url":"https://github.com/databio.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cell type specific open chromatin signal matrix\n\n### Prerequisities\nSome of UCSC tools https://genome.ucsc.edu/goldenPath/help/bigWig.html:\n- bedGraphToBigWig\n- bigWigMerge\n- bigWigToBedGraph\n\n- bedtools (http://quinlanlab.org/tutorials/bedtools/bedtools.html)\\\nR packages:\n- tidyverse\n- preprocessCore\n- data.table\n\n\n## How to create open chromatin signal matrix:\n\n## 1.-7. in */dataDownloadAndPreprocess* folder\nSet up path to the folder */dataDownloadAndPreprocess* in your *.bash_profile*.\n### 1. Download signal tracks from ENCODE \nAt ENCODE website apply criteria for file selection - hg19 / DNA accessibility / bigWig. Download *files.txt*. \\\nThe first line in *files.txt* contains the link to metadata connected to these files - download *metadata.tsv*.\\\n\\\nIn RStudio run:\n```filterMetadata.R```.\n\\\nThe output of *filterMetadata.R* is *metadata_cells.tsv*, and *downloadCells.txt*. \\\n\\\nFrom terminal: \n```\nmkdir primaryCells\nmv downloadCells.txt primaryCells\ncd primaryCells\nxargs -L 1 curl -O -L \u003c downloadCells.txt\ncd ..\n```\n\n### 2. Organize bigWig files into cell-specific subdirectories and merge \n\nFrom RStudio run script: \n```organizeFilesIntoCellFolders.R```.\n\\\nThe script contains function *moveBigWigs(cellMetadata, mainDir)*, which requires 2 inputs: \n- name of the metadata file containing file name - cell type information (*metadata_cells.tsv*) \n- name of a folder conatining downloaded bigWig files (*/primary cells*)\n\nWithin the directory provided to *moveBigWigs* a cell specific subdirectories are created and the bigWig files connected to the given cell type are moved into the subdirectory. \n\\\nFollowin step merges bigWig files coming from the same cell type. The output in form of bedGraph files is moved to a directory specified by used.\n\n``` \nmkdir primaryCells_bedGraph\ncd primaryCells\nmergeBigWig.sh\ncd ..\n```\nFollowing questions pop out:\n```\nWhat is the full path to the folder where outputs should be stored?\npath_to_dataDownloadAndProcess/primaryCells_mergedBigWig\n\nWhat is the full path to the file with chromosome sizes?\npath_to_dataDownloadAndProcess/hg19chrom.sizes\n```\nIf a folder contains a single bigWig file an error is generated In these folders run ```bigWigToBedGraph``` manually and move to a corresponding folder with all the other merged bedGraphs. \n\n### 3. Download cell specific open chromatin BED files \nAt ENCODE website apply filtering criteia - DNA accessibility / hg19 / primary cell / bed narrowPeak. \\\nSame as with the bigWig files, download the text document with links for BED file download (*dataDownloadAndProcess/BED_files/files.txt*) and diwnload the metadata from the first line (*dataDownloadAndProcess/BED_files/metadata.tsv*). \\\nRun following script to select only trully hg19 BED files and have status *released*.\n```\nfilterMetadata_BEDfiles.R\n```\nThis script creates a new text file with filtered links: (*dataDownloadAndProcess/BED_files/downloadBEDfiles.txt*) \\\nTo get the BEd files run following from terminal:\n```\ncd BED_files\nxargs -L 1 curl -O -L \u003c downloadBEDfiles.txt\n```\n### 4. Create a set of all possible open chromatin regions across cell types\nTo create a universe of oll possible chromatin accessible regions run following command from the directory with all the downloaded BED files:\n```\ncat *.bed | sort -k1,1 -k2,2n | bedtools merge -i stdin \u003e MasterPeaks.bed\n```\nYou can now remove all the downloaded BEd files and keep only the *MasterPeaks.bed*.\nCreate a 4th column in *MasterPeaks.bed* with names for individual peaks (e.g. chr_start_end) - otherwise and error will be generated in following step.\n### 5. Assign cell specific signal values to the genomic regions defined in step 4\nPlace the *MasterPeaks.bed* file into a directory, where it will be the only BED file. \\\nFrom the folder with final normalised bigWig files run following script:\n```\ncellSpecificity_bigWigOverBed.sh\n```\nFollowing question pops out. Give the full path to the *folder* containing *MasterPeaks.bed* - example bellow.\n```\nWhat is the folder with your BED files?\ndataDownloadAndProcess/BED_files\n```\nA new directory will be created within a folder with bigWig files - */MasterPeaks_coverage*. It contains coverage files for individual cell types in the predefined regions. The columns in the TAB files are following:\n- *name* - name field from bed, which should be unique (the 4th column in BED file)\n- *size* - size of bed (sum of exon sizes\n- *covered* - # bases within exons covered by bigWig\n- *sum* - sum of values over all bases covered\n- *mean0* - average over bases with non-covered bases counting as zeroes\n- *mean* - average over just covered bases\n- *min*  - minimum observed in the area\n- *max*  - maximum observed in the area\n\n### 6. Merge individual signal tracks into matrix\nRun following script, where you must first set a path to the folder with coverage files (the TAB files generated by running ```cellSpecificity_bigWigOverBed.sh```) - line 5 (variable *folderName*).\n```\ncreateOpenMatrix.R\n```\nThe script creates two matrices - one with maximum coverage value over the given region and one with mean0 value over the given region. Rows are individual genomic regions, columns are individual cell types. \n### 7. Normalize matrix\nFinal step of creating the cell specific open chromatin matrix is normalization. This is done by running ```normalizeOpenMatrix.R``` , where previously created matrix is passed to the variable *rawMatrix*. The script creates the open chromatin cell specific matrix in its final form, which is called *meanCoverage_percentile99_01_quantNormalized_round4d.txt*.\\\nThe normalization steps are following: \n1) Set all values above 99th percentile for a given cell type to 1.\n2) Normalize the rest of the values to range from 0 to 1.\n3) Perform quantile normalization for cell types to be comparable.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fcellspecificopenchromatin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabio%2Fcellspecificopenchromatin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fcellspecificopenchromatin/lists"}