https://github.com/ptrebert/sciddo
Home of the SCIDDO tool
https://github.com/ptrebert/sciddo
bioinformatics chromatin epigenomics tool
Last synced: 3 months ago
JSON representation
Home of the SCIDDO tool
- Host: GitHub
- URL: https://github.com/ptrebert/sciddo
- Owner: ptrebert
- License: gpl-3.0
- Created: 2018-04-20T08:46:38.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2022-09-15T17:51:40.000Z (almost 4 years ago)
- Last Synced: 2026-04-08T15:47:00.849Z (3 months ago)
- Topics: bioinformatics, chromatin, epigenomics, tool
- Language: Python
- Size: 1 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SCIDDO: Score-based identification of differential chromatin domains
## Publication
Manuscript: [DOI: 10.1093/bioinformatics/btaa960](https://doi.org/10.1093/bioinformatics/btaa960)
bioRxiv preprint: [DOI: 10.1101/441766 ](https://doi.org/10.1101/441766)
## Use cases
SCIDDO is a tool for the differential analysis of histone chromatin data.
SCIDDO uses chromatin state segmentation maps, e.g., as generated by ChromHMM or EpiCSeg,
for identifying regions of differential chromatin state between individual samples
or groups of replicated samples.
The detected differential chromatin domains can be expected to overlap largely
with regulatory regions or differentially expressed genes (see our manuscript
preprint for detailed results). Moreover, the score-based approach implemented
in SCIDDO affords a straightforward customization of scoring chromatin state
differences to emphasize different aspects of chromatin dynamics.
## Code maturity
SCIDDO is currently in BETA status
master branch:
[](https://travis-ci.org/ptrebert/sciddo)
dev branch:
[](https://travis-ci.org/ptrebert/sciddo)
## Setup
SCIDDO supports only Linux environments (that is unlikely to change in the future) and is developed using Python3.6.
Other Python3.x versions may or may not work, but are not officially supported.
For easy setup, it is highly recommended to install SCIDDO inside a dedicated Conda environment.
A suitable environment is specified in `environments/sciddo_env.yml`.
Otherwise, install the HDF5 library (tested with version 1.8.18) as appropriate for your local environment,
and the necessary Python dependencies from the `requirements.txt` file:
```bash
sudo apt-get install libhdf5
sudo pip install -r requirements.txt
```
Empirically, the setup of PyTables and HDF5 can create some headaches.
In this case, the best advice is to use Conda.
After all dependencies have been installed successfully,
run the SCIDDO setup as appropriate for your environment:
```bash
[sudo] python setup.py install
```
## Execution
### Input and output data formats
SCIDDO supports common text-based input and output data formats. Chromatin state segmentations as tabular (BED-like) files
should be compatible as long as they have a fixed bin width of at least 100 bp. Output files from ChromHMM or EpiCSeg
are supported out-of-the-box, and SCIDDO is designed to be used immediately downstream of these tools (e.g., SCIDDO knows
that ChromHMM segmentation files have the suffix "_segments.bed" and will strip that from file names before determining
possible sample labels). Auxiliary files such as chromatin state label or color mappings are supoprted in form of simple
tab-separated "key-value" text files.
SCIDDO's internal data managements is realized with the popular [pandas Python package](https://pandas.pydata.org/), and
data are stored in HDF5 files (*.h5) that are created with pandas. The main reason for using
HDF5 files for storing data and metadata is efficiency, but all contents of a HDF5 file can be dumped to text.
After the first step in a SCIDDO analysis of converting the input data to HDF5, all subsequent operations will be performed
on this HDF5 file.
When dumping identified differential chromatin domains (DCDs) or raw candidate regions to text, the output adheres to the
BED column layout (with header) `chromosome, start, end, name, score`, plus additional columns containing statistics and sample/group names.
If downstream tools cannot work with non-standard BED-like text files, a simple
`cut -f 1,2,3,4,5 .tsv > .bed` can be used to restrict the output to the first five,
BED-compliant columns.
### Getting help
`sciddo.py --help` or `sciddo.py --help` is your friend.
For a step-by-step help on how to use SCIDDO, please refer to the [tutorial hosted as part of this repositry](testdata/tutorial.md).
### Standard analysis run
A standard SCIDDO analysis run is split into several distinct steps that are realized by different code modules.
Besides module specific parameters, there are several global parameters to adjust SCIDDO's runtime behavior.
Importantly, these global parameters always have to be specified before the subcommand, i.e.,
```
sciddo.py [GLOBAL_PARAMETERS] [MODULE_PARAMETERS]
```
The global parameters are:
```bash
--workers: number of CPUs to use (no sanity checks!)
--debug: print debug messages to stderr; otherwise, SCIDDO operates silently
--config-dump: folder to dump run configuration (JSON); defaults to current working directory
--no-dump: do not dump run configuration
```
#### Step 1: convert
Convert all input data (state segmentations plus metadata) into a binary HDF5 file. Currently, ChromHMM
and EpiCSeg output files are supported out-of-the-box. This creates the SCIDDO DATA file.
```bash
sciddo.py [GLOBAL_PARAMETERS] convert --help
```
#### Step 2: stats
Compute a bunch of statistics (e.g., state composition per sample) that are potentially needed downstream.
```bash
sciddo.py [GLOBAL_PARAMETERS] stats --help
```
#### Step 3: score
Add scoring schemes (matrices) to the dataset. These can be derived automatically from the state segmentation
model emissions (if provided during the convert step), or can be supplied in form of a user-defined file.
Note that, in principle, an arbitrary number of scoring schemes can be added to a dataset.
```bash
sciddo.py [GLOBAL_PARAMETERS] score --help
```
#### Step 4: scan
Scan the dataset for differential chromatin domains. As opposed to the previous commands, this creates a separate
output file per run, i.e., the SCIDDO RUN file.
```bash
sciddo.py [GLOBAL_PARAMETERS] scan --help
```
#### Step 5: dump
All data and metadata in the SCIDDO DATA and RUN file can be dumped to text files (e.g., TSV tables or BED files) for downstream analysis.
```bash
sciddo.py [GLOBAL_PARAMETERS] dump --help
```