{"id":37723780,"url":"https://github.com/zavolanlab/mirflowz","last_synced_at":"2026-01-16T13:36:12.355Z","repository":{"id":117644452,"uuid":"610267492","full_name":"zavolanlab/mirflowz","owner":"zavolanlab","description":"Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.","archived":false,"fork":false,"pushed_at":"2026-01-02T09:07:36.000Z","size":1334,"stargazers_count":6,"open_issues_count":22,"forks_count":1,"subscribers_count":4,"default_branch":"dev","last_synced_at":"2026-01-08T08:16:00.408Z","etag":null,"topics":["bioinformatics","isomirs","mirna","snakemake","workflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zavolanlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-03-06T12:40:12.000Z","updated_at":"2026-01-02T09:07:39.000Z","dependencies_parsed_at":"2023-10-03T13:08:31.122Z","dependency_job_id":"9b2c2df3-6e73-4566-8ae3-4d61b7dbb40d","html_url":"https://github.com/zavolanlab/mirflowz","commit_stats":{"total_commits":150,"total_committers":8,"mean_commits":18.75,"dds":0.6133333333333333,"last_synced_commit":"7dbeb42ccfe8964a313b9fba345c73c9c882f785"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/zavolanlab/mirflowz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zavolanlab%2Fmirflowz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zavolanlab%2Fmirflowz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zavolanlab%2Fmirflowz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zavolanlab%2Fmirflowz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zavolanlab","download_url":"https://codeload.github.com/zavolanlab/mirflowz/tar.gz/refs/heads/dev","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zavolanlab%2Fmirflowz/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479033,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","isomirs","mirna","snakemake","workflow"],"created_at":"2026-01-16T13:36:12.275Z","updated_at":"2026-01-16T13:36:12.346Z","avatar_url":"https://github.com/zavolanlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# _MIRFLOWZ_\n\n_MIRFLOWZ_ is a [Snakemake][snakemake] workflow for mapping miRNAs and isomiRs.\n\n## Table of Contents\n\n1. [Installation](#installation)\n    - [Cloning the repository](#cloning-the-repository)\n    - [Dependencies](#dependencies)\n    - [Setting up the virtual environment](#setting-up-the-virtual-environment)\n    - [Testing your installation](#testing-your-installation)\n2. [Usage](#usage)\n    - [Preparing inputs](#preparing-inputs)\n    - [Running the workflow](#running-the-workflow)\n    - [Expected output files](#expected-output-files)\n    - [Creating a Snakemake report](#creating-a-snakemake-report)\n3. [Workflow description](#workflow-description)\n    - [Prepare module](#prepare-module)\n    - [Map module](#map-module)\n    - [Quantify module](#quantify-module)\n    - [ASCII-style pileups module](#ascii-style-pileups-module)\n4. [Contributing](#contributing)\n5. [License](#license)\n6. [Contact](#contact)\n\n## Installation\n\nThe workflow lives inside this repository and will be available for you to run\nafter following the installation instructions laid out in this section.\n\n### Cloning the repository\n\nTraverse to the desired path on your file system, then clone the repository and\nchange into it with:\n\n```bash\ngit clone https://github.com/zavolanlab/mirflowz.git\ncd mirflowz\n```\n\n### Dependencies\n\nFor improved reproducibility and reusability of the workflow, as well as an\neasy means to run it on a high performance computing (HPC) cluster managed,\ne.g., by [Slurm][slurm], all steps of the workflow run inside isolated\nenvironments ([Apptainer][apptainer] containers or [Conda][conda]\nenvironments). As a consequence, running this workflow has only a few\nindividual dependencies. These are managed by the package manager Conda, which\nneeds to be installed on your system before proceeding.\n\nIf you do not already have Conda installed globally on your system,\nwe recommend that you install [Miniconda][miniconda-installation]. For faster\ncreation of the environment (and Conda environments in general), you can also\ninstall [Mamba][mamba] on top of Conda. In that case, replace `conda` with\n`mamba` in the commands below (particularly in `conda env create`).\n\n### Setting up the virtual environment\n\nCreate and activate the environment with necessary dependencies with Conda:\n\n```bash\nconda env create -f environment.yml\nconda activate mirflowz\n```\n\nIf you plan to run _MIRFLOWZ_ via Conda, we recommend using the following\ncommand for a faster environment creation, specially if you will run it on an\nHPC cluster.\n\n```bash\nconda config --set channel_priority strict\n```\n\nIf you would like to contribute to _MIRFLOWZ_ development, you may find it\nuseful to create your environment with the development dependencies:\n\n```bash\nconda env create -f environment.dev.yml\n```\n\n### Testing your installation\n\nSeveral tests are provided to check the integrity of the installation. Follow\nthe instructions in this section to make sure the workflow is ready to use.\n\n#### Run test workflow on local machine\n\nExecute one of the following commands to run the test workflow on your local\nmachine:\n\n- Test workflow on local machine with **Apptainer**:\n\n```bash\nbash test/test_workflow_local_with_apptainer.sh\n```\n\n- Test workflow on local machine with **Conda**:\n\n```bash\nbash test/test_workflow_local_with_conda.sh\n```\n\n#### Run test workflow on a cluster via Slurm\n\nExecute one of the following commands to run the test workflow on a\nSlurm-managed high-performance computing (HPC) cluster:\n\n- Test workflow with **Apptainer**:\n\n```bash\nbash test/test_workflow_slurm_with_apptainer.sh\n```\n\n\n- Test workflow with **Conda**:\n\n```bash\nbash test/test_workflow_slurm_with_conda.sh\n```\n\n#### Rule graph\n\nExecute the following command to generate a rule graph image for the workflow.\nThe output will be found in the `images/` directory in the repository root.\n\n```bash\nbash test/test_rule_graph.sh\n```\n\nYou can see the rule graph below in the\n[workflow description](#workflow-description) section.\n\n#### Clean up test results\n\nAfter successfully running the tests above, you can run the following command\nto remove all artifacts generated by the test runs:\n\n```bash\nbash test/test_cleanup.sh\n```\n\n## Usage\n\nNow that your virtual environment is set up and the workflow is deployed and\ntested, you can go ahead and run the workflow on your samples.\n\n### Preparing inputs\n\nIt is suggested to have all the input files for a given run (or hard links\npointing to them) inside a dedicated directory, for instance under the\n_MIRFLOWZ_ root directory. This way, it is easier to keep the data together,\nset up Apptainer access to them and reproduce analyses.\n\n#### 1. Prepare a sample table\n\nRefer to `test/test_files/sample_table.tsv` to know what this file\nmust look like, or use it as a template.\n\n```bash\ntouch path/to/your/sample/table.tsv\n```\n\u003e Fill the sample table according to the following requirements:\n\u003e\n\u003e - `sample`. Arbitrary name for the miRNA sequencing library.\n\u003e - `sample_file`. Path to the miRNA sequencing library file. The path must be\n\u003e relative to the directory where the workflow will be run.\n\u003e - `adapter`. Sequence of the 3'-end adapter used during library preparation.\n\u003e - `format`. One of `fa`/`fasta` or `fq`/`fastq`, if the library file is in\n\u003e FASTA or FASTQ format, respectively.\n\n#### 2. Prepare genome resources\n\nThere are 4 files you must provide:\n\n1. A **`gzip`ped FASTA** file containing **reference sequences**, typically the\n   genome of the source/organism from which the library was extracted.\n\n2. A **`gzip`ped GTF** file with matching **gene annotations** for the\n   reference sequences above.\n\n\u003e _MIRFLOWZ_ expects both the reference sequence and gene annotation files to\n\u003e follow [Ensembl][ensembl] style/formatting. If you obtained these files from\n\u003e a source other than Ensembl, you must ensure that they adhere to the\n\u003e expected format by converting them, if necessary.\n\n3. An **uncompressed GFF3** file with **microRNA annotations** for the reference\n   sequences above.\n\n\u003e _MIRFLOWZ_ expects the miRNA annotations to follow [miRBase][mirbase]\n\u003e style/formatting. If you obtained this file from a source other than miRBase,\n\u003e you must ensure that it adheres to the expected format by converting it, if\n\u003e necessary.\n\n4. An **uncompressed tab-separated file** with a **mapping between the\n   reference names** used in the miRNA annotation file (column 1; \"UCSC style\")\n   and in the gene annotations and reference sequence files (column 2; \"Ensembl\n   style\"). Values in column 1 are expected to be unique, no header is\n   expected, and any additional columns will be ignored. [This resource][chrMap]\n   provides such files for various organisms, and in the expected format.\n\n5. **OPTIONAL**: A **BED6** file with regions for which to produce\n   [ASCII-style pileups][ascii-pileups]. If not provided, no pileups are\n   generated. See [here][bed-format] for the expected format.\n\n\u003e General note: If you want to process the genome resources before use (e.g.,\n\u003e filtering), you can do that, but make sure the formats of any modified\n\u003e resource files meet the formatting expectations outlined above!\n\n\n#### 3. Prepare a configuration file\n\nWe recommend creating a copy of the\n[configuration file template](config/config_template.yaml):\n\n```bash\ncp  config/config_template.yaml  path/to/config.yaml\n```\n\nOpen the new copy in your editor of choice and adjust the configuration\nparameters to your liking. The template explains what each of the\nparameters mean and how you can meaningfully adjust them.\n\n### Running the workflow\n\nWith all the required files in place, you can now run the workflow locally\nvia Apptainer with the following command:\n\n```bash\nsnakemake \\\n    --snakefile=\"path/to/Snakefile\" \\\n    --cores 4  \\\n    --configfile=\"path/to/config.yaml\" \\\n    --software-deployment-method apptainer \\\n    --apptainer-args \"--bind ${PWD}/../\" \\\n    --printshellcmds \\\n    --rerun-incomplete \\\n    --verbose\n```\n\nLikewise, you can run the workflow locally via Conda with the following\ncommand:\n\n```bash\nsnakemake \\\n    --snakefile=\"path/to/Snakefile\" \\\n    --cores 4  \\\n    --configfile=\"path/to/config.yaml\" \\\n    --software-deployment-method conda \\\n    --printshellcmds \\\n    --rerun-incomplete \\\n    --verbose\n```\n\n\u003e **NOTE:** Depending on your working directory, you do not need to use the\n\u003e parameters `--snakefile` and `--configfile`. For instance, if the `Snakefile`\n\u003e is in the same directory or the `workflow/` directory is beneath the current\n\u003e working directory, there's no need for the `--snakefile` directory. Refer to\n\u003e the [Snakemake documentation][snakemakeDocu] for more information.\n\nAfter successful execution of the workflow, results and logs will be found in\nthe `results/` and `logs/` directories, respectively.\n\n### Expected output files\n\nUpon successful execution of _MIRFLOWZ_, the tool automatically removes all\nintermediate files generated during the process. The final outputs comprise:\n\n1. A SAM file containing alignments intersecting a pri-miR locus. These\nalignments intersect with extended start and/or end positions specified in the\nprovided pri-miR annotations. Please note that they may not contribute to the\nfinal counting and may not appear in the final table.\n\n2. A SAM file containing alignments intersecting a mature miRNA locus. Similar\nto the previous file, these alignments intersect with extended start and/or end\npositions specified in the provided miRNA annotations. They may not contribute\nto the final counting and might be absent from the final table.\n\n3. A BAM file containing the set of alignments contributing to the final\ncounting and its corresponding index file (`.bam.bai`).\n\n4. Table(s) containing the counting data from all libraries for (iso)miRs\nand/or pri-miRs. Each row corresponds to a miRNA species, and each column\nrepresents a sample library. Each read is counted towards all the annotated\nmiRNA species it aligns to, with 1/n, where n is the number of genomic and/or\ntranscriptomic loci that read aligns to.\n\n5. **OPTIONAL**. ASCII-style pileups of read alignments produced for individual\nlibraries, combinations of libraries and/or all libraries of a given run. The\nexact number and nature of the outputs depends on the workflow\ninputs/parameters. See the\n[pileups section](pipeline_documentation.md/#pileup-workflow) for a detailed\ndescription.\n\nTo retain all intermediate files, include `--no-hooks` in the workflow call.\n\n```bash\nsnakemake \\\n    --snakefile=\"path/to/Snakefile\" \\\n    --cores 4  \\\n    --configfile=\"path/to/config.yaml\" \\\n    --software-deployment-method conda \\\n    --printshellcmds \\\n    --rerun-incomplete \\\n    --no-hooks \\\n    --verbose\n```\n\nAfter successful execution of the workflow, the intermediate files will be\nfound in the `results/intermediates` directory.\n\n### Creating a Snakemake report\n\nSnakemake provides the option to generate a detailed HTML report on runtime\nstatistics, workflow topology and results. If you want to create a Snakemake\nreport, you must run the following command:\n\n```bash\nsnakemake \\\n    --snakefile=\"path/to/Snakefile\" \\\n    --configfile=\"path/to/config.yaml\" \\\n    --report=\"snakemake_report.html\"\n```\n\n\u003e **NOTE:** The report creation must be done after running the workflow in\n\u003e order to have the runtime statistics and the results.\n\n## Workflow description\n\n_MIRFLOWZ_ consists of a main `Snakefile` and four functional modules. In the\n`Snakefile`, the configuration file is validated, and the various modules are\nimported. In addition, a handler for both, a successful and a failed run are\nset. If the workflow finishes without any errors, all the intermediate files\nare removed, otherwise, a log file is created. To keep the intermediate files\nupon completion, use the `--no-hooks` CLI argument when running the pipeline.\n\nThe modules [(1)](#prepare-module) process the genome resources,\n[(2)](#map-module) map and [(3)](#quantify-module) quantify the reads, and\n[(4)](#ascii-style-pileups-module) generate pileups, as described in detail\nbelow.\n\n\u003e **NOTE:** _MIRFLOWZ_ uses the notation provided by miRBase (_i.e._\n\u003e \"miRNA primary transcript\" for precursors and \"miRNA\" for the canonical\n\u003e mature miRNA). This implies that precursors are named \"pri-miRs\" across the\n\u003e workflow instead of pre-miR. This decision is made upon the lack of\n\u003e guarantee that \"miRNA primary transcripts\" are full pre-miR (and pre-miR\n\u003e only) sequences.\n\n### Prepare module\n\nThe _MIRFLOWZ_ workflow initially processes and indexes the genome resources\nprovided by the user. The regions corresponding to mature miRNAs are extended\nby a fixed but user-adjustable number of nucleotides on both sides to\naccommodate isomiR species with shifted start and/or end positions. If\nnecessary, pri-miR loci are extended to adjust to the new miRNA coordinates.\nIn addition, to account for the different genomic locations a miRNA sequence\ncan be annotated, the name of these sequences are modified to have the format\n`SPECIES-mir-NUMBER[LETTER]-#` for pri-miRs, and\n`SPECIES-miR-NUMBER[LETTER]-#-ARM` or `SPECIES-miR-NUMBER[LETTER]-#` for mature\nmiRNAs with both or just one arm respectively, where `#` is the paralog number\n(replica/locus index), included when multiple loci express the same or similar\nmiRNAs, and `LETTER` denotes a sequence variant of the mature miRNA\n(paralogous variant with similar but not identical sequences).\n\n### Map module\n\nThe user-provided short-read small RNA-seq libraries undergo quality filtering\n(skipped if libraries are provided in FASTA rather than FASTQ), followed by\nadapter removal. The resulting reads are independently mapped to both the\ngenome and the transcriptome using two distinct aligners: Segemehl and our\nin-house tool Oligomap.\n\nSegemehl implements a fast heuristic strategy that returns the alignment(s)\nwith the smallest edit distance. Oligomap, on the other hand, implements a\nslower and more restricted approach that reports all the alignments with an\nedit distance of at most 1. The combination of the fast and flexible results\nand the strict selection ensures results with a higher fidelity than if only\none of the tools was to be used.\n\nTwo merging steps are done in order to have all the alignments in a single\nfile. In the first one, the transcriptome and the genome mappings from both\naligners are fused and only those alignments with a smaller NH than the one\nprovided are kept. For the second step, transcriptomic coordinates are turned\ninto genomic ones and alignments are combined into a single file. Duplicate\nalignments resulting from the partially redundant mapping strategy are\ndiscarded and only the best alignments for each read are retained (_i.e._ the\nones with the smallest edit distance). In addition, and due to the alignment's\naggregation, a second filtering according to the new NH is performed.\nIf a read has been aligned beyond a specified threshold, it is removed due to\n(1) performance reasons as the file size can rapidly increase, and (2) the fact\nthat each read contributes to each count `1/N` where `N` is the number of\ngenomic loci it aligns to and a large `N` makes the contribution negligible.\n\nA final filter is made to further increase the classification accuracy and\nreduce the amount of multimappers (defined here as alignments of the same read\naligning to different genomic loci with the same edit distance). Given that\nisomiRs are known to contain more InDels than mismatches when compared to the\ncanonical sequence they come from, as demonstrated by\n[Saunders et al. (2017)][cite_saunders], [Neilsen et al. (2012)][cite_neilsen]\nand [Schumauch et al. (2024)][cite_schumauch] only those multimappers that\ncontain a higher or equal number of InDels compared to mismatches are retained.\nNote that some multimappers might still be present if the number of InDels is\nthe same across alignments.\n\n### Quantify module\n\nThe filtered alignments are subsequently intersected with the user-provided,\npre-processed miRNA annotation files using BEDTools. Each alignment is\nclassified according to the miRNA species it fully intersects with in order\nto do the counts.\n\nCounts are tabulated separately for reads consistent with either miRNA\nprecursors, mature miRNA and/or isomiRs, and all library counts are fused\ninto a single table. Note that an alignment is only counted towards a given\nmiRNA (or isomiR) species if one of its alignments fully falls within the\n(previously extended) locus annotated for that miRNA. Specifically, reads\ncontribute with `1/N` for each miRNA for which that is the case, where `N` is\nthe total number of genomic loci the read aligns to. Under this criterion, the\nprecursor counts contain reads that intersect with its mature arm(s), its\nhairpin sequence and/or the whole precursor itself.\n\n#### isomiRs notation\n\nA sequence is considered to be an isomiR if it has a shift on either end, an\nInDel or a mismatch on its sequence when compared to the canonical miRNA it\nmaps and intersects with.\n\n_MIRFLOWZ_ employs an unambiguous notation to classify isomiRs using the\nformat `miRNA_name|5p-shift|3p-shift|CIGAR|MD|READ_SEQ`, where `5p-shift` and\n`3p-shift` represent the differences between the annotated mature miRNA\nstart and end positions and those of the read alignment, respectively.\n\n### ASCII-style pileups module\n\nFinally, to visualize the distribution of read alignments around miRNA\nloci, ASCII-style alignment pileups are optionally generated for user-defined\nregions of interest.\n\n\nThe schema below is a visual representation of the individual workflow steps\nand how they are related:\n\n\u003e ![rule-graph][rule-graph]\n\n\u003e **NOTE:** For an elaborated description of each rule along with some\n\u003e examples, please, refer to the\n\u003e [workflow documentation](pipeline_documentation.md).\n\n## Contributing\n\n_MIRFLOWZ_ is an open-source project which relies on community contributions.\nYou are welcome to participate by submitting bug reports or feature requests,\ntaking part in discussions, or proposing fixes and other code changes. Please\nrefer to the [contributing guidelines](CONTRIBUTING.md) if you are interested\nin contributing.\n\n## License\n\nThis project is covered by the [MIT License](LICENSE).\n\n## Contact\n\nFor questions or suggestions regarding the code, please use the\n[issue tracker][issue-tracker]. Do not hesitate to contact us via\n[email][email] for any other inquiries.\n\n\u0026copy; 2023 [Zavolab, Biozentrum, University of Basel][zavolab]\n\n[apptainer]: \u003chttps://apptainer.org/docs/user/main/index.html\u003e\n[ascii-pileups]: \u003chttps://git.scicore.unibas.ch/zavolan_group/tools/ascii-alignment-pileup\u003e\n[bed-format]: \u003chttps://gist.github.com/deliaBlue/19ad3740c95937378bd9281bd9d1bc72\u003e\n[chrMap]: \u003chttps://github.com/dpryan79/ChromosomeMappings\u003e\n[cite_neilsen]:\u003chttps://www.sciencedirect.com/science/article/pii/S0168952512001126\u003e\n[cite_saunders]: \u003chttps://pubmed.ncbi.nlm.nih.gov/17360642/\u003e\n[cite_schumauch]: \u003chttps://www.biorxiv.org/content/10.1101/2024.03.28.587190v1\u003e\n[cluster execution]: \u003chttps://snakemake.readthedocs.io/en/stable/executing/cluster.html\u003e\n[conda]: \u003chttps://docs.conda.io/projects/conda/en/latest/index.html\u003e\n[email]: \u003czavolab-biozentrum@unibas.ch\u003e\n[ensembl]: \u003chttps://ensembl.org/\u003e\n[issue-tracker]: \u003chttps://github.com/zavolanlab/mirflowz/issues\u003e\n[mamba]: \u003chttps://github.com/mamba-org/mamba\u003e\n[miniconda-installation]: \u003chttps://docs.conda.io/en/latest/miniconda.html\u003e\n[mirbase]: \u003chttps://mirbase.org/\u003e\n[rule-graph]: images/rule_graph.svg\n[slurm]: \u003chttps://slurm.schedmd.com/documentation.html\u003e\n[snakemake]: \u003chttps://snakemake.readthedocs.io/en/stable/\u003e\n[snakemakeDocu]: \u003chttps://snakemake.readthedocs.io/en/stable/executing/cli.html\u003e\n[zavolab]: \u003chttps://www.biozentrum.unibas.ch/research/researchgroups/overview/unit/zavolan/research-group-mihaela-zavolan/\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzavolanlab%2Fmirflowz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzavolanlab%2Fmirflowz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzavolanlab%2Fmirflowz/lists"}