{"id":20612671,"url":"https://github.com/datasnakes/htseq-count-cluster","last_synced_at":"2025-10-03T17:19:45.938Z","repository":{"id":27210255,"uuid":"110619307","full_name":"datasnakes/htseq-count-cluster","owner":"datasnakes","description":"A cli for running multiple qsub jobs with HTSeq's htseq-count on a cluster.","archived":false,"fork":false,"pushed_at":"2022-02-25T22:43:19.000Z","size":73,"stargazers_count":4,"open_issues_count":5,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-28T17:33:52.122Z","etag":null,"topics":["cli","cluster","htseq","htseq-count","htseq-count-cluster","rnaseq","sge"],"latest_commit_sha":null,"homepage":"http://htseq-count-cluster.rtfd.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datasnakes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-14T00:30:40.000Z","updated_at":"2023-11-07T12:47:10.000Z","dependencies_parsed_at":"2022-08-07T12:15:39.041Z","dependency_job_id":null,"html_url":"https://github.com/datasnakes/htseq-count-cluster","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasnakes%2Fhtseq-count-cluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasnakes%2Fhtseq-count-cluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasnakes%2Fhtseq-count-cluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasnakes%2Fhtseq-count-cluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datasnakes","download_url":"https://codeload.github.com/datasnakes/htseq-count-cluster/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248761271,"owners_count":21157523,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","cluster","htseq","htseq-count","htseq-count-cluster","rnaseq","sge"],"created_at":"2024-11-16T11:07:36.699Z","updated_at":"2025-10-03T17:19:40.890Z","avatar_url":"https://github.com/datasnakes.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/datasnakes/htseq-count-cluster.svg?branch=master)](https://travis-ci.org/datasnakes/htseq-count-cluster) [![PyPI version](https://badge.fury.io/py/HTSeqCountCluster.svg)](https://badge.fury.io/py/HTSeqCountCluster) \n[![GitHub license](https://img.shields.io/github/license/datasnakes/htseq-count-cluster.svg)](https://github.com/datasnakes/htseq-count-cluster/blob/master/LICENSE)\n[![Documentation Status](https://readthedocs.org/projects/htseq-count-cluster/badge/?version=latest)](https://htseq-count-cluster.readthedocs.io/en/latest/?badge=latest)\n\n\n# htseq-count-cluster\n\nA cli wrapper for running [htseq](https://github.com/simon-anders/htseq)'s `htseq-count` on a cluster.\n\nView [documentation](https://tinyurl.com/yb7kz7zz).\n\n## Install\n\n`pip install HTSeqCountCluster`\n\n## Features\n\n- For use with large datasets (we've previously used a dataset of 120 different human samples)\n- For use with SGE/SGI cluster systems\n- Submits multiple jobs\n- Command line interface/script\n- Merges counts files into one counts table/csv file\n- Uses `accepted_hits.bam` file output of `tophat`\n\n\n### Examples\n\n#### Run htseq-count-cluster\n\nAfter generating bam output files from tophat, instead of using HTSeq's `htseq-count`, you\ncan use our `htseq-count-cluster` script. This script is intended for use with\nclusters that are using pbs (qsub) for job monitoring.\n\nOur default `htseq-count` command is `htseq-count -f bam -s no file.bam file.gtf -o htseq.out`.\nThis command does not take into account any strandedness (`-s no`) for the input bamfiles (`-f bam`) and uses the default `union` mode. For the default mode `union`, only the aligned read determines how the read pair is counted.\n\n```bash\nhtseq-count-cluster -p path/to/bam-files/ -f samples.csv -g genes.gtf -o path/to/cluster-output/\n```\n\n| Argument |                                                                             Description                                                                             | Required |\n|:--------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|\n|   `-p`   | This is the path of your .bam files.  Presently, this script looks for a folder that is the sample name and searches for an accepted_hits.bam file (tophat output). |    Yes   |\n|   `-i`   |                                                     You should have a csv file list of your samples or folder names (no header).                                                    |    Yes   |\n|   `-g`   |                                                           This should be the path to your genes.gtf file.                                                           |    Yes   |\n|   `-o`   |                                                  This should be an existing directory for your output counts files.                                                 |    Yes   |\n|   `-e`   |\n\nThis script uses logzero so there will be color coded logging information to your shell.\n\nA common linux practice is to use `screen` to create a new shell and run a program\nso that if it does produce output to the stdout/shell, the user can exit that particular\nshell without the program ending and utilize another shell.\n\n##### Help message output for `htseq-count-cluster`\n\n```\nusage: htseq-count-cluster [-h] -p INPATH -f INFILE -g GTF -o OUTPATH\n                              [-e EMAIL]\n\nThis is a command line wrapper around htseq-count.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -p INPATH, --inpath INPATH\n                        Path of your samples/sample folders.\n  -f INFILE, --infile INFILE\n                        Name or path to your input csv file.\n  -g GTF, --gtf GTF     Name or path to your gtf/gff file.\n  -o OUTPATH, --outpath OUTPATH\n                        Directory of your output counts file. The counts file\n                        will be named.\n  -e EMAIL, --email EMAIL\n                        Email address to send script completion to.\n\n*Ensure that htseq-count is in your path.\n\n\n```\n\n\n#### Merge output counts files\n\nIn order to prep your data for `DESeq2`, `limma` or `edgeR`, it's best to have 1 merged\ncounts file instead of multiple files produced from the `htseq-count-cluster` script. We offer this\nas a standalone script as it may be useful to keep those files separate.\n\n```bash\nmerge-counts -d path/to/cluster-output/\n```\n\n##### Help message for `merge-counts`\n\n```\nusage: merge-counts [-h] -d DIRECTORY\n\nMerge multiple counts tables into 1 counts .csv file.\n\nYour output file will be named:  merged_counts_table.csv\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -d DIRECTORY, --directory DIRECTORY\n                        Path to folder of counts files.\n```\n\n## ToDo\n\n- [ ] Monitor jobs.\n- [ ] Enhance wrapper input for other use cases.\n- [ ] Add example data.\n\n\n## Maintainers\n\nShaurita Hutchins | [@sdhutchins](https://github.com/sdhutchins) | [✉](mailto:sdhutchins@outlook.com)  \nRob Gilmore | [@grabear](https://github.com/grabear) | [✉](mailto:robgilmore127@gmail.com)\n\n\n## Help\n\nPlease feel free to [open an issue](https://github.com/datasnakes/htseq-count-cluster/issues/new) if you have a question/feedback/problem\nor [submit a pull request](https://github.com/datasnakes/htseq-count-cluster/compare) to add a feature/refactor/etc. to this project.\n\n## Citation\n\n*Simon Anders, Paul Theodor Pyl, Wolfgang Huber; **HTSeq—a Python framework to work with high-throughput sequencing data**, Bioinformatics, Volume 31, Issue 2, 15 January 2015, Pages 166–169, https://doi.org/10.1093/bioinformatics/btu638*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasnakes%2Fhtseq-count-cluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatasnakes%2Fhtseq-count-cluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasnakes%2Fhtseq-count-cluster/lists"}