{"id":37739334,"url":"https://github.com/gxelab/psite","last_synced_at":"2026-01-16T14:05:35.679Z","repository":{"id":65607867,"uuid":"474568909","full_name":"gxelab/psite","owner":"gxelab","description":"Model-based inference of P-site offsets for ribosome footprints","archived":false,"fork":false,"pushed_at":"2023-08-15T12:37:17.000Z","size":168,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-04-25T12:22:10.283Z","etag":null,"topics":["bioinformatics","deep-sequencing","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gxelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-27T07:42:47.000Z","updated_at":"2024-08-11T13:52:21.650Z","dependencies_parsed_at":"2024-08-11T14:07:35.231Z","dependency_job_id":null,"html_url":"https://github.com/gxelab/psite","commit_stats":{"total_commits":82,"total_committers":2,"mean_commits":41.0,"dds":"0.012195121951219523","last_synced_commit":"66a43be786602c67bf59c99676910a5dfd30270a"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/gxelab/psite","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gxelab%2Fpsite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gxelab%2Fpsite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gxelab%2Fpsite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gxelab%2Fpsite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gxelab","download_url":"https://codeload.github.com/gxelab/psite/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gxelab%2Fpsite/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","deep-sequencing","machine-learning"],"created_at":"2026-01-16T14:05:35.577Z","updated_at":"2026-01-16T14:05:35.674Z","avatar_url":"https://github.com/gxelab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PSite\n\n\n[![DOI](https://zenodo.org/badge/474568909.svg)](https://zenodo.org/badge/latestdoi/474568909)\n\n`PSite` is a python package that predicts P-site offsets for footprints generated in ribosome profiling using a Gradient Boosting Trees (GBT) model trained with footprints around both annotated start and stop codons. `PSite` can report estimated P-site offsets in two manners:\n\n- append a `PS` tag to each original alignment in `SAM` or `BAM` format, without any other modifications;\n- output a new `BAM` file of the alignments of P-site locations only;\n\nTo demonstrate the usage of the `PS` tag, `PSite` also has a `coverage` module that performs genome-wide calculation of P-site coverage of ribosome footprints at nucleotide resolution.\n\n### Dependency\n- `numpy` \u003e= 1.21.2\n- `pandas` \u003e= 1.3.4\n- `biopython` \u003e= 1.79\n- `scikit-learn` \u003e= 1.1.1\n- `pysam` \u003e= 0.17.0\n- `pyBigWig` \u003e= 0.3.18\n- `click` \u003e= 8.1.2\n- `seaborn` \u003e= 0.11.0\n\nNote: `pandas` 2.0 is supported now.\n\n\n### Install and uninstall\nTo install `PSite`, run\n```bash\npip install psite\n```\n\nAlternatively, download the package [tarball](https://github.com/gxelab/psite/releases) from the release page and run\n```\npip install psite-version-py3-none-any.whl\n```\n\nTo uninstall it, simply run\n```\npip uninstall psite\n```\n\n### Build distributions from source\nRun the following command in the source directory\n```bash\npython3 -m build\n```\n\n---------------------------------------\n\n### Usage\n`PSite` is designed to be used from the command line on Unix-like operating systems such as Linux or macOS.\n\n```bash\n$ psite -h\nUsage: psite [OPTIONS] COMMAND [ARGS]...\n\n  main interface\n\nOptions:\n  -h, --help  Show this message and exit.\n\nCommands:\n  coverage  calculate the coverage for plus strand and minus strand...\n  pbam      generate bam with only P-site regions\n  predict   load pre-trained model and predict P-site offsets\n  setp      set global fixed P-site offset tag\n  train     train a model for P-site offset prediction\n```\n\n#### `train`\nThe core module that trains the GBT model for P-site offset prediction. It requires transcriptome alignments (`PATH_BAM`) and the corresponding sequences of all transcripts (`PATH_REF`). The required bam can be generated by mapping footprints to the reference genome using [STAR](https://github.com/alexdobin/STAR) and output transcriptome alignments with parameter `--quantMode TranscriptomeSAM`. The trained model is saved in `pickle` format for later use.\n\n```bash\n$ psite train -h\nUsage: psite train [OPTIONS] PATH_REF PATH_BAM OUTPUT_PREFIX PATH_TXINFO\n\n  train a model for P-site offset prediction\n\n  path_ref     : reference transcriptome (fasta) matching the bam\n  path_bam     : alignments of RPFs to reference transcriptome\n  output_prefix: output prefix of fitted models and logs\n  path_txinfo  : transcriptome annotation\n\nOptions:\n  -t, --type_rep [longest|principal|kallisto|salmon]\n                                  type of representative transcripts\n                                  [default: longest]\n  -e, --path_exp TEXT             path of transcript expression quant results\n  -i, --ignore_txversion          ignore transcript version in\n                                  \".\\d+\" format  [default: False]\n  -n, --nts INTEGER               flanking nucleotides to consider at each side\n                                  [default: 3]\n  -f, --frac FLOAT                fraction of alignments for training (for\n                                  large datasets)  [default: 1.0]\n  --offset_min INTEGER            lower bound of distance between RPF 5p and\n                                  start codon  [default: 10]\n  --offset_max INTEGER            upper bound of distance between RPF 5p and\n                                  start codon  [default: 14]\n  -d, --max_depth INTEGER         max depth of trees  [default: 3]\n  -m, --min_samples_split INTEGER\n                                  min number of alignments required to split\n                                  an internal node  [default: 6]\n  -k, --keep                      whether to keep intermediate results\n                                  [default: False]\n  -h, --help                      Show this message and exit.  [default:\n                                  False]\n```\n\n#### `predict`\nThis module predicts P-site for each alignment using a pre-trained model and append a `PS` tag (for \"P-site\") to the original alignment. The input can be either genomic alignments or transcriptomic alignments.\n\n```bash\n$ psite predict -h\nUsage: psite predict [OPTIONS] PATH_REF PATH_BAM PATH_MODEL PATH_OUT\n\n  load pre-trained model and predict P-site offsets\n\n  path_ref   : reference transcriptome (fasta) matching the bam\n  path_bam   : alignments of RPFs to reference transcriptome\n  path_model : path to save the fitted model\n  path_out   : output path of bam with PS (for P-site) tag\n\nOptions:\n  -i, --ignore_txversion    ignore transcript version in \".\\d+\"\n                            format  [default: False]\n  -l, --rlen_min INTEGER    lower bound for mapped read length\n  -u, --rlen_max INTEGER    upper bound for mapped read length\n  -c, --chunk_size INTEGER  chunk size for prediction batch  [default: 100000]\n  -h, --help                Show this message and exit.  [default: False]\n```\n\n#### `pbam`\nThis module predicts P-site for each alignment and keeps only the first nucleotide after excluding the P-site offset. Thus, each alignment in the output contains a single site. The input can be either genomic alignments or transcriptomic alignments.\n\n```bash\n$ psite pbam -h\nUsage: psite pbam [OPTIONS] PATH_REF PATH_BAM PATH_MODEL PATH_OUT\n\n  generate bam with only P-site regions\n\n  path_ref   : reference transcriptome (fasta) matching the bam\n  path_bam   : alignments of RPFs to reference transcriptome\n  path_model : path to save the fitted model\n  path_out   : output path of bam with P-site regions only\n\nOptions:\n  -f, --out_format [bam|sam]  P-site alignment output format  [default: bam]\n  -i, --ignore_txversion      ignore transcript version in \".\\d+\"\n                              format  [default: False]\n  -l, --rlen_min INTEGER      lower bound for mapped read length\n  -u, --rlen_max INTEGER      upper bound for mapped read length\n  -c, --chunk_size INTEGER    chunk size for prediction batch  [default:\n                              100000]\n  -h, --help                  Show this message and exit.  [default: False]\n```\n\n#### `coverage`\nThis module calculates the genome or transcriptome-wide coverage of RPF P-sites using the `PS` tag generatd by `predict` module.\n\n```bash\n$ psite coverage -h\nUsage: psite coverage [OPTIONS] PATH_BAM PREFIX\n\n  calculate the coverage for plus strand and minus strand separately\n\n  path_bam: sorted alignment bam file with the PS tag (for P-site offset)\n  prefix  : output prefix of P-site coverage tracks in bigWig format\n\nOptions:\n  -l, --rlen_min INTEGER  lower bound for RPF mapped length  [default: 25]\n  -u, --rlen_max INTEGER  upper bound for mapped read length  [default: 40]\n  -q, --mapq_min INTEGER  minimum mapping quality  [default: 10]\n  -i, --ignore_supp       whether to ignore supplementary alignments\n                          [default: False]\n  -h, --help              Show this message and exit.  [default: False]\n```\n\n#### `setp`\nThis module sets a global fixed value for the \"PS\" tag.\n```bash\n$ psite setp -h\nUsage: psite setp [OPTIONS] PATH_BAM PATH_OUT\n\n  set global fixed P-site offset tag\n\n  path_bam   : alignments of RPFs to reference transcriptome\n  path_out   : output path of bam with PS (for P-site) tag\n\nOptions:\n  -l, --rlen_min INTEGER     lower bound for mapped read length  [default: 27]\n  -u, --rlen_max INTEGER     upper bound for mapped read length  [default: 35]\n  -n, --nucleotides INTEGER  fixed global offset value  [default: 12]\n  -h, --help                 Show this message and exit.\n```\n\n---------------------------------------\n\n### An example workflow to use PSite\n\n##### Prepare input files\nAfter trimming adapters and optionally removing reads derived from rRNAs or tRNAs, map ribosomal footprints to the reference genome with STAR:\n\n```bash\nSTAR --runThreadN 16 --outFilterType BySJout --outFilterMismatchNmax 2 --genomeDir genome_index --readFilesIn sample_RPF.fq.gz  --outFileNamePrefix sample_RPF --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --outFilterMultimapNmax 1 --outFilterMatchNmin 16 --alignEndsType EndToEnd --outSAMattributes NH HI AS nM NM MD\n```\n\nThe parameter `--quantMode TranscriptomeSAM` will instruct STAR to translate the genomic alignments into transcript alignments, which will be used to train the GBT model. Since many uniquely mapped reads in genomic alignments will become multi-mapping reads in transcriptome alignment due to the presence of alternative transcript isoforms, `--outFilterMultimapNmax 1` parameter is included to exlude only multi-mapping reads in genomic alignments.\n\nPSite needs to know the position of annotated start codons and stop codons of all protein-coding transcripts, which can be obtained with the helper scripts located in the `scripts` directory:\n```bash\nRscript --vanilla scripts/extract_txinfo_ensembl.R gene_annotations.gtf txinfo.tsv\n```\n\nOnly a represent isoform is used in the analysis for a gene with multiple transcript isoforms. By default, PSite uses the longest transcript. However, a more reasonable choice is the most abundant isoform. Therefore, if the information of transcript abundance as calculated by [kallisto](https://pachterlab.github.io/kallisto/) or [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) is provided, PSite can automatically determine the most abundant transcript isoform for later use:\n```bash\nsalmon quant -p4 --seqBias --gcBias --posBias -l A -i salmon_index -r sample_RNA.fq.gz -o salmon_results\n```\n\n##### Run PSite\nThe first step is to train a GBT model with `train` module with the transcriptome bam. Then, the fitted model will be saved in [pickle](https://docs.python.org/3/library/pickle.html) format.\n\n```bash\npsite train -i -t salmon -e salmon_results/quant.sf \\\n    all_transcripts.fa sample_RPF.Aligned.toTranscriptome.out.bam output_prefix txinfo.tsv\n```\n\nModel training is slow for large datasets. `-f` parameter can be used to select only a subset of alignments for training. This can significantly improve speed and reduce memory usage while maintaining similar accuracy.\n\nOnce the model is successfully trained, it can be used to predict P-site offsets for ribosome footprints that are mapped to the reference genome or reference transcriptomes. It should be noted that if you use genome bam for prediction, genomic fasta should be used as input, and vice versa.\n\n```bash\n# with transcriptomic bam\npsite predict -i all_transcripts.fa sample_RPF.Aligned.toTranscriptome.out.bam output_prefix.gbt.pickle sample_RPF.transcriptome.tag.bam\n\n# with genomic bam\npsite predict -i genome.fa sample_RPF.Aligned.sortedByCoord.out.bam output_prefix.gbt.pickle sample_RPF.genome.tag.bam\n```\nexample output:\n\n```\nr1\t16\t2L\t10716\t255\t29M\t*\t0\t0\tTACAATTTATTAAATGGGGACGGACCAAT\tIIIIIIIIIIIIIIIIIIIIIHIIDDDDD\tNH:i:1\tHI:i:1\tAS:i:28\tnM:i:0\tNM:i:0\tMD:Z:29\tjM:B:c,-1\tjI:B:i,-1\tPS:i:10\nr2\t16\t2L\t10836\t255\t33M\t*\t0\t0\tTGTCAACTTTTATCCTTTGTACCTTTCTACAAA\tIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDD\tNH:i:1\tHI:i:1\tAS:i:32\tnM:i:0\tNM:i:0\tMD:Z:33\tjM:B:c,-1\tjI:B:i,-1\tPS:i:12\nr3\t16\t2L\t10891\t255\t30M\t*\t0\t0\tCGGGTAAAGGGTATAAAGTCACTACGCGAA\tGGD1HEC?HHIHHGF\u003c1?IHHDHHF0@DD0\tNH:i:1\tHI:i:1\tAS:i:29\tnM:i:0\tNM:i:0\tMD:Z:30\tjM:B:c,-1\tjI:B:i,-1\tPS:i:12\nr4\t16\t2L\t11027\t255\t31M\t*\t0\t0\tTTTCTGTTTGTATGTAAATCGCGTTTAATTT\tIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDD\tNH:i:1\tHI:i:1\tAS:i:30\tnM:i:0\tNM:i:0\tMD:Z:31\tjM:B:c,-1\tjI:B:i,-1\tPS:i:12\nr5\t16\t2L\t11073\t255\t32M\t*\t0\t0\tCGTTCCTATTTTGCTGTCCCCGTTCGATTTTT\tIHHIIIIIIIIIIIIIIHIIIIHIIIIDDDDD\tNH:i:1\tHI:i:1\tAS:i:31\tnM:i:0\tNM:i:0\tMD:Z:32\tjM:B:c,-1\tjI:B:i,-1\tPS:i:12\nr6\t16\t2L\t11077\t255\t29M\t*\t0\t0\tCCTATTTTGCTGTCCCCGTTCGATTTTTA\t@CC?FCD\u003c\u003cCG\u003c/H@HDHHHIHCF0?@@D\tNH:i:1\tHI:i:1\tAS:i:28\tnM:i:0\tNM:i:0\tMD:Z:29\tjM:B:c,-1\tjI:B:i,-1\tPS:i:12\nr7\t16\t2L\t11132\t255\t31M\t*\t0\t0\tAAATTACATCAGGACTAGTACTCGTTTGCGT\tIIIHIIHIIIIIHIHIIIHHIGIIIIDDDBD\tNH:i:1\tHI:i:1\tAS:i:30\tnM:i:0\tNM:i:0\tMD:Z:31\tjM:B:c,-1\tjI:B:i,-1\tPS:i:11\nr8\t16\t2L\t11138\t255\t32M\t*\t0\t0\tCATCAGGACTAGTACTCGTTTGCGTCGTATTT\t1HHHIIHIHGIHIIIIIIHHIIHGGHHDDD@@\tNH:i:1\tHI:i:1\tAS:i:31\tnM:i:0\tNM:i:0\tMD:Z:32\tjM:B:c,-1\tjI:B:i,-1\tPS:i:12\nr9\t16\t2L\t11138\t255\t29M\t*\t0\t0\tCATCAGGACTAGTACTCGTTTGCGTCGTA\tCFEGFHFCIHIHIIIHDIGIIHCHD\u003cBB?\tNH:i:1\tHI:i:1\tAS:i:28\tnM:i:0\tNM:i:0\tMD:Z:29\tjM:B:c,-1\tjI:B:i,-1\tPS:i:10\nr10\t16\t2L\t11140\t255\t32M\t*\t0\t0\tTCAGGACTAGTACTCGTTTGCGTCGTATTTCT\tFCCHHCHIHIHHE0HHHDECHIH?IHG@0@D@\tNH:i:1\tHI:i:1\tAS:i:31\tnM:i:0\tNM:i:0\tMD:Z:32\tjM:B:c,-1\tjI:B:i,-1\tPS:i:13\n```\n\nIt is also possible to output alignments with P-site locations only, which can be used for downstream applications such as translated ORF prediction with [RibORF](https://github.com/zhejilab/RibORF).\n\n```bash\npsite pbam -f sam genome.fa sample_RPF.Aligned.sortedByCoord.out.bam output_prefix.gbt.pickle sample_RPF.genome.psite.sam\n```\n\nHere are a few lines from an example output:\n\n```\nr1      16      1       531180  255     1M      *       0       0       G       J       NH:i:1  HI:i:1  AS:i:30 nM:i:0  NM:i:0  MD:Z:31\nr2      16      1       531180  255     1M      *       0       0       G       J       NH:i:1  HI:i:1  AS:i:30 nM:i:0  NM:i:0  MD:Z:31\nr3      0       1       629921  255     1M      *       0       0       A       J       NH:i:1  HI:i:1  AS:i:31 nM:i:1  NM:i:1  MD:Z:0C33\nr4      0       1       629921  255     1M      *       0       0       A       J       NH:i:1  HI:i:1  AS:i:31 nM:i:1  NM:i:1  MD:Z:0C33\nr5      0       1       629922  255     1M      *       0       0       T       J       NH:i:1  HI:i:1  AS:i:32 nM:i:0  NM:i:0  MD:Z:33\nr6      0       1       629922  255     1M      *       0       0       T       J       NH:i:1  HI:i:1  AS:i:29 nM:i:1  NM:i:1  MD:Z:0C31\nr7      0       1       629922  255     1M      *       0       0       T       J       NH:i:1  HI:i:1  AS:i:29 nM:i:1  NM:i:1  MD:Z:0C31\nr8      0       1       629922  255     1M      *       0       0       T       J       NH:i:1  HI:i:1  AS:i:29 nM:i:1  NM:i:1  MD:Z:0C31\nr9      0       1       629922  255     1M      *       0       0       T       J       NH:i:1  HI:i:1  AS:i:32 nM:i:0  NM:i:0  MD:Z:33\nr10     0       1       629922  255     1M      *       0       0       T       J       NH:i:1  HI:i:1  AS:i:30 nM:i:1  NM:i:1  MD:Z:0C32\n```\n\nPSite also has a module for fast calculation of genome or transcriptome P-site coverage of ribosome footprints. The alignments should be sorted by coordinates before coverage calculation.\n```bash\n# sort bam\nsamtools sort -@ 8 -O bam -o sample_RPF.genome.tag.sorted.bam sample_RPF.genome.tag.bam\n\n# calculate coverage\npsite coverage sample_RPF.genome.tag.sorted.bam sample_RPF.psite_cov\n```\n\nNEW: a complete example of how to run PSite and use PSite output for downstream analyses is available from the [repository associated with PSite manuscript](https://github.com/gxelab/psite_analysis).\n\n---------------------------------------\n\n#### Other information\nPlease use the [issues](https://github.com/gxelab/psite/issues) panel for questions related to `PSite`, bug reports, or feature requests. If you use `psite` in your work, you can cite it as follows:\n\n\u003e Chang, Y., Lei, T., Zhang, H., 2023. PSite: inference of read-specific P-site offsets for ribosomal footprints. bioRxiv, 2023.2006.2027.546788. https://doi.org/10.1101/2023.06.27.546788.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgxelab%2Fpsite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgxelab%2Fpsite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgxelab%2Fpsite/lists"}