https://github.com/mt1022/cubar
R Package for Codon Usage Bias Analysis. Comprehensive documentation and tutorials are available at:
https://github.com/mt1022/cubar
bioinformatics codon-usage machine-learning r-package sequence-analysis
Last synced: 4 months ago
JSON representation
R Package for Codon Usage Bias Analysis. Comprehensive documentation and tutorials are available at:
- Host: GitHub
- URL: https://github.com/mt1022/cubar
- Owner: mt1022
- License: other
- Created: 2021-04-20T13:22:41.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2025-10-04T03:06:11.000Z (4 months ago)
- Last Synced: 2025-10-22T03:51:44.065Z (4 months ago)
- Topics: bioinformatics, codon-usage, machine-learning, r-package, sequence-analysis
- Language: R
- Homepage: https://mt1022.github.io/cubar/
- Size: 23.3 MB
- Stars: 10
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
# cubar
> **Comprehensive Codon Usage Bias Analysis in R**
[](https://CRAN.R-project.org/package=cubar)
[](https://cran.r-project.org/package=cubar)
[](https://doi.org/10.5281/zenodo.10155990)
[](https://lifecycle.r-lib.org/articles/stages.html#stable)
## Table of Contents
- [Overview](#overview)
- [Features](#features)
- [🧬 Codon-Level Analysis](#-codon-level-analysis)
- [📊 Gene-Level Metrics](#-gene-level-metrics)
- [🛠️ Utilities \& Tools](#️-utilities--tools)
- [Why Choose cubar?](#why-choose-cubar)
- [Installation](#installation)
- [Stable Release (Recommended)](#stable-release-recommended)
- [Development Version](#development-version)
- [Dependencies](#dependencies)
- [Documentation \& Tutorials](#documentation--tutorials)
- [🎯 Getting Started](#-getting-started)
- [📚 Advanced Topics](#-advanced-topics)
- [Example Workflow](#example-workflow)
- [🆘 Getting Help](#-getting-help)
- [Related Packages](#related-packages)
- [License](#license)
- [Acknowledgments](#acknowledgments)
- [Citation](#citation)
## Overview
Codon usage bias refers to the non-uniform usage of synonymous codons (codons that encode the same amino acid) across different organisms, genes, and functional categories. **cubar** is a comprehensive R package for analyzing codon usage bias in coding sequences. It provides a unified framework for calculating established codon usage metrics, conducting sliding-window analyses or differential usage analyses, and optimizing sequences for heterologous expression.
## Features
### 🧬 Codon-Level Analysis
- **RSCU calculation**: Relative synonymous codon usage analysis
- **Amino acid usage**: Frequency of each amino acid in sequences
- **Codon weights**: Calculate weights based on gene expression, tRNA availability, and mRNA stability
- **Optimal codon inference**: Machine learning-based identification of optimal codons
- **Codon-anticodon visualization**: Visualization of codon-tRNA pairing relationships
### 📊 Gene-Level Metrics
- **Codon frequency tabulation**: Count codon occurrences across sequences
- **CAI (Codon Adaptation Index)**: Measure similarity to highly expressed genes
- **ENC (Effective Number of Codons)**: Assess codon usage bias strength
- **Fop (Fraction of Optimal codons)**: Calculate proportion of optimal codons
- **tAI (tRNA Adaptation Index)**: Match codon usage to tRNA availability
- **CSCg (Codon Stabilization Coefficients)**: Quantify mRNA stability effects
- **Dp (Deviation from Proportionality)**: Analyze virus-host codon usage relationships
- **GC content metrics**: Overall GC, GC3s (3rd codon positions), GC4d (4-fold degenerate sites)
### 🛠️ Utilities & Tools
- **Sliding window analysis**: Positional codon usage patterns within genes
- **Sequence optimization**: Redesign sequences for optimal expression
- **Differential codon usage**: Statistical comparison between sequence sets
- **Quality control**: Comprehensive CDS validation and preprocessing
## Why Choose cubar?
- **🚀 High Performance**: Process large datasets (>100,000 sequences) efficiently using optimized `Biostrings` and `data.table` backends
- **🧬 Flexible Genetic Codes**: Support for all NCBI genetic codes plus custom genetic code tables
- **🔗 R Ecosystem Integration**: Seamlessly integrate with other bioinformatics and data analysis packages
- **📚 Comprehensive Documentation**: Extensive tutorials, examples, and theoretical background
- **🔬 Research Ready**: Implements established metrics with proper citations and validation
## Installation
### Stable Release (Recommended)
Install the latest stable version from CRAN:
```r
install.packages("cubar")
```
### Development Version
Install the latest development version from GitHub:
```r
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install cubar from GitHub
devtools::install_github("mt1022/cubar", dependencies = TRUE)
```
### Dependencies
**System Requirements:**
- R (≥ 4.1.0)
**Required Packages:**
- `Biostrings` (≥ 2.60.0) - Bioconductor package for sequence manipulation
- `IRanges` (≥ 2.34.0) - Bioconductor infrastructure for range operations
- `data.table` (≥ 1.14.0) - High-performance data manipulation
- `ggplot2` (≥ 3.3.5) - Data visualization
- `rlang` (≥ 0.4.11) - Language tools
**Note:** Bioconductor packages will be installed automatically, but you may need to update your R installation if you encounter compatibility issues.
## Documentation & Tutorials
📖 **Complete documentation** is available within R (`?function_name`) and on our [**package website**](https://mt1022.github.io/cubar/).
### 🎯 Getting Started
- [**Introduction to cubar**](https://mt1022.github.io/cubar/articles/cubar.html) - Basic usage and core functionality
- [**Non-standard Genetic Codes**](https://mt1022.github.io/cubar/articles/non_standard_genetic_code.html) - Working with alternative genetic codes
- [**Codon Optimization**](https://mt1022.github.io/cubar/articles/codon_optimization.html) - Sequence optimization strategies
### 📚 Advanced Topics
- [**Mathematical Foundations**](https://mt1022.github.io/cubar/articles/theory.html) - Detailed theory behind the metrics
- [**Function Reference**](https://mt1022.github.io/cubar/reference/) - Complete function documentation
## Example Workflow
Here's a toy example demonstrating key functionality:
```r
library(cubar)
library(ggplot2)
# 1. Load and quality-check sequences
data(yeast_cds)
clean_cds <- check_cds(yeast_cds)
# 2. Calculate codon frequencies
codon_freq <- count_codons(clean_cds)
# 3. Calculate multiple metrics
enc <- get_enc(codon_freq) # Effective number of codons
gc3s <- get_gc3s(codon_freq) # GC content at 3rd positions
# 4. Calculate CAI with RSCU of highly expressed genes
data(yeast_exp)
yeast_exp <- yeast_exp[yeast_exp$gene_id %in% rownames(codon_freq), ]
high_expr <- head(yeast_exp[order(-yeast_exp$fpkm), ], 500)
rscu_high <- est_rscu(codon_freq[high_expr$gene_id, ])
cai <- get_cai(codon_freq, rscu_high)
# 5. Visualize results
df <- data.frame(ENC = enc, CAI = cai, GC3s = gc3s)
ggplot(df, aes(color = GC3s, x = ENC, y = CAI)) +
geom_point(alpha = 0.6) +
scale_color_viridis_c() +
labs(title = "Codon Usage Bias Relationships",
x = "Effective Number of Codons", y = "Codon Adaptation Index")
```
## 🆘 Getting Help
- **📋 GitHub Issues**: [Report bugs, request features, or ask questions](https://github.com/mt1022/cubar/issues)
- **📖 Documentation**: Check function help (`?function_name`) and [online docs](https://mt1022.github.io/cubar/)
## Related Packages
For complementary analysis, consider these R packages:
- **[Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)** - Sequence input/output and manipulation
- **[Peptides](https://github.com/dosorio/Peptides)** - Peptide and protein property calculations
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- The R and Bioconductor communities for excellent foundational packages
- Contributors and users who have provided feedback and improvements
- **[GitHub Education](https://education.github.com/)** for providing free access to development tools
- **GitHub Copilot** was used to suggest code snippets during development
## Citation
If you use cubar in your research, please cite:
> Mengyue Liu, Bu Zi, Hebin Zhang, Hong Zhang, cubar: a versatile package for codon usage bias analysis in R, Genetics, 2025, iyaf191, https://doi.org/10.1093/genetics/iyaf191
Please also cite the original studies associated with any codon usage metrics or third-party software you use. You can find the relevant references in the documentation of the corresponding functions (for example, type `?cubar::get_enc` in the R console and check the "References" section in the help page).
---
**[📚 Documentation](https://mt1022.github.io/cubar/) • [🐛 Report Bug](https://github.com/mt1022/cubar/issues) • [💡 Request Feature](https://github.com/mt1022/cubar/issues)**