https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser
Automated parser for VaxiJen output ๐ A lightweight Perl tool to extract and tabulate antigenicity predictions from VaxiJen . Designed for bioinformatics, immunoinformatics, and reverse vaccinology workflows, this script helps researchers process large datasets into clean, ready-to-analyze tables.
https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser
antigenicity-prediction bioinformatics computational-biology excel-converter fasta immunoinformatics immunology peptide-analysis perl-script proteomics proteomics-data-analysis vaccine-development vaxijen
Last synced: about 1 month ago
JSON representation
Automated parser for VaxiJen output ๐ A lightweight Perl tool to extract and tabulate antigenicity predictions from VaxiJen . Designed for bioinformatics, immunoinformatics, and reverse vaccinology workflows, this script helps researchers process large datasets into clean, ready-to-analyze tables.
- Host: GitHub
- URL: https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser
- Owner: vaishnavpvarma
- License: mit
- Created: 2025-07-16T05:47:50.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-08-30T05:48:00.000Z (about 1 month ago)
- Last Synced: 2025-08-30T07:17:40.477Z (about 1 month ago)
- Topics: antigenicity-prediction, bioinformatics, computational-biology, excel-converter, fasta, immunoinformatics, immunology, peptide-analysis, perl-script, proteomics, proteomics-data-analysis, vaccine-development, vaxijen
- Language: Perl
- Homepage:
- Size: 47.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# ๐งฌ Vaxijen Antigenicity Parser
---
## ๐ Highlights
- ๐ Automates tedious manual parsing of **VaxiJen** webserver outputs.
- ๐ Converts raw text outputs into structured **Excel spreadsheets**.
- ๐งช Handles **large datasets** where copyโpasting results is not humanly possible.
- ๐ช Written in **Perl**, a classic bioinformatics scripting language, optimized for **text parsing**.---
## โน๏ธ Overview
Many bioinformatics webservers (like **VaxiJen [all version])**) do not provide an option to **download results**.
Instead, they display predictions on the webpage, forcing researchers to manually copy, paste, and tabulate the results.๐ This becomes impractical when working with **hundreds or thousands of peptides**, each output containing:
- A **score** (e.g. VaxiJen score, floating point)
- A **prediction** (e.g. *Probable ANTIGEN* / *NON-ANTIGEN*)These Perl scripts solve this problem by:
1. Reading raw `.txt` files saved from the webserver output.
2. Extracting IDs, peptide sequences, and prediction values using **Regular Expressions (RegEx)**.
3. Writing them neatly into an Excel file (`.xls` or `.xlsx`) for downstream analysis.๐ก In short:
**Unstructured text in โ Structured spreadsheet out.****Before:** You have a text file that looks something like this:
```
>protein_sequence_001
AKFPQRSTUVWXYZAB
Some technical text here...
Overall Prediction for the Protective Antigen = 0.7234
More text...
(Probable ANTIGEN)
>protein_sequence_002
MNPQRSTUVWXYZDEF
...
```**After:** You get a clean Excel file with columns:
| ID | Sequence | VaxiJen Score | Antigenicity |
|----|----------|---------------|--------------|
| protein_sequence_001 | AKFPQRSTUVWXYZAB | 0.7234 | Probable ANTIGEN |
| protein_sequence_002 | MNPQRSTUVWXYZDEF | 0.5621 | Probable NON-ANTIGEN |
---## Key Features
- ๐ **Smart Search**: Automatically finds protein sequences and their prediction scores, even when they're scattered across multiple lines
- ๐ **Excel Output**: Creates professional-looking spreadsheets with proper formatting and column headers
- ๐ก๏ธ **Error Proof**: Checks if files exist and handles common errors gracefully
- ๐ง **Flexible**: Works with different sequence lengths (optimized for 16-letter sequences but adapts to others)
- โ **Data Validation**: Only captures valid protein sequences (sequences with only capital letters A-Z)## How It Works (In Simple Terms)
Think of this script like a very patient assistant who:
1. **Reads every line** of your messy text file, one by one
2. **Looks for patterns** like sequence names (lines starting with ">") and protein sequences (lines with only capital letters)
3. **Connects the dots** between related information that might be several lines apart
4. **Organizes everything** into a neat table structure
5. **Creates a pretty Excel file** with proper formatting and headers## Technical Approach
### Pattern Matching Strategy
The script uses **regular expressions** (pattern matching rules) to identify different types of data:
- `^>(.+)` finds sequence identifiers
- `^[A-Z]{16}$` finds 16-letter protein sequences
- `Overall Prediction.*= (-?[0-9.]+)` extracts numerical scores
- `Probable (ANTIGEN|NON-ANTIGEN)` captures classification results### Search Algorithm
Uses a **forward-looking sequential search**:
- Processes the file line by line from top to bottom
- When it finds a sequence identifier, it searches the next 10 lines for the corresponding protein sequence
- Stops searching once it finds what it's looking for (efficient and prevents endless searching)### Why Not Use Simpler Tools?
**Question**: "Why use Perl instead of AWK or other text processing tools?"
**Answer**: While AWK would be simpler for just extracting text, this script needs to create formatted Excel files with headers, column widths, and styling. AWK can't do that directly - you'd need multiple tools. Perl handles both text parsing and Excel creation in one go, with robust error handling for research workflows.
## โฌ๏ธ Prerequisites
1) Install **Perl 5**
- Linux/macOS: usually preinstalled
- Windows: install via **Strawberry Perl** โ https://strawberryperl.com/2) Install Perl modules:
```bash
cpan Excel::Writer::XLSX # for VaxiJen script (.xlsx)
cpan Spreadsheet::WriteExcel # for AllerTOP script (.xls)
```## **Installation**
3) Clone the repo:
```bash
git clone https://github.com/vaishnavvarma/vaxijen-antigenicity-parser.git
cd vaxijen-antigenicity-parser
```
## **๐ Usage** (no command-line arguments)
1๏ธโฃ VaxiJen โ Excel (.xlsx)Step A: Edit file paths inside the script
Open scripts/vaxijen_to_excel.pl in your preferred Text Editor and set:
```
my $input_file = "path/to/vaxijen_output.txt";
my $output_file = "vaxijen_results.xlsx";
```
Step B: Run in commandline/terminal
```
perl scripts/vaxijen-antigenicity-parser.pl
```
๐ก Tips for file paths (Windows):> Prefer forward slashes: C:/Users/Name/Desktop/input.txt
Or escape backslashes: C:\\Users\\Name\\Desktop\\input.txt
> If a path has spaces, wrap in quotes inside the Perl string: "C:/My Data/results.txt"
## Perfect For
- ๐งฌ Bioinformatics researchers working with protein predictions (Immunoinformatics)
- ๐ Anyone who needs to convert scientific text output into spreadsheet format
- ๐ Students learning about data parsing and file processing
- ๐ฌ Labs that need to process VaxiJen antigenicity predictions regularly_ _ _ _ _ _
## **More About VaxiJen**
Actual Creators of VaxiJen:
- [Prof. Irini Doychinova](https://pharmfac.mu-sofia.bg/?page_id=5444&lang=en)
- [Darren Flower](https://www.linkedin.com/in/darrenflower/?originalSubdomain=uk)## **Use VaxiJen**
- [VaxiJen v2.0](https://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html)
- [VaxiJen v3.0](https://www.ddg-pharmfac.net/vaxijen3/home/)
- [Read More Here (BMC Bioinformatics, 2007)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-4)
____________________________________________________________________________________________________________________________________________________________________________## โ Behind the Code (vaxijen-antigenicity-parser)
๐จโ๐ฌ **Vaishnav P. Varma**
[GitHub Profile](https://github.com/vaishnavvarma) [LinkedIn Profile](https://www.linkedin.com/in/vaishnav-p-varma/)๐ป *Bioinformatician by training | ๐ธ Photographer by heart | โ Turning coffee & curiosity into code*
[](https://buymeacoffee.com/vaishnavpvarma)
---
โจ Crafted with โค๏ธ, code, and curry in India ๐ฎ๐ณ