https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser

Automated parser for VaxiJen output 🚀 A lightweight Perl tool to extract and tabulate antigenicity predictions from VaxiJen . Designed for bioinformatics, immunoinformatics, and reverse vaccinology workflows, this script helps researchers process large datasets into clean, ready-to-analyze tables.
https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser

antigenicity-prediction bioinformatics computational-biology excel-converter fasta immunoinformatics immunology peptide-analysis perl-script proteomics proteomics-data-analysis vaccine-development vaxijen

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser
Owner: vaishnavpvarma
License: mit
Created: 2025-07-16T05:47:50.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-08-30T05:48:00.000Z (about 1 month ago)
Last Synced: 2025-08-30T07:17:40.477Z (about 1 month ago)
Topics: antigenicity-prediction, bioinformatics, computational-biology, excel-converter, fasta, immunoinformatics, immunology, peptide-analysis, perl-script, proteomics, proteomics-data-analysis, vaccine-development, vaxijen
Language: Perl
Homepage:
Size: 47.9 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

# 🧬 Vaxijen Antigenicity Parser

---

## 🌟 Highlights
- 🚀 Automates tedious manual parsing of **VaxiJen** webserver outputs.
- 📊 Converts raw text outputs into structured **Excel spreadsheets**.
- 🧪 Handles **large datasets** where copy–pasting results is not humanly possible.
- 🐪 Written in **Perl**, a classic bioinformatics scripting language, optimized for **text parsing**.

---

## ℹ️ Overview
Many bioinformatics webservers (like **VaxiJen [all version])**) do not provide an option to **download results**.
Instead, they display predictions on the webpage, forcing researchers to manually copy, paste, and tabulate the results.

👉 This becomes impractical when working with **hundreds or thousands of peptides**, each output containing:
- A **score** (e.g. VaxiJen score, floating point)
- A **prediction** (e.g. *Probable ANTIGEN* / *NON-ANTIGEN*)

These Perl scripts solve this problem by:
1. Reading raw `.txt` files saved from the webserver output.
2. Extracting IDs, peptide sequences, and prediction values using **Regular Expressions (RegEx)**.
3. Writing them neatly into an Excel file (`.xls` or `.xlsx`) for downstream analysis.

💡 In short:
**Unstructured text in → Structured spreadsheet out.**

**Before:** You have a text file that looks something like this:
```
>protein_sequence_001
AKFPQRSTUVWXYZAB
Some technical text here...
Overall Prediction for the Protective Antigen = 0.7234
More text...
(Probable ANTIGEN)
>protein_sequence_002
MNPQRSTUVWXYZDEF
...
```

**After:** You get a clean Excel file with columns:
| ID | Sequence | VaxiJen Score | Antigenicity |
|----|----------|---------------|--------------|
| protein_sequence_001 | AKFPQRSTUVWXYZAB | 0.7234 | Probable ANTIGEN |
| protein_sequence_002 | MNPQRSTUVWXYZDEF | 0.5621 | Probable NON-ANTIGEN |
---

## Key Features

- 🔍 **Smart Search**: Automatically finds protein sequences and their prediction scores, even when they're scattered across multiple lines
- 📊 **Excel Output**: Creates professional-looking spreadsheets with proper formatting and column headers
- 🛡️ **Error Proof**: Checks if files exist and handles common errors gracefully
- 🔧 **Flexible**: Works with different sequence lengths (optimized for 16-letter sequences but adapts to others)
- ✅ **Data Validation**: Only captures valid protein sequences (sequences with only capital letters A-Z)

## How It Works (In Simple Terms)

Think of this script like a very patient assistant who:

1. **Reads every line** of your messy text file, one by one
2. **Looks for patterns** like sequence names (lines starting with ">") and protein sequences (lines with only capital letters)
3. **Connects the dots** between related information that might be several lines apart
4. **Organizes everything** into a neat table structure
5. **Creates a pretty Excel file** with proper formatting and headers

## Technical Approach

### Pattern Matching Strategy
The script uses **regular expressions** (pattern matching rules) to identify different types of data:
- `^>(.+)` finds sequence identifiers
- `^[A-Z]{16}$` finds 16-letter protein sequences
- `Overall Prediction.*= (-?[0-9.]+)` extracts numerical scores
- `Probable (ANTIGEN|NON-ANTIGEN)` captures classification results

### Search Algorithm
Uses a **forward-looking sequential search**:
- Processes the file line by line from top to bottom
- When it finds a sequence identifier, it searches the next 10 lines for the corresponding protein sequence
- Stops searching once it finds what it's looking for (efficient and prevents endless searching)

### Why Not Use Simpler Tools?

**Question**: "Why use Perl instead of AWK or other text processing tools?"

**Answer**: While AWK would be simpler for just extracting text, this script needs to create formatted Excel files with headers, column widths, and styling. AWK can't do that directly - you'd need multiple tools. Perl handles both text parsing and Excel creation in one go, with robust error handling for research workflows.

## ⬇️ Prerequisites
1) Install **Perl 5**
- Linux/macOS: usually preinstalled
- Windows: install via **Strawberry Perl** → https://strawberryperl.com/

2) Install Perl modules:
```bash
cpan Excel::Writer::XLSX # for VaxiJen script (.xlsx)
cpan Spreadsheet::WriteExcel # for AllerTOP script (.xls)
```

## **Installation**

3) Clone the repo:

```bash
git clone https://github.com/vaishnavvarma/vaxijen-antigenicity-parser.git
cd vaxijen-antigenicity-parser
```
## **🚀 Usage** (no command-line arguments)
1️⃣ VaxiJen → Excel (.xlsx)

Step A: Edit file paths inside the script
Open scripts/vaxijen_to_excel.pl in your preferred Text Editor and set:
```
my $input_file = "path/to/vaxijen_output.txt";
my $output_file = "vaxijen_results.xlsx";
```
Step B: Run in commandline/terminal
```
perl scripts/vaxijen-antigenicity-parser.pl
```
💡 Tips for file paths (Windows):

> Prefer forward slashes: C:/Users/Name/Desktop/input.txt

Or escape backslashes: C:\\Users\\Name\\Desktop\\input.txt

> If a path has spaces, wrap in quotes inside the Perl string: "C:/My Data/results.txt"

## Perfect For

- 🧬 Bioinformatics researchers working with protein predictions (Immunoinformatics)
- 📊 Anyone who needs to convert scientific text output into spreadsheet format
- 🎓 Students learning about data parsing and file processing
- 🔬 Labs that need to process VaxiJen antigenicity predictions regularly

_ _ _ _ _ _

## **More About VaxiJen**
Actual Creators of VaxiJen:
- [Prof. Irini Doychinova](https://pharmfac.mu-sofia.bg/?page_id=5444&lang=en)
- [Darren Flower](https://www.linkedin.com/in/darrenflower/?originalSubdomain=uk)

## **Use VaxiJen**
- [VaxiJen v2.0](https://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html)
- [VaxiJen v3.0](https://www.ddg-pharmfac.net/vaxijen3/home/)
- [Read More Here (BMC Bioinformatics, 2007)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-4)
____________________________________________________________________________________________________________________________________________________________________________

## ✍ Behind the Code (vaxijen-antigenicity-parser)

👨‍🔬 **Vaishnav P. Varma**
[GitHub Profile](https://github.com/vaishnavvarma) [LinkedIn Profile](https://www.linkedin.com/in/vaishnav-p-varma/)

💻 *Bioinformatician by training | 📸 Photographer by heart | ☕ Turning coffee & curiosity into code*

[![Buy Me a Coffee](https://img.shields.io/badge/Buy%20Me%20a%20Coffee-%E2%98%95-yellow)](https://buymeacoffee.com/vaishnavpvarma)

---

✨ Crafted with ❤️, code, and curry in India 🇮🇳

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser

Awesome Lists containing this project

README