An open API service indexing awesome lists of open source software.

https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser

Automated parser for VaxiJen output ๐Ÿš€ A lightweight Perl tool to extract and tabulate antigenicity predictions from VaxiJen . Designed for bioinformatics, immunoinformatics, and reverse vaccinology workflows, this script helps researchers process large datasets into clean, ready-to-analyze tables.
https://github.com/vaishnavpvarma/vaxijen-antigenicity-parser

antigenicity-prediction bioinformatics computational-biology excel-converter fasta immunoinformatics immunology peptide-analysis perl-script proteomics proteomics-data-analysis vaccine-development vaxijen

Last synced: about 1 month ago
JSON representation

Automated parser for VaxiJen output ๐Ÿš€ A lightweight Perl tool to extract and tabulate antigenicity predictions from VaxiJen . Designed for bioinformatics, immunoinformatics, and reverse vaccinology workflows, this script helps researchers process large datasets into clean, ready-to-analyze tables.

Awesome Lists containing this project

README

          

# ๐Ÿงฌ Vaxijen Antigenicity Parser

---

## ๐ŸŒŸ Highlights
- ๐Ÿš€ Automates tedious manual parsing of **VaxiJen** webserver outputs.
- ๐Ÿ“Š Converts raw text outputs into structured **Excel spreadsheets**.
- ๐Ÿงช Handles **large datasets** where copyโ€“pasting results is not humanly possible.
- ๐Ÿช Written in **Perl**, a classic bioinformatics scripting language, optimized for **text parsing**.

---

## โ„น๏ธ Overview
Many bioinformatics webservers (like **VaxiJen [all version])**) do not provide an option to **download results**.
Instead, they display predictions on the webpage, forcing researchers to manually copy, paste, and tabulate the results.

๐Ÿ‘‰ This becomes impractical when working with **hundreds or thousands of peptides**, each output containing:
- A **score** (e.g. VaxiJen score, floating point)
- A **prediction** (e.g. *Probable ANTIGEN* / *NON-ANTIGEN*)

These Perl scripts solve this problem by:
1. Reading raw `.txt` files saved from the webserver output.
2. Extracting IDs, peptide sequences, and prediction values using **Regular Expressions (RegEx)**.
3. Writing them neatly into an Excel file (`.xls` or `.xlsx`) for downstream analysis.

๐Ÿ’ก In short:
**Unstructured text in โ†’ Structured spreadsheet out.**

**Before:** You have a text file that looks something like this:
```
>protein_sequence_001
AKFPQRSTUVWXYZAB
Some technical text here...
Overall Prediction for the Protective Antigen = 0.7234
More text...
(Probable ANTIGEN)
>protein_sequence_002
MNPQRSTUVWXYZDEF
...
```

**After:** You get a clean Excel file with columns:
| ID | Sequence | VaxiJen Score | Antigenicity |
|----|----------|---------------|--------------|
| protein_sequence_001 | AKFPQRSTUVWXYZAB | 0.7234 | Probable ANTIGEN |
| protein_sequence_002 | MNPQRSTUVWXYZDEF | 0.5621 | Probable NON-ANTIGEN |
---

## Key Features

- ๐Ÿ” **Smart Search**: Automatically finds protein sequences and their prediction scores, even when they're scattered across multiple lines
- ๐Ÿ“Š **Excel Output**: Creates professional-looking spreadsheets with proper formatting and column headers
- ๐Ÿ›ก๏ธ **Error Proof**: Checks if files exist and handles common errors gracefully
- ๐Ÿ”ง **Flexible**: Works with different sequence lengths (optimized for 16-letter sequences but adapts to others)
- โœ… **Data Validation**: Only captures valid protein sequences (sequences with only capital letters A-Z)

## How It Works (In Simple Terms)

Think of this script like a very patient assistant who:

1. **Reads every line** of your messy text file, one by one
2. **Looks for patterns** like sequence names (lines starting with ">") and protein sequences (lines with only capital letters)
3. **Connects the dots** between related information that might be several lines apart
4. **Organizes everything** into a neat table structure
5. **Creates a pretty Excel file** with proper formatting and headers

## Technical Approach

### Pattern Matching Strategy
The script uses **regular expressions** (pattern matching rules) to identify different types of data:
- `^>(.+)` finds sequence identifiers
- `^[A-Z]{16}$` finds 16-letter protein sequences
- `Overall Prediction.*= (-?[0-9.]+)` extracts numerical scores
- `Probable (ANTIGEN|NON-ANTIGEN)` captures classification results

### Search Algorithm
Uses a **forward-looking sequential search**:
- Processes the file line by line from top to bottom
- When it finds a sequence identifier, it searches the next 10 lines for the corresponding protein sequence
- Stops searching once it finds what it's looking for (efficient and prevents endless searching)

### Why Not Use Simpler Tools?

**Question**: "Why use Perl instead of AWK or other text processing tools?"

**Answer**: While AWK would be simpler for just extracting text, this script needs to create formatted Excel files with headers, column widths, and styling. AWK can't do that directly - you'd need multiple tools. Perl handles both text parsing and Excel creation in one go, with robust error handling for research workflows.

## โฌ‡๏ธ Prerequisites
1) Install **Perl 5**
- Linux/macOS: usually preinstalled
- Windows: install via **Strawberry Perl** โ†’ https://strawberryperl.com/

2) Install Perl modules:
```bash
cpan Excel::Writer::XLSX # for VaxiJen script (.xlsx)
cpan Spreadsheet::WriteExcel # for AllerTOP script (.xls)
```

## **Installation**

3) Clone the repo:

```bash
git clone https://github.com/vaishnavvarma/vaxijen-antigenicity-parser.git
cd vaxijen-antigenicity-parser
```
## **๐Ÿš€ Usage** (no command-line arguments)
1๏ธโƒฃ VaxiJen โ†’ Excel (.xlsx)

Step A: Edit file paths inside the script
Open scripts/vaxijen_to_excel.pl in your preferred Text Editor and set:
```
my $input_file = "path/to/vaxijen_output.txt";
my $output_file = "vaxijen_results.xlsx";
```
Step B: Run in commandline/terminal
```
perl scripts/vaxijen-antigenicity-parser.pl
```
๐Ÿ’ก Tips for file paths (Windows):

> Prefer forward slashes: C:/Users/Name/Desktop/input.txt

Or escape backslashes: C:\\Users\\Name\\Desktop\\input.txt

> If a path has spaces, wrap in quotes inside the Perl string: "C:/My Data/results.txt"

## Perfect For

- ๐Ÿงฌ Bioinformatics researchers working with protein predictions (Immunoinformatics)
- ๐Ÿ“Š Anyone who needs to convert scientific text output into spreadsheet format
- ๐ŸŽ“ Students learning about data parsing and file processing
- ๐Ÿ”ฌ Labs that need to process VaxiJen antigenicity predictions regularly

_ _ _ _ _ _

## **More About VaxiJen**
Actual Creators of VaxiJen:
- [Prof. Irini Doychinova](https://pharmfac.mu-sofia.bg/?page_id=5444&lang=en)
- [Darren Flower](https://www.linkedin.com/in/darrenflower/?originalSubdomain=uk)

## **Use VaxiJen**
- [VaxiJen v2.0](https://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html)
- [VaxiJen v3.0](https://www.ddg-pharmfac.net/vaxijen3/home/)
- [Read More Here (BMC Bioinformatics, 2007)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-4)
____________________________________________________________________________________________________________________________________________________________________________

## โœ Behind the Code (vaxijen-antigenicity-parser)

๐Ÿ‘จโ€๐Ÿ”ฌ **Vaishnav P. Varma**
[GitHub Profile](https://github.com/vaishnavvarma) [LinkedIn Profile](https://www.linkedin.com/in/vaishnav-p-varma/)

๐Ÿ’ป *Bioinformatician by training | ๐Ÿ“ธ Photographer by heart | โ˜• Turning coffee & curiosity into code*

[![Buy Me a Coffee](https://img.shields.io/badge/Buy%20Me%20a%20Coffee-%E2%98%95-yellow)](https://buymeacoffee.com/vaishnavpvarma)

---

โœจ Crafted with โค๏ธ, code, and curry in India ๐Ÿ‡ฎ๐Ÿ‡ณ