https://github.com/waldronlab/bioanalyzer-backend
Streamline BugSigDB curation with AI-powered scientific paper analysis. Automatically extracts essential microbiome study metadata from research papers, reducing manual curation time and improving data quality for the BugSigDB database.
https://github.com/waldronlab/bioanalyzer-backend
ai api bioinformatics microbiome ml ncbi paper-analysis python
Last synced: 4 days ago
JSON representation
Streamline BugSigDB curation with AI-powered scientific paper analysis. Automatically extracts essential microbiome study metadata from research papers, reducing manual curation time and improving data quality for the BugSigDB database.
- Host: GitHub
- URL: https://github.com/waldronlab/bioanalyzer-backend
- Owner: waldronlab
- Created: 2025-10-20T07:33:52.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-05-28T12:45:45.000Z (15 days ago)
- Last Synced: 2026-05-28T13:09:15.846Z (15 days ago)
- Topics: ai, api, bioinformatics, microbiome, ml, ncbi, paper-analysis, python
- Language: Python
- Homepage:
- Size: 1.23 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# BioAnalyzer Package
[](https://github.com/waldronlab/bioanalyzer-backend/actions/workflows/ci.yml)
[](https://python.org)
[](https://fastapi.tiangolo.com)
[](https://docker.com)
[](LICENSE)
Extracts five core BugSigDB curation fields from scientific papers using LLMs. Pulls metadata and full text from PubMed/PMC, then analyzes papers to determine if they're ready for curation.
Works on Ubuntu with Docker. Python 3.8+ for local installs.
**Full documentation:** [docs/](docs/README.md)
## What It Does
Takes a PMID, fetches the paper from PubMed, and extracts:
1. Host Species (Human, Mouse, etc.)
2. Body Site (Gut, Oral, Skin, etc.)
3. Condition (disease/treatment being studied)
4. Sequencing Type (16S, metagenomics, etc.)
5. Sample Size (number of samples/participants)
Each field gets a status: `PRESENT`, `PARTIALLY_PRESENT`, or `ABSENT`, plus a confidence score.
## Quick Start
### Prerequisites
- Docker 20.0+ (recommended) or Python 3.8+
- NCBI API key (required)
- At least one LLM API key: Gemini (easiest), OpenAI, Anthropic, or Ollama (local)
### Docker (Recommended)
```bash
git clone https://github.com/waldronlab/bioanalyzer-backend.git
cd BioAnalyzer-Backend
chmod +x install.sh
./install.sh
BioAnalyzer build
BioAnalyzer start
BioAnalyzer status # confirm running
```
API docs at `http://localhost:8000/docs` when the API is running.
#### Custom API host/port (no hardcoded localhost)
The CLI and dev/ops scripts read the API URL from `BIOANALYZER_API_URL`.
You can set it to either the **root URL** (recommended) or the **`/api/v1` base**:
```bash
# Root URL
export BIOANALYZER_API_URL="http://127.0.0.1:8001"
# Or /api/v1 base
export BIOANALYZER_API_URL="http://127.0.0.1:8001/api/v1"
```
### Local Install
```bash
git clone https://github.com/waldronlab/bioanalyzer-backend.git
cd BioAnalyzer-Backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# Set API keys
export NCBI_API_KEY=your_key
export GEMINI_API_KEY=your_key
```
## Usage
### CLI
```bash
# Analyze a paper
BioAnalyzer analyze 12345678
# Batch analysis
BioAnalyzer analyze 12345678,87654321
BioAnalyzer analyze --file pmids.txt
# Retrieve paper data
BioAnalyzer retrieve 12345678
# System management
BioAnalyzer start
BioAnalyzer stop
BioAnalyzer status
# Curator table (sortable/searchable predictions)
BioAnalyzer run table
```
### API
**v1 (simple, fast):**
```bash
curl http://localhost:8000/api/v1/analyze/12345678
```
**v2 (RAG-enhanced, more accurate):**
```bash
curl "http://localhost:8000/api/v2/analyze/12345678?use_rag=true"
```
v2 uses RAG to improve accuracy but costs more API calls. Use v1 for quick checks, v2 when you need better results.
## Configuration
### Required
- `NCBI_API_KEY` - Get from [NCBI account settings](https://www.ncbi.nlm.nih.gov/account/settings/)
- `EMAIL` - Contact email for NCBI requests
### LLM Provider
Set one of:
- `GEMINI_API_KEY` - Google Gemini (recommended, cheapest)
- `OPENAI_API_KEY` - OpenAI
- `ANTHROPIC_API_KEY` - Anthropic
- `OLLAMA_BASE_URL` - Local Ollama (default: http://localhost:11434)
Auto-detects provider from available keys. Override with `LLM_PROVIDER=gemini|openai|anthropic|ollama`.
### RAG Settings (v2 API)
```bash
# Fast (good for batch jobs)
export RAG_SUMMARY_QUALITY=fast
export RAG_RERANK_METHOD=keyword
export RAG_TOP_K_CHUNKS=5
# Balanced (default, good tradeoff)
export RAG_SUMMARY_QUALITY=balanced
export RAG_RERANK_METHOD=hybrid
export RAG_TOP_K_CHUNKS=10
# High accuracy (slower, more expensive)
export RAG_SUMMARY_QUALITY=high
export RAG_RERANK_METHOD=llm
export RAG_TOP_K_CHUNKS=20
```
### Performance
- `USE_FULLTEXT=true` - Enable full text retrieval (slower but more accurate)
- `API_TIMEOUT=30` - Request timeout in seconds
- `CACHE_VALIDITY_HOURS=24` - How long to cache results
## Architecture
Standard layered setup:
```
app/
├── api/ # FastAPI routes (v1 and v2)
├── services/ # Business logic
│ ├── data_retrieval.py # PubMed fetching
│ ├── bugsigdb_analyzer.py # Field extraction
│ ├── advanced_rag.py # RAG pipeline
│ └── cache_manager.py # SQLite cache
├── models/ # LLM wrappers
│ ├── llm_provider.py # LiteLLM manager
│ └── unified_qa.py # QA interface
└── utils/ # Helpers
```
**Flow:**
1. Fetch paper from PubMed (cached in SQLite)
2. Chunk text if full text available
3. For each field: query LLM (v1) or use RAG pipeline (v2)
4. Validate and score results
5. Cache and return
v2 adds chunk re-ranking and contextual summarization before querying the LLM. Worth the extra cost for better accuracy.
## LLM Providers
Uses LiteLLM for provider abstraction. Supports:
- **Gemini** - Good balance of cost and quality
- **OpenAI** - Expensive but reliable
- **Anthropic** - Good for complex reasoning
- **Ollama** - Free but requires local setup
- **Llamafile** - Self-contained local models
Gemini is the default because it's cheap and works well for this use case.
## Performance
- **v1**: ~2-5 seconds per paper, 10-20 papers/min
- **v2**: ~5-10 seconds per paper, 5-10 papers/min
- **Memory**: ~100-200MB base, +50MB per concurrent request
- **Cache hit rate**: 60-80% for frequently analyzed papers
Cache is SQLite-based, stored in `cache/analysis_cache.db`. Results valid for 24 hours by default.
## Validation & Benchmarking
BioAnalyzer includes a formal validation workflow to compare automated predictions against expert curator annotations:
- **Ground truth**: Expert annotations in `feedback.csv` for the five core BugSigDB curation fields
- **Predictions**: BioAnalyzer outputs in a predictions CSV (e.g. `analysis_results.csv` or `new.csv`)
- **Alignment**: PMIDs are aligned with `align_pmids.py`
- **Evaluation**: `scripts/eval/confusion_matrix_analysis.py` computes 3-class confusion matrices (`ABSENT`, `PARTIALLY_PRESENT`, `PRESENT`) and per-field accuracy
- **Outputs**: Metrics and PNG confusion matrices are written to `confusion_matrix_results/`
For sharing/inspection, `create_validation_dataset.py` can generate a flat CSV:
- Columns: `Study, PMID, Experiment, Outcome of the experiment, Prediction`
- Each row = one paper–field comparison
- Used in the `Deliverables/` folder to communicate validation results (methods, ground truth analysis, and confusion-matrix summaries).
## Development
```bash
# Install dev dependencies
pip install -e .[dev]
# Run tests
pytest
# Format code
black .
# Lint
flake8 .
```
### Adding Features
- Services go in `app/services/`
- API routes in `app/api/routers/`
- CLI commands in `cli.py`
- Models in `app/api/models/`
## Troubleshooting
**Import errors:**
- Use Docker, or ensure virtual environment is activated
- Check Python version (3.8+)
**API not responding:**
```bash
docker compose ps
docker compose logs
```
**Missing API keys:**
- Check `.env` file or environment variables
- System will warn but continue (with limited functionality)
**Rate limiting:**
- NCBI enforces 3 requests/second. We throttle automatically.
- LLM providers have their own limits. Check your quota.
## Documentation
All documentation lives in the **[docs/](docs/)** folder:
- **[docs/README.md](docs/README.md)** – Index of all documentation
- **Getting started:** [QUICKSTART](docs/QUICKSTART.md), [SETUP_GUIDE](docs/SETUP_GUIDE.md), [QUICK_REFERENCE](docs/QUICK_REFERENCE.md)
- **Architecture:** [ARCHITECTURE](docs/ARCHITECTURE.md), [ARCHITECTURE_FLOW](docs/ARCHITECTURE_FLOW.md)
- **CLI:** [CLI_DOCUMENTATION](docs/CLI_DOCUMENTATION.md)
- **Features:** [RAG_GUIDE](docs/RAG_GUIDE.md), [CURATOR_TABLE_DESIGN](docs/CURATOR_TABLE_DESIGN.md), [CURATOR_TABLE_USER_GUIDE](docs/CURATOR_TABLE_USER_GUIDE.md)
- **Deployment:** [DOCKER_DEPLOYMENT](docs/DOCKER_DEPLOYMENT.md), [PRODUCTION_DEPLOYMENT](docs/PRODUCTION_DEPLOYMENT.md)
- **Development:** [TESTING](docs/TESTING.md)
When the API is running, interactive API docs: **http://localhost:8000/docs**
## License
MIT License - see [LICENSE](LICENSE) file.