{"id":34514603,"url":"https://github.com/waldronlab/bioanalyzer-backend","last_synced_at":"2026-06-08T07:31:35.936Z","repository":{"id":319840546,"uuid":"1079663950","full_name":"waldronlab/bioanalyzer-backend","owner":"waldronlab","description":"Streamline BugSigDB curation with AI-powered scientific paper analysis. Automatically extracts essential microbiome study metadata from research papers, reducing manual curation time and improving data quality for the BugSigDB database.","archived":false,"fork":false,"pushed_at":"2026-05-28T12:45:45.000Z","size":1293,"stargazers_count":1,"open_issues_count":19,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T13:09:15.846Z","etag":null,"topics":["ai","api","bioinformatics","microbiome","ml","ncbi","paper-analysis","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/waldronlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-20T07:33:52.000Z","updated_at":"2026-04-28T09:04:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"a1d38de1-b6cb-45f6-8a7a-e3d16f340a0d","html_url":"https://github.com/waldronlab/bioanalyzer-backend","commit_stats":null,"previous_names":["waldronlab/bioanalyzer-backend"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/waldronlab/bioanalyzer-backend","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/waldronlab%2Fbioanalyzer-backend","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/waldronlab%2Fbioanalyzer-backend/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/waldronlab%2Fbioanalyzer-backend/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/waldronlab%2Fbioanalyzer-backend/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/waldronlab","download_url":"https://codeload.github.com/waldronlab/bioanalyzer-backend/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/waldronlab%2Fbioanalyzer-backend/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34053434,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","api","bioinformatics","microbiome","ml","ncbi","paper-analysis","python"],"created_at":"2025-12-24T04:19:35.178Z","updated_at":"2026-06-08T07:31:35.931Z","avatar_url":"https://github.com/waldronlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BioAnalyzer Package\n\n[![CI/CD Pipeline](https://github.com/waldronlab/bioanalyzer-backend/actions/workflows/ci.yml/badge.svg)](https://github.com/waldronlab/bioanalyzer-backend/actions/workflows/ci.yml)\n[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://python.org)\n[![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-green.svg)](https://fastapi.tiangolo.com)\n[![Docker](https://img.shields.io/badge/Docker-20.0+-blue.svg)](https://docker.com)\n[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\nExtracts five core BugSigDB curation fields from scientific papers using LLMs. Pulls metadata and full text from PubMed/PMC, then analyzes papers to determine if they're ready for curation.\n\nWorks on Ubuntu with Docker. Python 3.8+ for local installs.\n\n**Full documentation:** [docs/](docs/README.md)\n\n## What It Does\n\nTakes a PMID, fetches the paper from PubMed, and extracts:\n1. Host Species (Human, Mouse, etc.)\n2. Body Site (Gut, Oral, Skin, etc.)\n3. Condition (disease/treatment being studied)\n4. Sequencing Type (16S, metagenomics, etc.)\n5. Sample Size (number of samples/participants)\n\nEach field gets a status: `PRESENT`, `PARTIALLY_PRESENT`, or `ABSENT`, plus a confidence score.\n\n## Quick Start\n\n### Prerequisites\n\n- Docker 20.0+ (recommended) or Python 3.8+\n- NCBI API key (required)\n- At least one LLM API key: Gemini (easiest), OpenAI, Anthropic, or Ollama (local)\n\n### Docker (Recommended)\n\n```bash\ngit clone https://github.com/waldronlab/bioanalyzer-backend.git\ncd BioAnalyzer-Backend\n\nchmod +x install.sh\n./install.sh\n\nBioAnalyzer build\nBioAnalyzer start\nBioAnalyzer status   # confirm running\n```\n\nAPI docs at `http://localhost:8000/docs` when the API is running.\n\n#### Custom API host/port (no hardcoded localhost)\n\nThe CLI and dev/ops scripts read the API URL from `BIOANALYZER_API_URL`.\n\nYou can set it to either the **root URL** (recommended) or the **`/api/v1` base**:\n\n```bash\n# Root URL\nexport BIOANALYZER_API_URL=\"http://127.0.0.1:8001\"\n\n# Or /api/v1 base\nexport BIOANALYZER_API_URL=\"http://127.0.0.1:8001/api/v1\"\n```\n\n### Local Install\n\n```bash\ngit clone https://github.com/waldronlab/bioanalyzer-backend.git\ncd BioAnalyzer-Backend\n\npython3 -m venv .venv\nsource .venv/bin/activate\n\npip install -e .\n\n# Set API keys\nexport NCBI_API_KEY=your_key\nexport GEMINI_API_KEY=your_key\n```\n\n## Usage\n\n### CLI\n\n```bash\n# Analyze a paper\nBioAnalyzer analyze 12345678\n\n# Batch analysis\nBioAnalyzer analyze 12345678,87654321\nBioAnalyzer analyze --file pmids.txt\n\n# Retrieve paper data\nBioAnalyzer retrieve 12345678\n\n# System management\nBioAnalyzer start\nBioAnalyzer stop\nBioAnalyzer status\n\n# Curator table (sortable/searchable predictions)\nBioAnalyzer run table\n```\n\n### API\n\n**v1 (simple, fast):**\n```bash\ncurl http://localhost:8000/api/v1/analyze/12345678\n```\n\n**v2 (RAG-enhanced, more accurate):**\n```bash\ncurl \"http://localhost:8000/api/v2/analyze/12345678?use_rag=true\"\n```\n\nv2 uses RAG to improve accuracy but costs more API calls. Use v1 for quick checks, v2 when you need better results.\n\n## Configuration\n\n### Required\n\n- `NCBI_API_KEY` - Get from [NCBI account settings](https://www.ncbi.nlm.nih.gov/account/settings/)\n- `EMAIL` - Contact email for NCBI requests\n\n### LLM Provider\n\nSet one of:\n- `GEMINI_API_KEY` - Google Gemini (recommended, cheapest)\n- `OPENAI_API_KEY` - OpenAI\n- `ANTHROPIC_API_KEY` - Anthropic\n- `OLLAMA_BASE_URL` - Local Ollama (default: http://localhost:11434)\n\nAuto-detects provider from available keys. Override with `LLM_PROVIDER=gemini|openai|anthropic|ollama`.\n\n### RAG Settings (v2 API)\n\n```bash\n# Fast (good for batch jobs)\nexport RAG_SUMMARY_QUALITY=fast\nexport RAG_RERANK_METHOD=keyword\nexport RAG_TOP_K_CHUNKS=5\n\n# Balanced (default, good tradeoff)\nexport RAG_SUMMARY_QUALITY=balanced\nexport RAG_RERANK_METHOD=hybrid\nexport RAG_TOP_K_CHUNKS=10\n\n# High accuracy (slower, more expensive)\nexport RAG_SUMMARY_QUALITY=high\nexport RAG_RERANK_METHOD=llm\nexport RAG_TOP_K_CHUNKS=20\n```\n\n### Performance\n\n- `USE_FULLTEXT=true` - Enable full text retrieval (slower but more accurate)\n- `API_TIMEOUT=30` - Request timeout in seconds\n- `CACHE_VALIDITY_HOURS=24` - How long to cache results\n\n## Architecture\n\nStandard layered setup:\n\n```\napp/\n├── api/          # FastAPI routes (v1 and v2)\n├── services/     # Business logic\n│   ├── data_retrieval.py      # PubMed fetching\n│   ├── bugsigdb_analyzer.py   # Field extraction\n│   ├── advanced_rag.py         # RAG pipeline\n│   └── cache_manager.py       # SQLite cache\n├── models/       # LLM wrappers\n│   ├── llm_provider.py        # LiteLLM manager\n│   └── unified_qa.py          # QA interface\n└── utils/        # Helpers\n```\n\n**Flow:**\n1. Fetch paper from PubMed (cached in SQLite)\n2. Chunk text if full text available\n3. For each field: query LLM (v1) or use RAG pipeline (v2)\n4. Validate and score results\n5. Cache and return\n\nv2 adds chunk re-ranking and contextual summarization before querying the LLM. Worth the extra cost for better accuracy.\n\n## LLM Providers\n\nUses LiteLLM for provider abstraction. Supports:\n- **Gemini** - Good balance of cost and quality\n- **OpenAI** - Expensive but reliable\n- **Anthropic** - Good for complex reasoning\n- **Ollama** - Free but requires local setup\n- **Llamafile** - Self-contained local models\n\nGemini is the default because it's cheap and works well for this use case.\n\n## Performance\n\n- **v1**: ~2-5 seconds per paper, 10-20 papers/min\n- **v2**: ~5-10 seconds per paper, 5-10 papers/min\n- **Memory**: ~100-200MB base, +50MB per concurrent request\n- **Cache hit rate**: 60-80% for frequently analyzed papers\n\nCache is SQLite-based, stored in `cache/analysis_cache.db`. Results valid for 24 hours by default.\n\n## Validation \u0026 Benchmarking\n\nBioAnalyzer includes a formal validation workflow to compare automated predictions against expert curator annotations:\n\n- **Ground truth**: Expert annotations in `feedback.csv` for the five core BugSigDB curation fields  \n- **Predictions**: BioAnalyzer outputs in a predictions CSV (e.g. `analysis_results.csv` or `new.csv`)  \n- **Alignment**: PMIDs are aligned with `align_pmids.py`  \n- **Evaluation**: `scripts/eval/confusion_matrix_analysis.py` computes 3-class confusion matrices (`ABSENT`, `PARTIALLY_PRESENT`, `PRESENT`) and per-field accuracy  \n- **Outputs**: Metrics and PNG confusion matrices are written to `confusion_matrix_results/`\n\nFor sharing/inspection, `create_validation_dataset.py` can generate a flat CSV:\n\n- Columns: `Study, PMID, Experiment, Outcome of the experiment, Prediction`  \n- Each row = one paper–field comparison  \n- Used in the `Deliverables/` folder to communicate validation results (methods, ground truth analysis, and confusion-matrix summaries).\n\n## Development\n\n```bash\n# Install dev dependencies\npip install -e .[dev]\n\n# Run tests\npytest\n\n# Format code\nblack .\n\n# Lint\nflake8 .\n```\n\n### Adding Features\n\n- Services go in `app/services/`\n- API routes in `app/api/routers/`\n- CLI commands in `cli.py`\n- Models in `app/api/models/`\n\n## Troubleshooting\n\n**Import errors:**\n- Use Docker, or ensure virtual environment is activated\n- Check Python version (3.8+)\n\n**API not responding:**\n```bash\ndocker compose ps\ndocker compose logs\n```\n\n**Missing API keys:**\n- Check `.env` file or environment variables\n- System will warn but continue (with limited functionality)\n\n**Rate limiting:**\n- NCBI enforces 3 requests/second. We throttle automatically.\n- LLM providers have their own limits. Check your quota.\n\n## Documentation\n\nAll documentation lives in the **[docs/](docs/)** folder:\n\n- **[docs/README.md](docs/README.md)** – Index of all documentation\n- **Getting started:** [QUICKSTART](docs/QUICKSTART.md), [SETUP_GUIDE](docs/SETUP_GUIDE.md), [QUICK_REFERENCE](docs/QUICK_REFERENCE.md)\n- **Architecture:** [ARCHITECTURE](docs/ARCHITECTURE.md), [ARCHITECTURE_FLOW](docs/ARCHITECTURE_FLOW.md)\n- **CLI:** [CLI_DOCUMENTATION](docs/CLI_DOCUMENTATION.md)\n- **Features:** [RAG_GUIDE](docs/RAG_GUIDE.md), [CURATOR_TABLE_DESIGN](docs/CURATOR_TABLE_DESIGN.md), [CURATOR_TABLE_USER_GUIDE](docs/CURATOR_TABLE_USER_GUIDE.md)\n- **Deployment:** [DOCKER_DEPLOYMENT](docs/DOCKER_DEPLOYMENT.md), [PRODUCTION_DEPLOYMENT](docs/PRODUCTION_DEPLOYMENT.md)\n- **Development:** [TESTING](docs/TESTING.md)\n\nWhen the API is running, interactive API docs: **http://localhost:8000/docs**\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwaldronlab%2Fbioanalyzer-backend","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwaldronlab%2Fbioanalyzer-backend","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwaldronlab%2Fbioanalyzer-backend/lists"}