https://github.com/alikhalajii/text-classification-life-sciences
Text classification of Life Science apps
https://github.com/alikhalajii/text-classification-life-sciences
data-analysis data-science datasets feature-importance jupyter-notebook pandas sbert scikit-learn word2vec
Last synced: 8 days ago
JSON representation
Text classification of Life Science apps
- Host: GitHub
- URL: https://github.com/alikhalajii/text-classification-life-sciences
- Owner: alikhalajii
- License: mit
- Created: 2025-08-11T19:26:46.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2025-08-12T09:57:58.000Z (9 months ago)
- Last Synced: 2025-08-19T14:40:00.751Z (9 months ago)
- Topics: data-analysis, data-science, datasets, feature-importance, jupyter-notebook, pandas, sbert, scikit-learn, word2vec
- Language: Python
- Homepage:
- Size: 36.9 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text Classification for Life Sciences Applications
## Introduction
In this case study, we develop a classification pipeline to automatically determine whether SAP Fiori apps are relevant to the life sciences domain.
**Problem Statement**
We are working with a catalog of 14,145 SAP Fiori apps, each described by 25 metadata fields (e.g. titles, descriptions, roles). The dataset is ***entirely unlabeled***, so we must manually annotate a small subset to train a classifier. Operating under ***low-resource conditions*** with only a few hundred labeled examples; we aim to build a model that can generalize effectively across the full catalog.
## Objectives
1. **Manual annotation**: Create a labeled dataset by manually tagging a representative sample of SAP Fiori apps as Relevant or Irrelevant to the life sciences domain.
2. **Model development**: Train interpretable classifiers, such as logistic regression and embedding-based models, to predict relevance based on app metadata.
3. **Evaluation & insights**: Assess model performance using metrics like accuracy, precision, recall, and F1-score. Analyze feature importances and embedding spaces to gain insights into model behavior and decision boundaries.
## Notebook structure
---
## Environment Setup
Follow these steps to recreate the full analysis from scratch:
**Create and activate a clean virtual environment**
```bash
python3.12 -m venv .venv_gxp
source .venv_gxp/bin/activate
pip install -U pip && pip install -r requirements.txt
```
**Note:**
You may safely delete any previously generated sub-directories and still reproduce every result.
## LICENSE
This project is licensed under the [MIT License](./LICENSE).