https://github.com/alikhalajii/text-classification-life-sciences

Text classification of Life Science apps
https://github.com/alikhalajii/text-classification-life-sciences

data-analysis data-science datasets feature-importance jupyter-notebook pandas sbert scikit-learn word2vec

Last synced: 2 months ago
JSON representation

Text classification of Life Science apps

Host: GitHub
URL: https://github.com/alikhalajii/text-classification-life-sciences
Owner: alikhalajii
License: mit
Created: 2025-08-11T19:26:46.000Z (11 months ago)
Default Branch: master
Last Pushed: 2025-08-12T09:57:58.000Z (11 months ago)
Last Synced: 2025-08-19T14:40:00.751Z (11 months ago)
Topics: data-analysis, data-science, datasets, feature-importance, jupyter-notebook, pandas, sbert, scikit-learn, word2vec
Language: Python
Homepage:
Size: 36.9 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Text Classification for Life Sciences Applications

## Introduction

In this case study, we develop a classification pipeline to automatically determine whether SAP Fiori apps are relevant to the life sciences domain.

**Problem Statement**
We are working with a catalog of 14,145 SAP Fiori apps, each described by 25 metadata fields (e.g. titles, descriptions, roles). The dataset is ***entirely unlabeled***, so we must manually annotate a small subset to train a classifier. Operating under ***low-resource conditions*** with only a few hundred labeled examples; we aim to build a model that can generalize effectively across the full catalog.

## Objectives
1. **Manual annotation**: Create a labeled dataset by manually tagging a representative sample of SAP Fiori apps as Relevant or Irrelevant to the life sciences domain.

2. **Model development**: Train interpretable classifiers, such as logistic regression and embedding-based models, to predict relevance based on app metadata.

3. **Evaluation & insights**: Assess model performance using metrics like accuracy, precision, recall, and F1-score. Analyze feature importances and embedding spaces to gain insights into model behavior and decision boundaries.

## Notebook structure

1. Data cleaning