https://github.com/joseruiz01/mlranking
Listwise Machine Learning model to rank medical documents based on a query
https://github.com/joseruiz01/mlranking
listwise machine-learning-algorithms medical-documents query ranking-algorithm
Last synced: 11 months ago
JSON representation
Listwise Machine Learning model to rank medical documents based on a query
- Host: GitHub
- URL: https://github.com/joseruiz01/mlranking
- Owner: JoseRuiz01
- Created: 2025-03-05T15:13:31.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-16T22:13:45.000Z (about 1 year ago)
- Last Synced: 2025-03-16T22:30:34.598Z (about 1 year ago)
- Topics: listwise, machine-learning-algorithms, medical-documents, query, ranking-algorithm
- Language: Jupyter Notebook
- Homepage:
- Size: 8.46 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐งช **Listwise Learning to Rank (LTR) for Lab Test Ranking**
Listwise Learning to Rank (LTR) optimizes the **entire ranking order** for a given queryโunlike Pointwise or Pairwise approaches. It's especially effective for ranking **lab tests** by relevance to queries like:
> *"glucose in blood"*, *"bilirubin in plasma"*, *"white blood cells count"*
---
## ๐ง **Step 1: Define the Listwise LTR Model**
Listwise LTR models learn a *ranking function* that optimizes evaluation metrics such as **NDCG** (*Normalized Discounted Cumulative Gain*).
### โ๏ธ Workflow:
1. **Input**: A list of lab tests (documents) for a given query.
2. **Scoring Function**: A model predicts a *relevance score* per test.
3. **Loss Function**:
- **eXtreme NDCG** โ a direct optimization of NDCG.
- **LambdaRank** โ also NDCG-focused.
4. **Output**: A ranked list of lab tests based on predicted relevance.
---
## ๐งน **Step 2: Data Preparation**
We calculate *relevance scores* for lab tests by computen two different scoring procedures.
### 1. Traditional Scoring
*Traditional scoring* is based on direct **keyword matching** between the query and dataset fields. This method prioritizes **exact and partial string matches** in key attributes such as the component and system.
#### ๐ 1.1 Define Query Features
- **Component**: Substance measured (e.g., *Glucose*)
- **System**: Environment of measurement (e.g., *Blood*, *Serum/Plasma*)
#### ๐งผ 1.2. Preprocess Dataset
Each lab test includes:
- **Component**
- **System**
#### ๐ฏ 1.3. Match Criteria
- **Exact Match**: Full match with the query term.
- **Partial Match**: Synonyms or semantically similar terms.
#### ๐งฎ 1.4. Scoring Scheme
- **Exact Match** (Component) = weight(component) * weight(component)
- **Partial Match** (Component) = weight(component)/2 * weight(component)
- **Exact Match** (System) = weight(system) * weight(system)
- **Partial Match** (System) = weight(system)/2 * weight(system) No Match = 0
### 2. Embedding-Based Semantic Scoring
This method uses **sentence embeddings** to measure the **semantic similarity** between the query and each field in the dataset.
#### ๐ง 2.1 Embedding the Query
- Encode the query string into a vector using a pre-trained embedding model.
#### ๐ 2.2 Embedding the Dataset
- Each text field (e.g., *component*, *system*, etc.) is encoded into a vector representation.
#### ๐ 2.3 Cosine Similarity
- Use **cosine similarity** to compare the query vector and each fieldโs embedding:
```python
similarity = cosine_similarity([query_embedding], [cell_embedding])[0][0]
```
- Normalize similarity score from [-1, 1] to [0, 1]:
```pyton
normalized_score = ((similarity + 1) / 2)
```
#### โ๏ธ 2.4 Weighted Embedding Score
- Final embedding score for a field:
```python
embedding_score = normalized_score * 5 * weight(field)
```
- Aggregate across all eligible text fields.
### โป๏ธ 3. Combined Scoring
```python
total_score = traditional_score + embedding_score
```
### โ๏ธ 4. Normalize Scores
Normalize scores between 0 and 1 using:
- **Normalized Score** = score / max_score
### ๐พ 5. Export Data
Save the processed data and scores into a new **CSV** file for model training.
---
## ๐ ๏ธ **Step 3: Implement the Listwise LTR Model**
We use **LightGBM** due to its speed, simplicity, and support for listwise ranking.
### ๐ 1. Dataset Preparation
- Load data from CSV.
- Encode categorical columns: `Query`, `Name`, `Component`, `System`, `Property`, `Measurement`.
- Create `Score_label` from `Normalized_Score`.
- Split into **train** and **test** sets.
### ๐ 2. LightGBM Dataset Setup
- **Features**: Encoded columns.
- **Grouping**: Group by `Query` (listwise requirement).
- **Labels**: Use `Score_label`.
### ๐ง 3. Train the Model
- **Objective**: `rank_xendcg`
- **Approach**: Simulate *AdaRank*-style boosting and reweighting using LightGBM parameters.
### ๐ 4. Prediction
- Predict and normalize scores.
- Sort by `Query` and `Predicted Score`.
- Save results to `results.csv`.
---
## ๐ **Step 4: Enhancing the Dataset**
To improve **NDCG**, we introduced new **features**, expanded **queries**, and added more **data**.
### ๐ 1. Expanded Queries
Added queries beyond the original three:
- `calcium in serum`
- `cells in urine`
...including query variations like `calcium`, `urine`, `cells`, etc.
### ๐ฆ 2. Dataset Expansion
We queried **LOINC Search** for additional documents:
- bilirubin in plasma / bilirubin
- calcium in serum / calcium
- glucose in blood / glucose
- leukocytes / white blood cells count
- blood / urine / cells
Saved results as CSVs.
---
## ๐ **Step 5: Model Evaluation**
We use multiple **metrics** to assess model performance:
| Metric | Description | Ideal Value |
|------------------|-------------------------------------------------------|-------------|
| **MSE** | Mean Squared Error โ lower is better | 0 |
| **Rยฒ** | R-squared โ explains variance, higher is better | 1 |
| **Spearman's ฯ** | Rank correlation โ higher shows stronger ranking match| 1 |
| **NDCG** | Normalized DCG โ higher is better ranking quality | 1 |
---
### ๐ **Dataset Performance Comparison**
| Dataset | MSE | Rยฒ | Spearman ฯ | NDCG | Notes |
|-------------------|--------|---------|------------|--------|---------------------------------|
| **Basic** | 0.1642 | -2.5187 | 0.7265 | 0.9086 | Initial 3 queries |
| **First Enhanced**| 0.0479 | -1.9010 | 0.4700 | 0.8533 | Added `calcium in serum` |
| **Second Enhanced**| 0.0461| -0.8984 | 0.6024 | 0.9421 | Added `bilirubin`, `glucose`, `leukocytes` |
| **Third Enhanced**| 0.0252 | -0.4765 | 0.4983 | 0.9398 | Added `blood`, `serum or plasma`|
| **Fourth Enhanced**| 0.0450| -1.4383 | 0.4323 | 0.9448 | Added `cells in urine` |
| **Fifth Enhanced** | 0.0191| -0.6009 | 0.4615 | **0.9517** | Final version with `cells`, `urine` |
---
### ๐ **Per-Query NDCG (Fifth Dataset)**
- bilirubin in plasma: 0.9499
- calcium in serum: 0.9637
- cells in urine: 0.9448
- glucose in blood: 0.9663
- white blood cells count: 0.9339