An open API service indexing awesome lists of open source software.

https://github.com/joseruiz01/mlranking

Listwise Machine Learning model to rank medical documents based on a query
https://github.com/joseruiz01/mlranking

listwise machine-learning-algorithms medical-documents query ranking-algorithm

Last synced: 11 months ago
JSON representation

Listwise Machine Learning model to rank medical documents based on a query

Awesome Lists containing this project

README

          

# ๐Ÿงช **Listwise Learning to Rank (LTR) for Lab Test Ranking**

Listwise Learning to Rank (LTR) optimizes the **entire ranking order** for a given queryโ€”unlike Pointwise or Pairwise approaches. It's especially effective for ranking **lab tests** by relevance to queries like:

> *"glucose in blood"*, *"bilirubin in plasma"*, *"white blood cells count"*

---

## ๐Ÿ”ง **Step 1: Define the Listwise LTR Model**

Listwise LTR models learn a *ranking function* that optimizes evaluation metrics such as **NDCG** (*Normalized Discounted Cumulative Gain*).

### โš™๏ธ Workflow:

1. **Input**: A list of lab tests (documents) for a given query.
2. **Scoring Function**: A model predicts a *relevance score* per test.
3. **Loss Function**:
- **eXtreme NDCG** โ€“ a direct optimization of NDCG.
- **LambdaRank** โ€“ also NDCG-focused.
4. **Output**: A ranked list of lab tests based on predicted relevance.

---

## ๐Ÿงน **Step 2: Data Preparation**

We calculate *relevance scores* for lab tests by computen two different scoring procedures.

### 1. Traditional Scoring
*Traditional scoring* is based on direct **keyword matching** between the query and dataset fields. This method prioritizes **exact and partial string matches** in key attributes such as the component and system.

#### ๐Ÿ” 1.1 Define Query Features
- **Component**: Substance measured (e.g., *Glucose*)
- **System**: Environment of measurement (e.g., *Blood*, *Serum/Plasma*)

#### ๐Ÿงผ 1.2. Preprocess Dataset
Each lab test includes:
- **Component**
- **System**

#### ๐ŸŽฏ 1.3. Match Criteria
- **Exact Match**: Full match with the query term.
- **Partial Match**: Synonyms or semantically similar terms.

#### ๐Ÿงฎ 1.4. Scoring Scheme
- **Exact Match** (Component) = weight(component) * weight(component)
- **Partial Match** (Component) = weight(component)/2 * weight(component)
- **Exact Match** (System) = weight(system) * weight(system)
- **Partial Match** (System) = weight(system)/2 * weight(system) No Match = 0

### 2. Embedding-Based Semantic Scoring
This method uses **sentence embeddings** to measure the **semantic similarity** between the query and each field in the dataset.

#### ๐Ÿง  2.1 Embedding the Query
- Encode the query string into a vector using a pre-trained embedding model.

#### ๐Ÿ“„ 2.2 Embedding the Dataset
- Each text field (e.g., *component*, *system*, etc.) is encoded into a vector representation.

#### ๐Ÿ“ 2.3 Cosine Similarity
- Use **cosine similarity** to compare the query vector and each fieldโ€™s embedding:

```python
similarity = cosine_similarity([query_embedding], [cell_embedding])[0][0]
```

- Normalize similarity score from [-1, 1] to [0, 1]:

```pyton
normalized_score = ((similarity + 1) / 2)
```
#### โš–๏ธ 2.4 Weighted Embedding Score
- Final embedding score for a field:

```python
embedding_score = normalized_score * 5 * weight(field)
```
- Aggregate across all eligible text fields.

### โ™ป๏ธ 3. Combined Scoring

```python
total_score = traditional_score + embedding_score
```

### โš–๏ธ 4. Normalize Scores
Normalize scores between 0 and 1 using:
- **Normalized Score** = score / max_score

### ๐Ÿ’พ 5. Export Data
Save the processed data and scores into a new **CSV** file for model training.

---

## ๐Ÿ› ๏ธ **Step 3: Implement the Listwise LTR Model**

We use **LightGBM** due to its speed, simplicity, and support for listwise ranking.

### ๐Ÿ“ 1. Dataset Preparation
- Load data from CSV.
- Encode categorical columns: `Query`, `Name`, `Component`, `System`, `Property`, `Measurement`.
- Create `Score_label` from `Normalized_Score`.
- Split into **train** and **test** sets.

### ๐Ÿ“Š 2. LightGBM Dataset Setup
- **Features**: Encoded columns.
- **Grouping**: Group by `Query` (listwise requirement).
- **Labels**: Use `Score_label`.

### ๐Ÿง  3. Train the Model
- **Objective**: `rank_xendcg`
- **Approach**: Simulate *AdaRank*-style boosting and reweighting using LightGBM parameters.

### ๐Ÿ“ˆ 4. Prediction
- Predict and normalize scores.
- Sort by `Query` and `Predicted Score`.
- Save results to `results.csv`.

---

## ๐Ÿš€ **Step 4: Enhancing the Dataset**

To improve **NDCG**, we introduced new **features**, expanded **queries**, and added more **data**.

### ๐Ÿ” 1. Expanded Queries
Added queries beyond the original three:
- `calcium in serum`
- `cells in urine`
...including query variations like `calcium`, `urine`, `cells`, etc.

### ๐Ÿ“ฆ 2. Dataset Expansion
We queried **LOINC Search** for additional documents:
- bilirubin in plasma / bilirubin
- calcium in serum / calcium
- glucose in blood / glucose
- leukocytes / white blood cells count
- blood / urine / cells

Saved results as CSVs.

---

## ๐Ÿ“Š **Step 5: Model Evaluation**

We use multiple **metrics** to assess model performance:

| Metric | Description | Ideal Value |
|------------------|-------------------------------------------------------|-------------|
| **MSE** | Mean Squared Error โ€“ lower is better | 0 |
| **Rยฒ** | R-squared โ€“ explains variance, higher is better | 1 |
| **Spearman's ฯ** | Rank correlation โ€“ higher shows stronger ranking match| 1 |
| **NDCG** | Normalized DCG โ€“ higher is better ranking quality | 1 |

---

### ๐Ÿ“‰ **Dataset Performance Comparison**

| Dataset | MSE | Rยฒ | Spearman ฯ | NDCG | Notes |
|-------------------|--------|---------|------------|--------|---------------------------------|
| **Basic** | 0.1642 | -2.5187 | 0.7265 | 0.9086 | Initial 3 queries |
| **First Enhanced**| 0.0479 | -1.9010 | 0.4700 | 0.8533 | Added `calcium in serum` |
| **Second Enhanced**| 0.0461| -0.8984 | 0.6024 | 0.9421 | Added `bilirubin`, `glucose`, `leukocytes` |
| **Third Enhanced**| 0.0252 | -0.4765 | 0.4983 | 0.9398 | Added `blood`, `serum or plasma`|
| **Fourth Enhanced**| 0.0450| -1.4383 | 0.4323 | 0.9448 | Added `cells in urine` |
| **Fifth Enhanced** | 0.0191| -0.6009 | 0.4615 | **0.9517** | Final version with `cells`, `urine` |

---

### ๐Ÿ“Œ **Per-Query NDCG (Fifth Dataset)**
- bilirubin in plasma: 0.9499
- calcium in serum: 0.9637
- cells in urine: 0.9448
- glucose in blood: 0.9663
- white blood cells count: 0.9339