{"id":27144174,"url":"https://github.com/joseruiz01/mlranking","last_synced_at":"2025-07-14T17:35:34.010Z","repository":{"id":280841453,"uuid":"943352258","full_name":"JoseRuiz01/MLRanking","owner":"JoseRuiz01","description":"Listwise Machine Learning model to rank medical documents based on a query","archived":false,"fork":false,"pushed_at":"2025-03-16T22:13:45.000Z","size":8873,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T22:30:34.598Z","etag":null,"topics":["listwise","machine-learning-algorithms","medical-documents","query","ranking-algorithm"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JoseRuiz01.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-05T15:13:31.000Z","updated_at":"2025-03-16T22:13:48.000Z","dependencies_parsed_at":"2025-03-16T22:41:14.501Z","dependency_job_id":null,"html_url":"https://github.com/JoseRuiz01/MLRanking","commit_stats":null,"previous_names":["joseruiz01/lab-mlrankingassignment","joseruiz01/lab-mlranking","joseruiz01/mlranking"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoseRuiz01%2FMLRanking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoseRuiz01%2FMLRanking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoseRuiz01%2FMLRanking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoseRuiz01%2FMLRanking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JoseRuiz01","download_url":"https://codeload.github.com/JoseRuiz01/MLRanking/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247809991,"owners_count":20999816,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["listwise","machine-learning-algorithms","medical-documents","query","ranking-algorithm"],"created_at":"2025-04-08T08:58:21.750Z","updated_at":"2025-07-14T17:35:33.993Z","avatar_url":"https://github.com/JoseRuiz01.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧪 **Listwise Learning to Rank (LTR) for Lab Test Ranking**\n\nListwise Learning to Rank (LTR) optimizes the **entire ranking order** for a given query—unlike Pointwise or Pairwise approaches. It's especially effective for ranking **lab tests** by relevance to queries like:\n\n\u003e *\"glucose in blood\"*, *\"bilirubin in plasma\"*, *\"white blood cells count\"*\n\n---\n\n## 🔧 **Step 1: Define the Listwise LTR Model**\n\nListwise LTR models learn a *ranking function* that optimizes evaluation metrics such as **NDCG** (*Normalized Discounted Cumulative Gain*).\n\n### ⚙️ Workflow:\n\n1. **Input**: A list of lab tests (documents) for a given query.  \n2. **Scoring Function**: A model predicts a *relevance score* per test.  \n3. **Loss Function**:\n   - **eXtreme NDCG** – a direct optimization of NDCG.\n   - **LambdaRank** – also NDCG-focused.\n4. **Output**: A ranked list of lab tests based on predicted relevance.\n\n---\n\n## 🧹 **Step 2: Data Preparation**\n\nWe calculate *relevance scores* for lab tests by computen two different scoring procedures.\n\n\n### 1. Traditional Scoring\n*Traditional scoring* is based on direct **keyword matching** between the query and dataset fields. This method prioritizes **exact and partial string matches** in key attributes such as the component and system.\n\n#### 🔍 1.1 Define Query Features\n- **Component**: Substance measured (e.g., *Glucose*)\n- **System**: Environment of measurement (e.g., *Blood*, *Serum/Plasma*)\n\n#### 🧼 1.2. Preprocess Dataset\nEach lab test includes:\n- **Component**\n- **System**\n\n#### 🎯 1.3. Match Criteria\n- **Exact Match**: Full match with the query term.\n- **Partial Match**: Synonyms or semantically similar terms.\n\n#### 🧮 1.4. Scoring Scheme\n- **Exact Match** (Component) = weight(component) * weight(component)\n- **Partial Match** (Component) = weight(component)/2 * weight(component)\n- **Exact Match** (System) = weight(system) * weight(system)\n- **Partial Match** (System) = weight(system)/2 * weight(system) No Match = 0\n\n\n### 2. Embedding-Based Semantic Scoring\nThis method uses **sentence embeddings** to measure the **semantic similarity** between the query and each field in the dataset.\n\n#### 🧠 2.1 Embedding the Query\n- Encode the query string into a vector using a pre-trained embedding model.\n\n#### 📄 2.2 Embedding the Dataset\n- Each text field (e.g., *component*, *system*, etc.) is encoded into a vector representation.\n\n#### 📏 2.3 Cosine Similarity\n- Use **cosine similarity** to compare the query vector and each field’s embedding:\n  \n   ```python\n   similarity = cosine_similarity([query_embedding], [cell_embedding])[0][0]\n   ```\n\n- Normalize similarity score from [-1, 1] to [0, 1]:\n  \n  ```pyton\n  normalized_score = ((similarity + 1) / 2)\n  ```\n#### ⚖️ 2.4 Weighted Embedding Score\n- Final embedding score for a field:\n  \n  ```python\n  embedding_score = normalized_score * 5 * weight(field)\n   ```\n- Aggregate across all eligible text fields.\n  \n### ♻️ 3. Combined Scoring\n\n   ```python\n   total_score = traditional_score + embedding_score\n   ```\n\n### ⚖️ 4. Normalize Scores\nNormalize scores between 0 and 1 using:\n- **Normalized Score** = score / max_score\n\n\n### 💾 5. Export Data\nSave the processed data and scores into a new **CSV** file for model training.\n\n---\n\n## 🛠️ **Step 3: Implement the Listwise LTR Model**\n\nWe use **LightGBM** due to its speed, simplicity, and support for listwise ranking.\n\n### 📁 1. Dataset Preparation\n- Load data from CSV.\n- Encode categorical columns: `Query`, `Name`, `Component`, `System`, `Property`, `Measurement`.\n- Create `Score_label` from `Normalized_Score`.\n- Split into **train** and **test** sets.\n\n### 📊 2. LightGBM Dataset Setup\n- **Features**: Encoded columns.\n- **Grouping**: Group by `Query` (listwise requirement).\n- **Labels**: Use `Score_label`.\n\n### 🧠 3. Train the Model\n- **Objective**: `rank_xendcg`\n- **Approach**: Simulate *AdaRank*-style boosting and reweighting using LightGBM parameters.\n\n### 📈 4. Prediction\n- Predict and normalize scores.\n- Sort by `Query` and `Predicted Score`.\n- Save results to `results.csv`.\n\n---\n\n## 🚀 **Step 4: Enhancing the Dataset**\n\nTo improve **NDCG**, we introduced new **features**, expanded **queries**, and added more **data**.\n\n### 🔍 1. Expanded Queries\nAdded queries beyond the original three:\n- `calcium in serum`\n- `cells in urine`  \n...including query variations like `calcium`, `urine`, `cells`, etc.\n\n### 📦 2. Dataset Expansion\nWe queried **LOINC Search** for additional documents:\n- bilirubin in plasma / bilirubin  \n- calcium in serum / calcium  \n- glucose in blood / glucose  \n- leukocytes / white blood cells count  \n- blood / urine / cells  \n\nSaved results as CSVs.\n\n---\n\n## 📊 **Step 5: Model Evaluation**\n\nWe use multiple **metrics** to assess model performance:\n\n| Metric           | Description                                           | Ideal Value |\n|------------------|-------------------------------------------------------|-------------|\n| **MSE**          | Mean Squared Error – lower is better                  | 0           |\n| **R²**           | R-squared – explains variance, higher is better       | 1           |\n| **Spearman's ρ** | Rank correlation – higher shows stronger ranking match| 1           |\n| **NDCG**         | Normalized DCG – higher is better ranking quality     | 1           |\n\n---\n\n### 📉 **Dataset Performance Comparison**\n\n| Dataset           | MSE    | R²      | Spearman ρ | NDCG   | Notes                          |\n|-------------------|--------|---------|------------|--------|---------------------------------|\n| **Basic**         | 0.1642 | -2.5187 | 0.7265     | 0.9086 | Initial 3 queries               |\n| **First Enhanced**| 0.0479 | -1.9010 | 0.4700     | 0.8533 | Added `calcium in serum`        |\n| **Second Enhanced**| 0.0461| -0.8984 | 0.6024     | 0.9421 | Added `bilirubin`, `glucose`, `leukocytes` |\n| **Third Enhanced**| 0.0252 | -0.4765 | 0.4983     | 0.9398 | Added `blood`, `serum or plasma`|\n| **Fourth Enhanced**| 0.0450| -1.4383 | 0.4323     | 0.9448 | Added `cells in urine`          |\n| **Fifth Enhanced** | 0.0191| -0.6009 | 0.4615     | **0.9517** | Final version with `cells`, `urine` |\n\n---\n\n### 📌 **Per-Query NDCG (Fifth Dataset)**\n- bilirubin in plasma: 0.9499\n- calcium in serum: 0.9637\n- cells in urine: 0.9448\n- glucose in blood: 0.9663\n- white blood cells count: 0.9339\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoseruiz01%2Fmlranking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoseruiz01%2Fmlranking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoseruiz01%2Fmlranking/lists"}