{"id":25815584,"url":"https://github.com/403errors/cancercareai","last_synced_at":"2026-02-09T06:07:46.786Z","repository":{"id":278565375,"uuid":"936031832","full_name":"403errors/CancerCareAI","owner":"403errors","description":"An AI-powered system for extracting cancer-related information from patient Electronic Health Record (EHR) notes","archived":false,"fork":false,"pushed_at":"2025-02-23T21:15:16.000Z","size":2903,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-28T11:08:16.797Z","etag":null,"topics":["accelerate","bitsandbytes","information-retrieval","local-llm-integration","nlp-machine-learning","nltk-python","optimum","pipelines","pro","prompt-engineering","requests-library-python","sentence-transformers","torch","transformers"],"latest_commit_sha":null,"homepage":"https://colab.research.google.com/drive/13bzx0MyOojzwq6f8PcUOp5o_LvXt6B1E","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/403errors.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-20T12:21:19.000Z","updated_at":"2025-02-24T11:48:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"a7640480-7324-4d2c-b7c4-ef8a9a280557","html_url":"https://github.com/403errors/CancerCareAI","commit_stats":null,"previous_names":["403errors/cancercareai"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/403errors/CancerCareAI","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/403errors%2FCancerCareAI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/403errors%2FCancerCareAI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/403errors%2FCancerCareAI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/403errors%2FCancerCareAI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/403errors","download_url":"https://codeload.github.com/403errors/CancerCareAI/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/403errors%2FCancerCareAI/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270117041,"owners_count":24530282,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerate","bitsandbytes","information-retrieval","local-llm-integration","nlp-machine-learning","nltk-python","optimum","pipelines","pro","prompt-engineering","requests-library-python","sentence-transformers","torch","transformers"],"created_at":"2025-02-28T04:29:30.430Z","updated_at":"2026-02-09T06:07:46.482Z","avatar_url":"https://github.com/403errors.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CancerCareAI: AI-Powered Patient Data Extraction\n\nThis project implements an AI-powered system for extracting cancer-related information from patient Electronic Health Record (EHR) notes. It addresses two main tasks:\n\n1.  **Information Retrieval:** Retrieving relevant text chunks based on a user query.\n2.  **Medical Data Extraction:** Extracting structured data (diagnosis and medication details) into a JSON format.\n\n**[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13bzx0MyOojzwq6f8PcUOp5o_LvXt6B1E?usp=sharing)** \n\n## Flowchart\n\n![Flowchart](./flowchart/CancerCareAI_Flowchart.jpg)\n\n## Project Structure\n\nThe project is implemented in Python and is structured as a single, well-commented Jupyter Notebook (`CancerCareAI.ipynb`).   The notebook is divided into four main sections:\n\n1.  **Project Setup and Data Loading:** Installs dependencies, imports libraries, and loads data from a GitHub repository.\n2.  **Task 1 - Information Retrieval (Pipeline):** Implements a combined keyword-based (BM25) and semantic search (Sentence Transformers, CrossEncoder) pipeline for retrieving relevant sentences.\n3.  **Task 2 - Medical Data Extraction (LLM-based Pipeline):** Uses a quantized Large Language Model (Qwen/Qwen2.5-7B-Instruct-1M) to extract structured data in JSON format.  Includes robust error handling for JSON parsing.\n4.  **Putting it all Together (Main Execution Block):** Provides an interactive interface for the user to select a patient, choose a mode (information retrieval or data extraction), and view the results.\n\n## Task 1: Information Retrieval\n\n**Approach:**\n\nThe information retrieval task uses a multi-stage approach to combine the strengths of different retrieval methods:\n\n1.  **Sentence Tokenization:** Input documents are split into individual sentences using `nltk.sent_tokenize`.  This provides a more granular level of retrieval compared to using entire documents.\n2.  **BM25 Ranking:**  The `rank_bm25` library is used to perform keyword-based ranking.  This is effective for finding sentences that contain the exact query terms.\n3.  **Semantic Search:**  The `sentence-transformers` library is used with the \"all-MiniLM-L6-v2\" model to find sentences that are semantically similar to the query, even if they don't share exact keywords.\n4.  **Filtering:** The top *N* results from both BM25 and semantic search are combined.  Irrelevant/administrative sentences are removed using regular expression based filtering.\n5.  **Cross-Encoder Reranking:** A CrossEncoder model (\"cross-encoder/ms-marco-MiniLM-L-6-v2\") is used to rerank the combined results.  CrossEncoders are more accurate than the Bi-Encoders used in the initial semantic search.\n6.  **Score Normalization and Combination:** Scores from BM25, semantic search, and the CrossEncoder are normalized to a 0-1 range and combined using weighted averaging. This allows for tuning the influence of each method.\n\n**[YouTube Video Demo (Task 1)](https://youtu.be/_N7l-hswtaU)**\n\n## Task 2: Medical Data Extraction\n\n**Approach:**\n\nThe medical data extraction task leverages the Qwen/Qwen2.5-7B-Instruct-1M large language model (LLM) with 4-bit quantization to extract structured data.\n\n1.  **Model Loading:** The Qwen model and tokenizer are loaded using the `transformers` library.  4-bit quantization (using `bitsandbytes`) is applied to reduce memory usage, enabling the model to run on a T4 GPU in Google Colab.  If a GPU is not available, the model loading is skipped.\n2.  **Prompt Engineering:** A carefully designed prompt is constructed to instruct the LLM to extract specific data elements (diagnosis characteristics and cancer-related medications) and output them in a strict JSON format.  The prompt includes:\n    *   Clear instructions on the LLM's role and task.\n    *   An example input and expected output.\n    *   Specific guidelines for handling missing data (using `null`).\n3.  **Inference:** The LLM generates text based on the prompt and input passage.  Inference parameters are set for deterministic output (greedy decoding, low temperature, top-k sampling).\n4.  **JSON Extraction and Error Handling:**  The generated text is parsed to extract the JSON object.  Robust error handling is implemented to deal with potential `JSONDecodeError` exceptions, and includes a fallback mechanism to attempt to recover partial JSON outputs. A regular expression based approach is used to first find the JSON code block and then parse.\n5. **Data Aggregation:** The `merge_extractions` function handles combining and deduplicating data extracted from multiple documents for the same patient. It prioritizes earlier diagnosis dates and combines medication information.\n\n**[YouTube Video Demo (Task 2)](https://youtu.be/TzEx-vvSADw)**\n\n## Running the Code\n\n1.  **Open in Colab:** The recommended way to run the code is in Google Colab. Use the Colab link provided.\n2.  **Runtime:** Ensure you are using a T4 GPU runtime (Runtime -\u003e Change runtime type). This is *required* for the 4-bit quantization of the Qwen model. If bitsandbytes issues occur, try restarting the runtime.\n3.  **Run All:** Execute all cells in the notebook (Runtime -\u003e Run all).\n4.  **Interactive Prompts:** The script will prompt you to:\n    *   Select a patient.\n    *   Choose a mode (1 for Information Retrieval, 2 for Medical Data Extraction).\n    *   Enter a query (for Mode 1).\n\n## Dependencies\n\n*   sentence-transformers\n*   rank\\_bm25\n*   pandas\n*   nltk\n*   bitsandbytes\n*   accelerate\n*   optimum\n*   transformers\n*   torch\n*   requests\n\nThese dependencies are installed at the beginning of the `CancerCareAI.ipynb` notebook using `pip`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F403errors%2Fcancercareai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F403errors%2Fcancercareai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F403errors%2Fcancercareai/lists"}