{"id":20405542,"url":"https://github.com/astrabert/breastcancer_contextml","last_synced_at":"2025-06-11T19:05:48.185Z","repository":{"id":239840209,"uuid":"800754002","full_name":"AstraBert/breastcancer_contextml","owner":"AstraBert","description":"On-spot training to enhance the performance of traditional machine learning algorithms, applied to the prediction of breast cancer malignity from ultrasound images","archived":false,"fork":false,"pushed_at":"2024-05-22T09:15:45.000Z","size":4257,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-26T06:32:39.126Z","etag":null,"topics":["academic-project","ai","breast-cancer-prediction","healthcare","image-classification","image-processing","kaggle-competition","machine-learning","qdrant","ultrasound"],"latest_commit_sha":null,"homepage":"https://astrabert.github.io/breastcancer_contextml/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraBert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-14T23:46:37.000Z","updated_at":"2024-05-22T09:15:48.000Z","dependencies_parsed_at":"2024-05-22T10:52:11.912Z","dependency_job_id":null,"html_url":"https://github.com/AstraBert/breastcancer_contextml","commit_stats":null,"previous_names":["astrabert/breastcancer_contextml"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2Fbreastcancer_contextml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2Fbreastcancer_contextml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2Fbreastcancer_contextml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraBert%2Fbreastcancer_contextml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraBert","download_url":"https://codeload.github.com/AstraBert/breastcancer_contextml/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241950145,"owners_count":20047591,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["academic-project","ai","breast-cancer-prediction","healthcare","image-classification","image-processing","kaggle-competition","machine-learning","qdrant","ultrasound"],"created_at":"2024-11-15T05:11:53.048Z","updated_at":"2025-03-05T02:17:57.217Z","avatar_url":"https://github.com/AstraBert.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003ebreastcancer-contextml\u003c/h1\u003e\r\n\u003ch2 align=\"center\"\u003ePredicting Breast Cancer Malignity Using Contextual Machine Learning\u003c/h2\u003e\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"https://img.shields.io/github/languages/top/AstraBert/breastcancer_contextml\" alt=\"GitHub top language\"\u003e\r\n   \u003cimg src=\"https://img.shields.io/github/commit-activity/t/AstraBert/breastcancer_contextml\" alt=\"GitHub commit activity\"\u003e\r\n   \u003cimg src=\"https://img.shields.io/badge/Release-v0.0.0-purple\" alt=\"Static Badge\"\u003e\r\n   \u003cimg src=\"https://img.shields.io/badge/Supported_platforms-Windows/Linux/Mac-brown\" alt=\"Static Badge\"\u003e\r\n   \u003cdiv\u003e\r\n        \u003ca href=\"https://astrabert.github.io/breastcancer_contextml\"\u003e\u003cimg src=\"./results/machine-prediction.png\"\u003e\u003c/a\u003e\r\n        \u003cp\u003e\u003ci\u003eThis image was generated with \u003ca href=\"https://pollinations.ai/\"\u003ePollinations AI\u003c/a\u003e API\u003c/i\u003e\u003c/p\u003e\r\n   \u003c/div\u003e\r\n\u003c/div\u003e\r\n\r\n\u003e ⚠️: _The hereby provided software is a low-level academic project developed in the context of the course \"Machine Learning in Health Care\", held in Spring Term 2024 by professor Christian Salvatore and professor Claudia Cava._\r\n\r\n\u003e _The code is entirely written with the sole purpose of taking part in **Automatic Diagnosis of Breast Cancer | IUSS 23-24** Kaggle competition, and **MUST NOT** be used for diagnostics. The authors are not responsible of any misuse or out-of-scope use of the software._\r\n\r\nIn this project, developed as a possible solution to the **Automatic Diagnosis of Breast Cancer | IUSS 23-24** Kaggle competition, we explored how on-spot training could enhance traditional machine learning methods performance on tabular data, applying this to the prediction of breast cancer malignity from ultrasound images.\r\n\r\nTo reproduce our results, make sure to go through the following steps:\r\n\r\n\u003ch3 align=\"center\"\u003e1. Set up local environment\u003c/h3\u003e\r\nFirst of all, clone this GitHub repository:\r\n\r\n```bash\r\ngit clone https://github.com/AstraBert/breastcancer_contextml\r\n```\r\n\r\nNow go in the cloned folder and install all the needed dependencies:\r\n\r\n```bash\r\ncd breastcancer_contextml\r\npython3 -m pip install -r scripts/requirements.txt\r\n```\r\n\r\nYou will also have to pull Qdrant Docker image:\r\n\r\n```bash\r\ndocker pull qdrant/qdrant:latest\r\n```\r\n\r\nOnce the installation is complete, we can begin building!🚀\r\n\r\n\u003ch3 align=\"center\"\u003e2. Preprocess the data\u003c/h3\u003e\r\n\u003cbr\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"results/preprocessing.png\"\u003e\r\n\u003c/div\u003e\r\n\u003cbr\u003e\r\n\r\nThe first piece of preprocessing, i.e. image feature extraction, has already been done (there are no images in this repository): the results, obtained through [pyradiomics](https://github.com/aim-harvard/pyradiomics), are saved in [extracted_features.csv](./data/extracted_features.csv). We have 547 training instances with 102 features, but:\r\n\r\n- Not all the features are equally useful, and we want our model to be built on the best ones\r\n- There is some imbalancing between benign and malignant training samples (benign ones are significantly more)\r\n\r\nThus we apply PCA (or _Principal Component Analysis_) to capture the features that encompass most of the variability in the dataset and we resample the training instances so that there is equilibrium between the two classes, using SMOTE (*Synthetic Minority Oversampling Technique*).\r\n\r\n```bash\r\npython3 scripts/preprocessing.py\r\n```\r\n\r\nNow we have all the training data, consisting of 775 instances and 16 features, in [combined_pca.csv](./data/combined_pca.csv) and all the test data in [extracted_test_pca.csv](./data/extracted_test_pca.csv).\r\n\r\n\r\n\u003ch3 align=\"center\"\u003e3. Build the back-end contextual architecture\u003c/h3\u003e\r\n\u003cbr\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"results/vectors.drawio.png\"\u003e\r\n\u003c/div\u003e\r\n\u003cbr\u003e\r\nIn this step we perform:\r\n\r\n- Vectorization of training and test data with dinov2-large by Facebook-Meta \u003e 547 + 100 1024-dimensional vectors\r\n- Creation of a Qdrant collection with training  data vectors\r\n- Implementation of a back-end semantic search architecture\r\n\r\n```bash\r\ndocker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant_storage:z qdrant/qdrant\r\npython3 scripts/qdrant_collection.py\r\n```\r\n\r\n\u003ch3 align=\"center\"\u003e4. Train and test the model\u003c/h3\u003e\r\n\u003cbr\u003e\r\n\u003cdiv align=\"center\"\u003e\r\n    \u003cimg src=\"results/contml.drawio.png\"\u003e\r\n\u003c/div\u003e\r\n\u003cbr\u003e\r\nIn this step:\r\n\r\n- For each testing image, selected PCA features are extracted with pyradiomics\r\n- Each testing vectorized instance, the 250 most similar images in the Qdrant collection are chosen\r\n- A HistGradientBoosting Classifier is trained on tabular data for the 250 selected training images\r\n- The trained model predicts the test instance, then moves forward\r\n\r\n```bash\r\npython3 scripts/contextual_machine_learning.py\r\n```\r\n\r\n\u003ch3 align=\"center\"\u003e5. PROs and CONs of the approach\u003c/h3\u003e\r\n\r\n#### PROs\r\n\r\n* On-spot training:\r\n* Less data to train\r\n* Highly customizable\r\n* Potentially extensible to other machine learning and deep learning frameworks such as neural networks, and also with AI\r\n* Training is test-oriented \r\n* Good overall performance\r\n\r\n#### CONs\r\n\r\n* Slow and computationally intense\r\n* Needs good-quality and well preprocessed data\r\n* Needs a solid backend contextual architecture\r\n* Needs a lot of trial and error before finding the right context window size\r\n\r\n\u003ch3 align=\"center\"\u003e6. License and Rights of Usage\u003c/h3\u003e\r\n\r\nThe hereby presented software is open-source and distributed under MIT license.\r\n\r\nAs stated before, the project was developed for learning purposes and must be used only in this sense.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fbreastcancer_contextml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrabert%2Fbreastcancer_contextml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrabert%2Fbreastcancer_contextml/lists"}