An open API service indexing awesome lists of open source software.

https://github.com/maguids/lung-cancer-classification

This project consists of lung cancer classification using computed tomography (CT) data. First Semester of the Third Year of the Bachelor's Degree in Artificial Intelligence and Data Science.
https://github.com/maguids/lung-cancer-classification

ai classification ct-scans lung-cancer-classification lung-cancer-detection lung-segmentation radiomics

Last synced: 3 months ago
JSON representation

This project consists of lung cancer classification using computed tomography (CT) data. First Semester of the Third Year of the Bachelor's Degree in Artificial Intelligence and Data Science.

Awesome Lists containing this project

README

          

# Lung Cancer Classification

This project was developed for the "Artificial Intelligence and Data Science Laboratory" course and aims to **correctly classifiy lung cancer using computed tomography (CT) data**. First Semester of the Third Year of the Bachelor's Degree in Artificial Intelligence and Data Science.


## Requirements:

- Python
- Numpy -> version ~= 1.19.5
- Matplotlib -> version ~= 3.4.3
- Pandas -> version ~= 1.3.5
- Pylidc -> version ~= 0.2.3
- Pydicom -> version ~= 2.0.0
- Pyradiomics -> version = v3.0.1
- Scikit-image
- Scipy
- Scikit-learn
- Seaborn
- SimpleITK
- opencv-python (cv2)
- Xgboost
- Pyswarm
- Statsmodel


## The Project
The proposed project consists on **the classification of pulmonary nodules**, through detailed data analysis and implementation of AI models, **using computed tomography (CT) images**. The **data used comes from the LIDC-IDRI dataset**, available at https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI. Throughout this project, solutions will be proposed for its analysis and processing, **based on studies and projects already carried out in the area**, which are referenced throughout the document. The **classification of pulmonary nodules will be performed using different techniques and algorithms**, which will **be compared and evaluated** in order to determine which model best suits the proposed problem.


### The Dataset:
As previously stated, the dataset used in this project was taken from the following link:
https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI
This will give you access to several files, from csv files with information about diagnoses, the number of nodules in each patient, notes about the nodules, to CT scans.

After **analyzing the various files**, we realized that some of them were **irrelevant**, so we ended up **extracting data directly from the CT scans using pylidc** and comparing them with the csvs in order to understand whether the information was consistent or not, but this entire process is explained in the project.


### Image Pre-Processing and Feature Extraction
To extract the required data from the displayed images, we used Pyradiomics. Howerver, before extracting its features, we need to prepare the images by standardizing them to **Hounsfield Units (HU)** and then **normalizing** them.

In this project we decided to extract 2d and 3d features and compare them. In order to do so, we had to segmentate the images accordingly:
- 2D Segmentation: We use only one slice, which is obtained from the centroid of the nodule.



- 3D Segmentation: For three-dimensional (3D) images, we use multiple slices of CT scans. These images contain information in three dimensions (x, y and z), resulting in a volumetric representation of the objects.




### Models and Results Comparison
Once we had the features, we decided to use different models and feature selection, using the following:

**Models:**
- Support Vector Machine (SVM)
- Random Forest
- XGBoost

**Feature Selection**
- Without feature selection
- PCA
- T-test
- Random Forest for feature selection

We tested each kind of the feature selections above in every proposed model for 2D and 3D features. In order to then understand which was the best approach we made the following tests and comparisons:
- **Comparison between different models for the same feature selection method**: to do so we used the ANOVA test and Tukey's Multiple Comparisons Test (Post Hoc);
- **Comparison between different feature selection methods for the same classification model**: to do so we used the ANOVA test and Tukey's Multiple Comparisons Test (Post Hoc);
- **Comparison between 2D and 3D:** to do so we used Paired samples t-test;


## About the repository:

- datasets ➡️ Folder with the created datasets;
- Extração_Dados_Pylidc.ipynb ➡️ Jupyter Notebook with the code used to extract data from the CT scans;
- Implicações éticas e legais no diagnóstico de cancros de pulmão.pdf ➡️ Is a pdf file that debates the ethical and legal implications of using AI in lung cancer detection;
- Lab_IACD_Project1.pdf ➡️Project statement
- T1_LabIACD_Project1.pdf ➡️ More information on how to develop the project;
- Project.ipynb ➡️ The project in jupyter notebook format;
- notebook.pdf ➡️The project in pdf format.


## Link to the course:

This course is part of the **first semester** of the **third year** of the **Bachelor's Degree in Artificial Intelligence and Data Science** at **FCUP** and **FEUP** in the academic year 2024/2025. You can find more information about this course at the following link:



Link to Course



FCUP


FEUP