https://github.com/codeamt/mle-capstone-data
  
  
    data preprocessing submodule for Udacity's mle nanodegree program. 
    https://github.com/codeamt/mle-capstone-data
  
        Last synced: 8 months ago 
        JSON representation
    
data preprocessing submodule for Udacity's mle nanodegree program.
- Host: GitHub
- URL: https://github.com/codeamt/mle-capstone-data
- Owner: codeamt
- Created: 2020-06-04T03:14:58.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-03-09T21:51:25.000Z (over 1 year ago)
- Last Synced: 2025-01-13T14:19:28.803Z (10 months ago)
- Language: Jupyter Notebook
- Size: 1.1 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 5
- 
            Metadata Files:
            - Readme: README.md
 
Awesome Lists containing this project
README
          Generating COVIDx Dataset
 Data preprocessing submodule for Udacity's Machine Learning Engineer Nanodegree program.
  
 Generates the latest COVIDx Dataset for modeling; from benchmark research model first presented in [[1]](). 
 
 ## Repo Contents
 
 
1 directory, 6 files
## Generating Covidx Training Set 
There are 2 ways to generate the COVIDx Dataset: 
- The [data preprocessing notebook](https://github.com/codeamt/mle-capstone-data/blob/master/data_pre-processing.ipynb) (In Jupyter or Colab) 
- The [command-line tool](https://github.com/codeamt/mle-capstone-data/tree/master/data-cli-tool)
### The Data Pre-Processing Notebook: 
The data preprocessing notebook [covidnet_data_processing.ipynb](https://github.com/codeamt/mle-capstone-data/blob/master/covidnet_data_processsing.ipynb) in this repo includes additional steps for generating .csv labeling files for modeling.
### Setting up and Running data-cli-tool:
#### What you'll need: 
- Linux-based system with Python 3.7+ installed
- And/or virtualenv intalled 
- A Kaggle Authentication Key (kaggle.json file)
#### Running Locally (Linux): 
In a terminal, get the repo via git if you don't have it on your system already, then change into the repo, create a virtual environment and activate, and run the python script: 
```
pip3 install virtualenv
git clone https://github.com/codeamt/mle-capstone-data.git
cd mle-capstone-data-master && virtualenv .
source bin/activate
python3 get_covidx.py --kaggle_file "/path/to/your/kaggle.json"
```
Be sure to upload and extract the output zip file of this pipeline phase to the environment/notebook you use for the [modeling phase](https://github.com/codeamt/mle-capstone-modeling).
## About the Data 
This set aggregates and deduplicates examples to construct COVIDxv3 from the following sources:  
- https://github.com/ieee8023/covid-chestxray-dataset
- https://github.com/agchung/Figure1-COVID-chestxray-dataset
- https://github.com/agchung/Actualmed-COVID-chestxray-dataset
- https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
- https://www.kaggle.com/c/rsna-pneumonia-detection-challenge (which came from: https://nihcc.app.box.com/v/ChestXray-NIHCC)
For more notes on previous versions of the dataset, please refer to the original [COVID-Net](https://github.com/lindawangg/COVID-Net) repo for more detailed [documentation](https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md). 
### Chest Radiography Images Distribution 

[1] L. Wang and A. Wong, “COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID19 Cases from Chest Radiography Images,” ArXiv200309871 Cs Eess, Mar. 2020 [Online]. Available: http://arxiv.org/abs/2003.09871.