Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/codeamt/mle-capstone-data
data preprocessing submodule for Udacity's mle nanodegree program.
https://github.com/codeamt/mle-capstone-data
Last synced: about 2 months ago
JSON representation
data preprocessing submodule for Udacity's mle nanodegree program.
- Host: GitHub
- URL: https://github.com/codeamt/mle-capstone-data
- Owner: codeamt
- Created: 2020-06-04T03:14:58.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-03-09T21:51:25.000Z (10 months ago)
- Last Synced: 2024-03-09T22:28:38.334Z (10 months ago)
- Language: Jupyter Notebook
- Size: 1.1 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Generating COVIDx Dataset
Data preprocessing submodule for Udacity's Machine Learning Engineer Nanodegree program.
Generates the latest COVIDx Dataset for modeling; from benchmark research model first presented in [[1]]().
## Repo Contents
1 directory, 6 files## Generating Covidx Training Set
There are 2 ways to generate the COVIDx Dataset:
- The [data preprocessing notebook](https://github.com/codeamt/mle-capstone-data/blob/master/data_pre-processing.ipynb) (In Jupyter or Colab)
- The [command-line tool](https://github.com/codeamt/mle-capstone-data/tree/master/data-cli-tool)### The Data Pre-Processing Notebook:
The data preprocessing notebook [covidnet_data_processing.ipynb](https://github.com/codeamt/mle-capstone-data/blob/master/covidnet_data_processsing.ipynb) in this repo includes additional steps for generating .csv labeling files for modeling.### Setting up and Running data-cli-tool:
#### What you'll need:
- Linux-based system with Python 3.7+ installed
- And/or virtualenv intalled
- A Kaggle Authentication Key (kaggle.json file)#### Running Locally (Linux):
In a terminal, get the repo via git if you don't have it on your system already, then change into the repo, create a virtual environment and activate, and run the python script:
```
pip3 install virtualenv
git clone https://github.com/codeamt/mle-capstone-data.git
cd mle-capstone-data-master && virtualenv .
source bin/activate
python3 get_covidx.py --kaggle_file "/path/to/your/kaggle.json"
```Be sure to upload and extract the output zip file of this pipeline phase to the environment/notebook you use for the [modeling phase](https://github.com/codeamt/mle-capstone-modeling).
## About the Data
This set aggregates and deduplicates examples to construct COVIDxv3 from the following sources:
- https://github.com/ieee8023/covid-chestxray-dataset
- https://github.com/agchung/Figure1-COVID-chestxray-dataset
- https://github.com/agchung/Actualmed-COVID-chestxray-dataset
- https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
- https://www.kaggle.com/c/rsna-pneumonia-detection-challenge (which came from: https://nihcc.app.box.com/v/ChestXray-NIHCC)For more notes on previous versions of the dataset, please refer to the original [COVID-Net](https://github.com/lindawangg/COVID-Net) repo for more detailed [documentation](https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md).
### Chest Radiography Images Distribution
[1] L. Wang and A. Wong, “COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID19 Cases from Chest Radiography Images,” ArXiv200309871 Cs Eess, Mar. 2020 [Online]. Available: http://arxiv.org/abs/2003.09871.