https://github.com/glencrawford/australia_rain_tomorrow_binary_classification_prediction
Binary classification model to predict whether or not it will rain tomorrow with a Tensorflow/Keras and scikit-learn neural network.
https://github.com/glencrawford/australia_rain_tomorrow_binary_classification_prediction
binary-classification classification keras machine-learning neural-network python scikit-learn tensorflow
Last synced: about 1 month ago
JSON representation
Binary classification model to predict whether or not it will rain tomorrow with a Tensorflow/Keras and scikit-learn neural network.
- Host: GitHub
- URL: https://github.com/glencrawford/australia_rain_tomorrow_binary_classification_prediction
- Owner: GlenCrawford
- Created: 2020-01-27T09:21:21.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-02-04T11:19:37.000Z (over 6 years ago)
- Last Synced: 2025-06-21T21:11:44.558Z (12 months ago)
- Topics: binary-classification, classification, keras, machine-learning, neural-network, python, scikit-learn, tensorflow
- Language: Python
- Homepage:
- Size: 3.66 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Binary classification machine learning model to predict whether it will rain tomorrow in Australia.
This is a neural network that uses binary classification to predict whether, given meteorological observations of a given day at a given weather station in Australia, it will rain there the next day. The model is trained and tested on a dataset containing about 10 years of daily weather observations from numerous Australian weather stations.
There are two separate implementations in this project: one using Tensorflow 2 and Keras, and another using scikit-learn.
The model currently has an accuracy of approximately 87%. Given that it doesn't rain exactly 50% of days, there are a lot more rows in the dataset where the target "RainTomorrow" column has a "No" value than "Yes". This means that you can make a complete guess and be right by random chance about 70% of the time. My goal was therefore to get the model accuracy to somewhere around 90%.
Here is the structure of the dataset used for training and testing, showing the header and two data rows:
| Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | WindSpeed9am | WindSpeed3pm | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RISK_MM | RainTomorrow |
|:----------:|:--------:|:-------:|:-------:|:--------:|:-----------:|:--------:|:-----------:|:-------------:|:----------:|:----------:|:------------:|:------------:|:-----------:|:-----------:|:-----------:|:-----------:|:--------:|:--------:|:-------:|:-------:|:---------:|:-------:|:------------:|
| 2010-10-20 | Sydney | 12.9 | 20.3 | 0.2 | 3 | 10.9 | ENE | 37 | W | E | 11 | 26 | 70 | 57 | 1028.8 | 1025.6 | 3 | 1 | 16.9 | 19.8 | No | 0 | No |
| 2017-06-25 | Brisbane | 11 | 24.2 | 0 | 2.2 | 9.8 | ENE | 20 | SSW | NNE | 2 | 7 | 68 | 53 | 1020.5 | 1017.3 | 6 | 3 | 15.9 | 22.6 | No | 0 | Yes |
The data was sourced from this [Kaggle dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) compiled by Joe Young and Adam Young, which was in turn sourced from [http://www.bom.gov.au/climate/data](http://www.bom.gov.au/climate/data) and [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/). This data is available under a Creative Commons (CC) Attribution 3.0 licence. For details on the meaning of each observation, see [this page](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml). Copyright Commonwealth of Australia, Bureau of Meteorology.
## Requirements
* Python (developed with version 3.7.4).
* See dependencies.txt for packages and versions (and below to install).
## Data preprocessing
Data preprocessing is done by a combination of Pandas (to drop NaN rows and map Yes/No strings into 1/0 binary integers), scikit-learn (to scale/normalize numeric features by calculating the z-score of each of their values), and Tensorflow to apply one-hot encoding to categorical features. The model's input layer is thus a combination of pre-normalized numeric features and one-hot encoded categorical features.
The following columns were skipped and not used as features for the model; all the rest were used:
* __Date:__ Not relevant.
* __RainToday:__ This is just a boolean representation of the numeric column "Rainfall". Experimented with adding this feature to the model, but had no effect on accuracy.
* __RISK_MM:__ This is the amount of rain for the following day. This was used to create the label/target column "RainTomorrow". This would be used if the model was doing regression, rather than classification.
* __RainTomorrow:__ Used as the training label/target.
The output of the model is just a single sigmoid-activation neuron which predicts target variable "RainTomorrow".
## Setup
* Clone the Git repository.
* Install the dependencies:
```bash
pip install -r dependencies.txt
```
## Run
```bash
python -W ignore tensor_flow.py
```
or
```bash
python -W ignore scikit_learn.py
```
Note that there is a [current bug in TensorFlow](https://github.com/tensorflow/tensorflow/issues/30609) where deprecation warnings are printed at the usage of feature columns, even though the new feature column API is indeed being used. It has been fixed and will be in a future release of TensorFlow. In the meantime, will just have to live with the warning output.
## Monitoring/logging
After training, run:
```
$ tensorboard --logdir logs/fit
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.0 at http://localhost:6006/ (Press CTRL+C to quit)
```
Then open the above URL in your browser to view the model in TensorBoard.