https://github.com/in-c0/drought-predictor-neural-network

Classification and Regression model trained on climate features extracted from the ERA5 global dataset
https://github.com/in-c0/drought-predictor-neural-network
Last synced: 2 months ago
JSON representation
Classification and Regression model trained on climate features extracted from the ERA5 global dataset
Host: GitHub
URL: https://github.com/in-c0/drought-predictor-neural-network
Owner: in-c0
Created: 2024-10-21T12:05:38.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-10-21T12:45:44.000Z (7 months ago)
Last Synced: 2025-01-26T09:29:20.869Z (4 months ago)
Size: 61.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # drought-predictor-neural-network

A drought predictor (Classification & Regression model) trained on climate features selectively extracted from the ERA5 global dataset over the period of 41 years from 1979 to 2020. Code is excluded for copyright reasons.

### Input Variables

| Variable | Description                                      |

|----------|--------------------------------------------------|

| mn2t     | Minimum temperature at 2 meters (°K)             |

| msl      | Mean sea level pressure (Pa)                     |

| mx2t     | Maximum temperature at 2 meters (°K)             |

| q        | Specific humidity (kg kg⁻¹)                      |

| t        | Average temperature at Pressure level 850 hPa (°K) |

| t2       | Average temperature at 2 meters (°K)             |

| tcc      | Total cloud cover (0-1)                          |

| u        | U wind component at pressure level 850 hPa (m s⁻¹) |

| u10      | U wind component at 10 meters (m s⁻¹)            |

| v        | V wind component at pressure level 850 hPa (m s⁻¹) |

| v10      | V wind component at 10 meters (m s⁻¹)            |

| z        | Geopotential (m² s⁻²)                            |

| month    | 1 to 12                                          |

| year     | 1979 to 2020                                     |

| grid ID  | The ID of the grid cell                          |

| SPI      | Standardised Precipitation Index (unitless)      |

### Target Variable

The drought index SPI (Standardised Precipitation Index) is used here as a proxy to characterise the intensity of drought caused by precipitation deficiency. Very low SPI values suggest intense drought, while very high values indicate very wet conditions. 

SPI is our target variable in the regression task, and we will predict it based on the climate variables. 

In the classification task, we will calculate a binary target variable ’Drought’ from SPI.

The ’Drought’ variable will be 1 to indicate the occurrence of drought and 0 to indicate no drought. We will apply a threshold of -1 to SPI, where values below or equal to this threshold indicate drought conditions (i.e. Drought = 1); otherwise, there is no drought (i.e. Drought = 0).

 time series plot of SPI for a single grid cell is shown in Fig. 3. Periods where SPI falls below or is equal to -1 indicate periods of drought.

### Preprocessing

 ```

Loading data

======= Climate_SPI.csv =======

Row, Col:  (15120, 16)

   year  month       u10       v10        mx2t        mn2t       tcc          t2            msl           t         q         u         v           z       SPI  grid_ID

0  1979      1 -2.958586 -0.634959  309.023720  297.283986  0.291988  302.992399  100796.767708  299.970613  0.008493 -4.568461 -2.839551  459.991748 -0.715037      303

1  1979     10 -0.049360 -0.113635  302.851483  289.459347  0.192154  296.133314  101346.796371  293.335081  0.004813 -2.372457 -3.721755  904.832294 -0.246504      303

2  1979     11 -0.404197 -0.341808  308.174168  294.707882  0.306703  301.354577  101135.556510  298.360025  0.005545 -2.794565 -3.715220  751.517519 -0.921090      303

3  1979     12 -1.089172 -0.600055  310.344489  296.948569  0.327175  303.544371  100999.285534  300.545961  0.006324 -3.429895 -3.687178  622.386996 -0.917537      303

4  1979      2 -2.160727 -0.715109  303.806692  294.093896  0.502479  298.759323  101125.883929  297.114917  0.009923 -3.892438 -2.943332  778.958111  0.631926      303

===============================

Excluding year and grid_ID from data

======= Climate_SPI.csv =======

Row, Col:  (15120, 14)

   month       u10       v10        mx2t        mn2t       tcc          t2            msl           t         q         u         v           z       SPI

0      1 -2.958586 -0.634959  309.023720  297.283986  0.291988  302.992399  100796.767708  299.970613  0.008493 -4.568461 -2.839551  459.991748 -0.715037

1     10 -0.049360 -0.113635  302.851483  289.459347  0.192154  296.133314  101346.796371  293.335081  0.004813 -2.372457 -3.721755  904.832294 -0.246504

2     11 -0.404197 -0.341808  308.174168  294.707882  0.306703  301.354577  101135.556510  298.360025  0.005545 -2.794565 -3.715220  751.517519 -0.921090

3     12 -1.089172 -0.600055  310.344489  296.948569  0.327175  303.544371  100999.285534  300.545961  0.006324 -3.429895 -3.687178  622.386996 -0.917537

4      2 -2.160727 -0.715109  303.806692  294.093896  0.502479  298.759323  101125.883929  297.114917  0.009923 -3.892438 -2.943332  778.958111  0.631926

===============================

Adding 'Drought' column (0 or 1) based on SPI

======= Climate_SPI.csv =======

Row, Col:  (2240, 15)

    month       u10       v10        mx2t        mn2t       tcc          t2            msl           t         q         u         v            z       SPI  Drought

9       7 -1.078143  0.902460  292.153382  278.805876  0.079997  285.057087  102283.450857  284.523974  0.003543 -2.473988 -1.114550  1695.127406 -1.099545        1

14     11 -0.753911 -0.576591  307.029682  293.860886  0.184344  300.264339  101172.854167  297.258291  0.005441 -3.071376 -3.726942   793.013513 -1.262581        1

20      6 -0.819211  0.867296  291.798410  280.497951  0.213060  285.672475  102054.188281  285.072712  0.005043 -2.170747 -0.765514  1537.134654 -1.606841        1

23      9 -0.170930  0.643260  301.150361  285.577702  0.054355  293.081444  101787.791667  291.516460  0.003437 -2.213144 -2.092395  1312.511619 -1.105385        1

37     10  0.034087  0.872630  302.460314  288.187859  0.148251  295.186908  101527.905746  292.523171  0.003713 -2.111258 -2.117658  1073.673213 -1.362772        1

===============================

Roughly 14.81% of the data has been labeled as 'Drought'.

The samples are equally distributed across 12 months over the entire dataset.

Month distribution: 

 month

1     1260

2     1260

3     1260

4     1260

5     1260

6     1260

7     1260

8     1260

9     1260

10    1260

11    1260

12    1260

Training set size: 10584 samples (70.0%)

Validation set size: 2268 samples (15.0%)

Test set size: 2268 samples (15.0%)

Total: 15120 samples

Replacing 'month' with 'cos_month' and 'sin_month'...

Training set: 

======= Climate_SPI.csv =======

Row, Col:  (10584, 16)

            u10       v10        mx2t        mn2t       tcc          t2            msl           t         q         u         v            z  Drought  month_normalised  cos(month_normalised)  sin(month_normalised)

6552  -1.479215 -0.349176  308.946537  296.729571  0.200873  302.857344  101032.544010  298.684343  0.007421 -3.278758 -2.823981   694.846833        0          0.000000           1.000000e+00               0.000000

10290  1.215222  1.341728  292.337389  282.794098  0.507675  287.565900  101984.340104  284.021999  0.005509 -0.961927 -0.816961  1463.148627        0          1.570796           6.123234e-17               1.000000

1098  -1.049909  0.459411  301.584464  289.574852  0.202775  295.267790  101702.975000  292.449862  0.006014 -2.335011 -1.833616  1250.147030        0          1.570796           6.123234e-17               1.000000

4525  -0.587676 -0.438277  301.028648  288.192378  0.286398  294.396633  101463.419607  290.270405  0.005417 -2.975901 -3.688355  1019.604987        0          4.712389          -1.836970e-16              -1.000000

4313  -1.335883  0.822051  304.262825  291.893402  0.207135  298.119280  101405.180696  293.967737  0.006583 -3.529105 -1.983720   983.308748        0          1.047198           5.000000e-01               0.866025

===============================

Validation set: 

======= Climate_SPI.csv =======

Row, Col:  (2268, 16)

            u10       v10        mx2t        mn2t       tcc          t2            msl           t         q         u         v            z  Drought  month_normalised  cos(month_normalised)  sin(month_normalised)

6706   1.839478 -0.081112  289.473332  278.865858  0.314341  283.883452  101895.747732  282.425177  0.003976 -0.153143 -1.877165  1323.057953        0          3.665191          -8.660254e-01              -0.500000

10093  0.897742  0.280632  295.250206  284.041292  0.427203  289.746081  101453.679940  284.966471  0.004620 -1.996432 -2.744730   945.190482        0          4.712389          -1.836970e-16              -1.000000

4169  -2.536848  0.726182  303.497044  293.499535  0.409161  298.526789  101526.751512  294.385840  0.007723 -4.536848 -2.243372  1103.272760        0          1.047198           5.000000e-01               0.866025

4660  -1.417957  0.659064  302.306398  289.817686  0.286100  296.016343  101274.662388  293.764555  0.007624 -2.855455 -1.091930   848.800585        0          0.523599           8.660254e-01               0.500000

5675   1.310550  0.547354  294.728937  282.397130  0.147330  288.758396  101700.329687  284.766466  0.004109 -1.696079 -2.510932  1189.587807        1          4.188790          -5.000000e-01              -0.866025

===============================

```

### Transformations

List of transformations applied:

- Removal of columns `year` and `grid_ID`

- Cyclic encoding of `month`, replacing it with `cos_month` and `sin_month`

- Standardisation of `u10`, `v10`, `mx2t`, `mn2t`, `tcc`, `t2`, `msl`, `t`, `q`, `u`, `v`, `z` using `StandardScaler`

- Skewness shifting using `PowerTransformer` (yeo-johanson) transformation on `q`, `tcc`, `v10`, `u`

  

![image](https://github.com/user-attachments/assets/fa72639d-d4b8-4070-a908-831192557290)

![image](https://github.com/user-attachments/assets/23035d36-8020-42d8-aa5b-9c370177aca6)

### Building the model

- A sequential model

- A single dense hidden layer (ReLU)

- Dropout = 0.3

- 100 epochs

- A single output layer (sigmoid)

- Loss function = Binary cross-entropy

- Batch sizes = [64, 128, 256]

- Optimisers = ['SGD', 'Adam']

- sgd_learning_rate = 0.01

- adam_learning_rate = 0.001

![image](https://github.com/user-attachments/assets/63b6e982-0c88-46d0-b4a6-e569cd83a61f)

- With SGD, the model with batch size 128 seems to be performing the best with lowest loss. 

- With Adam, the model with batch size 64 seems to be performing the best, and it achieves the lowest validation loss (~0.309) within 100 epochs. For both optimisers, the models with batch size 256 is converging more slowly and seems to require more than 100 epochs for it to yield results matching the other batch sizes.

The best model is with Adam with batch size 64, with the minimal validation loss 0.3048010468482971

![image](https://github.com/user-attachments/assets/841a3d7f-9f67-4133-b09d-7c558b02a90a)

![image](https://github.com/user-attachments/assets/fd15806f-ab63-45e3-9c5c-2d81016516cd)

The accuracy stabilizes around 86% for both training and validation sets. The validation accuracy initially appears slightly higher than training accuracy in the first 10-20 epochs, but falls afterwards to values slightly lower than training accuracy. The accuracy seems to stabilise around that point, though there is some fluctuation. I expect that there will be no significant increase in accuracy beyond this, indicating that the model has mostly converged at around 86%. 

### Optimisation

The 14 input features we have can be grouped by relevance. My hypothesis was that:

- `q` (humidity) and `tcc`(total cloud cover) are likely the most relevant factors to precipitation/droughtness,

- `cos(month_normalised)` and `sin(month_normalised)` might have a strong correlation due to seasonal nature of climate data,

- `t`, `t2`, `mx2t`, `mn2t` are all related to temperature, and using these all may be redundant, 

- `msl` (mean sea level), `u` (wind), `v` (wind), `u10` (wind), `v10` (wind) will have to be strongly correlated to each other, including temperature, in order to be an indicator of drought (i.e. for sea water to evaporate and form a rain cloud that can move quickly enough to reach the Murray-Darling basin area),

In addition to this grouping, we will train the model with these additional subsets:

- humidity + total cloud cover (`q`, `tcc`)

- humidity + seasonality (`q`, `cos(month_normalised)`, `sin(month_normalised)`)

- humidity + total cloud cover + seasonality (`q`, `tcc`, `cos(month_normalised)`, `sin(month_normalised)`)

- humidity + total cloud cover + seasonality + geopotential (`q`, `tcc`, `cos(month_normalised)`, `sin(month_normalised)`, `z`)

- temperature + msl + wind (`t`, `t2`, `mx2t`, `mn2t`, `msl`, `u`, `v`, `u10`, `v10`)

- temperature (average only) + msl + wind (`t`, `msl`, `u`, `v`, `u10`, `v10`)

- temperature + msl + wind + total cloud cover (`t`, `t2`, `mx2t`, `mn2t`, `msl`, `u`, `v`, `tcc`)

- temperature + msl + wind + total cloud cover + seasonality (`t`, `t2`, `mx2t`, `mn2t`, `msl`, `u`, `v`, `tcc`, `cos(month_normalised)`, `sin(month_normalised)`)

- temperature (average only) + msl + wind + humidity + total cloud cover + seasonality (`t`, `msl`, `u`, `v`, `q`, `tcc`, `cos(month_normalised)`, `sin(month_normalised)`)

- all 14 features

![image](https://github.com/user-attachments/assets/d5a31d23-6070-4488-9490-eaa25ed87829)

The best subset is: temp_avg_msl_wind_humidity_cloud_seasonality with validation accuracy: 0.86816579

### Evaluation

- Accuracy: 0.81569665 (Balanced: 0.74676501)

- Precision: 0.42084942

- Recall: 0.64880952

- F1-Score: 0.51053864

![image](https://github.com/user-attachments/assets/a2cb4e81-281b-4ab2-8340-e03d56c3d7de)

## Regression model

The goal is to predict the intensity of the drought, represented by ‘SPI’. We applied the same transformation and building configuation, except that the output layer uses linear activation and MSE for the loss function. (Metrics MAE)

![image](https://github.com/user-attachments/assets/16bb25b6-adc0-4783-b01c-feb267439438)

![image](https://github.com/user-attachments/assets/00b4aad4-4352-44c6-9e76-9c50cf8d36b0)

![image](https://github.com/user-attachments/assets/7e3588b3-b272-4738-aefd-21ba873f0256)

![image](https://github.com/user-attachments/assets/d7248e5b-9ce2-4901-8557-35c066592fc4)

![image](https://github.com/user-attachments/assets/5103f682-d95a-440e-bce1-6a22d1b0d569)

![image](https://github.com/user-attachments/assets/1450a59e-07fe-44af-ad9e-093f414e2b44)

- Balanced Accuracy on New Data: 0.74614713

- Precision on New Data: 0.44197685

- Mean Absolute Error (MAE): 0.49075953

- Pearson Correlation Coefficient: 0.78394778

## Rooms for Improvement

- The model converges too early; could use deeper or more complex networks

- Using Recurrent layer instead of Dense

- Experimenting with additional layer(s),

- Experimenting with different threshold value,

- F1-score or AUC-PR
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/in-c0/drought-predictor-neural-network

Awesome Lists containing this project

README