Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/simonneutert/wine_quality_data

data-mining data-science data-visualization jupyter jupyter-notebooks pandas plotting python python-3-6 python3 wine wine-quality

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/simonneutert/wine_quality_data
Owner: simonneutert
Created: 2017-10-06T11:01:49.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2017-12-15T12:16:26.000Z (about 7 years ago)
Last Synced: 2024-03-15T20:06:24.384Z (10 months ago)
Topics: data-mining, data-science, data-visualization, jupyter, jupyter-notebooks, pandas, plotting, python, python-3-6, python3, wine, wine-quality
Language: Jupyter Notebook
Size: 5.17 MB
Stars: 3
Watchers: 3
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

# Wine Quality Data Analysis

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

1. Title: Wine Quality

2. Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

3. Past Usage:

In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).

4. Relevant Information:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.

5. Number of Instances: red wine - 1599; white wine - 4898.

6. Number of Attributes: 11 + output attribute

Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.

7. Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

8. Missing Attribute Values: None

# content of the current directory:

```python
% ls
```

Readme.md winequality-red.csv
Wine Quality Analysis.ipynb winequality-white.csv
Wine+Quality+Analysis.html winequality.names.txt

```python
# import libs
%matplotlib inline
import pandas as pd
import seaborn as sns; sns.set(style="whitegrid", palette="muted")
import numpy as np
import matplotlib.pyplot as plt
```

### Create DataFrames for white and red wines

```python
white_wine_df = pd.read_csv('winequality-white.csv', sep=";")
red_wine_df = pd.read_csv('winequality-red.csv', sep=";")
```

### DataFrames for red and white wines combined

```python
ww = white_wine_df.loc[:]
ww["color"] = "white"
rw = red_wine_df.loc[:]
rw["color"] = "red"
wine_df = pd.concat([ww, rw], ignore_index=True)
```

# Data

```python
white_wine_df.head()
```

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
color

0
7.0
0.27
0.36
20.7
0.045
45.0
170.0
1.0010
3.00
0.45
8.8
6
white

1
6.3
0.30
0.34
1.6
0.049
14.0
132.0
0.9940
3.30
0.49
9.5
6
white

2
8.1
0.28
0.40
6.9
0.050
30.0
97.0
0.9951
3.26
0.44
10.1
6
white

3
7.2
0.23
0.32
8.5
0.058
47.0
186.0
0.9956
3.19
0.40
9.9
6
white

4
7.2
0.23
0.32
8.5
0.058
47.0
186.0
0.9956
3.19
0.40
9.9
6
white

```python
red_wine_df.head()
```

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
color

0
7.4
0.70
0.00
1.9
0.076
11.0
34.0
0.9978
3.51
0.56
9.4
5
red

1
7.8
0.88
0.00
2.6
0.098
25.0
67.0
0.9968
3.20
0.68
9.8
5
red

2
7.8
0.76
0.04
2.3
0.092
15.0
54.0
0.9970
3.26
0.65
9.8
5
red

3
11.2
0.28
0.56
1.9
0.075
17.0
60.0
0.9980
3.16
0.58
9.8
6
red

4
7.4
0.70
0.00
1.9
0.076
11.0
34.0
0.9978
3.51
0.56
9.4
5
red

```python
assert white_wine_df.columns.all() == red_wine_df.columns.all()
",".join(list(white_wine_df.columns))
```

'fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color'

### test for null values and check correct datatypes

```python
assert white_wine_df.notnull().all().all()
white_wine_df.info()
```

RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
fixed acidity 4898 non-null float64
volatile acidity 4898 non-null float64
citric acid 4898 non-null float64
residual sugar 4898 non-null float64
chlorides 4898 non-null float64
free sulfur dioxide 4898 non-null float64
total sulfur dioxide 4898 non-null float64
density 4898 non-null float64
pH 4898 non-null float64
sulphates 4898 non-null float64
alcohol 4898 non-null float64
quality 4898 non-null int64
color 4898 non-null object
dtypes: float64(11), int64(1), object(1)
memory usage: 497.5+ KB

__no null values in white wine dataframe found__

```python
assert red_wine_df.notnull().all().all()
red_wine_df.info()
```

RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
fixed acidity 1599 non-null float64
volatile acidity 1599 non-null float64
citric acid 1599 non-null float64
residual sugar 1599 non-null float64
chlorides 1599 non-null float64
free sulfur dioxide 1599 non-null float64
total sulfur dioxide 1599 non-null float64
density 1599 non-null float64
pH 1599 non-null float64
sulphates 1599 non-null float64
alcohol 1599 non-null float64
quality 1599 non-null int64
color 1599 non-null object
dtypes: float64(11), int64(1), object(1)
memory usage: 162.5+ KB

__no null values in red wine dataframe found__

### All datatypes are numeric.

# Build categoricals

```python
# this can be crucial :)
white_wine_df["color"] = white_wine_df["color"].astype("category")
red_wine_df["color"] = red_wine_df["color"].astype("category")
wine_df["color"] = wine_df["color"].astype("category")
```

# Means

### White Wines:

```python
white_wine_df.mean()
```

fixed acidity 6.854788
volatile acidity 0.278241
citric acid 0.334192
residual sugar 6.391415
chlorides 0.045772
free sulfur dioxide 35.308085
total sulfur dioxide 138.360657
density 0.994027
pH 3.188267
sulphates 0.489847
alcohol 10.514267
quality 5.877909
dtype: float64

### Red Wines:

```python
red_wine_df.mean()
```

fixed acidity 8.319637
volatile acidity 0.527821
citric acid 0.270976
residual sugar 2.538806
chlorides 0.087467
free sulfur dioxide 15.874922
total sulfur dioxide 46.467792
density 0.996747
pH 3.311113
sulphates 0.658149
alcohol 10.422983
quality 5.636023
dtype: float64

#### Differences between red and white wine means that are greater than 1.0

```python
mean_diff = white_wine_df.mean() - red_wine_df.mean()
mean_diff_abs = mean_diff.apply(lambda x: abs(x))
mean_diff[mean_diff_abs >= 1.0]
```

fixed acidity -1.464850
residual sugar 3.852609
free sulfur dioxide 19.433163
total sulfur dioxide 91.892865
dtype: float64

# Distribution of Quality

```python
sns.countplot(data=white_wine_df, x="quality")
sns.plt.title("White Wines")
```

![png](images/output_29_1.png)

```python
white_wine_df.quality.describe()
```

count 4898.000000
mean 5.877909
std 0.885639
min 3.000000
25% 5.000000
50% 6.000000
75% 6.000000
max 9.000000
Name: quality, dtype: float64

```python
x = sns.countplot(data=red_wine_df, x="quality")
sns.plt.title("Red Wines")
```

![png](images/output_31_1.png)

```python
red_wine_df.quality.describe()
```

count 1599.000000
mean 5.636023
std 0.807569
min 3.000000
25% 5.000000
50% 6.000000
75% 6.000000
max 8.000000
Name: quality, dtype: float64

# What may be important for a high quality rating?

__To find out, the percentual mean differences for low quality to high quality wines over the total mean are calculated, resulting in percentual changes.__

## For white wines:

```python
x = white_wine_df.groupby(["quality"]).mean()
lower_quals = x.loc[:4].mean()
higher_quals = x.loc[7:].mean()
ww_perc_means = (higher_quals - lower_quals) / white_wine_df.mean() * 100
ww_perc_means
```

alcohol 14.068993
chlorides -38.372538
citric acid 7.758984
density -0.254610
fixed acidity -6.235608
free sulfur dioxide -10.177344
pH 1.934262
quality NaN
residual sugar -8.100321
sulphates 1.999155
total sulfur dioxide -18.439304
volatile acidity -27.979143
dtype: float64

### Comparing low quality means to high quality ones, the following attributes differ more than 5 per cent:

```python
ww_perc_means[abs(ww_perc_means) > 5]
```

alcohol 14.068993
chlorides -38.372538
citric acid 7.758984
fixed acidity -6.235608
free sulfur dioxide -10.177344
residual sugar -8.100321
total sulfur dioxide -18.439304
volatile acidity -27.979143
dtype: float64

### Comparing low quality means to high quality ones, the following attributes differ more than 10 per cent:

```python
ww_perc_means[abs(ww_perc_means) > 10]
```

alcohol 14.068993
chlorides -38.372538
free sulfur dioxide -10.177344
total sulfur dioxide -18.439304
volatile acidity -27.979143
dtype: float64

## For red wines:

```python
x = red_wine_df.groupby(["quality"]).mean()
lower_quals = x.loc[:4].mean()
higher_quals = x.loc[7:].mean()
rw_perc_means = (higher_quals - lower_quals) / red_wine_df.mean() * 100
rw_perc_means
```

alcohol 16.023546
chlorides -38.955960
citric acid 77.707371
density -0.134937
fixed acidity 7.811538
free sulfur dioxide 12.783852
pH -3.345302
quality NaN
residual sugar -0.609713
sulphates 26.028988
total sulfur dioxide 7.875629
volatile acidity -71.161438
dtype: float64

### Comparing low quality means to high quality ones, the following attributes differ more than 5 per cent:

```python
rw_perc_means[abs(rw_perc_means) > 5]
```

alcohol 16.023546
chlorides -38.955960
citric acid 77.707371
fixed acidity 7.811538
free sulfur dioxide 12.783852
sulphates 26.028988
total sulfur dioxide 7.875629
volatile acidity -71.161438
dtype: float64

### Comparing low quality means to high quality ones, the following attributes differ more than 10 per cent:

```python
rw_perc_means[abs(rw_perc_means) > 10]
```

alcohol 16.023546
chlorides -38.955960
citric acid 77.707371
free sulfur dioxide 12.783852
sulphates 26.028988
volatile acidity -71.161438
dtype: float64

# What will be taken a closer look at:

* Alcohol
* Chlorides
* Citric Acid
* Sulphates
* Sulfur Dioxides
* Volatile Acidity

# Sulfur Dioxides and Quality

### White Wines

```python
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(14.5, 4.5)
fig.dpi = 300
sns.stripplot(data=white_wine_df, x="quality", y="total sulfur dioxide", jitter=True, ax=ax1)
sns.stripplot(data=white_wine_df, x="quality", y="free sulfur dioxide", jitter=True, ax=ax2)
```

![png](images/output_49_1.png)

```python
high_qual_ww_tsd_mean = white_wine_df[white_wine_df["quality"] >= 7]["total sulfur dioxide"].mean()
high_qual_ww_tsd_mean = format(high_qual_ww_tsd_mean, '.1f')
print(f"The mean for higher quality white wines (quality >= 7) is {high_qual_ww_tsd_mean}")
```

The mean for higher quality white wines (quality >= 7) is 125.2

### Interpretation White Wines

Both plots show, that higher quality white wines tend to have less total sulfur dioxide in it.

### red wine

```python
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(14.5, 4.5)
fig.dpi = 300
sns.stripplot(data=red_wine_df, x="quality", y="total sulfur dioxide", jitter=True, ax=ax1)
sns.stripplot(data=red_wine_df, x="quality", y="free sulfur dioxide", jitter=True, ax=ax2)
```

![png](images/output_53_1.png)

```python
high_qual_rw_tsd_mean = red_wine_df[red_wine_df["quality"] >= 7]["total sulfur dioxide"].mean()
high_qual_rw_tsd_mean = format(high_qual_rw_tsd_mean, '.1f')
print(f"The mean for higher quality red wines (quality >= 7) is {high_qual_rw_tsd_mean}")
```

The mean for higher quality red wines (quality >= 7) is 34.9

## Interpretation Red Wines

For the red wines, there are much lower concentrations of sulfur dioxides. Additionally, there seems to be no direct correlation between sulfur dioxide concentration and percepted quality.

# Conclusion: Sulfur Dioxides and Quality

Regarding high quality white wines (>= 7), those wines have a mean of sulfur dioxides of around 125. Respectively high quality Red Wines (>=7) have a mean concentration of sulfur dioxide of 35.

# Sulphates and Quality

```python
sns.stripplot(data=wine_df, x="quality", y="sulphates", jitter=True, hue="color", split=True)
```

![png](images/output_58_1.png)

## Alcohol in Wine

### White Wine

```python
white_wine_df.groupby("quality")["alcohol"].describe()
```

count
mean
std
min
25%
50%
75%
max

quality

3
20.0
10.345000
1.224089
8.0
9.55
10.45
11.00
12.6

4
163.0
10.152454
1.003217
8.4
9.40
10.10
10.75
13.5

5
1457.0
9.808840
0.847065
8.0
9.20
9.50
10.30
13.6

6
2198.0
10.575372
1.147776
8.5
9.60
10.50
11.40
14.0

7
880.0
11.367936
1.246536
8.6
10.60
11.40
12.30
14.2

8
175.0
11.636000
1.280138
8.5
11.00
12.00
12.60
14.0

9
5.0
12.180000
1.013410
10.4
12.40
12.50
12.70
12.9

### Red Wine

```python
red_wine_df.groupby("quality")["alcohol"].describe()
```

count
mean
std
min
25%
50%
75%
max

quality

3
10.0
9.955000
0.818009
8.4
9.725
9.925
10.575
11.0

4
53.0
10.265094
0.934776
9.0
9.600
10.000
11.000
13.1

5
681.0
9.899706
0.736521
8.5
9.400
9.700
10.200
14.9

6
638.0
10.629519
1.049639
8.4
9.800
10.500
11.300
14.0

7
199.0
11.465913
0.961933
9.2
10.800
11.500
12.100
14.0

8
18.0
12.094444
1.224011
9.8
11.325
12.150
12.875
14.0

### Plotting Alcohol to Quality

```python
sns.barplot(data=wine_df, x="quality", y="alcohol", hue="color")
```

![png](images/output_65_1.png)

```python
sns.lmplot(data=wine_df, x="quality", y="alcohol", hue="color")
```

![png](images/output_66_1.png)

#### Alcohol to Quality relation for Wines equal or greater than 7

```python
hq_wines = wine_df[wine_df.quality >= 7]
sns.lmplot(data=hq_wines, x="quality", y="alcohol", hue="color")
sns.plt.title("Quality >= 7")
```

![png](images/output_68_1.png)

## Heatmap Alcohol to Quality

```python
heat_table = wine_df[["quality", "alcohol"]].copy()
heat_table["alcohol"] = heat_table.alcohol.apply(func=lambda x: round(x * 2) / 2)
heat_table = heat_table.groupby(["quality", "alcohol"])["alcohol"].count().reset_index(name='counts')
sns.heatmap(heat_table.pivot("quality", "alcohol", "counts"))
```

![png](images/output_70_1.png)

# Chlorides

```python
sns.barplot(data=wine_df, hue="color", x="quality", y="chlorides")
```

![png](images/output_72_1.png)

__The less chlorides in a wine the higher the quality.__

## Chlorides and Alcohol

```python
g = sns.PairGrid(wine_df[["alcohol", "chlorides", "quality"]], hue="quality")
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()
```

![png](images/output_75_0.png)

## Acids

```python
sns.pairplot(white_wine_df[["volatile acidity", "citric acid", "quality"]], hue="quality")
```

![png](images/output_77_1.png)

## Bringing the relevant attributes together

```python
sns.pairplot(white_wine_df[["volatile acidity", "citric acid", "quality", "free sulfur dioxide", "chlorides"]], hue="quality")
```

![png](images/output_79_1.png)

```python
white_wine_df[["volatile acidity", "citric acid", "quality"]].groupby("quality").describe(percentiles=[])
```

citric acid
volatile acidity

count
mean
std
min
50%
max
count
mean
std
min
50%
max

quality

3
20.0
0.336000
0.081460
0.21
0.345
0.47
20.0
0.333250
0.140827
0.17
0.26
0.640

4
163.0
0.304233
0.163857
0.00
0.290
0.88
163.0
0.381227
0.173463
0.11
0.32
1.100

5
1457.0
0.337653
0.140814
0.00
0.320
1.00
1457.0
0.302011
0.100066
0.10
0.28
0.905

6
2198.0
0.338025
0.119325
0.00
0.320
1.66
2198.0
0.260564
0.088142
0.08
0.25
0.965

7
880.0
0.325625
0.079183
0.01
0.310
0.74
880.0
0.262767
0.091106
0.08
0.25
0.760

8
175.0
0.326514
0.085439
0.04
0.320
0.74
175.0
0.277400
0.108029
0.12
0.26
0.660

9
5.0
0.386000
0.082037
0.29
0.360
0.49
5.0
0.298000
0.057619
0.24
0.27
0.360

```python
white_wine_df[["quality", "free sulfur dioxide", "chlorides"]].groupby("quality").describe(percentiles=[])
```

chlorides
free sulfur dioxide

count
mean
std
min
50%
max
count
mean
std
min
50%
max

quality

3
20.0
0.054300
0.046468
0.022
0.041
0.244
20.0
53.325000
69.420776
5.0
33.5
289.0

4
163.0
0.050098
0.025888
0.013
0.046
0.290
163.0
23.358896
20.391349
3.0
18.0
138.5

5
1457.0
0.051546
0.026496
0.009
0.047
0.346
1457.0
36.432052
18.145991
2.0
35.0
131.0

6
2198.0
0.045217
0.020453
0.015
0.043
0.255
2198.0
35.650591
15.735679
3.0
34.0
112.0

7
880.0
0.038191
0.010697
0.012
0.037
0.135
880.0
34.125568
13.244737
5.0
33.0
108.0

8
175.0
0.038314
0.013164
0.014
0.036
0.121
175.0
36.720000
16.203675
6.0
35.0
105.0

9
5.0
0.027400
0.007436
0.018
0.031
0.035
5.0
33.400000
13.427584
24.0
28.0
57.0

# Final Conclusion

no attribute alone is strong enough to define a high quality wine, but as the figures show. For a wine to score high, having the acids and sulfur dioxide values all within in a certain range can help.

## Best vs. Worst

when comparing the best (8,9) vs. worst (3,4) we can see that they well overlap each other.

```python
qual3 = white_wine_df[white_wine_df["quality"] == 3]
qual4 = white_wine_df[white_wine_df["quality"] == 4].copy()
qual4.quality = 3
qual8 = white_wine_df[white_wine_df["quality"] == 8]
qual9 = white_wine_df[white_wine_df["quality"] == 9].copy()
qual9.quality = 8
white_wine_sample = pd.concat([qual3, qual4, qual8, qual9], ignore_index=True)
white_wine_sample
sns.pairplot(white_wine_sample[["volatile acidity", "citric acid", "quality"]], hue="quality")
```

![png](images/output_83_1.png)

```python
sns.pairplot(white_wine_sample[["volatile acidity", "citric acid", "quality", "sulphates", "chlorides"]], hue="quality")
```

![png](images/output_84_1.png)

# Seems like labs can't measure a wine's inner spirit (yet).

__But if you have to pick a wine only based on specs, i would suggest white wines close to this values:__

```python
qual8[["quality", "chlorides", "alcohol", "citric acid", "sulphates"]].describe(percentiles=[])
```

quality
chlorides
alcohol
citric acid
sulphates

count
175.0
175.000000
175.000000
175.000000
175.000000

mean
8.0
0.038314
11.636000
0.326514
0.486229

std
0.0
0.013164
1.280138
0.085439
0.147073

min
8.0
0.014000
8.500000
0.040000
0.250000

50%
8.0
0.036000
12.000000
0.320000
0.460000

max
8.0
0.121000
14.000000
0.740000
0.950000

__and red wines close to this values:__

```python
rqual8 = red_wine_df[red_wine_df["quality"] >= 8]
rqual8[["quality", "chlorides", "alcohol", "citric acid", "sulphates"]].describe(percentiles=[])
```

quality
chlorides
alcohol
citric acid
sulphates

count
18.0
18.000000
18.000000
18.000000
18.000000

mean
8.0
0.068444
12.094444
0.391111
0.767778

std
0.0
0.011678
1.224011
0.199526
0.115379

min
8.0
0.044000
9.800000
0.030000
0.630000

50%
8.0
0.070500
12.150000
0.420000
0.740000

max
8.0
0.086000
14.000000
0.720000
1.100000