1 Introduction

The intention of this project is to derive some interesting knowledge from a dataset consisting of a white and red wine features. That is, I will apply statistical tools as well as other tools (e.g. graphical, tabular, etc.) in order to find some knowledge that may seem interesting from statistical point of view.

2 Presentation of the data

First of all, load most common libraries:

library(tidyverse) # we will use tidyverse quite often here...

white <- readr::read_delim("winequality-white.csv", 
                    delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)

red <- readr::read_delim("winequality-red.csv", 
                  delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)

red %>% head() %>% knitr::kable(caption = "Red wine features (first 5 rows)")

Red wine features (first 5 rows)
fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5
7.8	0.88	0.00	2.6	0.098	25	67	0.9968	3.20	0.68	9.8	5
7.8	0.76	0.04	2.3	0.092	15	54	0.9970	3.26	0.65	9.8	5
11.2	0.28	0.56	1.9	0.075	17	60	0.9980	3.16	0.58	9.8	6
7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51	0.56	9.4	5
7.4	0.66	0.00	1.8	0.075	13	40	0.9978	3.51	0.56	9.4	5

white %>% head() %>% knitr::kable(caption = "White wine features (first 5 rows)")

White wine features (first 5 rows)
fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
7.0	0.27	0.36	20.7	0.045	45	170	1.0010	3.00	0.45	8.8	6
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6

We can see that in both tables, all the columns are numerical. That is, we will deal completely with numbers.

3 Statistical analysis & visualisation of the data

3.1 Mean and median values for both white and red wine features. Comparison of features between white and red wine

We could have used summary() function on all the features for each wine type, but then the table wouldn’t have been nice-looking. For this reason, I decided to create the table manually. What I did is I created a merged table for white and red wines and calculated mean and median values for them.

wine <- bind_rows(
  white %>% mutate(type = "white") %>% relocate(type), 
  red %>% mutate(type = "red") %>% relocate(type)
  )

wine_mean <- wine %>% 
  group_by(type) %>% 
  summarise(across(everything(), mean)) %>% mutate(across(-type, round, digits = 2))

wine_median <- wine %>% 
  group_by(type) %>% 
  summarise(across(everything(), median)) %>% mutate(across(-type, round, digits = 2))

wine_summary <- bind_rows(
  wine_mean %>% mutate(func = "mean") %>% relocate(func, .after = type), 
  wine_median %>% mutate(func = "median") %>% relocate(func, .after = type)
) %>% 
  arrange(desc(type))

wine_summary %>% knitr::kable(caption = "Mean and median values for both white and red types of wine")

Mean and median values for both white and red types of wine
type	func	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
white	mean	6.85	0.28	0.33	6.39	0.05	35.31	138.36	0.99	3.19	0.49	10.51	5.88
white	median	6.80	0.26	0.32	5.20	0.04	34.00	134.00	0.99	3.18	0.47	10.40	6.00
red	mean	8.32	0.53	0.27	2.54	0.09	15.87	46.47	1.00	3.31	0.66	10.42	5.64
red	median	7.90	0.52	0.26	2.20	0.08	14.00	38.00	1.00	3.31	0.62	10.20	6.00

wine_summary_longer <- wine_summary %>% 
  pivot_longer(-c(type, func), names_to = "feature", values_to = "value")

I don’t like the table since it’s very large, and thus complex (imagine what would have happened if I used summary()). So, let’s present the same table in a form of a bar chart (for now I will use mean only):

wine_summary_longer %>% 
  filter(func == "mean" & feature %in% c(
    "total sulfur dioxide", "free sulfur dioxide"
    )) %>%
  ggplot(aes(x = feature, y = value, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
            position = position_dodge(.9), size = 4) +
  labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 1") +
  theme(
    plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", size = 12),
    axis.title.y = element_text(face="bold", size = 12),
    legend.title = element_text(face="bold", size = 10)
    )

wine_summary_longer %>% 
  filter(func == "mean" & feature %in% c(
    "alcohol", "fixed acidity", "residual sugar", "quality", "pH"
    )) %>%
  ggplot(aes(x = feature, y = value, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
            position = position_dodge(.9), size = 4) +
  labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 2") +
  theme(
    plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", size = 12),
    axis.title.y = element_text(face="bold", size = 12),
    legend.title = element_text(face="bold", size = 10)
    )

wine_summary_longer %>% 
  filter(func == "mean" & feature %in% c(
    "density", "sulphates", "volatile acidity", "citric acid", "chlorides"
    )) %>%
  ggplot(aes(x = feature, y = value, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
            position = position_dodge(.9), size = 4) +
  labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 3") +
  theme(
    plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", size = 12),
    axis.title.y = element_text(face="bold", size = 12),
    legend.title = element_text(face="bold", size = 10)
    )

3.1.1 Observation

Let’s point out the features that are quite different in white and red wines (we can assume these features actually make white and red wines differ):

free sulfur dioxide (> 2 times)
total sulfur dioxide (~ 3 times)
fixed acidity (> 1 times)
residual sugar (~ 2.5 times)
chlorides (< 2 times)
citric acid (> 1 times)
sulphates (> 1 times)
volatile acidity (< 2 times)

3.2 Correlation between wine QUALITY and other features

Let’s check the correlation between the wine quality and other wine features. We will use Pearson test since we just want to test linear relationships between two interval variables without any monotone checking.

Firstly, we will check the correlation for the white wine:

white_features <- colnames(white)
white_features <- white_features[! white_features %in% c("quality")] # remove quality from other features vector

features_cor_col_names <- c("feature", "cor")
white_features_cor <- data.frame(matrix(nrow = 0, ncol = length(features_cor_col_names)))
colnames(white_features_cor) <- features_cor_col_names

for (feature in white_features) {
  cor_result <- cor(white$quality, white[[feature]], method = "pearson")
  white_features_cor[nrow(white_features_cor) + 1, ] <- list(feature, cor_result)
}

white_features_cor %>% 
  arrange(cor) %>% 
  knitr::kable(caption = "Correlation between white wine QUALITY and each other feature")

Correlation between white wine QUALITY and each other feature
feature	cor
density	-0.3071233
chlorides	-0.2099344
volatile acidity	-0.1947230
total sulfur dioxide	-0.1747372
fixed acidity	-0.1136628
residual sugar	-0.0975768
citric acid	-0.0092091
free sulfur dioxide	0.0081581
sulphates	0.0536779
pH	0.0994272
alcohol	0.4355747

In the above table, we can see that there are indeed some correlations between the white wine quality and some other features. These features are:

density (moderate negative correlation)
alcohol (moderate positive correlation)

Now, let’s do the same, but now with the red whine:

red_features <- colnames(red)
red_features <- red_features[! red_features %in% c("quality")] # remove quality from other features vector

features_cor_col_names <- c("feature", "cor")
red_features_cor <- data.frame(matrix(nrow = 0, ncol = length(features_cor_col_names)))
colnames(red_features_cor) <- features_cor_col_names

for (feature in red_features) {
  cor_result <- cor(red$quality, red[[feature]], method = "pearson")
  red_features_cor[nrow(red_features_cor) + 1, ] <- list(feature, cor_result)
}

red_features_cor %>% 
  arrange(cor) %>% 
  knitr::kable(caption = "Correlation between red wine QUALITY and each other feature")

Correlation between red wine QUALITY and each other feature
feature	cor
volatile acidity	-0.3905578
total sulfur dioxide	-0.1851003
density	-0.1749192
chlorides	-0.1289066
pH	-0.0577314
free sulfur dioxide	-0.0506561
residual sugar	0.0137316
fixed acidity	0.1240516
citric acid	0.2263725
sulphates	0.2513971
alcohol	0.4761663

In the new table, we can see that there are some correlations between red wine quality and other features as well. These are:

volatile acidity (moderate negative correlation) (new one, was not present in white wines)
alcohol (moderate positive correlation) (was present in white wines as well)

What is interesting is that in the red wines, there is no that big correlation between the red wine quality and its density, which was the case in white wines. So, for both white and red wine, there are some common features that correlate with wine quality, and at the same time, not all of them are the same in both white and red wines.

3.3 Distribution of wines with regard to some features

3.3.1 Quality

3.3.1.1 White wine

ggplot(white, aes(x = quality)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.5) +
  ggtitle("White wine: Quality distribution")

For white wine, the most popular qualities are 5, 6 and 7 (when 3 is min and 9 is max). The shape of distribution is slightly asymmetric to the right.

3.3.1.2 Red wine

ggplot(red, aes(x = quality)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.5) +
  ggtitle("Red wine: Quality distribution")

For the red wine, the most popular qualities are 5 and 6 (when 3 is min and 8 is max). Quality 7 is somewhere between popular and unpopular. The shape of distribution is slightly asymmetric to the left.

3.3.2 Fixed acidity

3.3.2.1 White wine

ggplot(white, aes(x = `fixed acidity`)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.25) +
  ggtitle("White wine: Fixed acidity distribution")

3.3.2.2 Red wine

ggplot(red, aes(x = `fixed acidity`)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.25) +
  ggtitle("Red wine: Fixed acidity distribution")

So, for the white wine, the most frequent fixed acidity values range from (approx.) 6 to 9. The shape of distribution is symmetric.

For the red wine, they range from 6 to 10. The shape of distribution is asymmetric to the right.

3.3.3 Volatile acidity

3.3.3.1 White wine

ggplot(white, aes(x = `volatile acidity`)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125 / 2) +
  ggtitle("White wine: Volatile acidity distribution")

3.3.3.2 Red wine

ggplot(red, aes(x = `volatile acidity`)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125 / 2) +
  ggtitle("Red wine: Volatile acidity distribution")

For the white wine, the volatile acidity values range from 0.15 to 0.45 (again, approximately (this also applies to all consequent observations)). The shape of distribution is slightly asymmetric to the right.

For the red wine, they range from 0.2 to 0.8. The shape of distribution is asymmetric to the right.

3.3.4 Citric acid

3.3.4.1 White wine

ggplot(white, aes(x = `citric acid`)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125 / 2) +
  ggtitle("White wine: Citric acid distribution")

3.3.4.2 Red wine

ggplot(red, aes(x = `citric acid`)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125 / 2) +
  ggtitle("Red wine: Citric acid distribution")

For the white wine, the citric acid values range from 0.25 to 0.5. The shape of distribution is slightly asymmetric to the right.

For the red wine, they range from 0 to 0.5. The shape of distribution is asymmetric to the right.

3.3.5 Residual sugar

3.3.5.1 White wine

ggplot(white, aes(x = `residual sugar`)) +
  geom_histogram(fill = "#FBB143", binwidth = 1) +
  ggtitle("White wine: Residual sugar distribution")

3.3.5.2 Red wine

ggplot(red, aes(x = `residual sugar`)) +
  geom_histogram(fill = "#F041AF", binwidth = 1) +
  ggtitle("Red wine: Residual sugar distribution")

For the white wine, the residual sugar values range from 1 to 16 . The shape of distribution is asymmetric to the right.

For the red wine, they range from 2 to 3. The shape of distribution is asymmetric to the right.

3.3.6 pH

3.3.6.1 White wine

ggplot(white, aes(x = pH)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125) +
  ggtitle("White wine: pH distribution")

3.3.6.2 Red wine

ggplot(red, aes(x = pH)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125) +
  ggtitle("Red wine: pH distribution")

For the white wine, the pH values range from 3 to 3.4 . The shape of distribution is symmetric.

For the red wine, they range from 3.1 to 3.5. The shape of distribution is symmetric.

3.3.7 Alcohol

3.3.7.1 White wine

ggplot(white, aes(x = alcohol)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125) +
  ggtitle("White wine: Alcohol distribution")

3.3.7.2 Red wine

ggplot(red, aes(x = alcohol)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125) +
  ggtitle("Red wine: Alcohol distribution")

For the white wine, the alcohol values range from 9 to 13. The shape of distribution is asymmetric to the right.

For the red wine, they range from 9.5 to 12.5. The shape of distribution is asymmetric to the right.

3.3.8 Observation

The majority of distributions are asymmetric to the right. There also were some symmetric distributions as well as asymmetric to the left.

3.4 Normality checking

In this section, we will graphically check if some distribution is normal or not.

Let’s test four distributions: two that looks like they are normal (pH distribution (3.3.6 pH)), and another two that do not look like normal (residual sugar distribution (3.3.5 Residual sugar)), so we will check our expectations here.

3.4.1 Normality of pH distribution

qqnorm(red$pH)
qqline(red$pH)

qqnorm(white$pH)
qqline(white$pH)

Looking at the graphs above, we can see that both distributions look like normal (very small value deviations from the line).

So, pH distribution in both wine types is indeed a normal distribution.

3.4.2 Normality of residual sugar distribution

qqnorm(white$`residual sugar`)
qqline(white$`residual sugar`)

qqnorm(red$`residual sugar`)
qqline(red$`residual sugar`)

Looking at the graphs above, we can see that both distributions do NOT look like normal (big value deviations from the line).

So, residual sugar distribution in both wine types is indeed NOT a normal distribution.

3.5 Testing hypotheses

3.5.1 Check if two features from red and white wines are the same

Let’s take a look at the mean graphs from the Section 3.1. We can notice that e.g. alcohol mean is almost the same in both red and white wines. So, we can assume that the alcohol is the same in both wine types. But, this is just a mean value, it’s not enough, so it’s still better to check if they are indeed the same with some statistical test. At the same time, we can see that e.g. total sulfur dioxide is not the same in red and white wines, so we will test it as well (for showing that it differs in both wine types).

For this reason, we will use Student’s t-test (two-sided). For both tests, we will use the same hypotheses:

Null hypothesis: means are equal
Alternative hypothesis: means differ

3.5.1.1 Alcohol

t.test(
  red$alcohol, white$alcohol,
  alternative = "two.sided",
  mu = 0, # difference in means (we want to test if alcohol in r. and w. wines differ, so diff = 0)
  paired = FALSE, # sample are independent, so FALSE
  conf.level = 0.95 # alpha = 0.05 (most common alpha)
  )

## 
##  Welch Two Sample t-test
## 
## data:  red$alcohol and white$alcohol
## t = -2.859, df = 3100.5, p-value = 0.004278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.15388669 -0.02868117
## sample estimates:
## mean of x mean of y 
##  10.42298  10.51427

We can see that p-value is far away from 0, but still lower than our alpha (which is 0.05), so we reject null hypothesis and accept alternative one, that is, alcohol differs for red and white wines. To be honest, I expected it to be equal (I mean, even the p-value is not really that much close to 0), but we cannot argue with the statistical test.

Now, let’s do the same for total sulfur dioxide.

3.5.1.2 Total sulfur dioxide

t.test(
  red$`total sulfur dioxide`, white$`total sulfur dioxide`,
  alternative = "two.sided",
  mu = 0, # difference in means (we want to test if alcohol in r. and w. wines differ, so diff = 0)
  paired = FALSE, # sample are independent, so FALSE
  conf.level = 0.95 # alpha = 0.05 (most common alpha)
  )

## 
##  Welch Two Sample t-test
## 
## data:  red$`total sulfur dioxide` and white$`total sulfur dioxide`
## t = -89.872, df = 3477, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -93.89760 -89.88813
## sample estimates:
## mean of x mean of y 
##  46.46779 138.36066

Again, as before, p-value is not just less than alpha, but is very close to 0. So, we reject null hypothesis, and thus accept the alternative one, that is, total sulfur dioxide differs for red and white wines. And this was expectable from the beginning.

4 Conclusion

I have checked all the properties that interested me the most. Here is a short summary of them:

Features that differ the most in red and white wines are free sulfur dioxide, total sulfur dioxide and residual sugar (difference of more than 2 times)
Regarding the correlation of wine quality with other features, we have seen that the alcohol is correlated with the wine quality in both wine types. However, density is correlated in white wine only while volatile acidity is correlated in red wine only.
The majority of feature distributions in both wine types are asymmetric to the right.
pH is distributed normally in both wine types as expected from a distribution graph, while the residual sugar is not normally distributed in any of wine types (again, as expected).
I tested if alcohol in both wine types differs. I expected it to be the same, but even though, when looking at their means, it seems so, with the help of a statistical test we saw that it actually differs. We also performed the same test on total sulfur dioxide where we were sure it won’t be the same, and it indeed wasn’t.

I have indeed gained some interesting knowledge from the analysis of this data, so I am satisfied with the results.

Statistics - R Project

Oleksandr Babenko

2022-07-25