1 Introduction

The intention of this project is to derive some interesting knowledge from a dataset consisting of a white and red wine features. That is, I will apply statistical tools as well as other tools (e.g. graphical, tabular, etc.) in order to find some knowledge that may seem interesting from statistical point of view.

2 Presentation of the data

First of all, load most common libraries:

library(tidyverse) # we will use tidyverse quite often here...
white <- readr::read_delim("winequality-white.csv", 
                    delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)

red <- readr::read_delim("winequality-red.csv", 
                  delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)

red %>% head() %>% knitr::kable(caption = "Red wine features (first 5 rows)")
Red wine features (first 5 rows)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 5
white %>% head() %>% knitr::kable(caption = "White wine features (first 5 rows)")
White wine features (first 5 rows)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7.0 0.27 0.36 20.7 0.045 45 170 1.0010 3.00 0.45 8.8 6
6.3 0.30 0.34 1.6 0.049 14 132 0.9940 3.30 0.49 9.5 6
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6

We can see that in both tables, all the columns are numerical. That is, we will deal completely with numbers.

3 Statistical analysis & visualisation of the data

3.1 Mean and median values for both white and red wine features. Comparison of features between white and red wine

We could have used summary() function on all the features for each wine type, but then the table wouldn’t have been nice-looking. For this reason, I decided to create the table manually. What I did is I created a merged table for white and red wines and calculated mean and median values for them.

wine <- bind_rows(
  white %>% mutate(type = "white") %>% relocate(type), 
  red %>% mutate(type = "red") %>% relocate(type)
  )

wine_mean <- wine %>% 
  group_by(type) %>% 
  summarise(across(everything(), mean)) %>% mutate(across(-type, round, digits = 2))

wine_median <- wine %>% 
  group_by(type) %>% 
  summarise(across(everything(), median)) %>% mutate(across(-type, round, digits = 2))

wine_summary <- bind_rows(
  wine_mean %>% mutate(func = "mean") %>% relocate(func, .after = type), 
  wine_median %>% mutate(func = "median") %>% relocate(func, .after = type)
) %>% 
  arrange(desc(type))
wine_summary %>% knitr::kable(caption = "Mean and median values for both white and red types of wine")
Mean and median values for both white and red types of wine
type func fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
white mean 6.85 0.28 0.33 6.39 0.05 35.31 138.36 0.99 3.19 0.49 10.51 5.88
white median 6.80 0.26 0.32 5.20 0.04 34.00 134.00 0.99 3.18 0.47 10.40 6.00
red mean 8.32 0.53 0.27 2.54 0.09 15.87 46.47 1.00 3.31 0.66 10.42 5.64
red median 7.90 0.52 0.26 2.20 0.08 14.00 38.00 1.00 3.31 0.62 10.20 6.00
wine_summary_longer <- wine_summary %>% 
  pivot_longer(-c(type, func), names_to = "feature", values_to = "value")

I don’t like the table since it’s very large, and thus complex (imagine what would have happened if I used summary()). So, let’s present the same table in a form of a bar chart (for now I will use mean only):

wine_summary_longer %>% 
  filter(func == "mean" & feature %in% c(
    "total sulfur dioxide", "free sulfur dioxide"
    )) %>%
  ggplot(aes(x = feature, y = value, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
            position = position_dodge(.9), size = 4) +
  labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 1") +
  theme(
    plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", size = 12),
    axis.title.y = element_text(face="bold", size = 12),
    legend.title = element_text(face="bold", size = 10)
    )

wine_summary_longer %>% 
  filter(func == "mean" & feature %in% c(
    "alcohol", "fixed acidity", "residual sugar", "quality", "pH"
    )) %>%
  ggplot(aes(x = feature, y = value, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
            position = position_dodge(.9), size = 4) +
  labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 2") +
  theme(
    plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", size = 12),
    axis.title.y = element_text(face="bold", size = 12),
    legend.title = element_text(face="bold", size = 10)
    )

wine_summary_longer %>% 
  filter(func == "mean" & feature %in% c(
    "density", "sulphates", "volatile acidity", "citric acid", "chlorides"
    )) %>%
  ggplot(aes(x = feature, y = value, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
            position = position_dodge(.9), size = 4) +
  labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 3") +
  theme(
    plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", size = 12),
    axis.title.y = element_text(face="bold", size = 12),
    legend.title = element_text(face="bold", size = 10)
    )

3.1.1 Observation

Let’s point out the features that are quite different in white and red wines (we can assume these features actually make white and red wines differ):

  • free sulfur dioxide (> 2 times)

  • total sulfur dioxide (~ 3 times)

  • fixed acidity (> 1 times)

  • residual sugar (~ 2.5 times)

  • chlorides (< 2 times)

  • citric acid (> 1 times)

  • sulphates (> 1 times)

  • volatile acidity (< 2 times)

3.2 Correlation between wine QUALITY and other features

Let’s check the correlation between the wine quality and other wine features. We will use Pearson test since we just want to test linear relationships between two interval variables without any monotone checking.

Firstly, we will check the correlation for the white wine:

white_features <- colnames(white)
white_features <- white_features[! white_features %in% c("quality")] # remove quality from other features vector

features_cor_col_names <- c("feature", "cor")
white_features_cor <- data.frame(matrix(nrow = 0, ncol = length(features_cor_col_names)))
colnames(white_features_cor) <- features_cor_col_names

for (feature in white_features) {
  cor_result <- cor(white$quality, white[[feature]], method = "pearson")
  white_features_cor[nrow(white_features_cor) + 1, ] <- list(feature, cor_result)
}

white_features_cor %>% 
  arrange(cor) %>% 
  knitr::kable(caption = "Correlation between white wine QUALITY and each other feature")
Correlation between white wine QUALITY and each other feature
feature cor
density -0.3071233
chlorides -0.2099344
volatile acidity -0.1947230
total sulfur dioxide -0.1747372
fixed acidity -0.1136628
residual sugar -0.0975768
citric acid -0.0092091
free sulfur dioxide 0.0081581
sulphates 0.0536779
pH 0.0994272
alcohol 0.4355747

In the above table, we can see that there are indeed some correlations between the white wine quality and some other features. These features are:

  • density (moderate negative correlation)

  • alcohol (moderate positive correlation)

Now, let’s do the same, but now with the red whine:

red_features <- colnames(red)
red_features <- red_features[! red_features %in% c("quality")] # remove quality from other features vector

features_cor_col_names <- c("feature", "cor")
red_features_cor <- data.frame(matrix(nrow = 0, ncol = length(features_cor_col_names)))
colnames(red_features_cor) <- features_cor_col_names

for (feature in red_features) {
  cor_result <- cor(red$quality, red[[feature]], method = "pearson")
  red_features_cor[nrow(red_features_cor) + 1, ] <- list(feature, cor_result)
}

red_features_cor %>% 
  arrange(cor) %>% 
  knitr::kable(caption = "Correlation between red wine QUALITY and each other feature")
Correlation between red wine QUALITY and each other feature
feature cor
volatile acidity -0.3905578
total sulfur dioxide -0.1851003
density -0.1749192
chlorides -0.1289066
pH -0.0577314
free sulfur dioxide -0.0506561
residual sugar 0.0137316
fixed acidity 0.1240516
citric acid 0.2263725
sulphates 0.2513971
alcohol 0.4761663

In the new table, we can see that there are some correlations between red wine quality and other features as well. These are:

  • volatile acidity (moderate negative correlation) (new one, was not present in white wines)

  • alcohol (moderate positive correlation) (was present in white wines as well)

What is interesting is that in the red wines, there is no that big correlation between the red wine quality and its density, which was the case in white wines. So, for both white and red wine, there are some common features that correlate with wine quality, and at the same time, not all of them are the same in both white and red wines.

3.3 Distribution of wines with regard to some features

3.3.1 Quality

3.3.1.1 White wine

ggplot(white, aes(x = quality)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.5) +
  ggtitle("White wine: Quality distribution")

For white wine, the most popular qualities are 5, 6 and 7 (when 3 is min and 9 is max). The shape of distribution is slightly asymmetric to the right.

3.3.1.2 Red wine

ggplot(red, aes(x = quality)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.5) +
  ggtitle("Red wine: Quality distribution")

For the red wine, the most popular qualities are 5 and 6 (when 3 is min and 8 is max). Quality 7 is somewhere between popular and unpopular. The shape of distribution is slightly asymmetric to the left.

3.3.2 Fixed acidity

3.3.2.1 White wine

ggplot(white, aes(x = `fixed acidity`)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.25) +
  ggtitle("White wine: Fixed acidity distribution")

3.3.2.2 Red wine

ggplot(red, aes(x = `fixed acidity`)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.25) +
  ggtitle("Red wine: Fixed acidity distribution")

So, for the white wine, the most frequent fixed acidity values range from (approx.) 6 to 9. The shape of distribution is symmetric.

For the red wine, they range from 6 to 10. The shape of distribution is asymmetric to the right.

3.3.3 Volatile acidity

3.3.3.1 White wine

ggplot(white, aes(x = `volatile acidity`)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125 / 2) +
  ggtitle("White wine: Volatile acidity distribution")

3.3.3.2 Red wine

ggplot(red, aes(x = `volatile acidity`)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125 / 2) +
  ggtitle("Red wine: Volatile acidity distribution")

For the white wine, the volatile acidity values range from 0.15 to 0.45 (again, approximately (this also applies to all consequent observations)). The shape of distribution is slightly asymmetric to the right.

For the red wine, they range from 0.2 to 0.8. The shape of distribution is asymmetric to the right.

3.3.4 Citric acid

3.3.4.1 White wine

ggplot(white, aes(x = `citric acid`)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125 / 2) +
  ggtitle("White wine: Citric acid distribution")

3.3.4.2 Red wine

ggplot(red, aes(x = `citric acid`)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125 / 2) +
  ggtitle("Red wine: Citric acid distribution")

For the white wine, the citric acid values range from 0.25 to 0.5. The shape of distribution is slightly asymmetric to the right.

For the red wine, they range from 0 to 0.5. The shape of distribution is asymmetric to the right.

3.3.5 Residual sugar

3.3.5.1 White wine

ggplot(white, aes(x = `residual sugar`)) +
  geom_histogram(fill = "#FBB143", binwidth = 1) +
  ggtitle("White wine: Residual sugar distribution")

3.3.5.2 Red wine

ggplot(red, aes(x = `residual sugar`)) +
  geom_histogram(fill = "#F041AF", binwidth = 1) +
  ggtitle("Red wine: Residual sugar distribution")

For the white wine, the residual sugar values range from 1 to 16 . The shape of distribution is asymmetric to the right.

For the red wine, they range from 2 to 3. The shape of distribution is asymmetric to the right.

3.3.6 pH

3.3.6.1 White wine

ggplot(white, aes(x = pH)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125) +
  ggtitle("White wine: pH distribution")

3.3.6.2 Red wine

ggplot(red, aes(x = pH)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125) +
  ggtitle("Red wine: pH distribution")

For the white wine, the pH values range from 3 to 3.4 . The shape of distribution is symmetric.

For the red wine, they range from 3.1 to 3.5. The shape of distribution is symmetric.

3.3.7 Alcohol

3.3.7.1 White wine

ggplot(white, aes(x = alcohol)) +
  geom_histogram(fill = "#FBB143", binwidth = 0.125) +
  ggtitle("White wine: Alcohol distribution")

3.3.7.2 Red wine

ggplot(red, aes(x = alcohol)) +
  geom_histogram(fill = "#F041AF", binwidth = 0.125) +
  ggtitle("Red wine: Alcohol distribution")

For the white wine, the alcohol values range from 9 to 13. The shape of distribution is asymmetric to the right.

For the red wine, they range from 9.5 to 12.5. The shape of distribution is asymmetric to the right.

3.3.8 Observation

The majority of distributions are asymmetric to the right. There also were some symmetric distributions as well as asymmetric to the left.

3.4 Normality checking

In this section, we will graphically check if some distribution is normal or not.

Let’s test four distributions: two that looks like they are normal (pH distribution (3.3.6 pH)), and another two that do not look like normal (residual sugar distribution (3.3.5 Residual sugar)), so we will check our expectations here.

3.4.1 Normality of pH distribution

qqnorm(red$pH)
qqline(red$pH)

qqnorm(white$pH)
qqline(white$pH)

Looking at the graphs above, we can see that both distributions look like normal (very small value deviations from the line).

So, pH distribution in both wine types is indeed a normal distribution.

3.4.2 Normality of residual sugar distribution

qqnorm(white$`residual sugar`)
qqline(white$`residual sugar`)

qqnorm(red$`residual sugar`)
qqline(red$`residual sugar`)

Looking at the graphs above, we can see that both distributions do NOT look like normal (big value deviations from the line).

So, residual sugar distribution in both wine types is indeed NOT a normal distribution.

3.5 Testing hypotheses

3.5.1 Check if two features from red and white wines are the same

Let’s take a look at the mean graphs from the Section 3.1. We can notice that e.g. alcohol mean is almost the same in both red and white wines. So, we can assume that the alcohol is the same in both wine types. But, this is just a mean value, it’s not enough, so it’s still better to check if they are indeed the same with some statistical test. At the same time, we can see that e.g. total sulfur dioxide is not the same in red and white wines, so we will test it as well (for showing that it differs in both wine types).

For this reason, we will use Student’s t-test (two-sided). For both tests, we will use the same hypotheses:

  • Null hypothesis: means are equal

  • Alternative hypothesis: means differ

3.5.1.1 Alcohol

t.test(
  red$alcohol, white$alcohol,
  alternative = "two.sided",
  mu = 0, # difference in means (we want to test if alcohol in r. and w. wines differ, so diff = 0)
  paired = FALSE, # sample are independent, so FALSE
  conf.level = 0.95 # alpha = 0.05 (most common alpha)
  )
## 
##  Welch Two Sample t-test
## 
## data:  red$alcohol and white$alcohol
## t = -2.859, df = 3100.5, p-value = 0.004278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.15388669 -0.02868117
## sample estimates:
## mean of x mean of y 
##  10.42298  10.51427

We can see that p-value is far away from 0, but still lower than our alpha (which is 0.05), so we reject null hypothesis and accept alternative one, that is, alcohol differs for red and white wines. To be honest, I expected it to be equal (I mean, even the p-value is not really that much close to 0), but we cannot argue with the statistical test.

Now, let’s do the same for total sulfur dioxide.

3.5.1.2 Total sulfur dioxide

t.test(
  red$`total sulfur dioxide`, white$`total sulfur dioxide`,
  alternative = "two.sided",
  mu = 0, # difference in means (we want to test if alcohol in r. and w. wines differ, so diff = 0)
  paired = FALSE, # sample are independent, so FALSE
  conf.level = 0.95 # alpha = 0.05 (most common alpha)
  )
## 
##  Welch Two Sample t-test
## 
## data:  red$`total sulfur dioxide` and white$`total sulfur dioxide`
## t = -89.872, df = 3477, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -93.89760 -89.88813
## sample estimates:
## mean of x mean of y 
##  46.46779 138.36066

Again, as before, p-value is not just less than alpha, but is very close to 0. So, we reject null hypothesis, and thus accept the alternative one, that is, total sulfur dioxide differs for red and white wines. And this was expectable from the beginning.

4 Conclusion

I have checked all the properties that interested me the most. Here is a short summary of them:

I have indeed gained some interesting knowledge from the analysis of this data, so I am satisfied with the results.