The intention of this project is to derive some interesting knowledge from a dataset consisting of a white and red wine features. That is, I will apply statistical tools as well as other tools (e.g. graphical, tabular, etc.) in order to find some knowledge that may seem interesting from statistical point of view.
First of all, load most common libraries:
library(tidyverse) # we will use tidyverse quite often here...
white <- readr::read_delim("winequality-white.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)
red <- readr::read_delim("winequality-red.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE, show_col_types = FALSE)
red %>% head() %>% knitr::kable(caption = "Red wine features (first 5 rows)")
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|
7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
white %>% head() %>% knitr::kable(caption = "White wine features (first 5 rows)")
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|
7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
We can see that in both tables, all the columns are numerical. That is, we will deal completely with numbers.
We could have used summary()
function on all the
features for each wine type, but then the table wouldn’t have been
nice-looking. For this reason, I decided to create the table manually.
What I did is I created a merged table for white and red wines and
calculated mean and median values for them.
wine <- bind_rows(
white %>% mutate(type = "white") %>% relocate(type),
red %>% mutate(type = "red") %>% relocate(type)
)
wine_mean <- wine %>%
group_by(type) %>%
summarise(across(everything(), mean)) %>% mutate(across(-type, round, digits = 2))
wine_median <- wine %>%
group_by(type) %>%
summarise(across(everything(), median)) %>% mutate(across(-type, round, digits = 2))
wine_summary <- bind_rows(
wine_mean %>% mutate(func = "mean") %>% relocate(func, .after = type),
wine_median %>% mutate(func = "median") %>% relocate(func, .after = type)
) %>%
arrange(desc(type))
wine_summary %>% knitr::kable(caption = "Mean and median values for both white and red types of wine")
type | func | fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
white | mean | 6.85 | 0.28 | 0.33 | 6.39 | 0.05 | 35.31 | 138.36 | 0.99 | 3.19 | 0.49 | 10.51 | 5.88 |
white | median | 6.80 | 0.26 | 0.32 | 5.20 | 0.04 | 34.00 | 134.00 | 0.99 | 3.18 | 0.47 | 10.40 | 6.00 |
red | mean | 8.32 | 0.53 | 0.27 | 2.54 | 0.09 | 15.87 | 46.47 | 1.00 | 3.31 | 0.66 | 10.42 | 5.64 |
red | median | 7.90 | 0.52 | 0.26 | 2.20 | 0.08 | 14.00 | 38.00 | 1.00 | 3.31 | 0.62 | 10.20 | 6.00 |
wine_summary_longer <- wine_summary %>%
pivot_longer(-c(type, func), names_to = "feature", values_to = "value")
I don’t like the table since it’s very large, and thus complex
(imagine what would have happened if I used summary()
). So,
let’s present the same table in a form of a bar chart (for now I will
use mean only):
wine_summary_longer %>%
filter(func == "mean" & feature %in% c(
"total sulfur dioxide", "free sulfur dioxide"
)) %>%
ggplot(aes(x = feature, y = value, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
position = position_dodge(.9), size = 4) +
labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 1") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", size = 12),
axis.title.y = element_text(face="bold", size = 12),
legend.title = element_text(face="bold", size = 10)
)
wine_summary_longer %>%
filter(func == "mean" & feature %in% c(
"alcohol", "fixed acidity", "residual sugar", "quality", "pH"
)) %>%
ggplot(aes(x = feature, y = value, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
position = position_dodge(.9), size = 4) +
labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 2") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", size = 12),
axis.title.y = element_text(face="bold", size = 12),
legend.title = element_text(face="bold", size = 10)
)
wine_summary_longer %>%
filter(func == "mean" & feature %in% c(
"density", "sulphates", "volatile acidity", "citric acid", "chlorides"
)) %>%
ggplot(aes(x = feature, y = value, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = value), fontface = "bold", vjust = 1.5,
position = position_dodge(.9), size = 4) +
labs(x = "Feature", y = "Value", fill = "Type of wine", title = "Mean values for wine features, part 3") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", size = 12),
axis.title.y = element_text(face="bold", size = 12),
legend.title = element_text(face="bold", size = 10)
)
Let’s point out the features that are quite different in white and red wines (we can assume these features actually make white and red wines differ):
free sulfur dioxide (> 2 times)
total sulfur dioxide (~ 3 times)
fixed acidity (> 1 times)
residual sugar (~ 2.5 times)
chlorides (< 2 times)
citric acid (> 1 times)
sulphates (> 1 times)
volatile acidity (< 2 times)
Let’s check the correlation between the wine quality and other wine features. We will use Pearson test since we just want to test linear relationships between two interval variables without any monotone checking.
Firstly, we will check the correlation for the white wine:
white_features <- colnames(white)
white_features <- white_features[! white_features %in% c("quality")] # remove quality from other features vector
features_cor_col_names <- c("feature", "cor")
white_features_cor <- data.frame(matrix(nrow = 0, ncol = length(features_cor_col_names)))
colnames(white_features_cor) <- features_cor_col_names
for (feature in white_features) {
cor_result <- cor(white$quality, white[[feature]], method = "pearson")
white_features_cor[nrow(white_features_cor) + 1, ] <- list(feature, cor_result)
}
white_features_cor %>%
arrange(cor) %>%
knitr::kable(caption = "Correlation between white wine QUALITY and each other feature")
feature | cor |
---|---|
density | -0.3071233 |
chlorides | -0.2099344 |
volatile acidity | -0.1947230 |
total sulfur dioxide | -0.1747372 |
fixed acidity | -0.1136628 |
residual sugar | -0.0975768 |
citric acid | -0.0092091 |
free sulfur dioxide | 0.0081581 |
sulphates | 0.0536779 |
pH | 0.0994272 |
alcohol | 0.4355747 |
In the above table, we can see that there are indeed some correlations between the white wine quality and some other features. These features are:
density (moderate negative correlation)
alcohol (moderate positive correlation)
Now, let’s do the same, but now with the red whine:
red_features <- colnames(red)
red_features <- red_features[! red_features %in% c("quality")] # remove quality from other features vector
features_cor_col_names <- c("feature", "cor")
red_features_cor <- data.frame(matrix(nrow = 0, ncol = length(features_cor_col_names)))
colnames(red_features_cor) <- features_cor_col_names
for (feature in red_features) {
cor_result <- cor(red$quality, red[[feature]], method = "pearson")
red_features_cor[nrow(red_features_cor) + 1, ] <- list(feature, cor_result)
}
red_features_cor %>%
arrange(cor) %>%
knitr::kable(caption = "Correlation between red wine QUALITY and each other feature")
feature | cor |
---|---|
volatile acidity | -0.3905578 |
total sulfur dioxide | -0.1851003 |
density | -0.1749192 |
chlorides | -0.1289066 |
pH | -0.0577314 |
free sulfur dioxide | -0.0506561 |
residual sugar | 0.0137316 |
fixed acidity | 0.1240516 |
citric acid | 0.2263725 |
sulphates | 0.2513971 |
alcohol | 0.4761663 |
In the new table, we can see that there are some correlations between red wine quality and other features as well. These are:
volatile acidity (moderate negative correlation) (new one, was not present in white wines)
alcohol (moderate positive correlation) (was present in white wines as well)
What is interesting is that in the red wines, there is no that big correlation between the red wine quality and its density, which was the case in white wines. So, for both white and red wine, there are some common features that correlate with wine quality, and at the same time, not all of them are the same in both white and red wines.
ggplot(white, aes(x = quality)) +
geom_histogram(fill = "#FBB143", binwidth = 0.5) +
ggtitle("White wine: Quality distribution")
For white wine, the most popular qualities are 5, 6 and 7 (when 3 is min and 9 is max). The shape of distribution is slightly asymmetric to the right.
ggplot(red, aes(x = quality)) +
geom_histogram(fill = "#F041AF", binwidth = 0.5) +
ggtitle("Red wine: Quality distribution")
For the red wine, the most popular qualities are 5 and 6 (when 3 is min and 8 is max). Quality 7 is somewhere between popular and unpopular. The shape of distribution is slightly asymmetric to the left.
ggplot(white, aes(x = `fixed acidity`)) +
geom_histogram(fill = "#FBB143", binwidth = 0.25) +
ggtitle("White wine: Fixed acidity distribution")
ggplot(red, aes(x = `fixed acidity`)) +
geom_histogram(fill = "#F041AF", binwidth = 0.25) +
ggtitle("Red wine: Fixed acidity distribution")
So, for the white wine, the most frequent fixed acidity values range from (approx.) 6 to 9. The shape of distribution is symmetric.
For the red wine, they range from 6 to 10. The shape of distribution is asymmetric to the right.
ggplot(white, aes(x = `volatile acidity`)) +
geom_histogram(fill = "#FBB143", binwidth = 0.125 / 2) +
ggtitle("White wine: Volatile acidity distribution")
ggplot(red, aes(x = `volatile acidity`)) +
geom_histogram(fill = "#F041AF", binwidth = 0.125 / 2) +
ggtitle("Red wine: Volatile acidity distribution")
For the white wine, the volatile acidity values range from 0.15 to 0.45 (again, approximately (this also applies to all consequent observations)). The shape of distribution is slightly asymmetric to the right.
For the red wine, they range from 0.2 to 0.8. The shape of distribution is asymmetric to the right.
ggplot(white, aes(x = `citric acid`)) +
geom_histogram(fill = "#FBB143", binwidth = 0.125 / 2) +
ggtitle("White wine: Citric acid distribution")
ggplot(red, aes(x = `citric acid`)) +
geom_histogram(fill = "#F041AF", binwidth = 0.125 / 2) +
ggtitle("Red wine: Citric acid distribution")
For the white wine, the citric acid values range from 0.25 to 0.5. The shape of distribution is slightly asymmetric to the right.
For the red wine, they range from 0 to 0.5. The shape of distribution is asymmetric to the right.
ggplot(white, aes(x = `residual sugar`)) +
geom_histogram(fill = "#FBB143", binwidth = 1) +
ggtitle("White wine: Residual sugar distribution")
ggplot(red, aes(x = `residual sugar`)) +
geom_histogram(fill = "#F041AF", binwidth = 1) +
ggtitle("Red wine: Residual sugar distribution")
For the white wine, the residual sugar values range from 1 to 16 . The shape of distribution is asymmetric to the right.
For the red wine, they range from 2 to 3. The shape of distribution is asymmetric to the right.
ggplot(white, aes(x = pH)) +
geom_histogram(fill = "#FBB143", binwidth = 0.125) +
ggtitle("White wine: pH distribution")
ggplot(red, aes(x = pH)) +
geom_histogram(fill = "#F041AF", binwidth = 0.125) +
ggtitle("Red wine: pH distribution")
For the white wine, the pH values range from 3 to 3.4 . The shape of distribution is symmetric.
For the red wine, they range from 3.1 to 3.5. The shape of distribution is symmetric.
ggplot(white, aes(x = alcohol)) +
geom_histogram(fill = "#FBB143", binwidth = 0.125) +
ggtitle("White wine: Alcohol distribution")
ggplot(red, aes(x = alcohol)) +
geom_histogram(fill = "#F041AF", binwidth = 0.125) +
ggtitle("Red wine: Alcohol distribution")
For the white wine, the alcohol values range from 9 to 13. The shape of distribution is asymmetric to the right.
For the red wine, they range from 9.5 to 12.5. The shape of distribution is asymmetric to the right.
The majority of distributions are asymmetric to the right. There also were some symmetric distributions as well as asymmetric to the left.
In this section, we will graphically check if some distribution is normal or not.
Let’s test four distributions: two that looks like they are normal (pH distribution (3.3.6 pH)), and another two that do not look like normal (residual sugar distribution (3.3.5 Residual sugar)), so we will check our expectations here.
qqnorm(red$pH)
qqline(red$pH)
qqnorm(white$pH)
qqline(white$pH)
Looking at the graphs above, we can see that both distributions look like normal (very small value deviations from the line).
So, pH distribution in both wine types is indeed a normal distribution.
qqnorm(white$`residual sugar`)
qqline(white$`residual sugar`)
qqnorm(red$`residual sugar`)
qqline(red$`residual sugar`)
Looking at the graphs above, we can see that both distributions do NOT look like normal (big value deviations from the line).
So, residual sugar distribution in both wine types is indeed NOT a normal distribution.
Let’s take a look at the mean graphs from the Section 3.1. We can notice that e.g. alcohol mean is almost the same in both red and white wines. So, we can assume that the alcohol is the same in both wine types. But, this is just a mean value, it’s not enough, so it’s still better to check if they are indeed the same with some statistical test. At the same time, we can see that e.g. total sulfur dioxide is not the same in red and white wines, so we will test it as well (for showing that it differs in both wine types).
For this reason, we will use Student’s t-test (two-sided). For both tests, we will use the same hypotheses:
Null hypothesis: means are equal
Alternative hypothesis: means differ
t.test(
red$alcohol, white$alcohol,
alternative = "two.sided",
mu = 0, # difference in means (we want to test if alcohol in r. and w. wines differ, so diff = 0)
paired = FALSE, # sample are independent, so FALSE
conf.level = 0.95 # alpha = 0.05 (most common alpha)
)
##
## Welch Two Sample t-test
##
## data: red$alcohol and white$alcohol
## t = -2.859, df = 3100.5, p-value = 0.004278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.15388669 -0.02868117
## sample estimates:
## mean of x mean of y
## 10.42298 10.51427
We can see that p-value is far away from 0, but still lower than our alpha (which is 0.05), so we reject null hypothesis and accept alternative one, that is, alcohol differs for red and white wines. To be honest, I expected it to be equal (I mean, even the p-value is not really that much close to 0), but we cannot argue with the statistical test.
Now, let’s do the same for total sulfur dioxide.
t.test(
red$`total sulfur dioxide`, white$`total sulfur dioxide`,
alternative = "two.sided",
mu = 0, # difference in means (we want to test if alcohol in r. and w. wines differ, so diff = 0)
paired = FALSE, # sample are independent, so FALSE
conf.level = 0.95 # alpha = 0.05 (most common alpha)
)
##
## Welch Two Sample t-test
##
## data: red$`total sulfur dioxide` and white$`total sulfur dioxide`
## t = -89.872, df = 3477, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -93.89760 -89.88813
## sample estimates:
## mean of x mean of y
## 46.46779 138.36066
Again, as before, p-value is not just less than alpha, but is very close to 0. So, we reject null hypothesis, and thus accept the alternative one, that is, total sulfur dioxide differs for red and white wines. And this was expectable from the beginning.
I have checked all the properties that interested me the most. Here is a short summary of them:
Features that differ the most in red and white wines are free sulfur dioxide, total sulfur dioxide and residual sugar (difference of more than 2 times)
Regarding the correlation of wine quality with other features, we have seen that the alcohol is correlated with the wine quality in both wine types. However, density is correlated in white wine only while volatile acidity is correlated in red wine only.
The majority of feature distributions in both wine types are asymmetric to the right.
pH is distributed normally in both wine types as expected from a distribution graph, while the residual sugar is not normally distributed in any of wine types (again, as expected).
I tested if alcohol in both wine types differs. I expected it to be the same, but even though, when looking at their means, it seems so, with the help of a statistical test we saw that it actually differs. We also performed the same test on total sulfur dioxide where we were sure it won’t be the same, and it indeed wasn’t.
I have indeed gained some interesting knowledge from the analysis of this data, so I am satisfied with the results.