December 13, 2015

K-means Clustering Analysis of Red Wine Quality

I have been a big fan of sweet wine. During Thanksgiving, I had a road trip to the Napa Valley in California. The answers to good quality of wine can be different from person to person, but the expert of wine can summarize the indicators of good quality of wine. 

In Jun 2014, Business Insider published an article to list three main explanation of high quality of red wine:complexity, intensity, and balance. In 2009, a dataset, created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis, provided 1599 types of red wine with 10 scientific attributes associated with the quality. The data is available to be downloaded on University of California Irvine machine learning web page

I developed the code in R Studio to conduct the k-means clustering analysis of the red wine quality based on the data provided on UCI website. I made the radar chart as above to summarize what I found from the clustering analysis. 

I was very excited to see that the radar chart corresponds to the three main indicators listed on Business Insider. 

- Complexity     
Each attribute contributes to different flavor. The wine of high quality performs lowest in volatile acidity but 3rd highest in citric acidity. 

- Intensity     

The excellent quality of wine ranks 2nd highest in alcohol. When the amount of alcohol is high in the wine, it simulates the sense of the tone for the fist sip. 

- Balance     

None of the attribute performs extremely dominant in the high quality of wine while the average quality of wine performs high in chlorides and sulphates. The wine of high quality is not extremely sweet or dry. 

K-means Clustering in R:
Step1: Scale the data
        As the measurement of free sulfur dioxide is from 1 to 72 while citric acidity is scaled from 0 to 1. We need to scale the data in order to perform accuracy of distance of each clusters. 

Step2: Find the ideal number of clusters
       I choose 6 and 8 to see which one is going to be more appropriate for our analysis.

Step3: Plot the clusters
       Here I provide the plot of number of clusters equal to 6. 
Step 4: Validate if number of cluster equal to 6 is more accurate than 8.

Step 5: Get the mean of the each attribute of each group

R code:
wine <- read.csv("~/Downloads/winequality-red.csv", sep=";")
for(i in 1:15) wss[i]<-sum(kmeans(wine,centers=i)$withinss)
plot(1:15,wss,type='b',xlab="Number of Clusters",ylab='Within groups sum of squares')
fit1 <- kmeans(wine,6)
fit2 <- kmeans(wine,8)

plotcluster(wine, fit1$cluster)
mydata <- data.frame(wine, fit1$cluster)

plotcluster(wine, fit2$cluster)
mydata <- data.frame(wine, fit2$cluster)

clusplot(wine, fit1$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)

cluster.stats(d, fit1$cluster, fit2$cluster)