About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

A short discription of the course

This course–Introduction to Open Data Science–is aimed to help students to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. After learning this course the students should know how to use R, RStudio, RMarkdown and GitHUb and also know how to learn more of these open software tools. Beside, and also the most important for me, the students will also know how to apply certain statistical methods of data science.

The link of my github repository is listed below:
https://github.com/XiaodongAAA/IODS-project


Regression and model validation

Describe the work you have done this week and summarize your learning.

Created at 10:00 12.11.2017
@author:Xiaodong Li
The script for RStudio Exercise 2 – data analysis

Import packages:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggplot2)

Step 1:read data

lrn2014 = read.table('/home/xiaodong/IODS_course/IODS-project/data/learning2014.txt',sep='\t',header = TRUE)

Structure of the data

str(lrn2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Dimensions of the data

dim(lrn2014)
## [1] 166   7

Data description According to the structure and dimensions of the data, the data frame contains 7 variables which are gender,age,attitude,deep,stra,surf,points. In each variable, there are 166 observations. gender represents male (M) and female (F) surveyors. age is the ages (in years) of the people derived from the date of birth.In attitude column lists the global attitudes toward statistics. Columns deep,surf and stra list the questions related to deep, surface and strategic learning. The related question could be found in the following page, http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt The points column list the exam points from the survey.

Step2: Explore the data

Plot the relationships between the variables

p <- ggpairs(lrn2014, mapping = aes(col=gender,alpha=0.3))
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The figure shows relationships between different variables. From the figure we can see that, from the gender column, female surveyors are more than male surveyors. Most of the people being surveyed are young generations, aged around 20 years old. The attitudes of man are higher than those of wemon towards statistics, which reflects that man has more positive attitudes towards statistics than wemen. The questions of deep and strategic learning are almost the same for men and wemen. The mean scores for them are around 3 and 4. However, the surface questions are quite different between the men and wemen surveyors. For wemen, the answers are centered at around 3.0 while the answers of men are centered at around 2.3. The exam points got from male and female answerers are quite the same and the most points are about 23 for both of them.

Step 3: Multiple regression

reg_model=lm(points~attitude+stra+surf,data=lrn2014)
summary(reg_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The target variable point is fitted to three explanatory variables: attitude, stra and surf. According to the results of the model, surf does not have a statistically significant relationship with the target variable. So, surf is removed from the fitting model and points is modelled according to attitude and stra again.

Step 4: Model again

reg_model2=lm(points~attitude+stra, data=lrn2014)
summary(reg_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The model is fitted with the target points and two explanatory variable, attitude and stra. According to the summary results, the relationship between these variable should be \(points=8.9729+3.4658*attitude+0.9137*stra\).
* The Std. Error is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions.
* The t values are the estimates divided by there standard errors. It is an estimation of how extreme the value you see is, relative to the standard error.
* Pr. is the p-value for the hypothesis for which the t value is the test statistic. It tell you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true (the null hypothesis is usually ‘no effect’, unless something else is specified). So, if the p-value is very low, then there is a higher probability that you’re see data that is counter-indicative of zero effect.
* Residual standard error represents the standard deviation of the residuals. It’s a measure of how close the fit is to the points.
* The Multiple R-squared is the proportion of the variance in the data that’s explained by the model. The more variables you add, the large this will be. The Adjusted R-squared reduces that to account for the number of variables in the model.
* The F-statistic on the last line is telling you whether the regression as a whole is performing ‘better than random’,in other words, it tells whether your model fits better than you’d expect if all your predictors had no relationship with the response.
* The p-value in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.

Step 5: Diagnostic plots

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage are plotted

par(mfrow=c(2,2))
plot(reg_model2, which=c(1,2,5))

  • The Residuals vs Fitted values plot examines if the errors have constant variance. The graph shows a reasonable constant variance without any pattern.
  • The Normal QQ-polt checks if the errors are normally distributed. We see from the graph a very good linear model fit, indicating a normally distributed error set.
  • The Residuals vs Leverage confirms if there are any outliers with high leverage. From the graph, it shows that all the leverage are below 0.06, indicating good model fitting.

Conclusion

The data learning_2014 is explord and analysed by using graphical overview, data summary, multiple regression and diagnostic plots methods. The relationships between different variables are examined by a single plot which shows all possible scatter plots from the columns of learning_2014 data. The exam points Points variable is fitted with combination variables attitude and surf and showed a reasonable good linear trend, despite the relatively low R-squared value. The validity of the model is checked by means of multiple residual analysis. The model predicts that the exam points of the students are positively correlated with their attitude and surface approaches.


Chapter 3 Logistic regression

Created on 18.11.2017 @author: Xiaodong Li This is the script for RStudio exercise 3 – Data analysis The work focuses on exploring data and performing and interpreting logistic regression analysis on the UCI Machine Learning Repository, Student Performance Data Set.

Step 0: import packages

library(tidyr)
library(dplyr)
library(ggplot2)

Step 1:read data

alc=read.csv('/home/xiaodong/IODS_course/IODS-project/data/alc.csv',sep=',',header = TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data used in the exercise is a joined data set that combines the two student alcohol consumption data sets, student-mat.csv and student-por.csv. The two data sets are retrieved from the UCI Machine Learning Repository. The data are from two identical questionaires related to secondary school student alcohol consumption in Portugal. For more background information, please check here. The variables not used for joining the two data have been combined by averaging. The alc_use colume is the average of weekday (Dalc) and weekend (Walc) alcohol use. The high_use column records if the alc_use is higher than 2 or not.

Step 2:hypothesis about relationships with alcohol consumption

Choose four interesting variables in the data and present personal hypothesis about their relationships with alcohol consumption. - failures: positive correlation, the more alcohol consumption, the more failures
- absenses: positive correlation, the more alcohol consumption, the more absenses
- sex: male is more than female students with high alcohol use
- studytime: negative correlation, the more alcohol consumption, the less studytime

Step 3: Explore the distributions of the chosen variables and there relationships with alcohol consumption

The relationship between sex and alcohol use

bar_sex=ggplot(alc, aes(x=alc_use,fill=sex))+geom_bar(); bar_sex

sex~alc_use:
According to the count~alc_use bar figure plotted according to different sexes, we can see that female students are the main low alcohol users (alc_use < 2.5), however, for high alcohol users (alc_use > 2.5), most of them are male students. The bar figure also tells us that most of the alcohol users are very light users and the numbers of them decrease quickly with the increasing alcohol use levels (except for the extreme high users).

The relationship between absences, failures, studytime and alcohol use

The failures, absences and studytime are scaled according to the counts of the alcohol users in different levels.

alc2=group_by(alc,alc_use)
tab_sum=summarise(alc2,count=n(),absences=sum(absences),failures=sum(failures),studytime=sum(studytime))
tab_sum=mutate(tab_sum,abs_count=absences/count,fai_count=failures/count,styt_count=studytime/count)
tab_sum
## # A tibble: 9 x 8
##   alc_use count absences failures studytime abs_count fai_count styt_count
##     <dbl> <int>    <int>    <int>     <int>     <dbl>     <dbl>      <dbl>
## 1     1.0   140      470       15       323  3.357143 0.1071429   2.307143
## 2     1.5    69      292       10       136  4.231884 0.1449275   1.971014
## 3     2.0    59      231       13       117  3.915254 0.2203390   1.983051
## 4     2.5    44      283        6        85  6.431818 0.1363636   1.931818
## 5     3.0    32      195       18        55  6.093750 0.5625000   1.718750
## 6     3.5    17       96        8        25  5.647059 0.4705882   1.470588
## 7     4.0     9       54        2        16  6.000000 0.2222222   1.777778
## 8     4.5     3       36        0         6 12.000000 0.0000000   2.000000
## 9     5.0     9       62        5        15  6.888889 0.5555556   1.666667
bar_absences=ggplot(tab_sum,aes(x=alc_use,y=abs_count))+geom_bar(stat='identity'); bar_absences

bar_failures=ggplot(tab_sum,aes(x=alc_use,y=fai_count))+geom_bar(stat='identity'); bar_failures

bar_studytime=ggplot(tab_sum,aes(x=alc_use,y=styt_count))+geom_bar(stat='identity'); bar_studytime

absences~alc_use:
There ia an increasing trend with absenses and alcohol use, which is in line with our hypothesis. When the alcohol use is 4.5, the absence is extremely high. The second high absence happens when the alcohol use is on the highest level.
failures~alc_use:
There is an positive correlation between failures and alcohol use. For light alcohol users, the failures are also in a low level, however, the failure reaches the highest mount when alc_use= 3. After that the failures fall down with incresing alcohol use, and we interpret this as lacking of enough samples. For extreme high alcohol users (alc_use=5), the failures are at the highest level, the same as alc_use=3.
studytime~alc_use:
The figure shows that there is no obvious relations between the study time and alcohol use. This is not in agreement with our hypothesis before. The lowest alcohol users have the most study time and the study time of the highest users are low compared with the other alcohol using levels. But the difference bwteen them are not quite obvious.

Box plots by goups

Box plots are an excellent way of displaying and comparing distributions. The box visualizes the percentiles of the 25th, 50th and 75th of the data, while the whiskers show the typical range and the ourliers of a variable.

box_absences=ggplot(alc,aes(x=high_use,y=absences))+geom_boxplot(); box_absences

box_failures=ggplot(alc,aes(x=high_use,y=failures))+geom_boxplot(); box_failures

box_studytime=ggplot(alc,aes(x=high_use,y=studytime))+geom_boxplot(); box_studytime

From the box plot of absences vs. high_use, it is obvious that the high alcohol users (alc_use > 2) are most likely to be absent from school. The box plot of studytime vs. high_use shows that high alcohol use also reduces the study time of the students. The conclusiona are in line with our hypothesis before.

Step 4: Logistic regression

The logistic regression is used here to identify factors (failures,absences,sex and studytime) related to higher than average student alcohol consumption.

m=glm(high_use~failures+absences+sex+studytime, data=alc,family='binomial')
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + studytime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1213  -0.8226  -0.6017   1.0824   2.1185  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.11174    0.43028  -2.584 0.009773 ** 
## failures     0.36048    0.19472   1.851 0.064126 .  
## absences     0.08734    0.02296   3.803 0.000143 ***
## sexM         0.79470    0.25212   3.152 0.001621 ** 
## studytime   -0.34003    0.16259  -2.091 0.036499 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 419.82  on 377  degrees of freedom
## AIC: 429.82
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    failures    absences        sexM   studytime 
## -1.11173770  0.36048380  0.08734126  0.79469663 -0.34003347

According to the summary results, the estimated coefficients for failures, absences, sexM and studytime are 0.360, 0.087, 0.795 and -0.340 respectively. The results show that, for failures, absences and sexM, the correlations between them and alcohol use are positive while for studytime, the correlation is negative. This is in agreement with our previous hypothesis. According to the P value shown in the summary part, the biggest possibility happens between absences and high_use. The relation between failures and high_use may seem not quite convincing and this is in agreement with the box plot shown in the last part.

OR=coef(m) %>% exp
CI=confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3289868 0.1402796 0.7609560
## failures    1.4340230 0.9804979 2.1153252
## absences    1.0912690 1.0455005 1.1442053
## sexM        2.2137693 1.3559547 3.6508476
## studytime   0.7117465 0.5129770 0.9721184

The ratio of expected “success” to “failures” are called the odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. Odds higher than 1 means that X is positively associated with “success”. If OR < 1, lower odds corresponds to the higher probability of success. The computational target variable in the logistic regression model is the log of odds, so applying exponent function to the modelled values gives the odds.
From the summary of the odds one can see that sexM gives the largest odds. This means that sexM has higher probability to high alcolhol use compared to failures and absences. The odds of studytime is the lower than 1. The result indicate that higher alcohol use corresponds to less study time.
The confidence intervals of 2.5% and 97.5 % for the odd ratios are also listed in the data frame.

Step 5: Binary predictions

predict() the probability of high_use

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability>0.5)
select(alc, failures, absences, sex, studytime, high_use, probability, prediction) %>% tail(10)
##     failures absences sex studytime high_use probability prediction
## 373        1        0   M         1    FALSE   0.4263911      FALSE
## 374        1        7   M         1     TRUE   0.5780560       TRUE
## 375        0        1   F         3    FALSE   0.1146096      FALSE
## 376        0        6   F         1    FALSE   0.2833868      FALSE
## 377        1        2   F         3    FALSE   0.1684473      FALSE
## 378        0        2   F         2    FALSE   0.1656021      FALSE
## 379        2        2   F         2    FALSE   0.2898414      FALSE
## 380        0        3   F         2    FALSE   0.1780258      FALSE
## 381        0        4   M         1     TRUE   0.4236739      FALSE
## 382        0        2   M         1     TRUE   0.3816874      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)%>%addmargins
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   256   12 268
##    TRUE     80   34 114
##    Sum     336   46 382
g=ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g=g+geom_point()
g

table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>%addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67015707 0.03141361 0.70157068
##    TRUE  0.20942408 0.08900524 0.29842932
##    Sum   0.87958115 0.12041885 1.00000000

According to the last 10 rows of data, we can see that most of the predictions are correct, except for the last two one. The last two samples are high alcohol users however, the prediction show that they are not, which is incorrect.
The cross tabulation of predictions versus the actual values show that 256 out of 268 FALSE(non-high alcohol users) values were predicted correctly by the model, while only 34 of 114 TRUE(high alcohol users) were predicted correctly. The correct prediction rate of FALSE samples is 95.5% while the correct prediction rate of TRUE samples is only 29.8%.
The results show that the model could give relatively good predictions for FALSE results while for the prediction of TRUE results is not sensitive.

Step 6: Compute the average number of incorrect predictions

Accuracy: the average number of correctly classified observations. Penalty (loss) function: the mean of incorrectly classified observations. Less penalty function means better predictions.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2408377

So, the wrong predictions of the model is about 24.1%. Combined with the analysis from step 5, we know that most of the wrong predictions are TRUE results (12 wrongly prediction for FALSE scenarios and 80 wrongly prediction for TRUE scenarios).

Step 7: Cross-validation

Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. The low value of cross-validation result means better model predictions.
Perform 10-fold cross-validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2460733

The 10-fold cross-validation result show that the test set performance are mostly between 0.25 and 0.26. There is no obvious smaller prediction error than the model introduced in DataCamp which has an error of about 0.26.


Chapter 4 Clustering and classification

Created on 26.11.2017 @author: Xiaodong Li This is the script for RStudio exercise 4 – Clustering and classification The work focuses on exploring data and performing clustering methods which try to find the clusters (or groups) from the data. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points. Based on a successful clustering, we may try to classify new observations to these clusters and hence validate the results of clustering. This week’s data is the Boston data from the MASS package.

Step 0: Import packages

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(dplyr)
library(ggplot2)

Step 1: Load and explore the data

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston data describes the housing values in the suburs of Boston. Is contains 14 variables which describe the situations of the housing environment like crim: per capita crime rate by town; lstat: lower status of the population (percent). In each variable there are 506 observations giving the detailed data of the coresponding variable.

Step 2: The correlations between variables

The function cor() can be used to create the correlation matrix. A more visual way to look at the correlations is to use corrplot() function (from the corrplot package).

cor_matrix<-cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle",type='upper',cl.pos='b',tl.pos='d',tl.cex=0.6)

The corrplot shows the correlations between variables of the Boston dataset. The darker color in the corrplot indicates more straight corelation. Blue color means a positive correlation while red color means a negative correlation. Then from the figure we can summarise that rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10000) have very strong positive correlation. And dis (weighted mean of distances to five Boston employment centres) very strong negative correlations with nox (nitrogen oxides concentration), age (proportion of owner-occupied units built prior to 1940) and indux (proportion of non-retail business acres per town). medv (meidan value of owner-occupied homes in $1000) and lstat (lower status of the population) have very strong negative correlation too.

Step 3: Scale the whole dataset

In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.
\(scaled(x) = \frac{x - mean(x)}{ sd(x)}\)

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

The variables changed according to the formula shown above. After the scaling, we can see that the data are all quite near 0 which is the mean value of the variable. Negative values mean that the data are less than the mean value while the positive figures mean that the data are bigger values than the mean value.
We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories.

boston_scaled=as.data.frame(boston_scaled)
bins=quantile(boston_scaled$crim)
crime=cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled=dplyr::select(boston_scaled, -crim)
boston_scaled=data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

First, we use the quantiles as the break points and creat a categorical variable of the crime rate in the Boston dataset. Then the old crime rate variable is dropped from the old crime rate variable. And then the dataset is divided into train and test sets where 80% of the data belongs to the train set.

Step 4: Linear discriminant analysis

Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.

lda.fit <- lda(crime~., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 3)

LDA can be visualized with a biplot. A Biplot is an enhanced scatterplot that uses both points and vectors to represent structure. A biplot uses points to represent the scores of the observations on the principal components, and it uses vectors to represent the coefficients of the variables on the principal components.
Points: Points that are close together correspond to observations that have similar scores on the components displayed in the plot. To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.
Vectors: A vector points in the direction which is most like the variable represented by the vector. This is the direction which has the highest squared multiple correlation with the principal components. The length of the vector is proportional to the squared multiple correlation between the fitted values for the variable and the variable itself.

Step 5: Predict LDA

The function predict() can be used to predict values based on a model.
We used the test data got from Step 3 where the crime categories is saved as correct_classes and the categorical crime variable is removed from the the test dataset.

lda.pred=predict(lda.fit,newdata=test)
table(correct=correct_classes,predicted=lda.pred$class) %>% addmargins
##           predicted
## correct    low med_low med_high high Sum
##   low       12       6        1    0  19
##   med_low    5      21        5    0  31
##   med_high   0       7       21    1  29
##   high       0       0        1   22  23
##   Sum       17      34       28   23 102

From the cross table we can see that the model could give relatively good predictions of the crime rate especially at high rate district. The correct prediction rates are shown below:

correct=c(16,23,13,26)
correct=c(correct,sum(correct))
sumdata=c(27,30,18,27)
sumdata=c(sumdata,sum(sumdata))
index=c('low','med_low','med_high','high','sum')
t=data.frame(correct=index,predict=correct,sum=sumdata)
t$percentage=t$predict/t$sum
t
##    correct predict sum percentage
## 1      low      16  27  0.5925926
## 2  med_low      23  30  0.7666667
## 3 med_high      13  18  0.7222222
## 4     high      26  27  0.9629630
## 5      sum      78 102  0.7647059

From the table we can see that on the whole the corrected prediction precentage is 76.5% and the percentage increases with the increase of crime rate. The model is more suitable for predict high crime rate area.

Step 6: Distance measures and clustering

Distance measurements
Similarity and dissimilarity of objects can be measured with distance measures. There are many different measures for different types of data. Here we perform Euclidean distance and Manhattan distance for the data.

data('Boston')
boston_scaled=scale(Boston)
dist_eu=dist(boston_scaled,method='euclidean')
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man=dist(boston_scaled,method='manhattan')
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

The calculations of thest two distances are:
distance figure

K-means
K-means is maybe the most used and known clustering method. It is an unsupervised method, that assighs observations to groups or cluster based on similarity of the objects.
Three clusters is tried first.

km=kmeans(boston_scaled,centers = 3)
pairs(boston_scaled,col=km$cluster)

Find the optimal number of cluster
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

set.seed(123)
k_max=10
twcss=sapply(1:k_max,function(k){kmeans(boston_scaled,k)$tot.withinss})
qplot(x=1:k_max,y=twcss,geom='line')

According to the line figure shown above, the most radically drop of the line happens when the K value is 2. So the optimal number of clusters is 2.

km=kmeans(boston_scaled,centers=2)
pairs(boston_scaled,col=km$cluster)

From the figure we can see that, for most of the variables, the k-means method could classify the results quite well.
The method of calculating K-means:
1. Choose the number of clusters and initial centroids.
2. Calculate distances between centroids and data points.
3. For all the data points: assign data point to clusters based on which centroid is closest.
4. Update centroids: within each cluster, calculate new centroid.
5. Update cluster: calculate distances between data points and updated centroids. If some other centroid is closer than the cluster centroid where the data point belongs, the data point changes cluster.
Continue updating steps until the centroids or the clusters do not change.
K-means is a quick and simple method to cluster analysis. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
However, the number of clusters k is very unpredictable. In this example, we used a simple way to find the optimal k value. But we must remember that, an inappropriate choice of k may yield poor results.
The choice of the initial cluster center has a great impact on the clustering result. Once the initial value is not chosen well, it may not be able to get a valid clustering result. The K-means method randomly assigns the initial cluster centers and this is a major issue of K-means algorithms. We used the function set.seed() to deal with the problem, but usually we need to choose the initial centers quite well.


Chapter 5 Dimensionality reduction technique

Created on 01.12.2017
@author: Xiaodong Li
This is the script for RStudio exercise 5 – Dimensionality reduction technique

Step 0: Import packages

library(GGally)
library(corrplot)

Step 1: Read and explore data

human = read.table('/home/xiaodong/IODS_course/IODS-project/data/human.txt',sep='\t',header = TRUE)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

The human data describes the Human Development Index (HDI) and Gross National Income (GNI) situation in different countries together with the education, labour and health experiences. The goal is to show that the people and the their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The meaning of different variables are shown below:

  • Edu2.FM: The ratio of Female and Male populations with secondary education

  • Labo.FM: The ratio of labour force participation of females and males

  • life.Exp: Life expectancy at birth

  • Edu.Exp: Expected years of education

  • GNI: Gross national income per capita

  • Mat.Mor: Maternal mortality ratio

  • Ado.Birth: Adolescent birth rate

  • Parli.F: Percent Representation in Parliament n

Step 2: Graphical overview of the data

Visuualize the human data with ggpairs()

ggpairs(human)

Visualize the human data with correlation plots

cor_matrix=cor(human)
corrplot(cor_matrix,type='upper')

The scatter plots and correlation plots between all the variables shown above give information about the relationships and distributions of all the data. From the plots we can see that Life.Exp has strong positive relation with Edu.Exp and also very strong negative relation with Mat.Mor and Ado.Birth. Edu2.Fm has strong negative relation with Mat.Mor too.

Step 3: Perform PCA on not standardized data

Perform Principal Component Analysis (PCA) with the SVD method

pca_human=prcomp(human)
s=summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
pca_pr=round(100*s$importance[2,],digits=1)
pc_lab=paste0(names(pca_pr),'(',pca_pr,'%)')
biplot(pca_human,choices=1:2,cex=c(0.8,1),xlab=pc_lab[1],ylab=pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

The biplot figure with not standardized variable is not clear because the varialbes have different scales and are not compariable.

Step 4: Perform PCA on standardized data

human_std=scale(human)
pca_human=prcomp(human_std)
s=summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
pca_pr=round(100*s$importance[2,],digits=1)
pc_lab=paste0(names(pca_pr),'(',pca_pr,'%)')
biplot(pca_human,choices = 1:2,cex=c(0.8,1),xlab=pc_lab[1],ylab=pc_lab[2])

The biplots got from Step 3 and Step 4 give different PCA results. For the data that was not standardized, the GNI variable has very big variance and all the data could be transformed to the first principal component (PC1). However, for the data that was standardized, all the variables have similar vairance and the features of the original data are comprised in several pricipal components. The different distributions of the data in these two different situations are shown clearly in the two biplots.
The biplot is a way of visualizing two representations of the same data. The biplot displays,

  1. The observations in a lower 2-dimensional representation.
    A scatter plot is drawn where the observations are placed on X and Y coordinates defined by two pricipal components.

  2. The original features and their relationships with both each other and the pricipal components.
    Arrows and/or labels are drawn to visualize the connections between the original features and pricipal components.
  • The angle between arrows representing the original features can be interpreted as the corelation between the features. Small angle = high positive correlation.

  • The angle between a feature and a principal component axis can be interpreted as the correlation between the two. Small angle = high positive correlation.

  • The length of the arrows are proportional to the standard deviation of the features.
    According to the description of the features of biplot we can summarize the relationship between the observations.
  • Mat.Mor and Ado.Birth have high positive relationship between each other.

  • Edu.Exp,Edu2.FM,GNI and Life.Exp have high positive relationship between other.

  • The above two groups have high negative relationship between each other.

  • The standard deviation of the observations are quite similar because of the scaling process.

Step 5: Perform MCA on tea dataset

Load the tea dataset

library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
data('tea')
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

The tea dataset contains the answers of a questionnaire on tea consumption for 300 individuals. Although the data contains 36 columns for demonstration purposes I will only consider 6 of the following columns:

keep_columns=c('Tea','How','how','sugar','where','lunch')
tea_time=dplyr::select(tea,one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value))+geom_bar()+facet_wrap('key',scales='free')+theme(axis.text=element_text(angle=45,hjust = 1,size=8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Perform MCA on the selected tea dateset–tea_time

mca=MCA(tea_time,graph=FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca,invisible=c('ind'),habillage='quali')

Multiple Correspondence Analysis (MCA) is an extension of simple CA to analyse a data table containing more than two categorical variables. MCA is generally used to analyse a data from survey.
The objectives are to identify:

  • A group of individuals with similar profile in their answers to the questions

  • The associations between variable categories

The result of the function summary() contains 4 tables:

  • Table 1 - Eigenvalues: table 1 contains the variances and the percentage of variances retained by each dimension.

  • Table 2 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active individuals on the dimensions 1 and 2.
  • Table 3 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active variable categories on the dimensions 1 and 2. This table contains also a column called v.test. The value of the v.test is generally comprised between 2 and -2. For a given variable category, if the absolute value of the v.test is superior to 2, this means that the coordinate is significantly different from 0.

  • Table 4 - categorical variables (eta2): contains the squared correlation between each variable and the dimensions.

The MCA graph shows a global pattern within the data. Rows (individuals) are usually represented by blue points and columns (variable categories) by red triangles.
The distance between any row points or column points gives a measure of their similarity (or dissimilarity).
Row points with similar profile are closed on the factor map. The same holds true for column points.
In the situation of our results shown above, we can conclude that, variable tea shop and unpackaged are the most correlated with dimension 1. Similarly, the variables other and chain store+tea shop are the most correlated with dimension 2.