Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This course–Introduction to Open Data Science–is aimed to help students to understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. After learning this course the students should know how to use R, RStudio, RMarkdown and GitHUb and also know how to learn more of these open software tools. Beside, and also the most important for me, the students will also know how to apply certain statistical methods of data science.
The link of my github repository is listed below:
https://github.com/XiaodongAAA/IODS-project
Describe the work you have done this week and summarize your learning.
Created at 10:00 12.11.2017
@author:Xiaodong Li
The script for RStudio Exercise 2 – data analysis
Import packages:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
lrn2014 = read.table('/home/xiaodong/IODS_course/IODS-project/data/learning2014.txt',sep='\t',header = TRUE)
Structure of the data
str(lrn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Dimensions of the data
dim(lrn2014)
## [1] 166 7
Data description According to the structure and dimensions of the data, the data frame contains 7 variables which are gender
,age
,attitude
,deep
,stra
,surf
,points
. In each variable, there are 166 observations. gender
represents male (M) and female (F) surveyors. age
is the ages (in years) of the people derived from the date of birth.In attitude
column lists the global attitudes toward statistics. Columns deep
,surf
and stra
list the questions related to deep, surface and strategic learning. The related question could be found in the following page, http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt The points
column list the exam points from the survey.
Plot the relationships between the variables
p <- ggpairs(lrn2014, mapping = aes(col=gender,alpha=0.3))
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The figure shows relationships between different variables. From the figure we can see that, from the gender
column, female surveyors are more than male surveyors. Most of the people being surveyed are young generations, aged around 20 years old. The attitudes
of man are higher than those of wemon towards statistics, which reflects that man has more positive attitudes towards statistics than wemen. The questions of deep
and strategic learning
are almost the same for men and wemen. The mean scores for them are around 3 and 4. However, the surface questions
are quite different between the men and wemen surveyors. For wemen, the answers are centered at around 3.0 while the answers of men are centered at around 2.3. The exam points
got from male and female answerers are quite the same and the most points are about 23 for both of them.
reg_model=lm(points~attitude+stra+surf,data=lrn2014)
summary(reg_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The target variable point
is fitted to three explanatory variables: attitude
, stra
and surf
. According to the results of the model, surf
does not have a statistically significant relationship with the target variable. So, surf
is removed from the fitting model and points is modelled according to attitude
and stra
again.
reg_model2=lm(points~attitude+stra, data=lrn2014)
summary(reg_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The model is fitted with the target points
and two explanatory variable, attitude
and stra
. According to the summary results, the relationship between these variable should be \(points=8.9729+3.4658*attitude+0.9137*stra\).
* The Std. Error
is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions.
* The t values
are the estimates divided by there standard errors. It is an estimation of how extreme the value you see is, relative to the standard error.
* Pr.
is the p-value
for the hypothesis for which the t value
is the test statistic. It tell you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true (the null hypothesis is usually ‘no effect’, unless something else is specified). So, if the p-value
is very low, then there is a higher probability that you’re see data that is counter-indicative of zero effect.
* Residual standard error
represents the standard deviation of the residuals. It’s a measure of how close the fit is to the points.
* The Multiple R-squared
is the proportion of the variance in the data that’s explained by the model. The more variables you add, the large this will be. The Adjusted R-squared
reduces that to account for the number of variables in the model.
* The F-statistic
on the last line is telling you whether the regression as a whole is performing ‘better than random’,in other words, it tells whether your model fits better than you’d expect if all your predictors had no relationship with the response.
* The p-value
in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.
Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage are plotted
par(mfrow=c(2,2))
plot(reg_model2, which=c(1,2,5))
Residuals vs Fitted
values plot examines if the errors have constant variance. The graph shows a reasonable constant variance without any pattern.Normal QQ-polt
checks if the errors are normally distributed. We see from the graph a very good linear model fit, indicating a normally distributed error set.Residuals vs Leverage
confirms if there are any outliers with high leverage. From the graph, it shows that all the leverage are below 0.06, indicating good model fitting.The data learning_2014
is explord and analysed by using graphical overview, data summary, multiple regression and diagnostic plots methods. The relationships between different variables are examined by a single plot which shows all possible scatter plots from the columns of learning_2014
data. The exam points Points
variable is fitted with combination variables attitude
and surf
and showed a reasonable good linear trend, despite the relatively low R-squared value. The validity of the model is checked by means of multiple residual analysis. The model predicts that the exam points of the students are positively correlated with their attitude and surface approaches.
Created on 18.11.2017 @author: Xiaodong Li This is the script for RStudio exercise 3 – Data analysis The work focuses on exploring data and performing and interpreting logistic regression analysis on the UCI Machine Learning Repository, Student Performance Data Set.
library(tidyr)
library(dplyr)
library(ggplot2)
alc=read.csv('/home/xiaodong/IODS_course/IODS-project/data/alc.csv',sep=',',header = TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data used in the exercise is a joined data set that combines the two student alcohol consumption data sets, student-mat.csv and student-por.csv. The two data sets are retrieved from the UCI Machine Learning Repository. The data are from two identical questionaires related to secondary school student alcohol consumption in Portugal. For more background information, please check here. The variables not used for joining the two data have been combined by averaging. The alc_use
colume is the average of weekday (Dalc
) and weekend (Walc
) alcohol use. The high_use
column records if the alc_use
is higher than 2 or not.
Choose four interesting variables in the data and present personal hypothesis about their relationships with alcohol consumption. - failures
: positive correlation, the more alcohol consumption, the more failures
- absenses
: positive correlation, the more alcohol consumption, the more absenses
- sex
: male is more than female students with high alcohol use
- studytime
: negative correlation, the more alcohol consumption, the less studytime
bar_sex=ggplot(alc, aes(x=alc_use,fill=sex))+geom_bar(); bar_sex
sex
~alc_use
:
According to the count~alc_use bar figure plotted according to different sexes, we can see that female students are the main low alcohol users (alc_use
< 2.5), however, for high alcohol users (alc_use
> 2.5), most of them are male students. The bar figure also tells us that most of the alcohol users are very light users and the numbers of them decrease quickly with the increasing alcohol use levels (except for the extreme high users).
The failures, absences and studytime are scaled according to the counts of the alcohol users in different levels.
alc2=group_by(alc,alc_use)
tab_sum=summarise(alc2,count=n(),absences=sum(absences),failures=sum(failures),studytime=sum(studytime))
tab_sum=mutate(tab_sum,abs_count=absences/count,fai_count=failures/count,styt_count=studytime/count)
tab_sum
## # A tibble: 9 x 8
## alc_use count absences failures studytime abs_count fai_count styt_count
## <dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1.0 140 470 15 323 3.357143 0.1071429 2.307143
## 2 1.5 69 292 10 136 4.231884 0.1449275 1.971014
## 3 2.0 59 231 13 117 3.915254 0.2203390 1.983051
## 4 2.5 44 283 6 85 6.431818 0.1363636 1.931818
## 5 3.0 32 195 18 55 6.093750 0.5625000 1.718750
## 6 3.5 17 96 8 25 5.647059 0.4705882 1.470588
## 7 4.0 9 54 2 16 6.000000 0.2222222 1.777778
## 8 4.5 3 36 0 6 12.000000 0.0000000 2.000000
## 9 5.0 9 62 5 15 6.888889 0.5555556 1.666667
bar_absences=ggplot(tab_sum,aes(x=alc_use,y=abs_count))+geom_bar(stat='identity'); bar_absences
bar_failures=ggplot(tab_sum,aes(x=alc_use,y=fai_count))+geom_bar(stat='identity'); bar_failures
bar_studytime=ggplot(tab_sum,aes(x=alc_use,y=styt_count))+geom_bar(stat='identity'); bar_studytime
absences
~alc_use
:
There ia an increasing trend with absenses and alcohol use, which is in line with our hypothesis. When the alcohol use is 4.5, the absence is extremely high. The second high absence happens when the alcohol use is on the highest level.
failures
~alc_use
:
There is an positive correlation between failures and alcohol use. For light alcohol users, the failures are also in a low level, however, the failure reaches the highest mount when alc_use
= 3. After that the failures fall down with incresing alcohol use, and we interpret this as lacking of enough samples. For extreme high alcohol users (alc_use
=5), the failures are at the highest level, the same as alc_use
=3.
studytime
~alc_use
:
The figure shows that there is no obvious relations between the study time and alcohol use. This is not in agreement with our hypothesis before. The lowest alcohol users have the most study time and the study time of the highest users are low compared with the other alcohol using levels. But the difference bwteen them are not quite obvious.
Box plots are an excellent way of displaying and comparing distributions. The box visualizes the percentiles of the 25th, 50th and 75th of the data, while the whiskers show the typical range and the ourliers of a variable.
box_absences=ggplot(alc,aes(x=high_use,y=absences))+geom_boxplot(); box_absences
box_failures=ggplot(alc,aes(x=high_use,y=failures))+geom_boxplot(); box_failures
box_studytime=ggplot(alc,aes(x=high_use,y=studytime))+geom_boxplot(); box_studytime
From the box plot of absences
vs. high_use
, it is obvious that the high alcohol users (alc_use
> 2) are most likely to be absent from school. The box plot of studytime
vs. high_use
shows that high alcohol use also reduces the study time of the students. The conclusiona are in line with our hypothesis before.
The logistic regression is used here to identify factors (failures,absences,sex and studytime) related to higher than average student alcohol consumption.
m=glm(high_use~failures+absences+sex+studytime, data=alc,family='binomial')
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + studytime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1213 -0.8226 -0.6017 1.0824 2.1185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.11174 0.43028 -2.584 0.009773 **
## failures 0.36048 0.19472 1.851 0.064126 .
## absences 0.08734 0.02296 3.803 0.000143 ***
## sexM 0.79470 0.25212 3.152 0.001621 **
## studytime -0.34003 0.16259 -2.091 0.036499 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 419.82 on 377 degrees of freedom
## AIC: 429.82
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) failures absences sexM studytime
## -1.11173770 0.36048380 0.08734126 0.79469663 -0.34003347
According to the summary results, the estimated coefficients for failures, absences, sexM and studytime are 0.360, 0.087, 0.795 and -0.340 respectively. The results show that, for failures, absences and sexM, the correlations between them and alcohol use are positive while for studytime, the correlation is negative. This is in agreement with our previous hypothesis. According to the P value
shown in the summary part, the biggest possibility happens between absences
and high_use
. The relation between failures
and high_use
may seem not quite convincing and this is in agreement with the box plot shown in the last part.
OR=coef(m) %>% exp
CI=confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3289868 0.1402796 0.7609560
## failures 1.4340230 0.9804979 2.1153252
## absences 1.0912690 1.0455005 1.1442053
## sexM 2.2137693 1.3559547 3.6508476
## studytime 0.7117465 0.5129770 0.9721184
The ratio of expected “success” to “failures” are called the odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. Odds higher than 1 means that X is positively associated with “success”. If OR < 1, lower odds corresponds to the higher probability of success. The computational target variable in the logistic regression model is the log of odds, so applying exponent function to the modelled values gives the odds.
From the summary of the odds one can see that sexM gives the largest odds. This means that sexM has higher probability to high alcolhol use compared to failures and absences. The odds of studytime is the lower than 1. The result indicate that higher alcohol use corresponds to less study time.
The confidence intervals of 2.5% and 97.5 % for the odd ratios are also listed in the data frame.
predict() the probability of high_use
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability>0.5)
select(alc, failures, absences, sex, studytime, high_use, probability, prediction) %>% tail(10)
## failures absences sex studytime high_use probability prediction
## 373 1 0 M 1 FALSE 0.4263911 FALSE
## 374 1 7 M 1 TRUE 0.5780560 TRUE
## 375 0 1 F 3 FALSE 0.1146096 FALSE
## 376 0 6 F 1 FALSE 0.2833868 FALSE
## 377 1 2 F 3 FALSE 0.1684473 FALSE
## 378 0 2 F 2 FALSE 0.1656021 FALSE
## 379 2 2 F 2 FALSE 0.2898414 FALSE
## 380 0 3 F 2 FALSE 0.1780258 FALSE
## 381 0 4 M 1 TRUE 0.4236739 FALSE
## 382 0 2 M 1 TRUE 0.3816874 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)%>%addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 256 12 268
## TRUE 80 34 114
## Sum 336 46 382
g=ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g=g+geom_point()
g
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>%addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03141361 0.70157068
## TRUE 0.20942408 0.08900524 0.29842932
## Sum 0.87958115 0.12041885 1.00000000
According to the last 10 rows of data, we can see that most of the predictions are correct, except for the last two one. The last two samples are high alcohol users however, the prediction show that they are not, which is incorrect.
The cross tabulation of predictions versus the actual values show that 256 out of 268 FALSE
(non-high alcohol users) values were predicted correctly by the model, while only 34 of 114 TRUE
(high alcohol users) were predicted correctly. The correct prediction rate of FALSE
samples is 95.5% while the correct prediction rate of TRUE
samples is only 29.8%.
The results show that the model could give relatively good predictions for FALSE
results while for the prediction of TRUE
results is not sensitive.
Accuracy: the average number of correctly classified observations. Penalty (loss) function: the mean of incorrectly classified observations. Less penalty function means better predictions.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2408377
So, the wrong predictions of the model is about 24.1%. Combined with the analysis from step 5, we know that most of the wrong predictions are TRUE
results (12 wrongly prediction for FALSE
scenarios and 80 wrongly prediction for TRUE
scenarios).
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. The low value of cross-validation result means better model predictions.
Perform 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2460733
The 10-fold cross-validation result show that the test set performance are mostly between 0.25 and 0.26. There is no obvious smaller prediction error than the model introduced in DataCamp which has an error of about 0.26.
Created on 26.11.2017 @author: Xiaodong Li This is the script for RStudio exercise 4 – Clustering and classification The work focuses on exploring data and performing clustering methods which try to find the clusters (or groups) from the data. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points. Based on a successful clustering, we may try to classify new observations to these clusters and hence validate the results of clustering. This week’s data is the Boston data from the MASS package.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(dplyr)
library(ggplot2)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston
data describes the housing values in the suburs of Boston. Is contains 14 variables which describe the situations of the housing environment like crim
: per capita crime rate by town; lstat
: lower status of the population (percent). In each variable there are 506 observations giving the detailed data of the coresponding variable.
The function cor()
can be used to create the correlation matrix. A more visual way to look at the correlations is to use corrplot()
function (from the corrplot package).
cor_matrix<-cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle",type='upper',cl.pos='b',tl.pos='d',tl.cex=0.6)
The corrplot shows the correlations between variables of the Boston dataset. The darker color in the corrplot indicates more straight corelation. Blue color means a positive correlation while red color means a negative correlation. Then from the figure we can summarise that rad
(index of accessibility to radial highways) and tax
(full-value property-tax rate per $10000) have very strong positive correlation. And dis
(weighted mean of distances to five Boston employment centres) very strong negative correlations with nox
(nitrogen oxides concentration), age
(proportion of owner-occupied units built prior to 1940) and indux
(proportion of non-retail business acres per town). medv
(meidan value of owner-occupied homes in $1000) and lstat
(lower status of the population) have very strong negative correlation too.
In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.
\(scaled(x) = \frac{x - mean(x)}{ sd(x)}\)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
The variables changed according to the formula shown above. After the scaling, we can see that the data are all quite near 0 which is the mean value of the variable. Negative values mean that the data are less than the mean value while the positive figures mean that the data are bigger values than the mean value.
We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories.
boston_scaled=as.data.frame(boston_scaled)
bins=quantile(boston_scaled$crim)
crime=cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled=dplyr::select(boston_scaled, -crim)
boston_scaled=data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
First, we use the quantiles as the break points and creat a categorical variable of the crime rate in the Boston dataset. Then the old crime rate variable is dropped from the old crime rate variable. And then the dataset is divided into train and test sets where 80% of the data belongs to the train set.
Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
lda.fit <- lda(crime~., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col=classes,pch=classes)
lda.arrows(lda.fit, myscale = 3)
LDA can be visualized with a biplot. A Biplot is an enhanced scatterplot that uses both points and vectors to represent structure. A biplot uses points to represent the scores of the observations on the principal components, and it uses vectors to represent the coefficients of the variables on the principal components.
Points
: Points that are close together correspond to observations that have similar scores on the components displayed in the plot. To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.
Vectors
: A vector points in the direction which is most like the variable represented by the vector. This is the direction which has the highest squared multiple correlation with the principal components. The length of the vector is proportional to the squared multiple correlation between the fitted values for the variable and the variable itself.
The function predict()
can be used to predict values based on a model.
We used the test data got from Step 3 where the crime categories is saved as correct_classes
and the categorical crime variable is removed from the the test dataset.
lda.pred=predict(lda.fit,newdata=test)
table(correct=correct_classes,predicted=lda.pred$class) %>% addmargins
## predicted
## correct low med_low med_high high Sum
## low 12 6 1 0 19
## med_low 5 21 5 0 31
## med_high 0 7 21 1 29
## high 0 0 1 22 23
## Sum 17 34 28 23 102
From the cross table we can see that the model could give relatively good predictions of the crime rate especially at high rate district. The correct prediction rates are shown below:
correct=c(16,23,13,26)
correct=c(correct,sum(correct))
sumdata=c(27,30,18,27)
sumdata=c(sumdata,sum(sumdata))
index=c('low','med_low','med_high','high','sum')
t=data.frame(correct=index,predict=correct,sum=sumdata)
t$percentage=t$predict/t$sum
t
## correct predict sum percentage
## 1 low 16 27 0.5925926
## 2 med_low 23 30 0.7666667
## 3 med_high 13 18 0.7222222
## 4 high 26 27 0.9629630
## 5 sum 78 102 0.7647059
From the table we can see that on the whole the corrected prediction precentage is 76.5% and the percentage increases with the increase of crime rate. The model is more suitable for predict high crime rate area.
Distance measurements
Similarity and dissimilarity of objects can be measured with distance measures. There are many different measures for different types of data. Here we perform Euclidean distance and Manhattan distance for the data.
data('Boston')
boston_scaled=scale(Boston)
dist_eu=dist(boston_scaled,method='euclidean')
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man=dist(boston_scaled,method='manhattan')
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
The calculations of thest two distances are:
K-means
K-means is maybe the most used and known clustering method. It is an unsupervised method, that assighs observations to groups or cluster based on similarity of the objects.
Three clusters is tried first.
km=kmeans(boston_scaled,centers = 3)
pairs(boston_scaled,col=km$cluster)
Find the optimal number of cluster
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
set.seed(123)
k_max=10
twcss=sapply(1:k_max,function(k){kmeans(boston_scaled,k)$tot.withinss})
qplot(x=1:k_max,y=twcss,geom='line')
According to the line figure shown above, the most radically drop of the line happens when the K value is 2. So the optimal number of clusters is 2.
km=kmeans(boston_scaled,centers=2)
pairs(boston_scaled,col=km$cluster)
From the figure we can see that, for most of the variables, the k-means method could classify the results quite well.
The method of calculating K-means:
1. Choose the number of clusters and initial centroids.
2. Calculate distances between centroids and data points.
3. For all the data points: assign data point to clusters based on which centroid is closest.
4. Update centroids: within each cluster, calculate new centroid.
5. Update cluster: calculate distances between data points and updated centroids. If some other centroid is closer than the cluster centroid where the data point belongs, the data point changes cluster.
Continue updating steps until the centroids or the clusters do not change.
K-means is a quick and simple method to cluster analysis. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
However, the number of clusters k is very unpredictable. In this example, we used a simple way to find the optimal k value. But we must remember that, an inappropriate choice of k may yield poor results.
The choice of the initial cluster center has a great impact on the clustering result. Once the initial value is not chosen well, it may not be able to get a valid clustering result. The K-means method randomly assigns the initial cluster centers and this is a major issue of K-means algorithms. We used the function set.seed()
to deal with the problem, but usually we need to choose the initial centers quite well.
Created on 01.12.2017
@author: Xiaodong Li
This is the script for RStudio exercise 5 – Dimensionality reduction technique
library(GGally)
library(corrplot)
human = read.table('/home/xiaodong/IODS_course/IODS-project/data/human.txt',sep='\t',header = TRUE)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
The human
data describes the Human Development Index (HDI) and Gross National Income (GNI) situation in different countries together with the education, labour and health experiences. The goal is to show that the people and the their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The meaning of different variables are shown below:
Edu2.FM
: The ratio of Female and Male populations with secondary education
Labo.FM
: The ratio of labour force participation of females and males
life.Exp
: Life expectancy at birth
Edu.Exp
: Expected years of education
GNI
: Gross national income per capita
Mat.Mor
: Maternal mortality ratio
Ado.Birth
: Adolescent birth rate
Parli.F
: Percent Representation in Parliament n
Visuualize the human
data with ggpairs()
ggpairs(human)
Visualize the human
data with correlation plots
cor_matrix=cor(human)
corrplot(cor_matrix,type='upper')
The scatter plots and correlation plots between all the variables shown above give information about the relationships and distributions of all the data. From the plots we can see that Life.Exp
has strong positive relation with Edu.Exp
and also very strong negative relation with Mat.Mor
and Ado.Birth
. Edu2.Fm
has strong negative relation with Mat.Mor
too.
Perform Principal Component Analysis (PCA) with the SVD method
pca_human=prcomp(human)
s=summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca_pr=round(100*s$importance[2,],digits=1)
pc_lab=paste0(names(pca_pr),'(',pca_pr,'%)')
biplot(pca_human,choices=1:2,cex=c(0.8,1),xlab=pc_lab[1],ylab=pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
The biplot figure with not standardized variable is not clear because the varialbes have different scales and are not compariable.
human_std=scale(human)
pca_human=prcomp(human_std)
s=summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca_pr=round(100*s$importance[2,],digits=1)
pc_lab=paste0(names(pca_pr),'(',pca_pr,'%)')
biplot(pca_human,choices = 1:2,cex=c(0.8,1),xlab=pc_lab[1],ylab=pc_lab[2])
The biplots got from Step 3
and Step 4
give different PCA results. For the data that was not standardized, the GNI
variable has very big variance and all the data could be transformed to the first principal component (PC1). However, for the data that was standardized, all the variables have similar vairance and the features of the original data are comprised in several pricipal components. The different distributions of the data in these two different situations are shown clearly in the two biplots.
The biplot is a way of visualizing two representations of the same data. The biplot displays,
The observations in a lower 2-dimensional representation.
A scatter plot is drawn where the observations are placed on X and Y coordinates defined by two pricipal components.
The angle between arrows representing the original features can be interpreted as the corelation between the features. Small angle = high positive correlation.
The angle between a feature and a principal component axis can be interpreted as the correlation between the two. Small angle = high positive correlation.
Mat.Mor
and Ado.Birth
have high positive relationship between each other.
Edu.Exp
,Edu2.FM
,GNI
and Life.Exp
have high positive relationship between other.
The above two groups have high negative relationship between each other.
The standard deviation of the observations are quite similar because of the scaling process.
Load the tea dataset
library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
data('tea')
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
The tea dataset contains the answers of a questionnaire on tea consumption for 300 individuals. Although the data contains 36 columns for demonstration purposes I will only consider 6 of the following columns:
keep_columns=c('Tea','How','how','sugar','where','lunch')
tea_time=dplyr::select(tea,one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value))+geom_bar()+facet_wrap('key',scales='free')+theme(axis.text=element_text(angle=45,hjust = 1,size=8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Perform MCA on the selected tea dateset–tea_time
mca=MCA(tea_time,graph=FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca,invisible=c('ind'),habillage='quali')
Multiple Correspondence Analysis (MCA) is an extension of simple CA to analyse a data table containing more than two categorical variables. MCA is generally used to analyse a data from survey.
The objectives are to identify:
A group of individuals with similar profile in their answers to the questions
The associations between variable categories
The result of the function summary() contains 4 tables:
Table 1 - Eigenvalues: table 1 contains the variances and the percentage of variances retained by each dimension.
Table 3 contains the coordinates, the contribution and the cos2 (quality of representation [in 0-1]) of the first 10 active variable categories on the dimensions 1 and 2. This table contains also a column called v.test. The value of the v.test is generally comprised between 2 and -2. For a given variable category, if the absolute value of the v.test is superior to 2, this means that the coordinate is significantly different from 0.
Table 4 - categorical variables (eta2): contains the squared correlation between each variable and the dimensions.
The MCA graph shows a global pattern within the data. Rows (individuals) are usually represented by blue points and columns (variable categories) by red triangles.
The distance between any row points or column points gives a measure of their similarity (or dissimilarity).
Row points with similar profile are closed on the factor map. The same holds true for column points.
In the situation of our results shown above, we can conclude that, variable tea shop
and unpackaged
are the most correlated with dimension 1. Similarly, the variables other
and chain store+tea shop
are the most correlated with dimension 2.