Unsupervised classification is
illustrated on the titanic
dataset. It is a data.frame with
1309 observations and 8 variables containing information on the
passengers of the Titanic. Each observation represents a passenger
described by a set of real variables: age in years (age
),
ticket price in pounds (fare
), a set of counting variables:
number of siblings/spouses aboard (sibsp
), number of
parents/children aboard (parch
) and a set of categorical
variables: sex
, ticket class (pclass
), port of
embarkation and a binary variable indicating if the passenger survived
(survived
). Furthermore, the dataset contains missing
values for three variables: age
, fare
and
embarked
.
## pclass survived sex age sibsp parch fare embarked
## 1 1st 1 female 29.0 0 0 211.3375 S
## 16 1st 0 male NA 0 0 25.9250 S
## 38 1st 1 male NA 0 0 26.5500 S
## 169 1st 1 female 38.0 0 0 80.0000 <NA>
## 285 1st 1 female 62.0 0 0 80.0000 <NA>
## 1226 3rd 0 male 60.5 0 0 NA S
First, the dataset must be converted in the MixtComp format.
Categorical variables must be numbered from 1 to the number of
categories (e.g. 3 for embarked
). This can be done using
the refactorCategorical
function that takes in arguments
the vector containing the data, the old labels and the new labels.
Totaly missing values must be indicated with a ?
.
titanicMC <- titanic
titanicMC$sex <- refactorCategorical(titanic$sex, c("male", "female"), c(1, 2))
titanicMC$pclass <- refactorCategorical(titanic$pclass, c("1st", "2nd", "3rd"), c(1, 2, 3))
titanicMC$embarked <- refactorCategorical(titanic$embarked, c("C", "Q", "S"), c(1, 2, 3))
titanicMC$survived <- refactorCategorical(titanic$survived, c(0, 1), c(1, 2))
titanicMC[is.na(titanicMC)] = "?"
head(titanicMC)
## pclass survived sex age sibsp parch fare embarked
## 1 1 2 2 29 0 0 211.3375 3
## 2 1 2 1 0.9167 1 2 151.55 3
## 3 1 1 2 2 1 2 151.55 3
## 4 1 1 1 30 1 2 151.55 3
## 5 1 1 2 25 1 2 151.55 3
## 6 1 2 1 48 0 0 26.55 3
The dataset is splitted in 2 datasets for illustrating learning and prediction.
indTrain <- sample(nrow(titanicMC), floor(0.8 * nrow(titanicMC)))
titanicMCTrain <- titanicMC[indTrain, ]
titanicMCTest <- titanicMC[-indTrain, ]
Then, as all variables are stored as character in a data.frame, a
model
object indicating which model to use for each
variable is created. In this example, a gaussian model is used for
age
and fare
variables, a multinomial for
sex
, pclass
, embarked
and
survived
, a Poisson for sibsp
and
parch
.
We choose to run the clustering analysis for 1 to 20 clusters with 3
runs for every number of clusters. These runs can be parallelized using
the nCore
parameter.
summary
and plot
functions are used to have
an overview of the results for the best number of classes according to
the chosen criterion (BIC or ICL). If this number is not the one desired
by the user, it can been changed via the parameter
nClass
.
The summary
displays the number of clusters chosen and
some outputs as the discriminative power indicating the variables that
contribute most to class separation and parameters associated with the 3
most discriminant variables.
## ############### MixtCompLearn Run ###############
## nClass: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## Criterion used: BIC
## 1 2 3 4 5 6 7
## BIC -14367.3 -12796.45 -12020.33 -11790.60 -11649.78 -11563.59 -11545.75
## ICL -14367.3 -12822.38 -12039.47 -11850.01 -11709.50 -11627.29 -11601.06
## 8 9 10 11 12 13 14
## BIC -11530.97 -11515.99 -11482.97 -11369.2 -11499.51 -11403.79 -11454.61
## ICL -11600.21 -11594.53 -11560.34 -11477.0 -11577.40 -11526.33 -11599.09
## 15 16 17 18 19 20
## BIC -11464.25 -11568.10 -11605.71 -11539.52 -11533.10 -11602.98
## ICL -11599.09 -11699.51 -11703.22 -11665.41 -11651.53 -11796.56
## Best model: 11 clusters
## ########### MixtComp Run ###########
## Number of individuals: 1047
## Number of variables: 8
## Number of clusters: 11
## Mode: learn
## Time: 0.289 s
## SEM burn-in iterations done: 50/50
## SEM run iterations done: 50/50
## Observed log-likelihood: -10875.48
## BIC: -11369.2
## ICL: -11477
## Discriminative power:
## fare pclass parch sibsp survived age sex embarked
## 0.613 0.424 0.174 0.174 0.158 0.138 0.137 0.135
## Proportions of the mixture:
## 0.08 0.061 0.037 0.047 0.086 0.089 0.144 0.05 0.206 0.054 0.147
## Parameters of the most discriminant variables:
## - fare: Gaussian
## mean sd
## k: 1 73.661 28.324
## k: 2 27.606 7.080
## k: 3 110.054 62.959
## k: 4 37.644 15.213
## k: 5 30.134 19.849
## k: 6 12.502 1.390
## k: 7 7.829 0.110
## k: 8 193.659 115.378
## k: 9 7.842 0.731
## k: 10 27.731 1.826
## k: 11 15.756 4.631
## - pclass: Multinomial
## modality 1 modality 2 modality 3
## k: 1 1.000 0.000 0.000
## k: 2 0.000 1.000 0.000
## k: 3 1.000 0.000 0.000
## k: 4 0.000 0.000 1.000
## k: 5 0.343 0.549 0.108
## k: 6 0.000 1.000 0.000
## k: 7 0.000 0.000 1.000
## k: 8 0.941 0.059 0.000
## k: 9 0.000 0.000 1.000
## k: 10 1.000 0.000 0.000
## k: 11 0.000 0.070 0.930
## - parch: Poisson
## lambda
## k: 1 0.222
## k: 2 0.928
## k: 3 0.500
## k: 4 2.408
## k: 5 0.126
## k: 6 0.000
## k: 7 0.000
## k: 8 1.180
## k: 9 0.000
## k: 10 0.000
## k: 11 0.645
## ####################################
The plot
function displayed the values of criteria, the
discriminative power of variables and the parameters of the three most
discriminative variable. More variables can be displayed using the
nVarMaxToPlot
parameter.
## $criteria
##
## $discrimPowerVar
##
## $proportion
##
## $fare
##
## $pclass
##
## $parch
The most discriminant variable for clustering are fare
and pclass
. The similarity between variables is shown with
the following code:
## fare age pclass survived sex embarked sibsp parch
## fare 1.00 0.36 0.41 0.36 0.36 0.38 0.37 0.36
## age 0.36 1.00 0.59 0.73 0.74 0.72 0.71 0.73
## pclass 0.41 0.59 1.00 0.58 0.57 0.58 0.54 0.54
## survived 0.36 0.73 0.58 1.00 0.82 0.73 0.70 0.72
## sex 0.36 0.74 0.57 0.82 1.00 0.74 0.71 0.73
## embarked 0.38 0.72 0.58 0.73 0.74 1.00 0.69 0.70
## sibsp 0.37 0.71 0.54 0.70 0.71 0.69 1.00 0.72
## parch 0.36 0.73 0.54 0.72 0.73 0.70 0.72 1.00
The greatest similarity is between survived
and
sex
, this relation is well-known in the dataset with a
great proportion of women surviving compared to men. On the contrary,
there is few similarity between fare
and other
variables.
Getters are available to easily access some results:
getBIC
, getICL
, getCompletedData
,
getParam
, getProportion
, getTik
,
getPartition
, … All these functions use the model
maximizing the asked criterion. If results for an other number of
classes is desired, the extractMixtCompObject
can be used.
For example:
## k: 1 k: 2 k: 3 k: 4 k: 5 k: 6 k: 7
## 0.07950192 0.06130268 0.03735632 0.04693487 0.08620690 0.08908046 0.14367816
## k: 8 k: 9 k: 10 k: 11
## 0.04980843 0.20593870 0.05363985 0.14655172
## k: 1 k: 2
## 0.5047801 0.4952199
Once a model is learnt, one can use it to predict the clusters of new individuals.
The probabilities of belonging to the different classes and the associated partition are given by:
## [,1] [,2] [,3] [,4] [,5]
## [1,] -Inf -Inf -Inf -Inf 0.000000
## [2,] -Inf -Inf -Inf -Inf 0.000000
## [3,] -0.15226406 -Inf -Inf -Inf -1.957305
## [4,] -55.62446720 -Inf -Inf -Inf 0.000000
## [5,] -Inf -Inf -Inf -Inf 0.000000
## [6,] -0.08527447 -Inf -Inf -Inf -2.504214
## [1] 5 5 1 5 5 1