mixtCompLearn
requires 3 R objects: algo (a list), data
(a list, a data.frame or a matrix) and model (a list).
Data must have one of the following format. Each variable must be named.
data <- list(
varName1 = c(elem11, elem12, elem13, elem14),
varName2 = c(elem21, elem22, elem23, elem24),
varName3 = c(elem31, elem32, elem33, elem34),
)
data <- data.frame(
varName1 = c(elem11, elem12, elem13, elem14),
varName2 = c(elem21, elem22, elem23, elem24),
varName3 = c(elem31, elem32, elem33, elem34),
)
data <- matrix(c(elem11, elem12, elem13, elem14,
elem21, elem22, elem23, elem24,
elem31, elem32, elem33, elem34),
ncol = 3, dimnames = list(NULL, c("varName1", "varName2", "varName3")))
Exemple:
model is a list describing the variables used for clustering
and the distribution used. Each element corresponds to a variable and
contains two elements: the model used (type
), and the
hyperparameters of the model if any (paramStr
). When there
is no hyperparameters, instead of a list, user can just provide the
model name.
model <- list(varName1 = list(type = "Model1", paramStr = "param1"),
varName2 = "Model2",
varName3 = list(type = "Model3", paramStr = ""))
The model object can contain less variables than the data object. Only variables listed in the model object are used for clustering.
Exemple:
algo is a list containing the required parameters of the SEM algorithm.
User can add extra elements, they will be copied in the output object.
algo <- list(nbBurnInIter = 100,
nbIter = 100,
nbGibbsBurnInIter = 100,
nbGibbsIter = 100,
nInitPerClass = 2,
nSemTry = 10,
confidenceLevel = 0.95,
ratioStableCriterion = 0.9,
nStableCriterion = 7,
notes = "You can add any note you wish in non mandatory fields like this one (notes). They will be copied to the output.")
To easily create this list, the function createAlgo
can
be used. It creates the desired list with default values.
Eight models are available in RMixtComp
Available models | Data type | Restrictions | Hyperparameters |
---|---|---|---|
Gaussian | Real | ||
Weibull | Real | >= 0 | |
Poisson | Integer | >= 0 | |
NegativeBinomial | Integer | >= 0 | |
Multinomial | Categorical | ||
Rank_ISR | Rank | ||
Func_CS | Functional | yes | |
Func_SharedAlpha_CS | Functional | yes |
For real data. For a class k, parameters are the mean (μk) and the standard deviation (σk). The distribution function is defined by:
$$ f_k(x) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\exp{\left(-2\frac{(x-\mu_k)^2}{\sigma_k^2}\right)} $$
For positive real data (usually lifetime). For a class j, parameters are the shape kj and the scale λj. The distribution function is defined by:
$$ f_j(x) = \frac{k_j}{\lambda_j} \left(\frac{x}{\lambda_j}\right)^{k_j-1} \exp{\left(-\left(\frac{x}{\lambda_j}\right)^{k_j}\right)} $$
For positive integer data. For a class k, the parameter is the mean and variance (λk). The density mass function is defined by:
$$ f_k(x) = \frac{\lambda^k}{k!}\exp{(-\lambda)} $$
For positive integer data. For a class k, parameters are the number of success (nk) and the probability of success (pk). The density mass function is defined by:
$$ f_k(x) = \frac{\Gamma(x+n_k)}{x! \Gamma(n_k)} p_k^{n_k}(1-p_k)^x $$
For categorical data. For a class k, the model has M parameters pk, j, j = 1, ..., M, where M is the number of modalities, corresponding to the probabilities to belong to the modality j. pk, j, j = 1, ..., M must verify $\sum_{j=1}^M p_{k,j} = 1$.
The density mass function is defined by:
$$ f_k(x = j) = \prod_{j=1}^K p_{k,j}^{a_j} \quad \text{with} \quad a_j = \begin{cases} 1 &\text{if } x = j \\ 0 &\text{otherwise} \end{cases} $$
The hyperparameter M does
not require to be specified, it can be guess from the data. If tou want
to specify it, add "nModality: M"
in the appropriate field
of the model object.
For ranking data. For a class k, the two parameters are the
central rank (μk) and the
probability of making a wrong comparison (πk). See the article for more
details. Ranks have their size M as hyperparameter. But it does not
require to be specified, it can be guess from the data. If tou want to
specify it, add "nModality: M"
in the appropriate field of
the description object.
Real values are saved with the dot as decimal separator. Missing data
are indicated by a ?
. Partial data can be provided through
intervals denoted by [a:b]
where a
(resp.
b
) is a real or -inf
(resp.
+inf
).
data <- list(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]")
)
data <- data.frame(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]")
)
data <- matrix(c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"), ncol = 1, dimnames = list(NULL, c("varGauss1")))
Weibull data are real positive values with the dot as decimal
separator. Missing data are indicated by a ?
. Partial data
can be provided through intervals denoted by [a:b]
where
a
and b
are positive reals (b
can
be +inf
).
data <- list(
varWeib1 = c("2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]")
)
data <- data.frame(
varWeib1 = c("2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]")
)
data <- matrix(c("2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]"), ncol = 1, dimnames = list(NULL, c("varWeib1")))
Counting data are positive integer. Missing data are indicated by a
?
. Partial data can be provided through intervals denoted
by [a:b]
where a
and b
are
positive integers (b
can be +inf
).
Modalities must be consecutive integers with 1 as minimal value.
Missing data are indicated by a ?
. For partial data, a list
of possible values can be provided by {a_1,...,a_j}
, where
a_i
denotes a modality.
Categorical data before formatting:
varCateg1 | varCateg2 |
---|---|
married | large |
single | small |
status unknown | medium |
divorced | small or medium |
divorced or single | large |
after formatting:
data <- list(
varCat1 = c("1", "2", "?", "3", "{2,3}"),
varCat2 = c("3", "1", "2", "{1,2}", "3")
)
data <- data.frame(
varCat1 = c("1", "2", "?", "3", "{2,3}"),
varCat2 = c("3", "1", "2", "{1,2}", "3")
)
data <- matrix(c("1", "2", "?", "3", "{2,3}",
"3", "1", "2", "{1,2}", "3"), ncol = 2, dimnames = list(NULL, c("varCat1", "varCat2")))
The format of a rank is: o_1,..., o_j
where
o_1
is an integer corresponding to the the number of the
object ranked in 1st position. For example: 4,2,1,3
means
that the fourth object is ranked first then the second object is in
second position and so on. Missing data can be specified by replacing
and object by a ?
or a list of potential object, for
example: 4, {2 3}, {2 1}, ?
means that the object ranked in
second position is either the object number 2 or the object number 3,
then the object ranked in third position is either the object 2 or 1 and
the last one can be anything. A totally missing rank is spedified by a
sequence of ?
separated by commas,
e.g. ?,?,?,?
for a totally missing rank of length 4.
data <- list(
varRank1 = c("1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}")
)
data <- data.frame(
varRank1 = c("1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}")
)
data <- matrix(c("1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}"), ncol = 1, dimnames = list(NULL, c("varRank1")))
Multinomial | Gaussian | Poisson | NegativeBinomial | Weibull | Rank_ISR | Func_CS | LatentClass | |
---|---|---|---|---|---|---|---|---|
Completely missing | ? |
? |
? |
? |
? |
?,?,?,? |
? |
|
Finite number of values | {a,b,c} |
4,{1 2},3,{1 2} |
{a,b,c} |
|||||
Bounded interval | [a:b] |
[a:b] |
[a:b] |
[a:b] |
||||
Right bounded interval | [-inf:b] |
|||||||
Left bounded interval | [a:+inf] |
[a:+inf] |
[a:+inf] |
[a:+inf] |
To perform a (semi-)supervised clustering, user can add a variable
named z_class
(with eventually some missing values) with
"LatentClass"
as model. Missing data are indicated by a
?
. For partial data, a list of possible values can be
provided by {a_1,...,a_j}
, where a_i
denotes a
class number.
data <- list(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"),
z_class = c("1", "1", "{1,3}", "3", "?", "2", "1")
)
data <- data.frame(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"),
z_class = c("1", "1", "{1,3}", "3", "?", "2", "1")
)
data <- matrix(c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]",
"1", "1", "{1,3}", "3", "?", "2", "1"),
ncol = 2, dimnames = list(NULL, c("varGauss1", "z_class")))