Title: | Multi-Layer Group-Lasso |
---|---|
Description: | It implements a new procedure of variable selection in the context of redundancy between explanatory variables, which holds true with high dimensional data (Grimonprez et al. (2023) <doi:10.18637/jss.v106.i03>). |
Authors: | Quentin Grimonprez [aut, cre], Samuel Blanck [ctb], Alain Celisse [ths], Guillemette Marot [ths], Yi Yang [ctb], Hui Zou [ctb] |
Maintainer: | Quentin Grimonprez <[email protected]> |
License: | GPL (>=2) |
Version: | 1.0.0 |
Built: | 2024-11-04 03:33:38 UTC |
Source: | https://github.com/modal-inria/mlgl |
This package presents a method combining Hierarchical Clustering and Group-lasso. Usually, a single partition of the covariates is used in the group-lasso. Here, we provide several partitions from the hierarchical tree.
A post-treatment method based on statistical test (with FWER and FDR control) for selecting the regularization parameter and the optimal group for this value is provided. This method can be applied for the classical group-lasso and our method.
The MLGL function performs the hierarchical clustering and the group-lasso. The post-treatment method can be performed with hierarchicalFWER and selFWER functions. The whole process can be run with the fullProcess function.
Quentin Grimonprez
Grimonprez Q, Blanck S, Celisse A, Marot G (2023). "MLGL: An R Package Implementing Correlated Variable Selection by Hierarchical Clustering and Group-Lasso." Journal of Statistical Software, 106(3), 1-33. doi:10.18637/jss.v106.i03.
MLGL, cv.MLGL, fullProcess, hierarchicalFWER
# Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y)
# Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y)
Hierarchical Clustering with distance matrix computed using bootstrap replicates
bootstrapHclust(X, frac = 1, B = 50, method = "ward.D2", nCore = NULL)
bootstrapHclust(X, frac = 1, B = 50, method = "ward.D2", nCore = NULL)
X |
data |
frac |
fraction of sample used at each replicate |
B |
number of replicates |
method |
desired method: "single", "complete", "average", "mcquitty", "ward.D", "ward.D2", "centroid", "median". |
nCore |
number of cores |
An object of class hclust
hc <- bootstrapHclust(USArrests, nCore = 1)
hc <- bootstrapHclust(USArrests, nCore = 1)
cv.MLGL
objectGet coefficients from a cv.MLGL
object
## S3 method for class 'cv.MLGL' coef(object, s = c("lambda.1se", "lambda.min"), ...)
## S3 method for class 'cv.MLGL' coef(object, s = c("lambda.1se", "lambda.min"), ...)
object |
|
s |
Either "lambda.1se" or "lambda.min" |
... |
Not used. Other arguments to predict. |
A matrix with estimated coefficients for given values of s.
Quentin Grimonprez
MLGL
objectGet coefficients from a MLGL
object
## S3 method for class 'MLGL' coef(object, s = NULL, ...)
## S3 method for class 'MLGL' coef(object, s = NULL, ...)
object |
|
s |
values of lambda. If NULL, use values from object |
... |
Not used. Other arguments to predict. |
A matrix with estimated coefficients for given values of s.
Quentin Grimonprez
Compute the group size weight vector with an authorized maximal size
computeGroupSizeWeight(hc, sizeMax = NULL)
computeGroupSizeWeight(hc, sizeMax = NULL)
hc |
output of hclust |
sizeMax |
maximum size of cluster to consider |
the weight vector
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # use 20 as the maximal number of group hc <- hclust(dist(t(X))) w <- computeGroupSizeWeight(hc, sizeMax = 20) # Apply MLGL method res <- MLGL(X, y, hc = hc, weightSizeGroup = w)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # use 20 as the maximal number of group hc <- hclust(dist(t(X))) w <- computeGroupSizeWeight(hc, sizeMax = 20) # Apply MLGL method res <- MLGL(X, y, hc = hc, weightSizeGroup = w)
V-fold cross validation for MLGL
function
cv.MLGL( X, y, nfolds = 5, lambda = NULL, hc = NULL, weightLevel = NULL, weightSizeGroup = NULL, loss = c("ls", "logit"), intercept = TRUE, sizeMaxGroup = NULL, verbose = FALSE, ... )
cv.MLGL( X, y, nfolds = 5, lambda = NULL, hc = NULL, weightLevel = NULL, weightSizeGroup = NULL, loss = c("ls", "logit"), intercept = TRUE, sizeMaxGroup = NULL, verbose = FALSE, ... )
X |
matrix of size n*p |
y |
vector of size n. If loss = "logit", elements of y must be in -1,1 |
nfolds |
number of folds |
lambda |
lambda values for group lasso. If not provided, the function generates its own values of lambda |
hc |
output of |
weightLevel |
a vector of size p for each level of the hierarchy. A zero indicates that the level will be ignored. If not provided, use 1/(height between 2 successive levels) |
weightSizeGroup |
a vector |
loss |
a character string specifying the loss function to use, valid options are: "ls" least squares loss (regression) and "logit" logistic loss (classification) |
intercept |
should an intercept be included in the model ? |
sizeMaxGroup |
maximum size of selected groups. If NULL, no restriction |
verbose |
print some informations |
... |
Others parameters for |
Hierarchical clustering is performed with all the variables. Then, the partitions from the different levels of the hierarchy are used in the different run of MLGL for cross validation.
a cv.MLGL object containing:
values of lambda
.
the mean cross-validated error.
estimate of standard error of cvm
upper curve = cvm+cvsd
lower curve = cvm-cvsd
The optimal value of lambda
that gives minimum cross validation error cvm
.
The largest value of lambda
such that error is within 1 standard error of the minimum.
computation time
Quentin Grimonprez
MLGL, stability.MLGL, predict.cv.gglasso, coef.cv.MLGL, plot.cv.MLGL
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply cv.MLGL method res <- cv.MLGL(X, y)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply cv.MLGL method res <- cv.MLGL(X, y)
Perform a F-test
Ftest(X, y, varToTest)
Ftest(X, y, varToTest)
X |
design matrix of size n*p |
y |
response vector of length n |
varToTest |
vector containing the index of the column of X to test |
y = X * beta + epsilon
null hypothesis: beta[varToTest] = 0 alternative hypothesis: it exists an index k in varToTest such that beta[k] != 0
The test statistic is based on a full and a reduced model. full: y = X * beta[varToTest] + epsilon reduced: the null model
a vector of the same length as varToTest containing the p-values of the test.
Run hierarchical clustering following by a group-lasso on all the different partition and a hierarchical testing procedure. Only for linear regression problem.
fullProcess(X, ...) ## Default S3 method: fullProcess( X, y, control = c("FWER", "FDR"), alpha = 0.05, test = partialFtest, hc = NULL, fractionSampleMLGL = 1/2, BHclust = 50, nCore = NULL, addRoot = FALSE, Shaffer = FALSE, ... ) ## S3 method for class 'formula' fullProcess( formula, data, control = c("FWER", "FDR"), alpha = 0.05, test = partialFtest, hc = NULL, fractionSampleMLGL = 1/2, BHclust = 50, nCore = NULL, addRoot = FALSE, Shaffer = FALSE, ... )
fullProcess(X, ...) ## Default S3 method: fullProcess( X, y, control = c("FWER", "FDR"), alpha = 0.05, test = partialFtest, hc = NULL, fractionSampleMLGL = 1/2, BHclust = 50, nCore = NULL, addRoot = FALSE, Shaffer = FALSE, ... ) ## S3 method for class 'formula' fullProcess( formula, data, control = c("FWER", "FDR"), alpha = 0.05, test = partialFtest, hc = NULL, fractionSampleMLGL = 1/2, BHclust = 50, nCore = NULL, addRoot = FALSE, Shaffer = FALSE, ... )
X |
matrix of size n*p |
... |
Others parameters for MLGL |
y |
vector of size n. |
control |
either "FDR" or "FWER" |
alpha |
control level for testing procedure |
test |
test used in the testing procedure. Default is partialFtest |
hc |
output of |
fractionSampleMLGL |
a real between 0 and 1: the fraction of individuals to use in the sample for MLGL (see Details). |
BHclust |
number of replicates for computing the distance matrix for the hierarchical clustering tree |
nCore |
number of cores used for distance computation. Use all cores by default. |
addRoot |
If TRUE, add a common root containing all the groups |
Shaffer |
If TRUE, a Shaffer correction is performed (only if control = "FWER") |
formula |
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. |
data |
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment (formula) |
Divide the n individuals in two samples. Then the three following steps are done: 1) Bootstrap Hierarchical Clustering of the variables of X 2) MLGL on the second sample of individuals 3) Hierarchical testing procedure on the first sample of individuals.
a list containing:
output of MLGL function
lambda values maximizing the number of rejects
A vector containing the index of selected variables for the first lambdaOpt
value
A vector containing the values index of selected groups for the first lambdaOpt
value
Selected groups for the first lambdaOpt
value
Selected groups for all lambda values
Control level
Test used in the testing procedure
"FDR" or "FWER"
Elapsed time
Quentin Grimonprez
MLGL, hierarchicalFDR, hierarchicalFWER, selFDR, selFWER
# least square loss set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- fullProcess(X, y)
# least square loss set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- fullProcess(X, y)
Apply hierarchical test for each hierarchy, and test external variables for FDR control at level alpha
hierarchicalFDR(X, y, group, var, test = partialFtest, addRoot = FALSE)
hierarchicalFDR(X, y, group, var, test = partialFtest, addRoot = FALSE)
X |
original data |
y |
associated response |
group |
vector with index of groups. group[i] contains the index of the group of the variable var[i]. |
var |
vector with the variables contained in each group. group[i] contains the index of the group of the variable var[i]. |
test |
function for testing the nullity of a group of coefficients in linear regression.
The function has 3 arguments: |
addRoot |
If TRUE, add a common root containing all the groups |
Version of the hierarchical testing procedure of Yekutieli for MLGL output. You can use th selFDR function to select groups at a desired level alpha.
a list containing:
pvalues of the different test (without correction)
adjusted pvalues
Index of the group
Matrix describing the hierarchical tree.
Yekutieli, Daniel. "Hierarchical False Discovery Rate-Controlling Methodology." Journal of the American Statistical Association 103.481 (2008): 309-16.
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFDR(X, y, res$group[[20]], res$var[[20]])
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFDR(X, y, res$group[[20]], res$var[[20]])
Apply hierarchical test for each hierarchy, and test external variables for FWER control at level alpha
hierarchicalFWER( X, y, group, var, test = partialFtest, Shaffer = FALSE, addRoot = FALSE )
hierarchicalFWER( X, y, group, var, test = partialFtest, Shaffer = FALSE, addRoot = FALSE )
X |
original data |
y |
associated response |
group |
vector with index of groups. group[i] contains the index of the group of the variable var[i]. |
var |
vector with the variables contained in each group. group[i] contains the index of the group of the variable var[i]. |
test |
function for testing the nullity of a group of coefficients in linear regression.
The function has 3 arguments: |
Shaffer |
boolean, if TRUE, a Shaffer correction is performed |
addRoot |
If TRUE, add a common root containing all the groups |
Version of the hierarchical testing procedure of Meinshausen for MLGL output. You can use th selFWER function to select groups at a desired level alpha
a list containing:
pvalues of the different test (without correction)
adjusted pvalues
Index of the group
Matrix describing the hierarchical tree.
Meinshausen, Nicolai. "Hierarchical Testing of Variable Importance." Biometrika 95.2 (2008): 265-78.
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFWER(X, y, res$group[[20]], res$var[[20]])
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFWER(X, y, res$group[[20]], res$var[[20]])
Apply Hierarchical Multiple Testing procedure on a MLGL
object
HMT( res, X, y, control = c("FWER", "FDR"), alpha = 0.05, test = partialFtest, addRoot = FALSE, Shaffer = FALSE, ... )
HMT( res, X, y, control = c("FWER", "FDR"), alpha = 0.05, test = partialFtest, addRoot = FALSE, Shaffer = FALSE, ... )
res |
|
X |
matrix of size n*p |
y |
vector of size n. |
control |
either "FDR" or "FWER" |
alpha |
control level for testing procedure |
test |
test used in the testing procedure. Default is partialFtest |
addRoot |
If TRUE, add a common root containing all the groups |
Shaffer |
If TRUE, a Shaffer correction is performed (only if control = "FWER") |
... |
extra parameters for selFDR |
a list containing:
lambda values maximizing the number of rejects
A vector containing the index of selected variables for the first lambdaOpt
value
A vector containing the values index of selected groups for the first lambdaOpt
value
Selected groups for the first lambdaOpt
value
indices associated with optimal lambdas
Selected groups for all lambda values
Control level
Test used in the testing procedure
"FDR" or "FWER"
Elapsed time
list containing the output of the testing function for each lambda. Each element can be used with the selFWER or selFDR functions.
lambda path
Number of groups before testing
Numer of groups after testing
hierarchicalFWER hierarchicalFDR selFWER selFDR
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) # perform hierarchical testing with FWER control out <- HMT(res, X, y, alpha = 0.05) # test a new value of alpha for a specific lambda selFWER(out$hierTest[[60]], alpha = 0.1)
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) # perform hierarchical testing with FWER control out <- HMT(res, X, y, alpha = 0.05) # test a new value of alpha for a specific lambda selFWER(out$hierTest[[60]], alpha = 0.1)
Obtain a sparse matrix of the coefficients of the path
listToMatrix(x, row = c("covariates", "lambda"))
listToMatrix(x, row = c("covariates", "lambda"))
x |
|
row |
"lambda" or "covariates". If row="covariates", each row of the output matrix represents a covariate else if row="lambda", it represents a value of lambda. |
This function can be used with a MLGL
object to obtain a matrix with all estimated coefficients
for the p original variables.
In case of overlapping groups, coefficients from repeated variables are summed.
a sparse matrix containing the estimated coefficients for different lambdas
# Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) # Convert output in sparse matrix format beta <- listToMatrix(res)
# Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) # Convert output in sparse matrix format beta <- listToMatrix(res)
Run hierarchical clustering following by a group-lasso on all the different partitions.
MLGL(X, ...) ## Default S3 method: MLGL( X, y, hc = NULL, lambda = NULL, weightLevel = NULL, weightSizeGroup = NULL, intercept = TRUE, loss = c("ls", "logit"), sizeMaxGroup = NULL, verbose = FALSE, ... ) ## S3 method for class 'formula' MLGL( formula, data, hc = NULL, lambda = NULL, weightLevel = NULL, weightSizeGroup = NULL, intercept = TRUE, loss = c("ls", "logit"), verbose = FALSE, ... )
MLGL(X, ...) ## Default S3 method: MLGL( X, y, hc = NULL, lambda = NULL, weightLevel = NULL, weightSizeGroup = NULL, intercept = TRUE, loss = c("ls", "logit"), sizeMaxGroup = NULL, verbose = FALSE, ... ) ## S3 method for class 'formula' MLGL( formula, data, hc = NULL, lambda = NULL, weightLevel = NULL, weightSizeGroup = NULL, intercept = TRUE, loss = c("ls", "logit"), verbose = FALSE, ... )
X |
matrix of size n*p |
... |
Others parameters for |
y |
vector of size n. If loss = "logit", elements of y must be in -1,1 |
hc |
output of |
lambda |
lambda values for group lasso. If not provided, the function generates its own values of lambda |
weightLevel |
a vector of size p for each level of the hierarchy. A zero indicates that the level will be ignored.
If not provided, use 1/(height between 2 successive levels). Only if |
weightSizeGroup |
a vector of size 2*p-1 containing the weight for each group.
Default is the square root of the size of each group. Only if |
intercept |
should an intercept be included in the model ? |
loss |
a character string specifying the loss function to use, valid options are: "ls" least squares loss (regression) and "logit" logistic loss (classification) |
sizeMaxGroup |
maximum size of selected groups. If NULL, no restriction |
verbose |
print some information |
formula |
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. |
data |
an optional data.frame, list or environment (or object coercible by as.data.frame to a data.frame) containing the variables in the model. If not found in data, the variables are taken from environment (formula) |
a MLGL object containing:
lambda values
intercept values for lambda
A list containing the values of estimated coefficients for each values of lambda
A list containing the index of selected variables for each values of lambda
A list containing the values index of selected groups for each values of lambda
A vector containing the number of non zero coefficients for each values of lambda
A vector containing the number of non zero groups for each values of lambda
A list containing 3 vectors. var: all variables used. group: associated groups. weight: weight associated with the different groups. level: for each group, the corresponding level of the hierarchy where it appears and disappears. 3 indicates the level with a partition of 3 groups.
computation time
dimension of X
Output of hierarchical clustering
Code executed by user
Quentin Grimonprez
cv.MLGL, stability.MLGL, listToMatrix, predict.MLGL, coef.MLGL, plot.cv.MLGL
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y)
Group-lasso with overlapping groups
overlapgglasso( X, y, var, group, lambda = NULL, weight = NULL, loss = c("ls", "logit"), intercept = TRUE, ... )
overlapgglasso( X, y, var, group, lambda = NULL, weight = NULL, loss = c("ls", "logit"), intercept = TRUE, ... )
X |
matrix of size n*p |
y |
vector of size n. If loss = "logit", elements of y must be in -1,1 |
var |
vector containing the variable to use |
group |
vector containing the associated groups |
lambda |
lambda values for group lasso. If not provided, the function generates its own values of lambda |
weight |
a vector the weight for each group. Default is the square root of the size of each group |
loss |
a character string specifying the loss function to use, valid options are: "ls" least squares loss (regression) and "logit" logistic loss (classification) |
intercept |
should an intercept be included in the model ? |
... |
Others parameters for |
Use a group-lasso algorithm (see gglasso
) to solve a group-lasso with overlapping groups.
Each variable j of the original matrix X
is paste k(j) times in a new dataset with k(j) the number of
different groups containing the variable j.
The new dataset is used to solve the group-lasso with overlapping groups running a group-lasso algorithm.
a MLGL object containing:
lambda values
intercept values for lambda
A list containing the values of estimated coefficients for each values of lambda
A list containing the index of selected variables for each values of lambda
A list containing the values index of selected groups for each values of lambda
A vector containing the number of non zero coefficients for each values of lambda
A vector containing the number of non zero groups for each values of lambda
A list containing 3 vectors. var: all variables used. group: associated groups. weight: weight associated with the different groups.
computation time
dimension of X
Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. 2009. Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML '09).
# Least square loss set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) var <- c(1:60, 1:8, 7:15) group <- c(rep(1:12, each = 5), rep(13, 8), rep(14, 9)) res <- overlapgglasso(X, y, var, group) # Logistic loss y <- 2 * (rowSums(X[, 1:4]) > 0) - 1 var <- c(1:60, 1:8, 7:15) group <- c(rep(1:12, each = 5), rep(13, 8), rep(14, 9)) res <- overlapgglasso(X, y, var, group, loss = "logit")
# Least square loss set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) var <- c(1:60, 1:8, 7:15) group <- c(rep(1:12, each = 5), rep(13, 8), rep(14, 9)) res <- overlapgglasso(X, y, var, group) # Logistic loss y <- 2 * (rowSums(X[, 1:4]) > 0) - 1 var <- c(1:60, 1:8, 7:15) group <- c(rep(1:12, each = 5), rep(13, 8), rep(14, 9)) res <- overlapgglasso(X, y, var, group, loss = "logit")
Perform a partial F-test
partialFtest(X, y, varToTest)
partialFtest(X, y, varToTest)
X |
design matrix of size n*p |
y |
response vector of length n |
varToTest |
vector containing the index of the column of X to test |
y = X * beta + epsilon
null hypothesis: beta[varToTest] = 0 alternative hypothesis: it exists an index k in varToTest such that beta[k] != 0
The test statistic is based on a full and a reduced model. full: y = X * beta + epsilon reduced: y = X * beta[-varToTest] + epsilon
a vector of the same length as varToTest containing the p-values of the test.
cv.MLGL
functionPlot the cross-validation obtained from cv.MLGL
function
## S3 method for class 'cv.MLGL' plot(x, log.lambda = FALSE, ...)
## S3 method for class 'cv.MLGL' plot(x, log.lambda = FALSE, ...)
x |
|
log.lambda |
If TRUE, use log(lambda) instead of lambda in abscissa |
... |
Other parameters for plot function |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply cv.MLGL method res <- cv.MLGL(X, y) # Plot the cv error curve plot(res)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply cv.MLGL method res <- cv.MLGL(X, y) # Plot the cv error curve plot(res)
fullProcess
functionPlot the path obtained from fullProcess
function
## S3 method for class 'fullProcess' plot( x, log.lambda = FALSE, lambda.lines = FALSE, lambda.opt = c("min", "max", "both"), ... )
## S3 method for class 'fullProcess' plot( x, log.lambda = FALSE, lambda.lines = FALSE, lambda.opt = c("min", "max", "both"), ... )
x |
|
log.lambda |
If TRUE, use log(lambda) instead of lambda in abscissa |
lambda.lines |
If TRUE, add vertical lines at lambda values |
lambda.opt |
If there is several optimal lambdas, which one to print "min", "max" or "both" |
... |
Other parameters for plot function |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- fullProcess(X, y) # Plot the solution path plot(res)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- fullProcess(X, y) # Plot the solution path plot(res)
HMT
functionPlot the path obtained from HMT
function
## S3 method for class 'HMT' plot( x, log.lambda = FALSE, lambda.lines = FALSE, lambda.opt = c("min", "max", "both"), ... )
## S3 method for class 'HMT' plot( x, log.lambda = FALSE, lambda.lines = FALSE, lambda.opt = c("min", "max", "both"), ... )
x |
|
log.lambda |
If TRUE, use log(lambda) instead of lambda in abscissa |
lambda.lines |
If TRUE, add vertical lines at lambda values |
lambda.opt |
If there is several optimal lambdas, which one to print "min", "max" or "both" |
... |
Other parameters for plot function |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) out <- HMT(res, X, y) plot(out)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) out <- HMT(res, X, y) plot(out)
MLGL
functionPlot the path obtained from MLGL
function
## S3 method for class 'MLGL' plot(x, log.lambda = FALSE, lambda.lines = FALSE, ...)
## S3 method for class 'MLGL' plot(x, log.lambda = FALSE, lambda.lines = FALSE, ...)
x |
|
log.lambda |
If TRUE, use log(lambda) instead of lambda in abscissa |
lambda.lines |
if TRUE, add vertical lines at lambda values |
... |
Other parameters for plot function |
# Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) # Plot the solution path plot(res)
# Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) # Plot the solution path plot(res)
stability.MLGL
functionPlot the stability path obtained from stability.MLGL
function
## S3 method for class 'stability.MLGL' plot(x, log.lambda = FALSE, threshold = 0.75, ...)
## S3 method for class 'stability.MLGL' plot(x, log.lambda = FALSE, threshold = 0.75, ...)
x |
|
log.lambda |
If TRUE, use log(lambda) instead of lambda in abscissa |
threshold |
Threshold for selection frequency |
... |
Other parameters for plot function |
A list containing:
Index of selected variables for the given threshold.
Index of the associated group.
Value of threshold
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply stability.MLGL method res <- stability.MLGL(X, y) selected <- plot(res) print(selected)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply stability.MLGL method res <- stability.MLGL(X, y) selected <- plot(res) print(selected)
cv.MLGL
objectPredict fitted values from a cv.MLGL
object
## S3 method for class 'cv.MLGL' predict( object, newx = NULL, s = c("lambda.1se", "lambda.min"), type = c("fit", "coefficients"), ... )
## S3 method for class 'cv.MLGL' predict( object, newx = NULL, s = c("lambda.1se", "lambda.min"), type = c("fit", "coefficients"), ... )
object |
|
newx |
matrix with new individuals for prediction. If type="coefficients", the parameter has to be NULL |
s |
Either "lambda.1se" or "lambda.min" |
type |
if "fit", return the fitted values for each values of s, if "coefficients", return the estimated coefficients for each s |
... |
Not used. Other arguments to predict. |
A matrix with fitted values or estimated coefficients for given values of s.
Quentin Grimonprez
MLGL
objectPredict fitted values from a MLGL
object
## S3 method for class 'MLGL' predict(object, newx = NULL, s = NULL, type = c("fit", "coefficients"), ...)
## S3 method for class 'MLGL' predict(object, newx = NULL, s = NULL, type = c("fit", "coefficients"), ...)
object |
|
newx |
matrix with new individuals for prediction. If type="coefficients", the parameter has to be NULL |
s |
values of lambda. If NULL, use values from object |
type |
if "fit", return the fitted values for each values of s, if "coefficients", return the estimated coefficients for each s |
... |
Not used. Other arguments to predict. |
A matrix with fitted values or estimated coefficients for given values of s.
original code from gglasso package Author: Yi Yang <[email protected]>, Hui Zou <[email protected]>
function inspired from predict function from gglasso package by Yi Yang and Hui Zou.
X <- simuBlockGaussian(n = 50, nBlock = 12, sizeBlock = 5, rho = 0.7) y <- drop(X[, c(2, 7, 12)] %*% c(2, 2, -1)) + rnorm(50, 0, 0.5) m1 <- MLGL(X, y, loss = "ls") predict(m1, newx = X) predict(m1, s=3, newx = X) predict(m1, s=1:3, newx = X)
X <- simuBlockGaussian(n = 50, nBlock = 12, sizeBlock = 5, rho = 0.7) y <- drop(X[, c(2, 7, 12)] %*% c(2, 2, -1)) + rnorm(50, 0, 0.5) m1 <- MLGL(X, y, loss = "ls") predict(m1, newx = X) predict(m1, s=3, newx = X) predict(m1, s=1:3, newx = X)
Print a fullProcess
object
## S3 method for class 'fullProcess' print(x, ...)
## S3 method for class 'fullProcess' print(x, ...)
x |
|
... |
Not used. |
fullProcess summary.fullProcess
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- fullProcess(X, y) print(res)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- fullProcess(X, y) print(res)
Print a HMT
object
## S3 method for class 'HMT' print(x, ...)
## S3 method for class 'HMT' print(x, ...)
x |
|
... |
Not used. |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) out <- HMT(res, X, y) print(out)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) out <- HMT(res, X, y) print(out)
Print a MLGL
object
## S3 method for class 'MLGL' print(x, ...)
## S3 method for class 'MLGL' print(x, ...)
x |
|
... |
Not used. |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) print(res)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) print(res)
Select groups from hierarchical testing procedure with FDR control (hierarchicalFDR)
selFDR(out, alpha = 0.05, global = TRUE, outer = TRUE)
selFDR(out, alpha = 0.05, global = TRUE, outer = TRUE)
out |
output of hierarchicalFDR function |
alpha |
control level for test |
global |
if FALSE the provided alpha is the desired level control for each family. |
outer |
if TRUE, the FDR is controlled only on outer node (rejected groups without rejected children). If FALSE, it is controlled on the full tree. |
See the reference for mode details about the method.
If each family is controlled at a level alpha, we have the following control: FDR control of full tree: alpha * delta * 2 (delta = 1.44) FDR control of outer node: alpha * L * delta * 2 (delta = 1.44)
a list containing:
vector of boolean. TRUE if the group is selected
Names of groups
control level for each family of hypothesis
control level for the tree (full tree or outer node)
Yekutieli, Daniel. "Hierarchical False Discovery Rate-Controlling Methodology." Journal of the American Statistical Association 103.481 (2008): 309-16.
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFDR(X, y, res$group[[20]], res$var[[20]]) sel <- selFDR(test, alpha = 0.05)
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFDR(X, y, res$group[[20]], res$var[[20]]) sel <- selFDR(test, alpha = 0.05)
Select groups from hierarchical testing procedure with FWER control (hierarchicalFWER)
selFWER(out, alpha = 0.05)
selFWER(out, alpha = 0.05)
out |
output of hierarchicalFWER function |
alpha |
control level for test |
Only outer nodes (rejected groups without rejected children) are returned as TRUE.
a list containing:
vector of boolean. TRUE if the group is selected
Names of groups
Meinshausen, Nicolai. "Hierarchical Testing of Variable Importance." Biometrika 95.2 (2008): 265-78.
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFWER(X, y, res$group[[20]], res$var[[20]]) sel <- selFWER(test, alpha = 0.05)
set.seed(42) X <- simuBlockGaussian(50, 12, 5, 0.7) y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) res <- MLGL(X, y) test <- hierarchicalFWER(X, y, res$group[[20]], res$var[[20]]) sel <- selFWER(test, alpha = 0.05)
Simulate n samples from a gaussian multivariate law with 0 vector mean and block diagonal variance matrix with diagonal 1 and block of rho.
simuBlockGaussian(n, nBlock, sizeBlock, rho)
simuBlockGaussian(n, nBlock, sizeBlock, rho)
n |
number of samples to simulate |
nBlock |
number of blocks |
sizeBlock |
size of blocks |
rho |
correlation within each block |
a matrix of size n * (nBlock * sizeBlock) containing the samples
Quentin Grimonprez
X <- simuBlockGaussian(50, 12, 5, 0.7)
X <- simuBlockGaussian(50, 12, 5, 0.7)
Stability selection for MLGL
stability.MLGL( X, y, B = 50, fraction = 0.5, hc = NULL, lambda = NULL, weightLevel = NULL, weightSizeGroup = NULL, loss = c("ls", "logit"), intercept = TRUE, verbose = FALSE, ... )
stability.MLGL( X, y, B = 50, fraction = 0.5, hc = NULL, lambda = NULL, weightLevel = NULL, weightSizeGroup = NULL, loss = c("ls", "logit"), intercept = TRUE, verbose = FALSE, ... )
X |
matrix of size n*p |
y |
vector of size n. If loss = "logit", elements of y must be in -1,1 |
B |
number of bootstrap sample |
fraction |
Fraction of data used at each of the |
hc |
output of |
lambda |
lambda values for group lasso. If not provided, the function generates its own values of lambda |
weightLevel |
a vector of size p for each level of the hierarchy. A zero indicates that the level will be ignored. If not provided, use 1/(height between 2 successive levels) |
weightSizeGroup |
a vector |
loss |
a character string specifying the loss function to use, valid options are: "ls" least squares loss (regression) and "logit" logistic loss (classification) |
intercept |
should an intercept be included in the model ? |
verbose |
print some informations |
... |
Others parameters for |
Hierarchical clustering is performed with all the variables. Then, the partitions from the different levels of the hierarchy are used in the different runs of MLGL for estimating the probability of selection of each group.
a stability.MLGL object containing:
sequence of lambda
.
Number of bootstrap samples.
A matrix of size length(lambda)*number of groups containing the probability of selection of each group
vector containing the index of covariates
vector containing the index of associated groups of covariates
computation time
Quentin Grimonprez
Meinshausen and Buhlmann (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72.4, p. 417-473.
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply stability.MLGL method res <- stability.MLGL(X, y)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply stability.MLGL method res <- stability.MLGL(X, y)
Summary of a fullProcess
object
## S3 method for class 'fullProcess' summary(object, ...)
## S3 method for class 'fullProcess' summary(object, ...)
object |
|
... |
Not used. |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- fullProcess(X, y) summary(res)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- fullProcess(X, y) summary(res)
Summary of a HMT
object
## S3 method for class 'HMT' summary(object, ...)
## S3 method for class 'HMT' summary(object, ...)
object |
|
... |
Not used. |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) out <- HMT(res, X, y) summary(out)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) out <- HMT(res, X, y) summary(out)
Summary of a MLGL
object
## S3 method for class 'MLGL' summary(object, ...)
## S3 method for class 'MLGL' summary(object, ...)
object |
|
... |
Not used. |
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) summary(res)
set.seed(42) # Simulate gaussian data with block-diagonal variance matrix containing 12 blocks of size 5 X <- simuBlockGaussian(50, 12, 5, 0.7) # Generate a response variable y <- X[, c(2, 7, 12)] %*% c(2, 2, -2) + rnorm(50, 0, 0.5) # Apply MLGL method res <- MLGL(X, y) summary(res)
hclust
resultsFind all unique groups in hclust
results
uniqueGroupHclust(hc)
uniqueGroupHclust(hc)
hc |
output of |
A list containing:
Vector containing the index of variables.
Vector containing the index of the group of each variable.
Quentin Grimonprez
hc <- hclust(dist(USArrests), "average") res <- uniqueGroupHclust(hc)
hc <- hclust(dist(USArrests), "average") res <- uniqueGroupHclust(hc)