--- title: "Categorical Functional Data Analysis" author: "Cristian Preda, Quentin Grimonprez" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{cfda} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- See for more details about the mathematical background and an overview of the package. ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = "center", fig.width = 5, fig.height = 5) ``` ## Dataset Simulation ``` {r load, echo=TRUE, message=FALSE} library(cfda) set.seed(42) ``` We simulate the Jukes Cantor model of nucleotide replacement. Denote the set of nucleotides: $E = \{A, T, G, C\}$. A DNA sequence is seen as a string of nucleotides: `ATGCATTAC`. Each \textbf{site} in the sequence is subject to mutation over time. In the Jukes-Cantor model (1969), each site is a Markovian jump process with continuous time having as generator: $$ Q = \ \ \ \begin{array}{c|cccc} &A&T&G&C \\ \hline A&-3\alpha&\alpha&\alpha&\alpha\\ T&\alpha&-3\alpha&\alpha&\alpha\\ G&\alpha&\alpha&-3\alpha&\alpha\\ C&\alpha&\alpha&\alpha&-3\alpha\\ \hline \end{array}$$ for some $\alpha >0$. Let assume that the process is observed over the period $[0,T_{\max}]$ in $n$ sites with the nucleotide $A$ at time $t=0$. In the following, nucleotides are recoded: $A\leftrightarrow 1$, $T\leftrightarrow 2$, $G\leftrightarrow 3$, $C\leftrightarrow 4$. ```{r genData, echo=TRUE} K <- 4 PJK <- matrix(1 / 3, ncol = K, nrow = K) - diag(rep(1 / 3, K)) lambda_PJK <- c(1, 1, 1, 1) Tmax <- 10 n <- 100 d_JK <- generate_Markov(n = n, K = 4, P = PJK, lambda = lambda_PJK, Tmax = Tmax, labels = c("A", "C", "G", "T")) head(d_JK, 10) ``` The dataset is a data.frame with 3 columns: `id`, `time`, `state`. `id` contains the id of the different individuals (usually integers). `time` contains the time values at which a change occurs, note that for each individual, time values are ordered and start at 0. `state` contains the state that appears at the given time, state must be characters, factors, integers. This data format is used by the most part of cfda's functions. ```{r suumary, echo=TRUE} summary_cfd(d_JK) ``` We can compute the duration of each trajectory. ```{r, echo=TRUE} duration <- compute_duration(d_JK) head(duration) ``` All individuals has a different length. The computation of an optimal encoding requires that the end time of each individual is the same ($T_{\max}$), this can be done with the following function: ```{r cutT, echo=TRUE} d_JKT <- cut_data(d_JK, Tmax = Tmax) ``` Individuals can be plotted: ```{r plot, echo=TRUE, fig.height = 7} plotData(d_JKT) ``` ## Basic statistics Generally, in categorical functional data analysis, the following basic statistics are computed. ### Time spent in each state over a period of length $T_{\max}$: For each individual, the time spent in each of the $K$ states is computed. ```{r, echo=TRUE} timeSpent <- compute_time_spent(d_JKT) timeSpent[1:10, ] ``` The results can be plotted: ```{r, echo=TRUE} boxplot(timeSpent) ``` ### Number of jumps in $[0,T_{\max}]$: For each individual, the number of jumps occurring in $[0,T_{\max}]$ is computed. ```{r, echo=TRUE} nJump <- compute_number_jumps(d_JK) head(nJump) ``` The results can be plotted: ```{r, echo=TRUE} hist(nJump) ``` ### Probabilities to be in some state $x$ at time $t$, $p_{x}(t)$: An other interesting statistic is the evolution of the probability to be in each state. ```{r, echo=TRUE} pt_evol <- estimate_pt(d_JKT) pt_evol$pt[1:K, 1:10] head(pt_evol$t) ``` The output is a list with two elements: `t` the different time values and `pt` a matrix where each row contains the probability to be in a given state for all the time values. The result can be plotted: ```{r, echo=TRUE} plot(pt_evol, ribbon = TRUE) ``` ### Transitions in $[0,T_{\max}]$: The transitions between states can been studied by computing a frequency table counting the number of times each pair of states were observed in successive observation times. ```{r, echo=TRUE} statetable(d_JK) ``` Assuming that the data follows a Markov process, parameters $P$ (transition probability matrix) and $\lambda$ of the process can be estimated. ```{r, echo=TRUE} mark <- estimate_Markov(d_JK) mark ``` The estimated parameters are closed to the ones used. The results can be plotted through a transition graph where each node corresponds to a state with the mean time spent within (corresponding to $1/\lambda$) and arrows correspond to transition probabilities between states. ```{r, echo=TRUE} plot(mark) ``` ## Encoding ### Concept $X$ is a continuous stochastic process with jumps. We define categorical functional data as a set of sample paths of $X$. **cfda** is seen as an extension of the multiple correspondence analysis to a stochastic process. The idea is to find a scalar real random variable $z \in L_2(\Omega)$ that is the *most correlated* to ${X} = \{X_{t}\ : \ t\in [0,T_{\max}]\}$. ### Mathematical Background For a fixed $t\in [0,T_{\max}]$, let $E^{t}$ the projection operator, $$E^{t}(z) = \mathbb{E}(z|X_{t}) = \sum_{x\in S}{\mathbb{E}(z|X_t=x) \mathbf{1}_{X_t=x}}$$ Then, the correlation coefficient between $z$ and $X_{t}$ is: $$\eta^{2}(z, X_{t}) = \frac{var(\mathbb{E}(z|X_{t}))}{var(z)}$$ For $t, s \in [0,T_{\max}]$, the *(simple) correspondence analysis* looks for $z$ such that it maximizes $$\displaystyle\frac{1}{2}\left(\eta^{2}(z, X_{t})+\eta^{2}(z, X_{s})\right)$$ and the solution is: $$(E^{t}+E^{s})z = \lambda z$$ $z$ is called a *principal component*. As an extension, the *functional correspondence analysis* is defined as the solution of the optimization problem: $$ \arg\max_{z}\displaystyle\frac{1}{T_{max}}\int_{0}^{T_{max}}\eta^{2}(z, X_{t})dt$$ Solution: $$\int_{0}^{T_{max}}E^{s}z\ ds = \lambda z$$ Denote $\xi_t = E^{t}z$, $$\int_0^{T_{\max}}K(t,s)\xi_s\ ds = \lambda \xi_t, \ \ \ \ \mbox{with}\ K(s,t) = E^tE^s. $$ It follows that, $$\xi_{t} = \sum_{x\in S}a_{x}(t)\mathbf{1}_{X_t=x}$$ with $a_x(t) =\mathbb{E}(z|X_t= x), \ \ \forall x \in S$. For each $x\in S$, the functions $a_x: [0,T_{\max}|\rightarrow \mathbb{R}$ are called *optimal encoding* of the state $x$. The eigenvalue problem becomes in terms of encodings: $$\lambda a_x(t) = \sum_{y\in S}\displaystyle\int_{0}^{T_{\max}}\frac{p_{x,y}(t,s)}{p_x(t)}a_y(s)ds, \ \ \ \ \ \forall x \in S, \forall t\in [0,T_{\max}], $$ where$p_{x,y}(t,s) = \mathbb{P}(X_t=x, X_s=y)$ and $p_x(t)= \mathbb{P}(X_t=x)$. Under general conditions (continuity in probability of $X$), there exists $\{\lambda_i\}_{i\geq1}$ positive eigen-values and $\{a_x^i\}_{i\geq 1}$ eigen-functions. The following expansion formula holds (Mercer thm.): $$p_{x,y}(t,s) = p_x(t)p_y(s)\sum_{i\geq 1}\lambda_{i}a^i_x(t)a^i_y(s)$$ For $x=y$, one obtains: $$p_x(t) =\displaystyle\frac{1}{\displaystyle\sum_{i\geq 1}\lambda_{i}a^i_x(t)a^i_y(s)}$$ We are interested in approximating the encoding functions, $a_x$, by $$a_{x} \approx \sum_{i=1}^m \alpha_{x,i}\phi_i, $$ where $\mathbf{\alpha_x} = (\alpha_{x,1}, \ldots, \alpha_{x,m}) \in \mathbb{R}^m$ are the expansion coefficients onto a basis of functions defined on $[0,T]$, $\{\phi_1, \ldots, \phi_m\}$, $m \geq 1$. The main result (Deville, 1982) is that $\alpha = (\alpha_{1}, \ldots, \alpha_{K}) \in \mathbb{R}^{Km}$ is the solution of the following eigen-value problem: $$F^{-1}G\alpha = \lambda\alpha, $$ where $G$ and $F$ the matrix defined by: $$G = cov(\{V_{(x,i)}\}_{x\in S; i=1:m }),$$ $$F = \mathbb{E}\left(\{F_{(x,i),(y,j)}\}_{x,y\in S ; i,j=1:m}\right)$$ with the random variables $V_{(x,i)}$ and $F_{(x,i),(x,j)}$ defined by: $$V_{(x,i)} = \int_{0}^{T_{\max}}\phi_i(t)\mathbf{1}_xdt \ \ \ \ \mbox{and}\ \ \ \ F_{(x,i), (y,j)} = \int_{0}^{T_{\max}}\phi_i(t)\phi_{j}(t)\mathbf{1}_x\mathbf{1}_y dt$$ ### Application Firstly, a basis of functions must be defined. We choose a B-splines basis of order 4. ```{r, echo=TRUE} m <- 10 basis <- create.bspline.basis(c(0, Tmax), nbasis = m, norder = 4) ``` The optimal encoding is computed using: ```{r, echo=TRUE} fmca <- compute_optimal_encoding(d_JKT, basis, verbose = FALSE) summary(fmca) ``` By default this function computes bootstrap estimates of the encoding functions in order to have a confidence interval. This is controlled by the `computeCI` arguments. The output is a list containing the different elements computed during the process: `eigenvalues`, `pc`, `alpha`, `F`, `G`, `V` and `basisobj`. Note that `alpha`contains the coefficients of the different encoding functions and `pc`the principal components. These components can be used with classical statistic methods (k-means, regression...). The eigenvalues can be computed using: ```{r, echo=TRUE} plotEigenvalues(fmca, cumulative = TRUE) ``` The first encoding function coefficients $\alpha$'s: ```{r, echo=TRUE} print(fmca$alpha[[1]]) ``` The resulting encoding can be plotted: ```{r, echo=TRUE} plot(fmca) ``` or extracted as a *fd* object (or matrix): ```{r, echo=TRUE} encoding <- get_encoding(fmca, fdObject = TRUE) ``` By default, it plots (and returns) the first encoding functions. Other encoding can be accessed with the `harm` argument. A confidence interval can be plotted using the `addCI` argument: ```{r, echo=TRUE} plot(fmca, addCI = TRUE, coeff = 2, states = "A") ``` Plot the two first components: ```{r, echo=TRUE} plotComponent(fmca, comp = c(1, 2), addNames = FALSE) ```