-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathQuickStart.Rmd
231 lines (193 loc) · 7.45 KB
/
QuickStart.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "QuickStart"
author:
- name: Y-h. Taguchi
affiliation: Department of Physics, Chuo University, Tokyo 112-8551, Japan
email: [email protected]
output:
BiocStyle::html_document:
toc: true
bibliography: references.bib
vignette: >
%\VignetteIndexEntry{QuickStart}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
crop = NULL,
comment = "#>"
)
```
# Installation
```{r, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TDbasedUFE")
```
# Introduction
```{r setup}
library(TDbasedUFE)
```
Here I introduce how we can use TDbasedUFE to process real data.
# Data Analyses
## Gene expression.
Here is how we can process real data. First we prepare the data set from
tximportdata package
```{r}
require(GenomicRanges)
require(rTensor)
library("readr")
library("tximport")
library("tximportData")
dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples$condition <- factor(rep(c("A","B"),each=3))
rownames(samples) <- samples$run
samples[,c("pop","center","run","condition")]
files <- file.path(dir,"salmon", samples$run, "quant.sf.gz")
names(files) <- samples$run
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
```
As can be seen on the above, this data set is composed of six samples, which
are divided into two classes, each of which includes three out of six samples.
The number of features is 58288
```{r}
dim(txi$counts)
```
which is too many to execute small desktops, we reduce number of features to
10,000
```{r}
txi[seq_len(3)] <-
lapply(txi[seq_len(3)],
function(x){dim(x);x[seq_len(10000),]})
```
Then we reformat count data, txi\$counts, as a tensor, $Z$, whose dimension is
$10000 \times 3 \times 2$ and HOSVD was applied to $Z$ to get tensor
decomposition using function computeHosvd.
```{r}
Z <- PrepareSummarizedExperimentTensor(matrix(samples$sample,c(3,2)),
rownames(txi$abundance),array(txi$counts,c(dim(txi$counts)[1],3,2)))
dim(attr(Z,"value"))
HOSVD <- computeHosvd(Z)
```
Here 3 and 2 stand for the number of samples in each class and the number of
classes, respectively. HOSVD includes output from HOSVD. Next, we need to decide
which singular value vectors are used for the download analysis interactively.
Since vignettes has no ability to store the output from interactive output from
R, you have to input here as
```{r}
input_all <- selectSingularValueVectorSmall(HOSVD,input_all= c(1,2)) #batch mode
```
in batch mode. Then you get the above graphic. In actual usage you can
activate interactive mode as
```
input_all <- selectSingularValueVectorSmall(HOSVD)
````
In interactive mode, you can see "Next", "Prev" and "Select" radio buttons by
which you can select singular value vectors one by one interactively.
Now 1 and 2 is entered in input_all, since these correspond to constant among
three samples (AAA or BBB) and distinction between A and B, respectively.
Then we try to select features.
```{r}
index <- selectFeature(HOSVD,input_all)
```
We get the above graphic. The left one represents the dependence of "flatness"
of histogram of $P$-values computed with assuming the null hypothesis. More or
less it can select smallest value. Thus it is successful. The right one
represents the histogram of 1-$P$-values. Peak at right end corresponds
to genes selected (The peak at the left end does not have any meaning since
they corresponds to $P \simeq 1$).
Then we try to see top ranked features
```{r}
head(tableFeatures(Z,index))
```
These are associated with $P=0$. Thus, successful.
## Multiomics data analysis
Next we try to see how we can make use of TDbasedUFE to perform multiomics
analyses. In order that, we prepare data set using MOFAdata package as follows.
```{r}
require(MOFAdata)
data("CLL_data")
data("CLL_covariates")
help(CLL_data)
```
CLL_data is composed of four omics data,
- Drugs: viability values in response to 310 different drugs and concentrations
- Methylation: methylation M-values for the 4248 most variable CpG sites
- mRNA: normalized expression values for the 5000 most variable genes
- Mutations: Mutation status for 69 selected genes
each of which was converted to squared matrix (i. e., liner
kernel[@Taguchi2022b])and they are bundle as one tensor using
PrepareSummarizedExperimentTensorSquare function, to which HOSVD is
applied as follows.
```{r}
Z <- PrepareSummarizedExperimentTensorSquare(
sample=matrix(colnames(CLL_data$Drugs),1),
feature=list(Drugs=rownames(CLL_data$Drugs),
Methylation=rownames(CLL_data$Methylation),
mRNA=rownames(CLL_data$mRNA),Mutations=rownames(CLL_data$Mutations)),
value=convertSquare(CLL_data),sampleData=list(CLL_covariates[,1]))
HOSVD <- computeHosvdSqure(Z)
```
CLL_covariate is labeling information, among which we employed the distinction
between male (m) and female (f).
```{r}
table(CLL_covariates[,1])
cond <- list(attr(Z,"sampleData")[[1]],attr(Z,"sampleData")[[1]],seq_len(4))
```
Since a tensor is a bundle of liner kernel, the first two singular value
vectors are dedicated to samples and the third (last) one is dedicated to four
omics classes (Drugs, Methylation, mRNA and mutations). In order to select
which singular value vectors are employed, we execute selectFeatureSquare
function in batch mode as follows
```{r}
input_all <- selectSingularValueVectorLarge(HOSVD,cond,input_all=c(8,1))
```
In actual usage you can activate interactive mode as
```
input_all <- selectSingularValueVectorLarge(HOSVD,cond)
```
and can select the eight singular value vetor for samples that represents
distinction between male and female and the first singular value vectors
to four omics categories that represents the commoness between four omics
data (i.e., constant values for four omics data).
Now we come to the stage to select features. Since there are four features,
we need to optimize SD for each of them. Try to execute selectFeatureSquare
in a batch mode as follows.
```{r}
index <- selectFeatureSquare(HOSVD,input_all,CLL_data,
de=c(0.3,0.03,0.1,0.1),interact=FALSE) #for batch mode
```
In actual usage, you can activate interctive mode
```
index <- selectFeatureSquare(HOSVD,input_all,CLL_data,
de=c(0.3,0.03,0.1,0.1))
```
to see these plots one by one if you hope.
In the above, de=c(0.3,0.03,0.1,0.1) represents initial SD for the optimization
of SD for each of four omics. There might be need for some trial & errors to
get good initial values. Every time you type enter in an interactive mode,
you can see one of four plots for four omics data.
Now the output includes four lists each of which corresponds to one of four
omics data. Each list is composed of two vectors whose length is equivalent to
the number of features in each of omics data. Two vectors, index and p.value,
stores logical vector that shows if individual features are selected and raw
$P$-values.
In order to see selected features for all four omics data, tableFeatureSquare
function must be repeated four times as
```{r}
for (id in seq_len(4)){print(head(tableFeaturesSquare(Z,index,id)))}
```
# Conclusion
In this vignettes, we briefly introduce how we can make use of TDbasedUFE to
perform gene expression analysis and multiomics analysis. I am glad if you can
make use of it as well for your own research.
```{r}
sessionInfo()
```