styler
packagescale()
What are biological networks? What do they represent (networks encode relationships between entities)? Why are biological networks useful and what can we learn from them?
What is a biological pathway? Why are they important, and how are they useful?
What are gene regulatory networks? Why are they important? How do we identify them (data driven correlation, wetlab experiments, others?)? How are they useful?
What are protein-protein interaction (PPI) networks? What information do they represent (direct association, functional association, etc)? Where does PPI information come from (datasets, databases, etc)? What are some ways we can use PPI information when interpreting other biological data (like differential expression? not sure)?
Weighted correlation network analysis (WGCNA) identifies genes (or other biology features) that are highly correlated with each other, and group them into clusters / modules.
Typical inputs of WGCNA is the expression matrix of genes, proteins or other biology features. Then, WGCNA will construct a undirected and weighted co-expression network. The nodes will be genes (or other features), and the edges connecting them will be the pairwise correlation value of their expression level. Modules are simply the clusters of highly connected genes. After the modules are detected, down stream analyses may include summarizing the module by its “eigengenes,” inferring the functionality of a gene from genes in the same module, or compare different modules.
For more information, you may refer to this article or the manual of WGCNA R package.
First, install package WGCNA
.
BiocManager::install("WGCNA")
library(WGCNA)
note: if you receive error message saying some dependencies are not available, it’s probably because you used to try to install it using standard method install.package("WGCNA")
. If that’s the case, try the following to re-install WGCNA:
BiocManager::install("WGCNA",force = T)
Let’s load a toy gene expression matrix. We have 15 samples and 7 genes.
## CRAB Whale Lobst Octop Coral BabyShark Orca
## samp1 6.867731 8.796212 7.820130 6.6081601 15.37307 14.047536 32.96333
## samp2 10.918217 6.775947 16.545788 7.5685123 28.59193 8.923224 19.13865
## samp3 5.821857 8.682239 5.089099 2.8030264 17.99007 14.643098 36.93207
## samp4 17.976404 5.741274 27.596721 8.8660342 34.62918 9.473205 18.68929
## samp5 11.647539 10.866047 14.852970 2.2034360 14.15414 15.949892 42.21472
## samp6 5.897658 11.960800 15.915636 -0.1724085 21.78596 16.336686 44.93125
An optional choice in WGCNA to allow multithreads
allowWGCNAThreads(4)
## Allowing multi-threading with up to 4 threads.
Now we will choose a “soft-thresholding power” to construct the network. There is no single answer in the power to choose.
If you refer to WGCNA faq, there are more details for power choosing.
# A set of soft-thresholding powers to choose from:
<- seq(1, 10)
powers
# Call the network topology analysis function
<- pickSoftThreshold(
sft data = dat,
powerVector = powers,
verbose = 5
)
## pickSoftThreshold: will use block size 7.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 7 of 7
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.026500 -12.800 0.38900 1.920 1.820 2.350
## 2 2 0.007510 2.420 -0.18500 1.240 1.180 1.730
## 3 3 0.001090 0.582 0.00340 0.931 0.908 1.430
## 4 4 0.001300 -0.460 -0.14200 0.733 0.708 1.220
## 5 5 0.003130 0.526 -0.03820 0.592 0.556 1.060
## 6 6 0.018100 -1.150 0.04810 0.486 0.440 0.922
## 7 7 0.087400 -2.130 -0.15900 0.404 0.349 0.807
## 8 8 0.063500 -1.570 -0.11100 0.338 0.279 0.708
## 9 9 0.000378 -0.206 0.00319 0.285 0.224 0.621
## 10 10 0.003650 -0.600 -0.00478 0.241 0.181 0.545
par(mfrow = c(1, 2))
<- 0.9
cex1 plot(sft$fitIndices[, 1],
-sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
xlab = "Soft Threshold (power)",
ylab = "Scale Free Topology Model Fit, signed R^2",
main = paste("Scale independence")
)text(sft$fitIndices[, 1],
-sign(sft$fitIndices[, 3]) * sft$fitIndices[, 2],
labels = powers, cex = cex1, col = "red"
)abline(h = 0.90, col = "red")
plot(sft$fitIndices[, 1],
$fitIndices[, 5],
sftxlab = "Soft Threshold (power)",
ylab = "Mean Connectivity",
type = "n",
main = paste("Mean connectivity")
)text(sft$fitIndices[, 1],
$fitIndices[, 5],
sftlabels = powers,
cex = cex1, col = "red"
)
Generally, we want to choose a power that give us the “Scale Free Topology Model Fit, signed R^2” around 0.8. This toy dataset is too small and can’t show it properly.
For illustration, we will pick power 4 in this example.
<- 4
picked_power
# fix a namespace conflict issue, force R to use the cor() function in WGCNA:
<- cor
temp_cor <- WGCNA::cor cor
Construct network.
There are a lot of parameters. In most cases, you don’t need to modify them. But, make sure you choose the proper power
, networkType
(“signed” or “unsigned”), and minModuleSize
in your analysis.
- networkType
Using “signed” option, the direction of the correlation will be considered, e.g. two genes won’t be clustered together if their correlation value is a perfect -1. On the other hand, use “unsigned” option to consider the absolute value of correlation.
- minModuleSize
The minimum umber of features, e.g. genes, in a module. In this example with only 7 genes, so I set this argument to be 2. In real analysis, a typical choice may be 20 or 30 for a matrix with several hundreds or thousands of genes.
<- blockwiseModules(
netwk datExpr = dat,
power = picked_power,
networkType = "signed",
deepSplit = 2,
pamRespectsDendro = F,
minModuleSize = 2,
maxBlockSize = 4000,
reassignThreshold = 0,
mergeCutHeight = 0.25,
saveTOMs = T, # Archive the run results in TOM file (saves time)
saveTOMFileBase = "ER",
numericLabels = T,
verbose = 3
)
## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 1 into file ER-block.1.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
Now we can see the module membership assignment:
$colors netwk
## CRAB Whale Lobst Octop Coral BabyShark Orca
## 1 2 1 1 1 2 2
CRAB, Lobst, Octop and Coral are assigned to module 1. Whale, BabyShark and Orca are assigned to module 2.
We can use function labels2colors() to conveniently convert the module membership to color labels.
<- labels2colors(netwk$colors) mergedColors
Plot the dendrogram and the module colors:
plotDendroAndColors(
$dendrograms[[1]],
netwk$blockGenes[[1]]],
mergedColors[netwk"Module colors",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05
)
In a real analysis with a lot of genes, you will see much more modules and colors.
In this dendrogram, you will be able to roughly see the size of each module by looking at the area of different colors.
At this point, the WGCNA analysis is finished.
We can summarize the module information in a data frame:
<- data.frame(
module_df gene_id = names(netwk$colors),
modules = labels2colors(netwk$colors)
) module_df
## gene_id modules
## 1 CRAB turquoise
## 2 Whale blue
## 3 Lobst turquoise
## 4 Octop turquoise
## 5 Coral turquoise
## 6 BabyShark blue
## 7 Orca blue