In this hands-on session, we will show how principal component analysis (PCA) is used in common bioinformatics workflows. We will use a toy single-cell RNA-seq (scRNA-seq) dataset and a real NCI microarray data to illustrate it.
The main objectives are the following:
prcomp
function.Analyzing tabular datasets, specially high-dimensional data, involves identifying which features carry more information and whether some features can be recovered from other features. Principal component analysis (PCA) is a dimensionality reduction method that transforms high-dimensions data into lower-dimensions while retaining as much information as possible.
We will work tabular data with numerical entries, using an \(n x m\) matrix to represent feature values (columns) for experimental samples (rows).
PCA offers several advantages when analyzing biological data:
Single-cell RNA-seq allows the gene expression profiling of thousands of cells, one a time. It is often exemplified using the following analogy:
scRNA-seq is a double-edged sword. On the one hand, scRNA-seq provides unprecedented discriminative power to chart all the cell types and states in a given tissue. This has catalyzed the creation of the Human Cell Atlas (HCA), which aims to classify all cells in the human body using scRNA-seq; and has been coined as āthe periodic table of human cellsā.
On the other hand, however, scRNA-seq has a high degree of technical variability, which can be introduced at multiple points of the process. Some examples include the following (Definitions extracted from table 1 of this paper):
As an example, we will create an expression matrix with the following cells:
In the following image you can visualize the redundancy between CD79A and CD79B. Together, they form the heterodimer CD79; which means that for every molecule of CD79A we need one of CD79B. Under the hood they are the same variable, which should be captured by a single principal component.
Image obtained from this link
Load packages
library(pheatmap)
library(tidyverse)
We will create a toy dataset that contains the gene expression of each of the cells. Cells are represented as rows and genes as columns.
# Create toy dataset
toy <- data.frame(
CD3D = c(4, 5, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
CD3E = c(5, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
LYZ = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
S100A8 = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
CD79A = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 4, 3, 5, 0, 0),
CD79B = c(0, 0, 0, 0, 0, 0, 0, 5, 5, 3, 4, 6, 0, 0),
BLNK = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 5, 5, 4, 0, 0),
TOP2A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 9, 0, 0),
MKI67 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 10, 0, 0),
MT_CO1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 23),
MT_ND5 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 25)
)
rownames(toy) <- paste("cell", as.character(1:nrow(toy)))
toy <- as.matrix(toy)
# Visualize
pheatmap(toy, cluster_cols = FALSE, cluster_rows = FALSE, angle_col = 45)
Two ways to perform PCA in R:
prcomp
function.Regardless of the method, we always need to normalize the data (center and scale). We will do it by using the function scale
.
The function scale
takes as arguments the matrix and two parameters: center
and scale
, that can be either a logical value or a numeric-alike vector of length equal to the number of columns of the matrix.
center
: if center
is a numeric-alike vector with length equal to the number of columns of the matrix, then the output will have the result of substracting these values to each of the values in the corresponding columns. If center
is TRUE
then centering is done by subtracting the column means (omitting NAs) of the matrix from their corresponding columns, and if center
is FALSE
, no centering is done.scale
: if scale
is a numeric-alike vector with length equal to the number of columns of the matrix, then, after centering, each column of the matrix is divided by the corresponding value from scale
. If scale
is TRUE
then scaling is done by dividing the (centered) columns of the matrix by their standard deviations, and if scale
is FALSE
, no scaling is done.toy_centered <- scale(toy, center = TRUE, scale = FALSE)
toy_normalized <- scale(toy, center = TRUE, scale = rep((sqrt(nrow(toy)-1)), ncol(toy)))
toy_normalized[1:5,1:5]
## CD3D CD3E LYZ S100A8 CD79A
## cell 1 0.7924289 1.0697789 -0.356593 -0.356593 -0.4358359
## cell 2 1.0697789 0.5150788 -0.356593 -0.356593 -0.4358359
## cell 3 0.7924289 0.7924289 -0.356593 -0.356593 -0.4358359
## cell 4 0.5150788 0.7924289 -0.356593 -0.356593 -0.4358359
## cell 5 -0.3169715 -0.3169715 1.030158 1.030158 -0.4358359
Now we will compute the exact same thing (PCA) in the 2 ways.
In each case we will be tracking the same three pieces of information:
We will be computing an eigenbasis for the covariance matrix
\(\Omega = A^tA\) where \(A\) is the normalized data (centered and scaled). We will use the function eigen
.
The eigenvectors of the covariance matrix are the loadings (principal components):
cov_matrix <- (t(toy_normalized) %*% toy_normalized)
eigen_cov_matrix <- eigen(cov_matrix)
pdir_eigen <- eigen_cov_matrix$vectors
pdir_eigen[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.03621728 -0.1281224 -0.2892754 0.37018269 -0.1942142
## [2,] -0.03621728 -0.1281224 -0.2892754 0.37018269 -0.1942142
## [3,] -0.04319892 -0.1949563 0.5855971 -0.07543384 -0.1254525
## [4,] -0.04319892 -0.1949563 0.5855971 -0.07543384 -0.1254525
## [5,] -0.06565032 0.2467930 -0.1395316 -0.42583898 -0.3020970
The eigenvalues contains the variance of the data along each principal component.
var_eigen <- eigen_cov_matrix$values
head(var_eigen)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
The scores of each data sample in the principal components can be obtained by multiplying the original centered matrix with the matrix with the eigenvectors as columns:
scores_eigen <- toy_centered %*% pdir_eigen
scores_eigen[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## cell 1 -3.951320 -3.948608 -3.661022 3.654521 -0.25850086
## cell 2 -3.915103 -3.820486 -3.371747 3.284338 -0.06428666
## cell 3 -3.915103 -3.820486 -3.371747 3.284338 -0.06428666
## cell 4 -3.878885 -3.692364 -3.082472 2.914155 0.12992755
## cell 5 -4.057353 -4.745070 4.798428 -0.431462 0.23490185
Now we will use a stand-alone PCA function provided by R, prcomp
, to calculate the principal components.
For the prcomp
function we donāt need to scale the data, it is already implemented within the function. We do need to center the data.
pca_out <- prcomp(toy, center = TRUE, scale = FALSE)
We can use the following keywords to parse the output instance:
Principal components (loadings): ārotationā
pdir_pca <- pca_out$rotation
pdir_pca[1:5,1:5]
## PC1 PC2 PC3 PC4 PC5
## CD3D -0.03621728 -0.1281224 0.2892754 0.37018269 0.1942142
## CD3E -0.03621728 -0.1281224 0.2892754 0.37018269 0.1942142
## LYZ -0.04319892 -0.1949563 -0.5855971 -0.07543384 0.1254525
## S100A8 -0.04319892 -0.1949563 -0.5855971 -0.07543384 0.1254525
## CD79A -0.06565032 0.2467930 0.1395316 -0.42583898 0.3020970
Standard deviations : āsdevā
(We need to multiply by \(n-1\) since, in prcomp
, variances are computed with the usual divisor \(n-1\). See documentation.)
var_pca <- (pca_out$sdev)** 2
head(var_pca)
## [1] 133.1235625 36.7772802 14.4225675 10.9965640 0.4843188 0.2349899
Principal components scores: āxā
scores_pca <- pca_out$x
scores_pca[1:5,1:5]
## PC1 PC2 PC3 PC4 PC5
## cell 1 -3.951320 -3.948608 3.661022 3.654521 0.25850086
## cell 2 -3.915103 -3.820486 3.371747 3.284338 0.06428666
## cell 3 -3.915103 -3.820486 3.371747 3.284338 0.06428666
## cell 4 -3.878885 -3.692364 3.082472 2.914155 -0.12992755
## cell 5 -4.057353 -4.745070 -4.798428 -0.431462 -0.23490185
In summary, we obtain the following:
pca_out$rotation
pca_out$sdev ** 2
pca_out$x
When using prcomp
function can get the proportion of variance explained (PVE) and the cumulative proportion with the summary
function (more info in this book):
summary(pca_out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 11.5379 6.0644 3.79771 3.31611 0.69593 0.4848 0.48038
## Proportion of Variance 0.6777 0.1872 0.07342 0.05598 0.00247 0.0012 0.00117
## Cumulative Proportion 0.6777 0.8649 0.93829 0.99427 0.99674 0.9979 0.99911
## PC8 PC9 PC10 PC11
## Standard deviation 0.33153 0.24758 0.06164 4.519e-16
## Proportion of Variance 0.00056 0.00031 0.00002 0.000e+00
## Cumulative Proportion 0.99967 0.99998 1.00000 1.000e+00
To reduce the dimensionality, we will choose the number of PC that explains most of the variance in the dataset. To this end, we use a so-called elbow plot:
# Calculate percentage of variance explained (PVE):
pve <- pca_out$sdev ** 2 / sum(pca_out$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
pve_gg <- pve_df %>%
ggplot(aes(principal_component, pve)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(pve)) +
labs(x = "Principal Component", y = "Percentage of Variance Explained (%)") +
theme_bw()
pve_gg
Some people prefer to plot the cumulative proportion of explained variance:
summ_pca <- summary(pca_out)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
ggplot(aes(principal_component, cum_pct)) +
geom_point() +
geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
scale_x_continuous(breaks = 1:length(cum_pct)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
theme_bw()
cum_pct_gg
Thus, we conclude that the first 4 PC are significant. Let us express each cell as a vector of the first 4 PC (scores):
toy_reduced <- pca_out$x[, c("PC1", "PC2", "PC3", "PC4")]
pheatmap(toy_reduced, cluster_rows = FALSE, cluster_cols = FALSE, angle_col = 45)
Now we have reduced the dimensionality of the dataset. To interpret each principal component, let us inspect the gene loadings:
gene_loads_reduced <- pca_out$rotation[, c("PC1", "PC2", "PC3", "PC4")]
significant_pcs <- c("PC1", "PC2", "PC3", "PC4")
loadings_gg <- map(significant_pcs, function(x) {
loadings <- gene_loads_reduced[, x]
df <- data.frame(gene = names(loadings), score = loadings)
p <- df %>%
ggplot(aes(fct_reorder(gene, score), score)) +
geom_point() +
labs(x = "", y = x) +
theme_bw() +
coord_flip()
return(p)
})
loadings_gg
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
Clustering means classifying observations into groups that minimize and maximize intra-group and inter-group distances, respectively.
To that end we need a concept of distance, which basically measures how similar or different two vectors are. In our case we will use Euclidean distance, but be aware that there are a plethora of different options that can yield different clustering results.
Let us start by computing all pairwise distances (distance between cells, considering their expression profiles):
dist_mat <- dist(toy_reduced, method = "euclidean")
pheatmap(dist_mat, cluster_rows = FALSE, cluster_cols = FALSE)
hclust_average <- hclust(dist_mat, method = "average")
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = "Cophenetic distance"
)
plot(
hclust_average,
labels = rownames(toy),
main = "Average Linkage",
xlab = "",
sub = "",
ylab = "Cophenetic distance"
)
abline(h = 7.5, col = "red")
Visualize the correspondent cluster to each of the cells:
hc_clusters <- cutree(hclust_average, h = 7.5)
hc_clusters
## cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10
## 1 1 1 1 2 2 2 3 3 3
## cell 11 cell 12 cell 13 cell 14
## 4 4 5 5
Visualize the number of cells in each cluster:
table(hc_clusters)
## hc_clusters
## 1 2 3 4 5
## 4 3 3 2 2
Cluster | Cell type |
---|---|
1 | T-cells |
2 | Monocytes |
3 | Naive B-cells |
4 | Plasma Cells |
5 | poor-quality cells |
Visualize the centered gene expression across cells and their corresponding cluster:
annotation_rows <- data.frame(cluster = as.character(hc_clusters))
rownames(annotation_rows) <- names(hc_clusters)
annotated_clusters <- c("T-cells", "Monocytes", "Naive B-cells", "Plasma Cells", "poor-quality cells")
names(annotated_clusters) <- as.character(1:5)
annotation_rows$annotation <- annotated_clusters[annotation_rows$cluster]
pheatmap(
toy_normalized,
cluster_rows = FALSE,
cluster_cols = FALSE,
angle_col = 315,
annotation_row = annotation_rows
)
We will use the data from NCI 60 Data from the ISLR package. This data set contains NCI microarray data. The data contains expression levels on 6830 genes from 64 cancer cell lines.
Exercise
Find the principal components of this dataset and cluster the cell lines according to their similarities in gene expression.
We will load the data and keep the information of the table and the tumor types of the cell lines:
library(ISLR)
data_NCI60 <- NCI60$data
labels_NCI60 <- NCI60$labs
rownames(data_NCI60) <- labels_NCI60
We can look at the dimension of the data set which contains information about 64 cell lines of different tumor types and 6830 genes.
dim(data_NCI60)
## [1] 64 6830
We can see the list of the tumor types:
labels_NCI60
## [1] "CNS" "CNS" "CNS" "RENAL" "BREAST"
## [6] "CNS" "CNS" "BREAST" "NSCLC" "NSCLC"
## [11] "RENAL" "RENAL" "RENAL" "RENAL" "RENAL"
## [16] "RENAL" "RENAL" "BREAST" "NSCLC" "RENAL"
## [21] "UNKNOWN" "OVARIAN" "MELANOMA" "PROSTATE" "OVARIAN"
## [26] "OVARIAN" "OVARIAN" "OVARIAN" "OVARIAN" "PROSTATE"
## [31] "NSCLC" "NSCLC" "NSCLC" "LEUKEMIA" "K562B-repro"
## [36] "K562A-repro" "LEUKEMIA" "LEUKEMIA" "LEUKEMIA" "LEUKEMIA"
## [41] "LEUKEMIA" "COLON" "COLON" "COLON" "COLON"
## [46] "COLON" "COLON" "COLON" "MCF7A-repro" "BREAST"
## [51] "MCF7D-repro" "BREAST" "NSCLC" "NSCLC" "NSCLC"
## [56] "MELANOMA" "BREAST" "BREAST" "MELANOMA" "MELANOMA"
## [61] "MELANOMA" "MELANOMA" "MELANOMA" "MELANOMA"
Solution
We will use the function prcomp
to calculate the principal components.
pr_output <- prcomp(data_NCI60, center = TRUE)
As in the previous example, we can calculate the loadings:
pdir_pca <- pr_output$rotation
pdir_pca[1:6,1:6]
## PC1 PC2 PC3 PC4 PC5
## 1 -0.005096247 0.0009839929 0.002116058 0.007628801 -0.012118316
## 2 -0.001642354 0.0034355664 0.008049350 0.004910196 -0.007412249
## 3 -0.002509243 -0.0015838271 0.004746350 0.008769557 -0.012426296
## 4 0.004940063 0.0078435347 0.013716214 0.011378816 -0.014851587
## 5 -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## 6 0.001382038 -0.0034431320 0.003167842 0.007425971 -0.002879251
## PC6
## 1 0.0061392242
## 2 0.0114429879
## 3 -0.0002860562
## 4 -0.0065009935
## 5 -0.0003513787
## 6 0.0003210365
The standard deviations :
var_pca <- pr_output$sdev ** 2
head(var_pca)
## [1] 633.2156 352.9278 279.9189 183.0830 163.5573 149.0968
And the principal components scores
scores_pca <- pr_output$x
scores_pca[1:6,1:6]
## PC1 PC2 PC3 PC4 PC5 PC6
## CNS -19.79578 0.1152691 -5.968917 4.753293 -4.882164 18.92591
## CNS -21.54610 -1.4573503 -9.019584 6.767942 -2.247604 17.07273
## CNS -25.05662 1.5260929 -6.959653 2.785913 -10.819648 16.45389
## RENAL -37.40954 -11.3894784 -5.407097 15.442094 -16.011475 33.09651
## BREAST -50.21864 -1.3461737 -17.599944 15.099862 -13.852847 16.94340
## CNS -26.43520 0.4629819 -16.931456 11.389195 -6.742920 11.85838
We will plot the first 2 principal components score vectors with a dot plot, in order to visualize the data. We will color the observations (cell lines) according to their cancer type. In this way we can see to what extent the observations within a cancer type are similar to each other.
scores <- as.data.frame.array(pr_output$x)
scores$tumor_type <- labels_NCI60
scores %>%
ggplot(aes(x = PC1, y = PC2, color = labels_NCI60)) +
geom_point() +
labs(x = "PC1", y = "PC2") +
theme_classic()
summary(pr_output)$importance[,c(1:14)]
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 25.16378 18.78637 16.73078 13.53082 12.78895 12.21052
## Proportion of Variance 0.14893 0.08301 0.06584 0.04306 0.03847 0.03507
## Cumulative Proportion 0.14893 0.23194 0.29777 0.34083 0.37930 0.41437
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 11.05840 10.94492 10.59140 9.576574 9.449313 9.22659
## Proportion of Variance 0.02876 0.02817 0.02638 0.021570 0.021000 0.02002
## Cumulative Proportion 0.44313 0.47130 0.49769 0.519260 0.540260 0.56028
## PC13 PC14
## Standard deviation 8.821954 8.668634
## Proportion of Variance 0.018300 0.017670
## Cumulative Proportion 0.578580 0.596260
As in the previous example we will represent the cumulative proportion of variance explained:
pve <- pr_output$sdev ** 2 / sum(pr_output$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
summ_pca <- summary(pr_output)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
ggplot(aes(principal_component, cum_pct)) +
geom_point() +
scale_x_continuous(breaks = 1:length(cum_pct)) +
scale_y_continuous(limits = c(0, 100)) +
labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
theme_bw()
cum_pct_gg
In this case, we will select the first 8 PC. We can cluster the samples within the heatmap:
pheatmap(scores[, c("PC1", "PC2", "PC3","PC4","PC5", "PC6", "PC7","PC8")], cluster_rows = TRUE, cluster_cols = FALSE, angle_col = 45, labels_row = labels_NCI60, fontsize_row = 5)
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ISLR_1.4 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0
## [5] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 pheatmap_1.0.12
## [13] BiocStyle_2.28.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3
## [4] stringi_1.7.12 hms_1.1.3 digest_0.6.33
## [7] magrittr_2.0.3 evaluate_0.22 grid_4.3.2
## [10] timechange_0.2.0 RColorBrewer_1.1-3 bookdown_0.35
## [13] fastmap_1.1.1 jsonlite_1.8.7 BiocManager_1.30.22
## [16] fansi_1.0.5 scales_1.2.1 jquerylib_0.1.4
## [19] cli_3.6.1 rlang_1.1.1 munsell_0.5.0
## [22] withr_2.5.1 cachem_1.0.8 yaml_2.3.7
## [25] tools_4.3.2 tzdb_0.4.0 colorspace_2.1-0
## [28] vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.3
## [31] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1
## [34] gtable_0.3.4 glue_1.6.2 xfun_0.40
## [37] tidyselect_1.2.0 rstudioapi_0.15.0 knitr_1.44
## [40] farver_2.1.1 htmltools_0.5.6.1 rmarkdown_2.25
## [43] labeling_0.4.3 compiler_4.3.2