Principle Component Analysis

Introduction

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining most of the information.

Steps to Perform PCA in R

We will load the iris data.
Standardize the data and then compute PCA.

library(factoextra)

Loading required package: ggplot2

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

data <- iris
pca_result <- prcomp(data[, 1:4], scale = T)
pca_result

Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

We print the summary of the PCA result, which includes the standard deviation of each principal component and the proportion of variance explained.

summary(pca_result)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Visualize PCA Results

Scree Plot

A scree plot shows the proportion of variance explained by each principal component.

fviz_eig(pca_result)

Biplot

A biplot shows the scores of the samples and the loadings of the variables on the first two principal components.

plt <- fviz_pca_biplot(pca_result, geom.ind = "point", pointshape = 21, 
                pointsize = 2, fill.ind = iris$Species, 
                col.var = "black", repel = TRUE)
plt

Interpretation

The Scree Plot suggests to decide the number of principle components to retain by looking an elbow point where the explained variance starts to level off.
The biplot visualizes both the samples (points) and the variables (arrows). Points that are close to each other represent samples with similar characteristics, while the direction and length of the arrows indicate the contribution of each variable to the principal components.

Visualization of PCA in 3d Scatter Plot

A 3d scatter plot allows us to see the relationships between three principle components simultaneously and also gives us a better understanding of how much variance is explained by these components.

It also allows for interactive exploration where we can rotate the plot and view it from a different angles.

We will plot this using plotly package.

pca_result2 <- prcomp(data[, 1:4], scale = T, rank. = 3)
pca_result2

Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 3):
                    PC1         PC2        PC3
Sepal.Length  0.5210659 -0.37741762  0.7195664
Sepal.Width  -0.2693474 -0.92329566 -0.2443818
Petal.Length  0.5804131 -0.02449161 -0.1421264
Petal.Width   0.5648565 -0.06694199 -0.6342727

Next, we will create a dataframe of the 3 principle components and negate PC2 and PC3 for visual preference to make the plot look more organised and symmetric in 3d space.

components <- as.data.frame(pca_result2$x)
components$PC2 <- -components$PC2
components$PC3 <- -components$PC3

fig <- plot_ly(components, 
               x = ~PC1, 
               y = ~PC2, 
               z = ~PC3, 
               color = ~data$Species, 
               colors = c('darkgreen','darkblue','darkred')) %>%
  add_markers(size = 12)

fig <- fig %>%
  layout(title = "3d Visualization of PCA",
         scene = list(bgcolor = "lightgray"))
fig