Principal Component Analysis (PCA)


What is Principal Component Analysis?

Principal component analysis, or PCA, is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated.This rotation is often followed by a subset of the new features, according to how important they are for explaining the data. In other word, PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

This is easiest to explain by way of example.Here are some triangles in the shape of an oval:

Small Test Image

Image that triangles are points of data.To find the direction where there is most variance, find the straight light where the data is most spread out when projected onto it. A vertical straight line with the points projected onto it will look like this:

Small Test Image

The data is not very spread out here, therefore it doesn't have a large variance. It is probably not the principle component.

A horizontal line are with points projected on will look like this:

Small Test Image

On this line the data is way more spread out, it has a large variance. In fact, there is not a straight line you can draw that has a large variance than a horizontal one. A horizontal line is therefore the principle component here.

Luckily, we can use math to find principle component rather than drawing lines and unevenly shaped triangles.This is where eigenvectors and eigenvalues come in.

Eigenvectors and Eigenvalues

Eigenvectors and values exist in pairs : every eigenvector has corresponding eigenvalue. An eigenvector is a direction, in the example above the eigenvector was the direction of the line (vertical, horizontal, 45 degrees etc.) .An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line. The eigenvector with the highest eigenvalue is therefore the principle component.

The amount of eigenvectors/values that exist equals the number of dimensions the data set has. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions. This sounds complicated, but again an example should make it clear.

Here is the graph of the oval:

At the moment the oval is on x-y axis. x could be age and y hours on the internet. These are two dimensions that my data set is currently being measured in. Now remember that the principle component of the oval was a line splitting it longways.

It turns out the other eigenvector (there are only two of them as it is a 2-D problem) is perpendicular to the principle component. As we said, the eigenvectors have to be able to span the whole x-y area, in order to do this most effectively, the two directions need to be orthogonal to one another. This is why x and y axis are orthogonal to each other in the first place.

The eigenvectors have given us more a much more useful axis to frame the data in. We can now frame the data in this new dimensions. It would look like this:

Note that nothing has been done to the data itself. We’re just looking at it from a different angle.So getting the eigenvectors gets you from one set of axes to another. These axes are much more intuitive to the shape of the data now. These directions are where there is most variation, and that is where there is more information.

In short ..

Total Sample Variance = Sum of Eigenvalues
Eigenvector with highest eigenvalue = Principal Component


Important concept

  • Variance : measures how far the values of variables are spread out from its mean
  • Covariance : measures how much two variables change together
  • Covariance Matrix : a matrix that measures the relationship between multiple variables
  • Vectors & Matrices : vector is a 1 x n array, while a matrix can be m x n array.
  • Matrix decomposition : is a factorization of a matrix into product of matrices.
  • Eigen decomposition : is the factorization of a matrix into standard form, where its representation is in terms of its eigenvalues and eigenvectors.

  • Matrix Decomposition

    There are two ways to perform PCA :

  • Eigen Value Decomposition | Covariance matrix

    the steps of the eigen value decomposition are as followed :

    1. Center the original data X
    2. Calculate the covariance matrix C
    3. Calculate the eigenvalues and eigenvectors of C
    4. Find the transformation matrix V by selecting eigenvectors with highest eigenvalues (variance)
    5. Derive the new data set by taking Y =XV


  • Singular Value Decomposition

  • You can break the matrix into three other matrices

    EVD & SVD Summary :

    Both approaches will decompose our matrix into important vectors (principal components)

    Application in python

    You can find an application of PCA built from scratch with python in here.