Principal Component Analysis (PCA), is a linear dimension reduction method, which represents high dimensional data with low dimensional code.
If we can find a linear manifold in a high dimensional space, we can project the data onto this manifold, and just represent where it is on the manifold. Given a data matrix X, PCA is usually performed in the following steps:
A typical explanation of PCA:
What PCA does is taking m-dimensional data and finds the k orthogonal directions (also called principal components) in which the data have the most variance. These k principal components form a lower-dimensional subspace, we can then represent a m-dimensional datapoint by its projections onto the kprincipal directions. This loses all information about where the datapoint is located in the remaining orthogonal directions. However, since in PCA, we find the k principal components that have the largest variance explained, those left-out directions don't cover a lot of variance, we haven't lost much information.
We see the term of “variance” shows up a lot in the typical explanation of PCA. We can understand this intuitively, since we always want to choose the directions where data can spread out, which are the directions of larger “variance.” However, the resources convincing us of this point mathematically are limited.
The purpose of this blog post is as follows:
Say X is our original data matrix, V is a matrix whose columns are orthonormal (Vᵀ V = I ). We define Y as:
V is the matrix used for doing projection, andY is the projected matrix. The dimensions of these matrices are as follows:
with n being the number of samples, m being the number of features, and k being the number of features in the projected space. When we try to project this back to the original space, we can do:
we call X' the reconstructed matrix. Combining (1) and (2), we have:
The information loss (we call this reconstruction error) through this process is:
This reconstruction error is what we want to minimize when we choose a projection matrix, since we always want to minimize the information loss, and this is the fundamental metric to evaluate whether a projection algorithm is good or not.
Proposition 1. Minimizing reconstruction error is equivalent to maximizing trace(Vᵀ Xᵀ XV ).
It’s easy to show that:
Then (4) becomes:
Then we have:
We want to minimize (9), since X is fixed, this is equivalent to finding the V that maximizes trace (Vᵀ Xᵀ XV), and we complete the proof.
Now, we will prove the following statement:
First, we note that XᵀX can be written as:
be the column vectors of V, also we have:
From (12) and (13), we will have (proof omitted here):
It is obvious that when
achieves its maximum value when V is
To this point, we have completed the proof for Proposition 1 and Proposition 2; with them combined, we can reach the conclusion: to achieve the minimum reconstruction error, we can use a projection matrix that's composed of Xᵀ X's eigenvectors which correspond to Xᵀ X's top eigenvalues.
This is exactly what PCA does. Thus, PCA is constructed to choose the k components that can minimize the reconstruction error.
Notice that in this whole process, we never center the data. So even if we don't center the data, PCA still can give us the components which can minimize the reconstruction error.
What will the situation be if we centered the data then?
To this point, we mathematically showed why PCA on centered data is trying to find the k components on which the data have the most variance.
So, choosing the components with largest data variance happens to be an interpretation of PCA when the data is centered. When the data is not centered, we lose this interpretation.
However, we can still perform the PCA steps, and the projection matrix found that way can also minimize the reconstruction error defined in (4).
The term “Reconstruction Error” often reminds us of a well-known algorithm called "Autoencoder." Actually, we indeed can perform PCA with an “Autoencoder.”
If we train an Autoencoder network where the hidden layer and output layer are linear, so that it will learn hidden units that are a linear function of the data and minimize the squared reconstruction error, the hidden units will span the same space as the first k components found by PCA.
These k components tend to have equal variances, not like in PCA, they have the largest variances in large to small ordering. However, since they span the same space as the k components PCA finds, the reconstruction error will actually be the same as what PCA gets.
We can use gradient descent or any optimization algorithm to find this hidden code. To this point, PCA is reduced to be the same as other machine learning algorithms in terms of the fact that they are all trying to solve an optimization problem.
We can see this puts a restriction on the AutoEncoder network that the network has to be linear. If we make this nonlinear, then it will become the normal AutoEncoder, and can find encodings that are nonlinear combinations of the input. Thus, we can use AutoEncoder to perform non-linear dimension reduction.