From SVD to PCA
The applications of Singular Value Decomposition (SVD) are manifold. In this post, we will focus on the application of SVD to PCA, which is a great tool for dimensionality reduction.
Principal Component Analysis (PCA) is a dimensionality reduction technique that is widely used in data science. In this post, we will focus on the application of Singular Value Decomposition (SVD) to PCA, which is a great tool for dimensionality reduction.
First, we will learn how to find eigenvalues and eigenvectors of a matrix. Then, we will learn how to use SVD to find principal components.
- Eigenvalues and eigenvectors
- Basic QR iteration
- Improving the convergence of QR iteration
- Principal Component Analysis (PCA)
Eigenvalues and eigenvectors
One of the top 10 algorithms of the 20th century is the QR iteration for computing eigenvalues and eigenvectors of a matrix. it transformed a problem that seems incredibly difficult into something that can be solved numerically efficiently and reliably.
Before we dive into the QR iteration, let’s dicuss Abel-Ruffini Theorem a little bit. This theorem states that there is no general algebraic solution to polynomial equations of degree greater than 4. This theorem is the reason why we can’t solve polynomial equations of degree greater than 4, which means we cannot compute eigenvalues and eigenvectors of a matrix of order greater than 4 either. That’s why we need to use the QR iteration.
Let’s denote by the eigenvalue decomposition of a matrix . In this post, many of our algorithms will normalize vectors by dividing them their norm. For example, if is a vector, then is the normalized vector.
Suppose that is a symmetric matrix of order . Then as an eigenvalue decomposition . One of the nice things about this expression is that it provides a very simple expression for :
This equation can be expanded as follows:
Notice that even if is real, it might have complex eigenvalues and eigenvectors, which is why we are using the conjugate transpose instead of just the transpose.

Remark: one needs to ponder a little bit to understand why the above equation will work by looking at the figure 1, each term in the sum of the equation (2) is a matrix with dimension .
Let’s now imagine that the eigenvalues satisfy the following condition:
Then, even for moderate values of , we expected to be much larger than others, which will dominate the sum in equation (2). This means that the dominant eigenvector of is the eigenvector corresponding to the largest eigenvalue:
Now, let’s multiply both sides of equation (4) by a random vector :
Since the dimension of the vector is , which is same with the dimension of , we can write the above equation as follows:
The power iteration algoritm is based on the above equation. The algorithm is as follows:
- Initialize to a random vector of dimension .
- Repeat the following steps until convergence:
- Compute .
- Normalize by dividing it by its norm.
- Compute .
The algorithm will converge to the dominant eigenvector of and the corresponding eigenvalue. Notice that we never need to compute the matrix explicitly as the power iteration algorithm only needs to compute for a random vector .
import numpy as np
def power_iteration(A:np.ndarray, num_simulations:int):
"""Returns the dominant eigenpair of A.
Parameters
----------
A : ndarray
Square matrix.
num_simulations : int
Number of power iterations to perform.
Returns
-------
v : ndarray
Dominant eigenvector.
lambda : float
Dominant eigenvalue.
"""
n, m = A.shape
if n != m:
raise ValueError('A must be a square matrix.')
# generate a random vector
v = np.random.rand(n)
v /= np.linalg.norm(v)
for _ in range(num_simulations):
Av = A @ v
v = Av / np.linalg.norm(Av)
lambda_top = v.T @ A @ v
return v, lambda_top
Although the power iteration algorithm above is great but it is somewhat limited. First, it cannot be used to compute all eigenvalues and eigenvectors of a matrix. Second, if we already have a good approximation of one of the eigenvalues, there is no way to use this information to accelerate the convergence of the algorithm.
Now, we will introduce a method called inverse iteration that can be used to find an approximate eigenvector when the an approximate eigenvalue is known.
Assume that we have an approximation as one of the eigenvalues of , say (which means that is close to ). Then, consider the matrix
which should have very small eigenvalues and its inverse is
must have one very big eigenvalue . The inverse iteration algorithm is as follows:
- Initialize to a random vector of dimension .
- Repeat the following steps until convergence:
- Compute .
- Normalize by dividing it by its norm.
- Compute .
The above algorithm should allow us to find an approximate eigenvector of very quickly since we are using extra information about the eigenvalue.
Actuall we could even speed up the convergence of the algorithm by using Rayleigh quotient iteration. The idea is to use the following equation to compute the
The algorithm is as follows:
- Initialize to a random vector of dimension .
- Repeat the following steps until convergence:
- Compute .
- Normalize by dividing it by its norm.
- Compute .
- Compute .
Basic QR iteration
The QR iteration algorithm is a very powerful algorithm that can be used to compute all eigenvalues and eigenvectors of a matrix instead of just one.
From now on, we no longer assume that the matrix is symmetric. We will only assume that the matrix is diagnoalizable with eigenvalues such that .
We begin with an algorithm called orthogonal iteration, which allows us to revover more than one eigenvalues at once. Since eigenvalues are just basis vectors of the matrix , we can construct orthogonal basis during our power iteration.
Let’s walk through for a special case, which assumes that the matrix has distinct eigenvalues. The algorithm is as follows:
- Choose two random orthogonal vectors and .
- Repeat the following steps until convergence:
- Compute , .
- Project onto the orthogonal complement of .
- Normalize and .
This means and span a subspace of dimension that is orthogonal to each other. The algorithm will converge to two eigenvectors of that are orthogonal to each other.
To find the eigenvalues, we need to compuate the following values:

The reason that we call this method QR iteration is that converges to a matrix that is upper-triangular, with and on the diagonal as it is shown in Figure 2.
To see why it is upper-triangular, notice that
and so the lower-left entry converges to 0.
The key idea of the QR iteration algorithm is to use the QR decomposition to construct the basis we need in equation (8). We want to have a series of vectors that are orthogonal to each other and span the same subspace as eigenvectors of . The QR decomposition allows us to construct such a basis.
Recall that the QR decomposition of a matrix of is given by
where is an orthogonal matrix and is an upper-triangular matrix. If you want to refresh your memory about the QR decomposition, you can check out my another post.
def qr_iteration(A:np.ndarray, num_simulations:int):
"""Returns all eigenpairs of A.
Parameters
----------
A : ndarray
Square matrix.
num_simulations : int
Number of QR iterations to perform.
Returns
-------
v : ndarray
all eigenvectors.
lambda : float
all eigenvalues.
"""
for _ in range(num_simulations):
Q, R = np.linalg.qr(A)
A = R @ Q
v = Q
lambda_all = A.diagonal()
return v, lambda_all
Remark: when the matrix is symmetric, there are more efficient ways to find the eigenvalues and eigenvectors, such as Jacobi method and Householder method. We will not cover these methods in this post.
Improving the convergence of QR iteration
The QR iteration algorithm is very powerful, but it is not very efficient. The reason is that the QR decomposition is very expensive to compute. In fact, the QR decomposition is , which is much more expensive than the power iteration algorithm. There are two ways to improve the convergence of the QR iteration algorithm:
- Transform the matrix into upper Hessenberg form.
- Use the shift-invert method.
The first variant is called the Hessenberg QR iteration and the second variant is called the shifted QR iteration. The first one is very easy to implement, we just need to transform the matrix into upper Hessenberg form and then perform the QR iteration. The second one is a little bit more complicated, but it is still very easy to implement.
import scipy as sp
def qr_iteration_with_hessenberg(A:np.ndarray, num_simulations:int):
H = sp.linalg.hessenberg(A)
for _ in range(num_simulations):
Q, R = np.linalg.qr(H)
H = R @ Q
v = Q
lambda_all = H.diagonal()
return v, lambda_all
The Hessenberg QR iteration algorithm is much more efficient than the basic QR iteration algorithm. The reason is that the Hessenberg QR iteration algorithm only requires operations to compute the QR decomposition. The basic QR iteration algorithm requires operations to compute the QR decomposition. However, the Hessenberg QR iteration algorithm is still not as efficient as it converges much slower than shifted QR iteration algorithm.
The shifted QR iteration algorithm combines the QR iteration algorithm with the shift-invert method, which uses equation (7) to compute the eigenvalues. When we talk about inverse iteration, we state that if we know an approximation of an eigenvector, then we can compute the eigenvalue by using the following matrix
as its eigenvalue must be very big. The shifted QR iteration algorithm uses the same idea.
Notice that if - which converging to upper-triangular - had the following form
Then is the eigenvalue of and the remaining matrix has the same eigenvalues as . The shifted QR iteration algorithm uses this idea to compute the eigenvalues.
Here is the implementation of the shifted QR iteration algorithm.
- Set where is the orthogonal matrix that transforms into upper Hessenberg form.
- Repeat the following steps until convergence:
- Compute a shift by using equation (11).
- Compute .
- Set .
We are not using Rayleigh quotient to compute the shift because it is very expensive to compute the inverse of a matrix. Instead, we use the Wilkinson’s shift, which is given by
where .
Here is the implementation of the shifted QR iteration algorithm.
def wilkinson_shift(a, b, c):
# Calculate Wilkinson's shift for symmetric matrices:
delta = (a-c)/2
shift = c - np.sign(delta)*b**2/(np.abs(delta) + np.sqrt(delta**2+b**2))
return shift
def qr_with_shift(A:np.ndarray, num_iterations:int):
n, m = A.shape
eigen_values = []
if n != m:
raise ValueError('A must be a symmetric matrix.')
I = np.eye(n)
H = sp.linalg.hessenberg(A)
for _ in range(num_iterations):
u = wilkinson_shift(H[n-2, n-2], H[n-1, n-1], H[n-2, n-1])
Q, R = np.linalg.qr(H - u*I)
H = R @ Q + u*I
v = Q
lambda_all = H.diagonal()
return v, lambda_all
Singular Value Decomposition (SVD)
For a matrix of , the SVD is given by
where and . How could we find the SVD of a matrix?
Calculating the SVD of a matrix is very easy. We just need to find the eigenvalues and eigenvectors of and because
In particular, the eigenvectors of are the columns of and the eigenvectors of are the columns of . The eigenvalues of are the squares of the singular values of and the eigenvalues of are the squares of the singular values of .
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a very powerful technique that is used to reduce the dimensionality of a dataset. The idea is that we can find a new basis that is orthogonal to each other and that is a linear combination of the original basis. The new basis is called the principal components and the linear combination is called the principal components scores. The principal components are the eigenvectors of the covariance matrix of the dataset. The principal components scores are the eigenvalues of the covariance matrix of the dataset.
Suppose we have a dataset of samples and features. The covariance matrix of is given by
The covariance matrix is a symmetric matrix. The eigenvectors of are the principal components of and the eigenvalues of are the principal components scores of .
When the matrix is symmetric, the eigenvectors of are orthogonal to each other. This means we can have

What PCA does is to project the data onto the principal components, which are the eigenvectors corresponding to the largest eigenvalues.
The PCA method makes the fundamental assumption that the data is centered. This means that the mean of each feature is zero. If the data is not centered, then we need to subtract the mean of each feature from the data before applying PCA. It also makes the fundamental assumption that the data is normalized. This means that the variance of each feature is one. If the data is not normalized, then we need to divide each feature by its standard deviation before applying PCA.
Then, we will have a linear transformation which makes the data have variables that is as much uncorrelated as possible.