Background

Recently, I have been thinking about how to present an example that is both easy to understand and practical. So I decided to write an article to not only implement PCA but also show how it works.

Principle Component Analysis (PCA)

Principle Component Analysis (PCA) is a powerful technique for multiple variables analysis. PCA reduces the dimensionality of the data while preserving as much variance as possible. It is usually applied to:

  • Reduce dimensionality and retain as much information as possible
  • Make multiple variables independent
  • Distinguish variables importance, and select features

What is PCA

When performing PCA, the first principal component of a set of $p$ variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through $p$ iterations until all the variance is explained.1

Mathematically, PCA seeks a vector $\vec{v}$ such that original variables $x_i$ projected onto it can have the highest variance. Denoting $x_i$ is data point $i$, and $\vec{v}$ as a unit vector. The function is:

$$ \argmax_{||\vec{v}||=1} \sigma_{\vec{v}^Tx_i}^2 \ = \frac{1}{n}\sum_{i=1}^{n}(\vec{v}^Tx_i-\mu)^2 $$

Steps of Getting Components

1. Center $x_i$ by subtractin the mean of x $\mu_x$

Subtracting mean of $x$ doesn’t effect the variance, but reduced the computation by making the project mean $\mu$ equals to zero. If we re-assigned $x_i=x_i-\mu_x$, then the function becomes: $$ \frac{1}{n}\sum_{i=1}^{n}(\vec{v}^Tx_i-\mu)^2 = \frac{1}{n}\sum_{i=1}^{n}(\vec{v}^Tx_i-0)^2 = \frac{1}{n}\sum_{i=1}^{n}(\vec{v}^Tx_i)^2 $$

2. Calculate covariance matrix $C$

Expanding the function and make $x_ix_i^T$ in the middle, which is covariance matrix $C$ 2, Therefore: $$ \frac{1}{n}\sum_{i=1}^{n}(\vec{v}^Tx_i)^2 = \vec{v}^T\frac{1}{n}\sum_{i=1}^{n}x_ix_i^T\vec{v} = \vec{v}^TC\vec{v} $$

3. Calculate eigenvalues & eigenvectors of matrix $C$

By the Lagrange multiplier, the maximum variance is achieved when $\vec{v}$ is an eigenvector of $C$. Here are some references 3 4 if you are interested. Denoting $\lambda$ as an eigenvalue of $C$: $$ \sigma_{\vec{v}^Tx_i}^2=\vec{v}^TC\vec{v} =\vec{v}^T\lambda\vec{v} = \lambda\vec{v}^T\vec{v} = \lambda $$

The eigenvalues $\lambda$ represent the variance of the projected variable. There are some important points to note:

  • Eigenvalues $\lambda$ and Eigenvectors $\vec{v}$ are pairs, and we have to sort them by eigenvalues. Since greater eigenvalues indicate larger variance. It can be represented as $\lambda_1 \geq \lambda_2\geq…\geq\lambda_p$, where $p$ is number of features $x_i$, and $\vec{v_j}$ is respective to $\lambda_j$
  • Covariance matrix $C$ is symmetric, so its eigenvectors are orthogonal 5. Which ensuring transformed variables are independent.

4. Transform $x_i$ to new space by $\vec{v_j}$

Finally, project $x_i$ by $\vec{v}$ to get the tranformed variables. Which can also be denoted as $\vec{v}_j^TX$ where $X=[x_1, x_2, …, x_i]^T$

Implementation & Visualization with Python

Let’s implement PCA and visualize it using the Setosa species from the Iris dataset, focusing on sepal length and sepal width.

0. Load Dataset and extract setosa

First, let’s take a look of dataset

import pandas as pd
import plotly.express as px
import jax.numpy as jnp

df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species")
fig.show()

Then, extract the specie setosa, and showing in a figure which boundaries are (-6, 6) for x and y axes.

setosa = df.loc[df['species'] == 'setosa', ['sepal_width', 'sepal_length']]

fig_setosa = px.scatter(features, x="sepal_width", y="sepal_length", range_x=[-6, 6], range_y=[-6, 6])
fig_setosa.show()

1. Center the $x_i$

features = jnp.array(setosa)

center = original_setosa - original_setosa.mean(axis=0)

fig = px.scatter(x=center[:, 0], y=center[:, 1],
                 labels={'x': 'sepal_width', 'y': 'sepal_length'},
                 range_x=[-6, 6],  # Set x-axis range from 0 to 1
                 range_y=[-6, 6])
fig.show()

2. Calculate the Covariance matrix $C$

cov = jnp.cov(center.T)

3. Calculate eigenvalues & eigenvectors of matrix $C$

First compute eigenvalues and eigenvectors of $C$

eig_vals, eig_vecs = jnp.linalg.eig(cov)

Then, sort the eigenvectors by eigenvalue.

eig_vals = eig_vals[sorted_idx]
eig_vecs = eig_vecs[:, sorted_idx]

Let’s visualize eigenvectors with respective standard errors in the figure

pc1 = (eig_vecs[:, 0] * jnp.sqrt(eig_vals[0])).astype('float32')
pc2 = (eig_vecs[:, 1] * jnp.sqrt(eig_vals[1])).astype('float32')

pca_fig = px.scatter(
    x=center[:, 0], y=center[:, 1],
    labels={'x': 'sepal_width', 'y': 'sepal_length'},
    range_x=[-1, 1],  # Set x-axis range from 0 to 1
    range_y=[-1, 1]
)
pca_fig.update_yaxes(
    scaleanchor="x",
    scaleratio=1,
) # to see if two vectors orthogonal

pca_fig.add_trace(
    go.Scatter(
        x=[0, pc1[0]],
        y=[0, pc1[1]],
        mode='lines',
        name='PC1',
        line=dict(color='red', width=2),
      )
).add_trace(
    go.Scatter(
        x=[0, pc2[0]],
        y=[0, pc2[1]],
        mode='lines',
        name='PC2',
        line=dict(color='green', width=2),
      )
)

pca_fig.show()

4. Transform (Rotate) the original data

In the eigenvalues matrix, all vectors are uint vectors, and orthogonal. That actually is a matrix for rotation.

rotate = center @ eig_vecs
rotate = rotate.astype('float32')

rotate_fig = px.scatter(x=rotate[:, 0], y=rotate[:, 1],
                 labels={'x': 'sepal_width', 'y': 'sepal_length'},
                 range_x=[-1, 1],
                 range_y=[-1, 1],
                 )
rotate_fig.update_yaxes(
    scaleanchor="x",
    scaleratio=1,
) # to fix ratio

rotate_fig.add_trace(
    go.Scatter(
        x=[0, jnp.sqrt(eig_vals[0]).astype('float32')],
        y=[0, 0],
        mode='lines',
        name='PC1',
        line=dict(color='red', width=2),
      )
).add_trace(
    go.Scatter(
        x=[0, 0],
        y=[0, jnp.sqrt(eig_vals[1]).astype('float32')],
        mode='lines',
        name='PC2',
        line=dict(color='green', width=2),
      )
)

rotate_fig.show()

Conclusion

In this article, we’ve revisited Principal Component Analysis (PCA), breaking down its mathematical foundations and providing a implementation in Python with visualization. PCA is not really hard to understand, it involve just few steps

  • Centering the data
  • Calculating covariance matrix
  • Getting the eigenvalues and eigenvectors of covariance matrix
  • Transforming data by the eigenvectors