Why do I need to indicate the number of components to be kept in Principal Component Analysis?

Issue

I found that to use PCA it is necessary to indicate at the beginning the number of components to be kept such as in the following code:

model = pca(n_components=3, normalize=True)

Is there any way to indicate only the variance and let the algorithm give me the most important components?

Solution

You don’t necessarily need to specify the number of components in advance. You can extract all components and keep only the ones that explain a given fraction of the cumulative variance. See the code below for an example.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_spd_matrix
from sklearn.preprocessing import StandardScaler

# generate the data
np.random.seed(100)

N = 1000  # number of samples
K = 10    # number of features

mean = np.zeros(K)
cov = make_spd_matrix(K)
X = np.random.multivariate_normal(mean, cov, N)
print(X.shape)
# (1000, 10)

# rescale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# perform the PCA
pca = PCA(n_components=None)
pca.fit(X)

# extract the smallest number of components which
# explain at least p% (e.g. 80%) of the variance
p = 0.80
n_components = 1 + np.argmax(np.cumsum(pca.explained_variance_ratio_) >= p)
print(n_components)
# 6

# extract the values of the selected components
Z = pca.transform(X)[:, :n_components]
print(Z.shape)
# (1000, 6)

Answered By – Flavia Giammarino

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published