pca
Syntax
pca(X, [colNames], [k], [normalize], [maxIter], [svdSolver],
[randomState])
Arguments
ds is one or multiple data source. It is usually generated by function sqlDS.
colNames is a string vector indicating column names. The default value is the names of all columns in ds.
k is a positive integer indicating the number of principal components. The default value is the number of columns in ds.
normalize is a Boolean value indicating whether to normalize each column. The default value is false.
maxIter is a positive integer indicating the number of iterations when svdSolver="randomized". If it is not specified, maxIter=7 if k<0.1*cols and maxIter=7 otherwise. Here cols means the number of columns in ds.
svdSolver is a string. It can take the value of "full", "randomized" or "auto". svdSolver="full" is suitable for situations where k is close to size(colNames); svdSolver="randomized" is suitable for situations where k is much smaller than size(colNames). The default value is "auto", which means the system automatically determines whether to use "full" or "randomized".
randomState is an integer indicating the random seed. It only takes effect
when set svdSolver="randomized". The default value is
int(time(now()))
.
Details
-
components: the matrix of principal component coefficients with size(colNames) rows and k columns.
-
explainedVarianceRatio: a vector of length k with the percentage of the total variance explained by each of the first k principal component.
-
singularValues: a vector of length k with the principal component variances (eigenvalues of the covariance matrix).
Examples
x = [7,1,1,0,5,2]
y = [0.7, 0.9, 0.01, 0.8, 0.09, 0.23]
t=table(x, y)
ds = sqlDS(<select * from t>);
pca(ds);
// output
components->
#0 #1
--------- ---------
-0.999883 0.015306
-0.015306 -0.999883
explainedVarianceRatio->[0.980301,0.019699]
// output
singularValues->[6.110802,0.866243]