randomForestClassifier

Syntax

randomForestClassifier(ds, yColName, xColNames, numClasses, [maxFeatures=0], [numTrees=10], [numBins=32], [maxDepth=32], [minImpurityDecrease=0.0], [numJobs=-1], [randomSeed])

Arguments

ds is the data sources to be trained. It can be generated with function sqlDS.

yColName is a string indicating the category column.

xColNames is a string scalar/vector indicating the names of the feature columns.

numClasses is a positive integer indicating the number of categories in the category column. The value of the category column must be integers in [0, numClasses).

maxFeatures (optional) is an integer or a floating number indicating the number of features to consider when looking for the best split. The default value is 0.
  • if maxFeatures is a positive integer, then consider maxFeatures features at each split.

  • if maxFeatures is 0, then sqrt(the number of feature columns) features are considered at each split.

  • if maxFeatures is a floating number between 0 and 1, then int(maxFeatures * the number of feature columns) features are considered at each split.

numTrees (optional) is a positive integer indicating the number of trees in the random forest. The default value is 10.

numBins (optional) is a positive integer indicating the number of bins used when discretizing continuous features. The default value is 32. Increasing numBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication time.

maxDepth (optional) is a positive integer indicating the maximum depth of a tree. The default value is 32.

minImpurityDecrease (optional) a node will be split if this split induces a decrease of the Gini impurity greater than or equal to this value. The default value is 0.

numJobs (optional) is an integer indicating the maximum number of concurrently running jobs if set to a positive number. If set to -1, all CPU threads are used. If set to another negative integer, (the number of all CPU threads + numJobs + 1) threads are used.

randomSeed (optional) is the seed used by the random number generator.

Details

Fit a random forest classification model. The result is a dictionary with the following keys: numClasses, minImpurityDecrease, maxDepth, numBins, numTress, maxFeatures, model, modelName and xColNames. model is a tuple with the result of the trained trees; modelName is "Random Forest Classifier".

The fitted model can be used as an input for function predict .

Examples

Fit a random forest classification model with simulated data:

t = table(100:0, `cls`x0`x1, [INT,DOUBLE,DOUBLE])
cls = take(0, 50)
x0 = norm(-1.0, 1.0, 50)
x1 = norm(-1.0, 1.0, 50)
insert into t values (cls, x0, x1)
cls = take(1, 50)
x0 = norm(1.0, 1.0, 50)
x1 = norm(1.0, 1.0, 50)
insert into t values (cls, x0, x1)

model = randomForestClassifier(sqlDS(<select * from t>), `cls, `x0`x1, 2);

Use the fitted model in forecasting:

predict(model, t)

Save the fitted model to disk:

saveModel(model, "C:/DolphinDB/Data/classificationModel.txt");

Load a saved model:

loadModel("C:/DolphinDB/data/classifierModel.bin")