randomForestRegressor

Syntax

randomForestRegressor(ds, yColName, xColNames, [maxFeatures=0], [numTrees=10], [numBins=32], [maxDepth=32], [minImpurityDecrease=0.0], [numJobs=-1], [randomSeed])

Arguments

ds is the data sources to be trained. It can be generated with function sqlDS .

yColName is a string indicating the dependent variable column.

xColNames is a string scalar/vector indicating the names of the feature columns.

maxFeatures (optional) is an integer or a floating number indicating the number of features to consider when looking for the best split. The default value is 0.

if maxFeatures is a positive integer, then consider maxFeatures features at each split.
if maxFeatures is 0, then sqrt(the number of feature columns) features are considered at each split.
if maxFeatures is a floating number between 0 and 1, then int(maxFeatures * the number of feature columns) features are considered at each split.

numTrees (optional) is a positive integer indicating the number of trees in the random forest. The default value is 10.

numBins (optional) is a positive integer indicating the number of bins used when discretizing continuous features. The default value is 32. Increasing numBins allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication time.

maxDepth (optional) is a positive integer indicating the maximum depth of a tree. The default value is 32.

minImpurityDecrease (optional) a node will be split if this split induces a decrease of impurity greater than or equal to this value. The default value is 0.

numJobs (optional) is an integer indicating the maximum number of concurrently running jobs if set to a positive number. If set to -1, all CPU threads are used. If set to another negative integer, (the number of all CPU threads + numJobs + 1) threads are used.

randomSeed (optional) is the seed used by the random number generator.

Details

Fit a random forest regression model. The result is a dictionary with the following keys: minImpurityDecrease, maxDepth, numBins, numTress, maxFeatures, model, modelName and xColNames. model is a tuple with the result of the trained trees; modelName is "Random Forest Regressor".

The fitted model can be used as an input for function predict .

Examples

Fit a random forest regression model with simulated data:

x1 = rand(100.0, 100)
x2 = rand(100.0, 100)
b0 = 6
b1 = 1
b2 = -2
err = norm(0, 10, 100)
y = b0 + b1 * x1 + b2 * x2 + err
t = table(x1, x2, y)
model = randomForestRegressor(sqlDS(<select * from t>), `y, `x1`x2)
yhat=predict(model, t);

plot(y, yhat, ,SCATTER);

Save the trained model to disk:

saveModel(model, "C:/DolphinDB/Data/regressionModel.txt");

Load a saved model:

model=loadModel("C:/DolphinDB/Data/regressionModel.txt");