createGPLearnEngine

Note: This function is not supported by Community Edition. You can get a trial of Shark from DolphinDB official website.

Syntax

createGPLearnEngine(trainData, targetData,[groupCol=''], [populationSize=1000], [generations=20], [tournamentSize=20], [stoppingCriteria=0.0], [constRange], [windowRange], [initDepth], [initMethod='half'], [initProgram=''], [functionSet], [maxSamples=1.0], [fitnessFunc='mse'], [parsimonyCoefficient=0.001], [crossoverMutationProb=0.9], [subtreeMutationProb=0.01], [hoistMutationProb=0.01], [pointMutationProb=0.01], [eliteCount =0], [restrictDepth=false], [deviceId=0], [seed], [verbose=true], [minimize=true], [useAbsFit=true])

Details

Create a GPLearn engine for training and predicting with symbolic regression.

Arguments

  • trainData is a table where all columns are of FLOAT or DOUBLE type, indicating the training data.
  • targetData is a vector of the same type as trainData, indicating the target data to be predicted.
  • groupCol (optional) is a STRING scalar or vector, indicating the name of grouping column based on which grouped calculation is performed. The default value is NULL, meaning no grouping column is specified. Note that the groupCol values are ignored in calculation.
  • populationSize (optional) is an integer indicating the generation size (i.e., the number of programs) for each generation. The default value is 1000.
  • generations (optional) is an integer indicating the number of generations (iterations) to evolve. The default value is 20.
  • tournamentSize (optional) is an integer indicating the number of programs that will compete to become part of the next generation. The default value is 20.
  • stoppingCriteria (optional) is a floating-point scalar, indicating the required criteria for the fitness. Evolution will be ended early if fitness is smaller than stoppingCriteria. The default value is 0, indicating the evolution will only end when the number of iterations reaches generations.
  • constRange (optional) can be 0 or 2-element floating-point vector, specifying the range of constants included in the programs. The default is [-1.0, 1.0].
    • 0 means no constants will be included in the candidate programs.
    • For a vector, its 2 elements specify the left and right boundaries (closed) for the range.
  • windowRange (optional) is an integral vector from which a random value is taken as the sliding window size. The default value is [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 7, 14, 21, 48, 35, 42].
  • initDepth (optional) is a 2-element integral vector, indicating the range of tree depths for the initial population of naive formulas. The default value is [2, 6].
  • initMethod (optional) is a string indicating the initialization method. It can be:
    • grow: Nodes are chosen at random from functions, constants and variables.
    • full: Functions are chosen until the initDepth is reached, and then terminals are selected.
    • half (default): Trees are grown through a 50/50 mix of 'full' and 'grow'.
  • initProgram (optional) is metacode or a tuple of metacode. The default value is NULL. This parameter is used to initialize the population. For example, <mavg(price, 10)>, where mavg is a built-in function, and price is a column from trainData.
  • functionSet (optional) is a STRING vector specifying the functions used when building and evolving programs. The default value is NULL, indicating all functions can be used. See appendix for the available functions.
  • maxSamples (optional) is a floating-point number in [0,1] indicating the fraction of samples involved in fitnessFunc. The default value is 1.0.
  • fitnessFunc (optional) is a FUNCTIONDEF or STRING scalar indicating the fitness function. It can be:
    • 'mse' (default): mean squared error.
    • 'rmse': root mean squared error.
    • 'mae': mean absolute error.
    • 'pearson': Pearson's product-moment correlation coefficient.
    • 'spearmanr': Spearman's rank-order correlation coefficient.
  • parsimonyCoefficient (optional) is a floating-point number indicating the parsimony coefficient. This constant penalizes large programs by adjusting their fitness to be less favorable for selection. The default value is 0.0.
  • crossoverMutationProb (optional) is a floating-point number indicating the probability of performing crossover on a tournament winner. The default value is 0.9.
  • subtreeMutationProb (optional) is a floating-point number indicating the probability of performing subtree mutation on a tournament winner. The default value is 0.01.
  • hoistMutationProb (optional) is a floating-point number indicating the probability of performing hoist mutation on a tournament winner. The default value is 0.01.
  • pointMutationProb (optional) is a floating-point number indicating the probability of performing point mutation on a tournament winner. The default value is 0.01.
  • useAbsFit (optional) is a boolean value that determines if absolute values are used in fitness calculations for correlation-based fitness functions, i.e., fitnessFunc = 'pearson', 'spearmanr', or corr. The default value is true.
Note: The above genetic operation probabilities must sum to no greater than 1.
  • eliteCount (optional) is an INT scalar indicating the number of elites to be preserved. A number of eliteCount programs with better fitness will be preserved to the next generation without mutation.

  • restrictDepth (optional) is a Boolean scalar specifying whether to strictly limit the program length to initDepth. The default value is false.

  • deviceId (optional) is an INT scalar or vector specifying the device ID to be used. The default value is 0.

  • seed (optional) is an integer indicating the seed used for training.

  • verbose (optional) is a Boolean scalar indicating whether to output the training information. The default value is true.

  • minimize (optional) is a Boolean scalar indicating whether to minimize or maximize the fitness score. The default value is true, i.e., a smaller score means better fitness.

Examples

See Quick Start Guide for Shark GPLearn

Appendix

The following table lists available functions for building and evolving programs. The parameter n indicates the sliding window size taken from windowRange. For all m-functions, if the current window is smaller than n, 0 is returned.

Function Number of Inputs Description
add(x,y) 2 Addition
sub(x,y) 2 Subtraction
mul(x,y) 2 Multiplication
div(x,y) 2 Division, returns 1 if the absolute value of the divisor is less than 0.001
max(x,y) 2 Maximum value
min(x,y) 2 Minimum value
sqrt(x) 1 Square root based on absolute value
log(x) 1 If x < 0.001, returns 0, otherwise returns log(abs(x))
neg(x) 1 Negation
reciprocal(x) 1 Reciprocal, returns 0 if the absolute value of x is less than 0.001
abs(x) 1 Absolute value
sin(x) 1 Sine function
cos(x) 1 Cosine function
tan(x) 1 Tangent function
sig(x) 1 Sigmoid function
mdiff(x, n) 1 n-th order difference of x
mcovar(x, y, n) 2 Covariance of x and y with a sliding window of size n
mcorr(x, y, n) 2 Correlation of x and y with a sliding window of size n
mstd(x, n) 1 Sample standard deviation of x with a sliding window of size n
mmax(x, n) 1 Maximum value of x with a sliding window of size n
mmin(x, n) 1 Minimum value of x with a sliding window of size n
msum(x, n) 1 Sum of x with a sliding window of size n
mavg(x, n) 1 Average of x with a sliding window of size n
mprod(x, n) 1 Product of x with a sliding window of size n
mvar(x, n) 1 Sample variance of x with a sliding window of size n
mvarp(x, n) 1 Population variance of x with a sliding window of size n
mstdp(x, n) 1 Population standard deviation of x with a sliding window of size n
mimin(x, n) 1 Index of the minimum value of x with a sliding window of size n
mimax(x, n) 1 Index of the maximum value of x with a sliding window of size n
mbeta(x, y, n) 2 Least squares estimate of the regression coefficient of x on y with a sliding window of size n
mwsum(x, y, n) 2 Inner product of x and y with a sliding window of size n
mwavg(x, y, n) 2 Weighted average of x using y as weights with a sliding window of size n