# createAnomalyDetectionEngine {#createanomalydetectionengine}

**Parent topic:**[Functions](../../Functions/category.md)

## Syntax {#syntax}

`createAnomalyDetectionEngine(name, metrics, dummyTable, outputTable, timeColumn, [keyColumn], [windowSize], [step], [garbageSize], [roundTime=true], [snapshotDir], [snapshotIntervalInMsgCount], [raftGroup],[anomalyDescription])`

## Details {#details}

This function creates an anomaly detection engine and returns a table object. Data inserted into this table is used to calculate specified metrics to detect anomalies.

Note that the following functions are not supported in metrics: `next`, `talibNull`, `linearTimeTrend`, `iterate`, or aggregate function nested within an order-sensitive function \(e.g., `tmsum(sum())`\).

For more application scenarios, see [Streaming Engines](../../Streaming/streaming_engines.md).

### Calculation Rules {#calculation-rules}

The anomaly detection engine uses different calculation rules for the following 3 types of anomaly metrics:

-   Comparison between a column and a constant or between columns. Only non-aggregate functions can be included. For examples: `qty < 4`, `qty > price`, `lt(qty, prev(qty))`, `isNull(qty) == false`, etc. For these metrics, the engine conducts calculations for each row and determines whether to output the result.

-   Comparison between aggregate function result and a constant or between aggregate function results. Non-aggregate functions may be used, but their arguments may only include aggregate functions and/or constants, not columns. For examples: `avg(qty - price) > 10`, `percentile(qty, 90) < 100`, `max(qty) < avg(qty) * 2`, `le(sum(qty), 5)`, etc. For these metrics, the engine conducts calculations at frequencies determined by the parameter *step* and determines whether to output the result.

-   Comparison between aggregate function result and a column, or non-aggregate functions are used and their arguments include aggregate functions and columns. For examples: `avg(qty) > qty`, `le(med(qty), price)`, etc. For these metrics, the engine conducts calculations at frequencies determined by the parameter *step* and determines whether to output the result.


Note: If an aggregate function is used in *metrics*, the parameters *windowSize* and *step* must be specified. The anomaly metrics are calculated in a window of *windowSize* at every step. To facilitate observation and comparison of calculation results, the engine automatically adjusts the starting point of the first window. See [createTimeSeriesEngine](createTimeSeriesEngine.md) for the alignment rules.

If the parameter *keyColumn* is specified, the anomaly metrics are calculated in each group.

### Other Features: {#other-features}

-   Data cleanup: If an aggregate function or order-sensitive function is specified for *metrics*, the engine keeps the historical data. You can specify the parameter *garbageSize* to clean up the data that are no longer needed. If *keyColumn* is specified, data cleanup is performed within each group independently.

-   Snapshot: Snapshot mechanism is used to restore the streaming engine to the latest snapshot after system interruption. \(See parameters *snapshotDir* and *snapshotIntervalInMsgCount*\)

-   High availability: To enable high availability for the streaming engines, specify the parameter *raftGroup* on the leader of a raft group on the subscriber. When a leader is down, the raft group automatically switches to a new leader to resubscribe to the stream table.


## Arguments {#arguments}

**name** is a string indicating the name of the anomaly detection engine. It is the unique identifier of the engine on a data/compute node. It can contain letters, numbers and underscores, and must start with a letter.

**metrics** is metacode or tuple specifying the formulas for anomaly detection. It uses functions or expressions such as `<[sum(qty)>5, avg(qty)>qty, qty<4]>` that indicate anomaly conditions and return Boolean values.

Note: The column names specified in *metrics* are not case-sensitive and can be inconsistent with the column names of the input tables.

**dummyTable** is a table object whose schema must be the same as the stream table to which the engine subscribes.

**outputTable** is the output table of detected anomalies. It can be an in-memory table or a DFS table. Create an empty table before calling the function.

The output columns are in the following order:

\(1\) The first column is of temporal data type and stores the time when an anomaly is detected.

\(2\) The following column\(s\) are *keyColumn* \(if specified\).

\(3\) Then followed by an INT column indicating the position of the condition in metrics;

\(4\) And a STRING/SYMBOL column indicating the content of the anomaly conditions or description.

**timeColumn** is a string indicating the name of the temporal column of the input stream table.

**keyColumn** \(optional\) is a STRING scalar or vector indicating the grouping columns. The anomaly detection engine conducts calculations within each group specified by *keyColumn*.

**windowSize** \(optional\) is a positive integer indicating the size of window for calculation. Only the left boundary is included in the window.

**step** \(optional\) is a positive integer indicating the duration between 2 adjacent windows. The value of *windowSize* must be a multiple of *step*.

Note: If aggregate functions are used in *metrics*, parameters *windowSize* and *step* must be specified.

**garbageSize** \(optional\) is a positive integer. The default value is 2,000 \(in Bytes\).

-   If *keyColumn* is not specified, when the number of historical records in memory exceeds *garbageSize*, the system will clear the records that are no longer needed.

-   If *keyColumn* is specified, garbage collection is conducted separately in each group. When the number of historical records in a group exceeds *garbageSize*, the system will clear the records that are no longer needed within the group.


Note that *garbageSize* only takes effect when aggregate function\(s\) are specified in parameter metrics.

**roundTime** \(optional\) is a Boolean value indicating the method to align the window boundary if the time column is in millisecond or second precision and step is greater than one minute. The default value is true indicating the alignment is based on the multi-minute rule. False means alignment is based on the one-minute rule \(See [Alignment Rules](createTimeSeriesEngine.md)\).

To enable snapshot in the streaming engines, specify parameters *snapshotDir* and *snapshotIntervalInMsgCount*.

**snapshotDir** \(optional\) is a string indicating the directory where the streaming engine snapshot is saved. The directory must already exist, otherwise an exception is thrown. If the *snapshotDir* is specified, the system checks whether a snapshot already exists in the directory when creating a streaming engine. If it exists, the snapshot will be loaded to restore the engine state. Multiple streaming engines can share a directory where the snapshot files are named as the engine names.

The file extension of a snapshot can be:

-   *&lt;engineName&gt;.tmp*: a temporary snapshot
-   *&lt;engineName&gt;.snapshot*: a snapshot that is generated and flushed to disk
-   *&lt;engineName&gt;.old*: if a snapshot with the same name already exists, the previous snapshot is renamed to *&lt;engineName&gt;.old*.

**snapshotIntervalInMsgCount** \(optional\) is a positive integer indicating the number of messages to receive before the next snapshot is saved.

**raftGroup** \(optional\) is an integer greater than 1, indicating the ID of the raft group on the high-availability streaming subscriber specified by the configuration parameter *streamingRaftGroups*. Specify *raftGroup* to enable high availability on the streaming engine. When an engine is created on the leader, it is also created on each follower and the engine snapshot is synchronized to the followers. When a leader is down, the raft group automatically switches to a new leader to resubscribe to the stream table. Note that *snapshotDir* must also be specified when specifying a raft group.

**anomalyDescription** \(optional\) is a STRING vector, where each element indicates a custom description for the corresponding anomaly condition specified in *metrics*.

## Examples {#examples}

Example 1: The following example creates a table "engine" with `createAnomalyDetectionEngine`, then subscribes to the stream table trades and writes data to the engine. The engine conducts calculations by group based on column sym and saves the results to *outputTable*.

```
share streamTable(1000:0, `time`sym`qty, [TIMESTAMP, SYMBOL, INT]) as trades
share table(1000:0, `time`sym`type`metric, [TIMESTAMP, SYMBOL, INT, STRING]) as outputTable
engine = createAnomalyDetectionEngine(name="anomalyDetection1", metrics=<[sum(qty) > 5, avg(qty) > qty, qty < 4]>, dummyTable=trades, outputTable=outputTable, timeColumn=`time, keyColumn=`sym, windowSize=3, step=3)
subscribeTable(tableName="trades", actionName="anomalyDetectionSub1", offset=0, handler=append!{engine}, msgAsTable=true)

def writeData(n){
timev = 2018.10.08T01:01:01.001 + 1..n
symv =take(`A`B, n)
qtyv = n..1
insert into trades values(timev, symv, qtyv)
}

writeData(6);

select * from trades;
```

|time|sym|qty|
|----|---|---|
|2018.10.08T01:01:01.002|A|6|
|2018.10.08T01:01:01.003|B|5|
|2018.10.08T01:01:01.004|A|4|
|2018.10.08T01:01:01.005|B|3|
|2018.10.08T01:01:01.006|A|2|
|2018.10.08T01:01:01.007|B|1|

```
select * from outputTable;
```

|time|sym|type|metric|
|----|---|----|------|
|2018.10.08T01:01:01.003|A|0|sum\(qty\) &gt; 5|
|2018.10.08T01:01:01.004|A|1|avg\(qty\) &gt; qty|
|2018.10.08T01:01:01.005|B|2|qty &lt; 4|
|2018.10.08T01:01:01.006|A|1|avg\(qty\) &gt; qty|
|2018.10.08T01:01:01.006|A|2|qty &lt; 4|
|2018.10.08T01:01:01.006|B|0|sum\(qty\) &gt; 5|
|2018.10.08T01:01:01.007|B|1|avg\(qty\) &gt; qty|
|2018.10.08T01:01:01.007|B|2|qty &lt; 4|

The calculation process of the anomaly detection engine is explained in details below:

\(1\) The indicator `sum(qty)>5` represents the comparison between the aggregate result and a constant, the anomaly detection engine checks this indicator during the calculation in each window.

-   The first window ranges from 2018.10.08T01:01:01.000 to 2018.10.08T01:01:01.002, and the `sum(qty)` of A and B is calculated respectively. At 2018.10.08T01:01:01.003 the engine starts to judge whether it meets the condition `sum(qty)>5`.

-   The second window ranges from 2018.10.08T01:01:01.003 to 2018.10.08T01:01:01.005, at 2018.10.08T01:01:01.006, the engine starts to judge whether it meets the condition `sum(qty)>5`, and so on.


\(2\) The indicator `avg(qty)>qty` represents the comparison between the aggregate result and a certain column. Therefore, whenever data arrives, the anomaly detection engine compares the data with the aggregate result of the previous window. Until the next aggregation is triggered, the engine checks whether the result meets the conditions and outputs it.

-   The first window ranges from 2018.10.08T01:01:01.000 to 2018.10.08T01:01:01.002, the `avg(qty)` of A and B is calculated respectively. Each qty between 2018.10.08T01:01:01.003 and 2018.10.08T01:01:01.005 will be compared with the `avg(qty)` of the previous window.

-   The window moves at 2018.10.08T01:01:01.005.

-   The second window ranges from 2018.10.08T01:01:01.003 to 2018.10.08T01:01:01.005. In the second window, the `avg(qty)` of A and B is calculated, each qty between 2018.10.08T01:01:01.006 and 2018.10.08T01:01:01.008 will be compared with the `avg(qty)` of the previous window, and so on.


\(3\) The metric `qty<4` represents the comparison between a column and a constant. Therefore, whenever a new record arrives, the anomaly detection engine compares the value with 4.

Example 2: The following example demonstrates how the *anomalyDescription* parameter works.

``` {#codeblock_ukl_czt_x1c}
share streamTable(1000:0, `time`temp, [TIMESTAMP, DOUBLE]) as sensordata
share streamTable(1000:0, `time`anomalyType`anomalyString, [TIMESTAMP, INT, SYMBOL]) as outputTable
engine = createAnomalyDetectionEngine(name = "engineB", metrics=<[temp > 65, temp > percentile(temp, 75)]>, dummyTable = sensordata, outputTable = outputTable, timeColumn = `time, windowSize = 6, step = 3, anomalyDescription=["The temperature is higher than 65°C", "The temperature is larger than 75% values of the last window"])
subscribeTable(,tableName = "sensordata", actionName = "sensorAnomalyDetection", offset = 0, handler = append!{engine}, msgAsTable = true)

timev = 2018.10.08T01:01:01.001 + 1..10
tempv = 59 66 57 60 63 51 53 52 56 55
insert into sensordata values(timev, tempv)

sleep(10)
select * from outputTable
```

|time|anomalyType|anomalyString|
|----|-----------|-------------|
|2018.10.08T01:01:01.003|0|The temperature is higher than 65°C|
|2018.10.08T01:01:01.003|1|The temperature is larger than 75% values of the last window|
|2018.10.08T01:01:01.005|1|The temperature is larger than 75% values of the last window|
|2018.10.08T01:01:01.006|1|The temperature is larger than 75% values of the last window|

