# createSnapshotJoinEngine {#createSnapshotJoinEngine}

**Parent topic:**[Functions](../../Functions/category.md)

## Syntax {#syntax}

`createSnapshotJoinEngine(name, leftTable, rightTable, outputTable, metrics, matchingColumn, [timeColumn], [outputElapsedMicroseconds=false], [keepLeftDuplicates=false], [keepRightDuplicates=false], [isInnerJoin=true], \[snapshotDir\], \[snapshotIntervalInMsgCount\])`

## Details {#details}

Create a snapshot join streaming engine to receive streams through left and right tables, performing either inner or full outer joins based on the specified *matchingColumn*.

**Return value**: A table object.

**Join type**: inner join or full outer join, controlled by *isInnerJoin*.

**Matching behavior**: match all records or only the latest record in each group, controlled by *keepLeftDuplicates* and *keepRightDuplicates*.

Snapshot join engine vs. lookup join engine:

-   The lookup join engine responds only to new records in the left table, supporting both inner join and left join operations.
-   Snapshot join engine responds to new records in either the left or right table, supporting both inner join and full outer join operations.

Snapshot join engine vs. equi join engine:

-   Equi join engine joins records immediately upon finding a match. These joined records are not matched again and are removed when the garbage size limit is reached.
-   Snapshot join engine can be configured to either join only the latest records orall records in each group, while maintaining cache of joined records that can be rematched.

## Arguments {#arguments}

**name** is a string indicating the name of the snapshot join engine. It is the unique identifier of the engine on a data/compute node. It can contain letters, numbers and underscores and must start with a letter.

**leftTable** is a table object whose schema must be the same as the stream table to which the engine subscribes.

**rightTable** is a table object whose schema must be the same as the stream table to which the engine subscribes.

**outputTable** is a table object to hold the calculation results. Create an empty table and specify the column names and types before calling the function.

The columns of *outputTable* are in the following order:

\(1\) One or more columns on which the tables are joined, arranged in the same order as specified in *matchingColumn*.

\(2\) Then followed by two time columns of left and right tables respectively. If *timeColumn* is specified, they have the same data type as *timeColumn*. If not, the data type must be TIMESTAMP.

\(3\) Further followed by the calculation results of *metrics*.

\(4\) If the *outputElapsedMicroseconds* is set to true, specify two more columns: a LONG column and an INT column.

**metrics** is metacode \(can be a tuple\) specifying the calculation formulas. For more information about metacode, refer to [Metaprogramming](../../Programming/Metaprogramming/metaprogramming.md).

-   *metrics* can use one or more expressions, built-in or user-defined functions \(but not aggregate functions\).
-   *metrics* can be functions that return multiple values and the columns in the output table to hold the return values must be specified. For example, `<func(price) as `col1`col2>`.

To specify a column that exists in both the left and the right tables, use the format *tableName.colName*. By default, the column from the left table is used.

**Note:** The column names specified in *metrics* are not case-sensitive and can be inconsistent with the column names of the input tables.

**matchingColumn** is a STRING scalar/vector/tuple indicating the column\(s\) on which the tables are joined. It supports integral, temporal or literal \(except UUID\) types.

-   When there is only 1 column to match - If the names of the columns to match are the same in both tables, *matchingColumn* should be specified as a STRING scalar; otherwise it's a tuple of two elements. For example, if the column is named "sym" in the left table and "sym1" in the right table, then *matchingColumn* = \[\[\`sym\],\[\`sym1\]\].
-   When there are multiple columns to match - If both tables share the names of all columns to match, *matchingColumn* is a STRING vector; otherwise it's a tuple of two elements. For example, if the columns are named "timestamp" and "sym" in the left table, whereas in the right table they're named "timestamp" and "sym1", then *matchingColumn* = \[\[\`timestamp, \`sym\], \[\`timestamp,\`sym1\]\].

**timeColumn** \(optional\) is a STRING scalar/vector indicating the name of the time column in the left table and the right table. The time columns must have the same data type. If the names of the time column in the left table and the right table are the same, *timeColumn* is a string. Otherwise, it is a vector of 2 strings indicating the time column in each table.

**outputElapsedMicroseconds** \(optional\) is a Boolean value. The default value is false. It determines whether to output:

-   the elapsed time \(in microseconds\) from the ingestion of data to the output of result in each batch.
-   the total number of each batch.

**keepLeftDuplicates** \(optional\) is a Boolean value indicating whether to match all records in each group of the left table. When set to false \(default\), the engine matches only the latest record in each group. When set to true, the engine matches all records in each group.

**keepRightDuplicates** \(optional\) is a Boolean value indicating whether to match all records in each group of the right table. When set to false \(default\), the engine matches the latest record in each group. When set to true, the engine matches all records in each group.

**isInnerJoin** \(optional\) is a Boolean value to determine whether an inner join or full outer join is performed.

-   If *isInnerJoin*=true \(default\), an inner join is performed. Results are only generated when matches are found between both tables.
-   If *isInnerJoin*=false, an outer join is performed. Results are generated whether or not a match is found. If there are unmatched records, entries from the other table are null padded.

To enable snapshot in the streaming engines, specify parameters *snapshotDir* and *snapshotIntervalInMsgCount*.

**snapshotDir** \(optional\) is a string indicating the directory where the streaming engine snapshot is saved. The directory must already exist, otherwise an exception is thrown. If the *snapshotDir* is specified, the system checks whether a snapshot already exists in the directory when creating a streaming engine. If it exists, the snapshot will be loaded to restore the engine state. Multiple streaming engines can share a directory where the snapshot files are named as the engine names.

The file extension of a snapshot can be:

-   *&lt;engineName&gt;.tmp*: a temporary snapshot
-   *&lt;engineName&gt;.snapshot*: a snapshot that is generated and flushed to disk
-   *&lt;engineName&gt;.old*: if a snapshot with the same name already exists, the previous snapshot is renamed to *&lt;engineName&gt;.old*.

**snapshotIntervalInMsgCount** \(optional\) is a positive integer indicating the number of messages to receive before the next snapshot is saved.

## Examples {#examples}

Example 1. Create a snapshot join engine that inner joins left and right tables, matching all records within each group.

``` {#codeblock_btm_byh_fdc}
// define the input and output tables
share streamTable(1:0, `timestamp`sym1`id`price`val, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as leftTable
share streamTable(1:0, `timestamp`sym2`id`price`qty, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as rightTable
output=table(100:0, ["id","sym", "timestamp1", "timestamp2", "factor1", "factor2"], 
[INT, SYMBOL, TIMESTAMP, TIMESTAMP, DOUBLE, DOUBLE])

test_metrics = [<val*10>, <qty>]
// create the engine
test_engine = createSnapshotJoinEngine(name = "test_SJE", leftTable=leftTable, rightTable=rightTable, 
outputTable=output, metrics=test_metrics, matchingColumn = [["id","sym1"],["id","sym2"]], 
timeColumn = `timestamp, isInnerJoin=true, keepLeftDuplicates=true,keepRightDuplicates=true)

// append data to left table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,1,2,1,5,2,4,4,1,4]
price = [2.53,7.61,8.07,7.87,7.29,9.39,5.98,9.49,9.20,9.17]
val = [101,108,101,109,104,100,108,100,107,104]
left_data = table(timestamp as timestamp,sym as sym1,id as id,price as price,val as val)
appendForJoin(test_engine,true, left_data)

// append data to right table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,2,4,3,5,5,4,2,5,5]
price =  [1.08,9.08,9.97,7.60,1.91,6.77,7.81,8.81,0.61,5.92]
qty =  [208,200,203,202,204,201,206,207,205,205]
right_data = table(timestamp as timestamp,sym as sym2,id as id,price as price,qty as qty)
appendForJoin(test_engine,false, right_data)

select * from output
```

You can see from the output table that all matched records in the left table are calculated and output.

|id|sym|timestamp1|timestamp2|factor1|factor2|
|---|---|----------|----------|-------|-------|
|1|a|2024.10.10T15:12:01.508|2024.10.10T15:12:01.508|1,010|208|
|1|a|2024.10.10T15:12:01.516|2024.10.10T15:12:01.508|1,070|208|
|2|b|2024.10.10T15:12:01.513|2024.10.10T15:12:01.509|1,000|200|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.510|1,080|203|
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.512|1,040|204|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.514|1,080|206|
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.516|1,040|205|

Example 2. Create a snapshot join engine that inner joins left and right tables, matching only the latest record within each group.

``` {#codeblock_rdv_fyh_fdc}
// drop the registered engine if you executed example 1
dropStreamEngine("test_SJE")

// define the input and output tables
share streamTable(1:0, `timestamp`sym1`id`price`val, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as leftTable
share streamTable(1:0, `timestamp`sym2`id`price`qty, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as rightTable
output=table(100:0, ["id","sym", "timestamp1", "timestamp2", "factor1", "factor2"], 
[INT, SYMBOL, TIMESTAMP, TIMESTAMP, DOUBLE, DOUBLE])

test_metrics = [<val*10>, <qty>]
// create the engine
test_engine = createSnapshotJoinEngine(name = "test_SJE", leftTable=leftTable, rightTable=rightTable, 
outputTable=output, metrics=test_metrics, matchingColumn = [["id","sym1"],["id","sym2"]], 
timeColumn = `timestamp, isInnerJoin=true, keepLeftDuplicates=false,keepRightDuplicates=true)

// append data to left table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,1,2,1,5,2,4,4,1,4]
price = [2.53,7.61,8.07,7.87,7.29,9.39,5.98,9.49,9.20,9.17]
val = [101,108,101,109,104,100,108,100,107,104]
left_data = table(timestamp as timestamp,sym as sym1,id as id,price as price,val as val)
appendForJoin(test_engine,true, left_data)

// append data to right table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,2,4,3,5,5,4,2,5,5]
price =  [1.08,9.08,9.97,7.60,1.91,6.77,7.81,8.81,0.61,5.92]
qty =  [208,200,203,202,204,201,206,207,205,205]
right_data = table(timestamp as timestamp,sym as sym2,id as id,price as price,qty as qty)
appendForJoin(test_engine,false, right_data)

select * from output
```

You can see from the output table that only the latest matched records in the left table are calculated and output.

|id|sym|timestamp1|timestamp2|factor1|factor2|
|---|---|----------|----------|-------|-------|
|1|a|2024.10.10T15:12:01.516|2024.10.10T15:12:01.508|1,070|208|
|2|b|2024.10.10T15:12:01.513|2024.10.10T15:12:01.509|1,000|200|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.510|1,080|203|
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.512|1,040|204|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.514|1,080|206|
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.516|1,040|205|

Example 3. Create a snapshot join engine that full outer joins left and right tables, matching all records in the right table and only the latest record in the left table for each group.

``` {#codeblock_jrg_kyh_fdc}
// drop the registered engine if you executed examples 1 and 2
dropStreamEngine("test_SJE")

// define the input and output tables
share streamTable(1:0, `timestamp`sym1`id`price`val, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as leftTable
share streamTable(1:0, `timestamp`sym2`id`price`qty, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as rightTable
output=table(100:0, ["id","sym", "timestamp1", "timestamp2", "factor1", "factor2"], 
[INT, SYMBOL, TIMESTAMP, TIMESTAMP, DOUBLE, DOUBLE])

test_metrics = [<val*10>, <qty>]
// create the engine
test_engine = createSnapshotJoinEngine(name = "test_SJE", leftTable=leftTable, rightTable=rightTable, 
outputTable=output, metrics=test_metrics, matchingColumn = [["id","sym1"],["id","sym2"]], 
timeColumn = `timestamp, isInnerJoin=false, keepLeftDuplicates=false,keepRightDuplicates=true)

// append data to left table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,1,2,1,5,2,4,4,1,4]
price = [2.53,7.61,8.07,7.87,7.29,9.39,5.98,9.49,9.20,9.17]
val = [101,108,101,109,104,100,108,100,107,104]
left_data = table(timestamp as timestamp,sym as sym1,id as id,price as price,val as val)
appendForJoin(test_engine,true, left_data)

// append data to right table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,2,4,3,5,5,4,2,5,5]
price =  [1.08,9.08,9.97,7.60,1.91,6.77,7.81,8.81,0.61,5.92]
qty =  [208,200,203,202,204,201,206,207,205,205]
right_data = table(timestamp as timestamp,sym as sym2,id as id,price as price,qty as qty)
appendForJoin(test_engine,false, right_data)

select * from output
```

The output table shows that while only the latest matched records from the left table are calculated, all unmatched records are still included with null values filled in.

|id|sym|timestamp1|timestamp2|factor1|factor2|
|---|---|----------|----------|-------|-------|
|1|a|2024.10.10T15:12:01.508|1,010| | |
|1|b|2024.10.10T15:12:01.509|1,080| | |
|2|c|2024.10.10T15:12:01.510|1,010| | |
|1|d|2024.10.10T15:12:01.511|1,090| | |
|5|a|2024.10.10T15:12:01.512|1,040| | |
|2|b|2024.10.10T15:12:01.513|1,000| | |
|4|c|2024.10.10T15:12:01.514|1,080| | |
|4|d|2024.10.10T15:12:01.515|1,000| | |
|1|a|2024.10.10T15:12:01.516|1,070| | |
|4|b|2024.10.10T15:12:01.517|1,040| | |
|1|a|2024.10.10T15:12:01.516|2024.10.10T15:12:01.508|1,070|208|
|2|b|2024.10.10T15:12:01.513|2024.10.10T15:12:01.509|1,000|200|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.510|1,080|203|
|3|d|2024.10.10T15:12:01.511|202| | |
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.512|1,040|204|
|5|b|2024.10.10T15:12:01.513|201| | |
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.514|1,080|206|
|2|d|2024.10.10T15:12:01.515|207| | |
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.516|1,040|205|
|5|b|2024.10.10T15:12:01.517|205| | |

Example 4. Based on example 2, set *outputElapsedMicroseconds* = true to output two more columns in the output table.

``` {#codeblock_zc3_pyh_fdc}
// drop the registered engine if you executed above examples
dropStreamEngine("test_SJE")

// define the input and output tables
share streamTable(1:0, `timestamp`sym1`id`price`val, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as leftTable
share streamTable(1:0, `timestamp`sym2`id`price`qty, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as rightTable
output=table(100:0, ["id","sym", "timestamp1", "timestamp2", "factor1", "factor2", "timecost","batchsize"],
[INT, SYMBOL, TIMESTAMP, TIMESTAMP, DOUBLE, DOUBLE, LONG, INT])

test_metrics = [<val*10>, <qty>]
// create the engine
test_engine = createSnapshotJoinEngine(name = "test_SJE", leftTable=leftTable, rightTable=rightTable, 
outputTable=output, metrics=test_metrics, matchingColumn = [["id","sym1"],["id","sym2"]],
timeColumn = `timestamp, outputElapsedMicroseconds=true, isInnerJoin=true,
keepLeftDuplicates=false,keepRightDuplicates=true)

// append data to left table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,1,2,1,5,2,4,4,1,4]
price = [2.53,7.61,8.07,7.87,7.29,9.39,5.98,9.49,9.20,9.17]
val = [101,108,101,109,104,100,108,100,107,104]
left_data = table(timestamp as timestamp,sym as sym1,id as id,price as price,val as val)
appendForJoin(test_engine,true, left_data)

// append data to right table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,2,4,3,5,5,4,2,5,5]
price =  [1.08,9.08,9.97,7.60,1.91,6.77,7.81,8.81,0.61,5.92]
qty =  [208,200,203,202,204,201,206,207,205,205]
right_data = table(timestamp as timestamp,sym as sym2,id as id,price as price,qty as qty)
appendForJoin(test_engine,false, right_data)

select * from output
```

The output table displays the elapsed time and total records for calculating each batch.

|id|sym|timestamp1|timestamp2|factor1|factor2|timecost|batchsize|
|---|---|----------|----------|-------|-------|--------|---------|
|1|a|2024.10.10T15:12:01.516|2024.10.10T15:12:01.508|1,070|208|109|10|
|2|b|2024.10.10T15:12:01.513|2024.10.10T15:12:01.509|1,000|200|109|10|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.510|1,080|203|109|10|
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.512|1,040|204|109|10|
|4|c|2024.10.10T15:12:01.514|2024.10.10T15:12:01.514|1,080|206|109|10|
|5|a|2024.10.10T15:12:01.512|2024.10.10T15:12:01.516|1,040|205|109|10|

Example 5. Set *keepRightDuplicates* = false and *keepLeftDuplicates* = false. When *timeColumn* is not specified, the time columns represent the arrival times of data to the left and right tables respectively.

``` {#codeblock_dxy_sk3_5dc}
// drop the registered engine if you executed above examples
dropStreamEngine("test_SJE")

// define the input and output tables
share streamTable(1:0, `timestamp`sym1`id`price`val, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as leftTable
share streamTable(1:0, `timestamp`sym2`id`price`qty, [TIMESTAMP, SYMBOL, INT, DOUBLE, DOUBLE]) as rightTable
output=table(100:0, ["id","sym", "timestamp1", "timestamp2", "factor1", "factor2"],
[INT, SYMBOL, TIMESTAMP, TIMESTAMP, DOUBLE, DOUBLE])

test_metrics = [<val*10>, <qty>]
// create the engine
test_engine = createSnapshotJoinEngine(name = "test_SJE", leftTable=leftTable, rightTable=rightTable, 
outputTable=output, metrics=test_metrics, matchingColumn = [["id","sym1"],["id","sym2"]], isInnerJoin=true)

// append data to left table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,1,2,1,5,2,4,4,1,4]
price = [2.53,7.61,8.07,7.87,7.29,9.39,5.98,9.49,9.20,9.17]
val = [101,108,101,109,104,100,108,100,107,104]
left_data = table(timestamp as timestamp,sym as sym1,id as id,price as price,val as val)
appendForJoin(test_engine,true, left_data)

// append data to right table
timestamp = 2024.10.10T15:12:01.507+1..10
sym = take(["a","b","c","d"],10)
id = [1,2,4,3,5,5,4,2,5,5]
price =  [1.08,9.08,9.97,7.60,1.91,6.77,7.81,8.81,0.61,5.92]
qty =  [208,200,203,202,204,201,206,207,205,205]
right_data = table(timestamp as timestamp,sym as sym2,id as id,price as price,qty as qty)
appendForJoin(test_engine,false, right_data)

select * from output
```

You can see from the the output table that the arrival times are output respectively.

|id|sym|timestamp1|timestamp2|factor1|factor2|
|---|---|----------|----------|-------|-------|
|1|a|2024.12.20T15:05:49.603|2024.12.20T15:05:49.603|1,070|208|
|2|b|2024.12.20T15:05:49.603|2024.12.20T15:05:49.603|1,000|200|
|4|c|2024.12.20T15:05:49.603|2024.12.20T15:05:49.603|1,080|203|
|5|a|2024.12.20T15:05:49.603|2024.12.20T15:05:49.603|1,040|204|
|4|c|2024.12.20T15:05:49.603|2024.12.20T15:05:49.603|1,080|206|
|5|a|2024.12.20T15:05:49.603|2024.12.20T15:05:49.603|1,040|205|

