Best Practices for Array Vectors

Many institutions leverage L1/L2 snapshot market data for quantitative research. DolphinDB, a high-performance real-time computing platform, is optimized for storing and processing large-scale high-frequency market data. To efficiently handle the multi-level structure of snapshot data, DolphinDB implements array vectors that support variable-length two-dimensional arrays. This specialized data strucutre simplifies code for common queries and computations while improving storage and processing performance.

1. Concept

1.1 Overview

In DolphinDB, array vector is a specialized data structure for storing multi-level market data (such as stock bid/ask quotes). Each element of an array vector is a strongly-typed 1D array. This storage method simplifies code for specific queries and computations, while improving data compression and query performance when multiple columns contain redundant data. DolphinDB supports binary operations on array vectors with scalars, vectors, or array vectors, enabling efficient vectorized calculations and boosting performance.

1.2 Array Vector vs. Matrix

Both array vectors and matrices can organize two-dimensional structured data, and their key differences include:

Matrices require fixed row lengths, whereas array vectors do not.
Matrices are stored in column-major order, while array vectors use row-major storage.

1.3 Fast Array Vector vs. Columnar Tuple

Array vectors can be further categorized into fast array vectors and columnar tuples. The fast array vector implementation consists of two underlying vectors: one stores all row data contiguously, while the other stores the end indices of each row. This compact storage method delivers efficient read/write and computing performance but makes modifying row lengths challenging. To overcome this limitation, DolphinDB introduced the columnar tuple structure. A columnar tuple represents a column where each element corresponds to a row with consistent value types. Each element in a columnar tuple is an independent object (scalar or vector), allowing for row modifications.

Key Differences

Fast array vectors currently do not support SYMBOL and STRING types, whereas columnar tuples do.
When using the typestr function to check data types, fast array vector returns "FAST XXX[] VECTOR", while columnar tuple returns "ANY VECTOR". To check if a variable is a columnar tuple, the isColumnarTuple function can be used.
Fast array vector offers higher efficiency in storage, queries, and computations but has lower efficiency in update and delete operations. Additionally, the length of elements in each row cannot be changed. In contrast, columnar tuple performs better in update and delete operations and allows variable row lengths. Note: These update and delete operations are implemented in the underlying C++ code. However, interfaces for updates and deletions are not exposed, meaning script-level modifications or deletions of array vector elements are not supported. Therefore, it is generally recommended for users to use fast array vectors. Unless otherwise specified, "array vectors" generally refers to fast array vectors in DolphinDB documentation.

Application Scenarios

Fast array vector: Suited for storing and computing numerical two-dimensional arrays, such as 10-level quote data.
Columnar tuple: Ideal for scenarios that require frequent updates to elements in a two-dimensional array. For instance, reactive state engines automatically convert input fast array vector columns into columnar tuples to enhance the efficiency of state updates. This conversion is transparent to users, who simply need to pass fast array vector columns to the engine.

2. Functions and Operations

2.1 Creation of Array Vectors

2.1.1 Creating Array Vector Variables

Fast Array Vector

Use array or bigarray to define an empty array vector and populate it with append!:

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
x // output: [[1,2,3],[4,5],[6,7,8],[9,10]]

Use fixedLengthArrayVector to combine vectors/tuples/matrices/tables into an array vector:

vec = 1 2 3
tp = [4 5 6, 7 8 9]
m = matrix(10 11 12, 13 14 15, 16 17 18)
tb = table(19 20 21 as v1, 22 23 24 as v2)
x = fixedLengthArrayVector(vec, tp, m, tb)
x // output: [[1,4,7,10,13,16,19,22],[2,5,8,11,14,17,20,23],[3,6,9,12,15,18,21,24]]

Use arrayVector to split a single vector into an array vector:

x = arrayVector(3 5 8 10, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
x // output: [[1,2,3],[4,5],[6,7,8],[9,10]]

Columnar Tuple

Use setColumnarTuple! to convert a regular tuple into a columnar tuple:

x = [[1,2,3],[4,5],[6,7,8],[9,10]].setColumnarTuple!()
x // output: ([1,2,3],[4,5],[6,7,8],[9,10])

2.1.2 Creating Tables With Array Vector Columns

Assign an array vector as a column in a table:

Fast array vector columns are typed as "XXX[]" (e.g., "INT[]", "DOUBLE[]").
Columnar tuple columns are typed as "ANY".

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
t = table(1 2 3 4 as id, x as newCol1)
update t set newCol2=x
t["newCol3"] = y
t
/* 
id newCol1 newCol2 newCol3
-- ------- ------- -------
1  [1,2,3] [1,2,3] [1,2,3]
2  [4,5]   [4,5]   [4,5]  
3  [6,7,8] [6,7,8] [6,7,8]
4  [9,10]  [9,10]  [9,10] 
*/

t.schema().colDefs
/*
name    typeString typeInt extra comment
------- ---------- ------- ----- -------
id      INT        4                    
newCol1 INT[]      68                   
newCol2 INT[]      68                   
newCol3 ANY        25   
*/

Use fixedLengthArrayVector to combine multiple table columns into one array vector column:

t = table(1 2 3 4 as id, 1 3 5 6 as v1, 4 7 9 3 as v2)
t = select *, fixedLengthArrayVector(v1, v2) as newCol from t
t
/*
id v1 v2 newCol
-- -- -- ------
1  1  4  [1,4] 
2  3  7  [3,7] 
3  5  9  [5,9] 
4  6  3  [6,3] 
*/

Use toArray and group by to aggregate grouped data into an array vector:
```
t = table(1 1 3 4 as id, 1 3 5 6 as v1)
new_t = select toArray(v1) as newV1 from t group by id
new_t
/*
id newV1
-- -----
1  [1,3]
3  [5]  
4  [6]  
*/
```
Note: Since toArray generates fast array vector data, which does not support SYMBOL or STRING types, toArray cannot be applied to these columns. For grouping SYMBOL or STRING data, use string concatenation instead:
```
t = table(1 1 3 4 as group_id, `a1`a2`a3`a4 as name)
new_t = select concat(name, ";") as name from t group by group_id 
new_t
/*
group_id name 
-------- -----
1        a1;a2
3        a3   
4        a4        
*/
```
Then use split in queries to split the concatenated string:
```
select *, name.split(";")[0] as name0 from new_t
/*
group_id name  name0
-------- ----- -----
1        a1;a2 a1   
3        a3    a3   
4        a4    a4   
*/
```
Use loadText and specify the parameters schema and arrayDelimiter to read tables containing fast array vector columns from text files.When saving with saveText, fast array vector columns are stored with elements separated by arrayDelimiter:
```
x = array(INT[], 0).append!([1 3 5, 2 7 9])
t = table(1 2 as id, x as value) 
saveText(t, "./test.csv")
```
Contents of "./test.csv":
```
id,value
1,"1,3,5"
2,"2,7,9"
```
To load this CSV, specify the column type as "INT[]" in the parameter schema and arrayDelimiter as ",":
```
t = loadText("./test.csv", schema=table(`id`value as name, ["INT", "INT[]"] as type), arrayDelimiter=",")
t
/*
id value  
-- -------
1  [1,3,5]
2  [2,7,9]
*/
```

2.2 Basic Operations on Array Vectors

2.2.1 Accessing Elements in Array Vectors

Array vector elements can be accessed using functions (row, at) or positional indexing (x[index]).

When accessing array vectors using positional indexing,

If index is a scalar or pair (e.g., 0 or 0:3), it operates on columns.
If index is a vector (e.g., [1, 2, 3]), it operates on rows.
Out-of-bounds indices are filled with nulls.

Accessing Rows in Array Vectors

Use the row function to retrieve a single row and return a vector.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
x.row(1)  // output: [4,5]

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
y.row(1)  // output: [4,5]

// Out-of-bound: null-filled
x.row(10) // output: [,,,]
y.row(10) // output: [,,,]

Specify a vector index for positional indexing x[index] to retrieve rows and return an array vector.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()

// Single row
x[[1]]    // output: [[4,5]]
y[[1]]    // output: ([4,5])

// Multiple rows
x[[1,2]]  // output: [[4,5],[6,7,8]]
y[[1,2]]  // output: ([4,5],[6,7,8])

// Out-of-bound: null-filled
x[[3,4]]  // output: [[9,10],]
y[[3,4]]  // output: ([9,10],)

Accessing Columns in Array Vectors

Specify a scalar index for positional indexing x[index] to retrieve a single column and return a vector.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()

x[1]      // output: [2,5,7,10]
y[1]      // output: [2,5,7,10]

// Out-of-bound: null-filled
x[10]     // output: [,,,]
y[10]     // output: [,,,]

Specify a pair index for x[start:end] to retrieve multiple columns and return an array vector.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()

// With end specified
x[0:2]    // output: [[1,2],[4,5],[6,7],[9,10]]
y[0:2]    // output: [[1,2],[4,5],[6,7],[9,10]]

// Open-ended range
x[1:]     // output: [[2,3],[5],[7,8],[10]]
y[1:]     // output: ([2,3],[5],[7,8],[10])

// Out-of-bound: null-filled
x[1:4]    // output: [[2,3,],[5,,],[7,8,],[10,,]]
y[1:4]    // output: [[2,3,],[5,,],[7,8,],[10,,]]

Note:

start must be ≥0 and < end; otherwise, an exception is thrown.
For columnar tuple:
- If x is of STRING/SYMBOL type, or end is not specified, the result is a columnar tuple.
- Otherwise, the result is a fast array vector.

Accessing Specific Elements (Row r, Column c)

Use positional indexing to retrieve elements. For fast array vectors, use x[r, c]; For columnar tuples, use x[c, r].
```
x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [[1,2,3],[4,5],[6,7,8],[9,10]].setColumnarTuple!()
r, c = 2, 1

x[r, c]  // output: [7]
y[c, r]  // output: 7
```
Note: Out-of-bound indices will throw an exception.

Locate elements by row first, and then by column.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [[1,2,3],[4,5],[6,7,8],[9,10]].setColumnarTuple!()
r, c = 2, 1
rows, cols = [1, 2, 3], 0:2

x.row(r).at(c)  // output: 7
y.row(r).at(c)  // output: 7

x[[r]][c]       // output: [7]
y[[r]][c]       // output: [7]

x[rows][cols]   // output: [[4,5],[6,7],[9,10]]
y[rows][cols]   // output: [[4,5],[6,7],[9,10]]

// Out-of-bound: null-filled
x[2 3 4][1:3]   // output: [[7,8],[10,],[,]]
y[2 3 4][1:3]   // output: [[7,8],[10,],[,]]

Locate elements by column first, and then by row.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [[1,2,3],[4,5],[6,7,8],[9,10]].setColumnarTuple!()
r, c = 2, 1
rows, cols = [1, 2, 3], 0:2

x[c][r]         // output: 7
y[c][r]         // output: 7

x[cols][rows]   // output: [[4,5],[6,7],[9,10]]
y[cols][rows]   // output: [[4,5],[6,7],[9,10]]

x[1:3][2 3 4]   // output: [[7,8],[10,],]
y[1:3][2 3 4]   // output: [[7,8],[10,],]

Accessing Array Vector Columns

To retrieve elements from array vector columns, specify a scalar or pair index for positional indexing x[index]. Out-of-bound indices are filled with nulls.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
t = table(1 2 3 4 as id, x as x, y as y)
new_t = select *, x[2] as x_newCol1, x[1:3] as x_newCol2, y[2] as y_newCol1, y[1:3] as y_newCol2 from t
new_t

/*
id x       y       x_newCol1 x_newCol2 y_newCol1 y_newCol2
-- ------- ------- --------- --------- --------- ---------
1  [1,2,3] [1,2,3] 3         [2,3]     3         [2,3]    
2  [4,5]   [4,5]             [5,]                [5,]     
3  [6,7,8] [6,7,8] 8         [7,8]     8         [7,8]    
4  [9,10]  [9,10]            [10,]               [10,]    
*/

2.2.2 Appending Data to Array Vectors

Currently, array vectors support appending rows, but do not support modifying or deleting elements.

Use append! to append rows to array vectors.

// Append a single row
x = array(INT[], 0).append!([1 2 3, 4 5 6])
x.append!(7)       // Append scalar
x.append!([8 9])   // Append vector
x // output: [[1,2,3],[4,5,6],[7],[8,9]]

y = [1 2 3, 4 5 6].setColumnarTuple!()
y.append!(7)       // Append scalar  
y.append!([8 9])   // Append vector
y // output: ([1,2,3],[4,5,6],7,[8,9])

// Append multiple rows
x = array(INT[], 0).append!([1 2 3, 4 5 6])
x.append!([7, 8 9]) 
x // output: [[1,2,3],[4,5,6],[7],[8,9]]

y = [1 2 3, 4 5 6].setColumnarTuple!()
y.append!([7, 8 9]) 
y // output: ([1,2,3],[4,5,6],7,[8,9])

Use tableInsert or append! to add rows to tables containing array vector columns:

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
t = table(1 2 3 4 as id, x as col1, y as col2)
t.tableInsert(5, 11, 11)
t.tableInsert(6, [12 13 14], [12 13 14])
t.append!(table(7 8 as id, [15 16, 17] as col1, [15 16, 17] as col2))
t

/*
id col1       col2      
-- ---------- ----------
1  [1,2,3]    [1,2,3]   
2  [4,5]      [4,5]     
3  [6,7,8]    [6,7,8]   
4  [9,10]     [9,10]    
5  [11]       11        
6  [12,13,14] [12,13,14]
7  [15,16]    [15,16]   
8  [17]       17         
*/

2.2.3 Converting Array Vectors to Vectors and Matrices

Use the flatten function to convert an array vector into a one-dimensional vector:

x = array(INT[], 0).append!([1 2 3, 4 5 6])
z = flatten(x)
z // output: [1,2,3,4,5,6]

y = [1 2 3, 4 5 6].setColumnarTuple!()
z = flatten(y)
z // output: [1,2,3,4,5,6]

Use the matrix function to convert equal-length array vectors to matrices:

x = array(INT[], 0).append!([1 2 3, 4 5 6])
z = matrix(x)
z

/*
#0 #1 #2
-- -- --
1  2  3 
4  5  6 
*/

y = [1 2 3, 4 5 6].setColumnarTuple!()
z = matrix(y)
z

/*
#0 #1
-- --
1  4 
2  5 
3  6 
*/

2.2.4 Filtering Elements of Fast Array Vectors

Elements of a fast array vector that meet the filtering condition cond can be filtered out using the syntax x[cond].

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
z = x[x>=5]
z // output: [,[5],[6,7,8],[9,10]]

t = table(1 2 3 4 as id, x as x)
new_t = select *, x[x>=5] as newCol from t
new_t

/*
id x       newCol 
-- ------- -------
1  [1,2,3] []  
2  [4,5]   [5]    
3  [6,7,8] [6,7,8]
4  [9,10]  [9,10] 
*/

2.3 Calculations on Array Vectors

2.3.1 Operations With Scalars

The calculation is performed between each element of the array vector and the scalar.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
z = x + 1
z // output: [[2,3,4],[5,6],[7,8,9],[10,11]]

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
z = y + 1
z // output: ([2,3,4],[5,6],[7,8,9],[10,11])

t = table(1 2 3 4 as id, x as x, y as y)
new_t = select *, x + 1 as new_x, y + 1 as new_y from t
new_t

/*
id x       y       new_x   new_y  
-- ------- ------- ------- -------
1  [1,2,3] [1,2,3] [2,3,4] [2,3,4]
2  [4,5]   [4,5]   [5,6]   [5,6]  
3  [6,7,8] [6,7,8] [7,8,9] [7,8,9]
4  [9,10]  [9,10]  [10,11] [10,11]
*/

2.3.2 Operations With Vectors

The vector length must match the number of rows in the array vector. The calculation is performed between each row of the array vector and the corresponding element in the vector.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
z = x * [1, 2, 3, 4]
z // output: [[1,2,3],[8,10],[18,21,24],[36,40]]

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
z = y * [1, 2, 3, 4]
z // output: ([1,2,3],[8,10],[18,21,24],[36,40])

t = table(1 2 3 4 as id, x as x, y as y)
new_t = select *, x*id as new_x, y*id as new_y from t
new_t

/*
id x       y       new_x      new_y     
-- ------- ------- ---------- ----------
1  [1,2,3] [1,2,3] [1,2,3]    [1,2,3]   
2  [4,5]   [4,5]   [8,10]     [8,10]    
3  [6,7,8] [6,7,8] [18,21,24] [18,21,24]
4  [9,10]  [9,10]  [36,40]    [36,40]  
*/

2.3.3 Operations With Array Vectors

The two array vectors must be of the same dimension. The calculation is performed between the corresponding elements of these two array vectors.

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
xx = array(INT[], 0).append!([3 1 2, 3 1, 1 2 3, 0 1])
z = pow(x, xx)
z // output: [[1,2,9],[64,5],[6,49,512],[1,10]]

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
yy = [3 1 2, 3 1, 1 2 3, 0 1].setColumnarTuple!()
z = pow(y, yy)
z // output: ([1,2,9],[64,5],[6,49,512],[1,10])

t = table(1 2 3 4 as id, x as x, xx as xx, y as y, yy as yy)
new_t = select *, pow(x, xx) as new_x, pow(y, yy) as new_y from t
new_t

/*
id x       xx      y       yy      new_x      new_y     
-- ------- ------- ------- ------- ---------- ----------
1  [1,2,3] [3,1,2] [1,2,3] [3,1,2] [1,2,9]    [1,2,9]   
2  [4,5]   [3,1]   [4,5]   [3,1]   [64,5]     [64,5]    
3  [6,7,8] [1,2,3] [6,7,8] [1,2,3] [6,49,512] [6,49,512]
4  [9,10]  [0,1]   [9,10]  [0,1]   [1,10]     [1,10]     
*/

Note that direct calculations between fast array vectors and columnar tuples are not supported.

2.3.4 Row-Based Calculations

Row-Based Functions

Row-based functions support array vectors. These functions are named in the format rowFunc, such as rowSum, rowAlign, etc.

Unary functions: row-wise sum

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
z = rowSum(x)
z // output: [6,9,21,19]

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
z = rowSum(y)
z // output: [6,9,21,19]

t = table(1 2 3 4 as id, x as x, y as y)
new_t = select *, rowSum(x) as new_x, rowSum(y) as new_y from t
new_t

/*
id x       y       new_x new_y
-- ------- ------- ----- -----
1  [1,2,3] [1,2,3] 6     6    
2  [4,5]   [4,5]   9     9    
3  [6,7,8] [6,7,8] 21    21   
4  [9,10]  [9,10]  19    19   
*/

Binary operations (on array vectors and vectors): Row-wise weighted average

x = array(INT[], 0).append!([1 2 3, 4 5 6, 6 7 8, 9 10 11])
z = rowWavg(x, [1, 1, 2])
z // output: [2.25,5.25,7.25,10.25]

y = [1 2 3, 4 5 6, 6 7 8, 9 10 11].setColumnarTuple!()
z = rowWavg(y, [1, 1, 2])
z // output: [2.25,5.25,7.25,10.25]

t = table(1 2 3 4 as id, x as x, y as y)
new_t = select *, rowWavg(x, [1, 1, 2]) as new_x, rowWavg(y, [1, 1, 2]) as new_y from t
new_t

/*
id x         y         new_x new_y
-- --------- --------- ----- -----
1  [1,2,3]   [1,2,3]   2.25  2.25 
2  [4,5,6]   [4,5,6]   5.25  5.25 
3  [6,7,8]   [6,7,8]   7.25  7.25 
4  [9,10,11] [9,10,11] 10.25 10.25
*/

Binary operations (on array vectors and array vectors): Row-wise correlation coefficient

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
xx = array(INT[], 0).append!([3 1 2, 3 1, 1 2 3, 0 1])
z = rowCorr(x, xx)
z // output: [-0.5,-1,1,1]

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
yy = [3 1 2, 3 1, 1 2 3, 0 1].setColumnarTuple!()
z = rowCorr(y, yy)
z // output: [-0.5,-1,1,1]

t = table(1 2 3 4 as id, x as x, xx as xx, y as y, yy as yy)
new_t = select *, rowCorr(x, xx) as new_x, rowCorr(y, yy) as new_y from t
new_t

/*
id x       xx      y       yy      new_x new_y
-- ------- ------- ------- ------- ----- -----
1  [1,2,3] [3,1,2] [1,2,3] [3,1,2] -0.5  -0.5 
2  [4,5]   [3,1]   [4,5]   [3,1]   -1    -1   
3  [6,7,8] [1,2,3] [6,7,8] [1,2,3] 1     1    
4  [9,10]  [0,1]   [9,10]  [0,1]   1     1       
*/

Special Binary Operations: row-wise alignment

For special data alignment rules in financial scenarios, DolphinDB provides the rowAlign and rowAt functions: rowAlign(left, right, how) aligns the left and right array vectors based on specified alignment rules. rowAt(X, [Y]) retrieves the row indices for each true element in X.

left = array(DOUBLE[], 0).append!([9.00 8.98 8.97 8.96 8.95, 8.99 8.97 8.95 8.93 8.91])
right = prev(left)
left // output: [[9,8.98,8.97,8.96,8.949999999999999],[8.99,8.97,8.949999999999999,8.929999999999999,8.91]]
right // output: [,[9,8.98,8.97,8.96,8.949999999999999]]

leftIndex, rightIndex = rowAlign(left, right, how="bid")
leftIndex // output: [[0,1,2,3,4],[-1,0,-1,1,-1,2]]
rightIndex // output: [[-1,-1,-1,-1,-1],[0,-1,1,2,3,4]]

leftResult = rowAt(left, leftIndex)
rightResult = rowAt(right, rightIndex)

leftResult // output: [[9,8.98,8.97,8.96,8.949999999999999],[,8.99,,8.97,,8.949999999999999]]
rightResult // output: [[,,,,],[9,,8.98,8.97,8.96,8.949999999999999]]

Higher-Order Function byRow

The higher-order function byRow can be called on fast array vectors to apply a user-defined function on each row element.

For example, calculate cumulative sum for each row:

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
z = byRow(cumsum, x)
z // output: [[1,3,6],[4,9],[6,13,21],[9,19]]

t = table(1 2 3 4 as id, x as x)
new_t = select *, byRow(cumsum, x) as new_x from t
new_t

/*
id x       new_x        
-- ------- ---------
1  [1,2,3] [1,3,6]  
2  [4,5]   [4,9]    
3  [6,7,8] [6,13,21]
4  [9,10]  [9,19]   
*/

Note:

byRow only supports fast array vectors.
When using the byRow function, the return value of func can only be a scalar or a vector of the same length as the array vector.

// Return value is a scalar
defg foo1(v){
    return last(v)-first(v)
}

// Return value is a vector of the same length as v
def foo2(v){
    return v \ prev(v) - 1
}

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
x1 = byRow(foo1, x)
x2 = byRow(foo2, x)

x1 // output: [2,1,2,1]
x2 // output: [[,1,0.5],[,0.25],[,0.166666666666667,0.142857142857143],[,0.111111111111111]]

t = table(1 2 3 4 as id, x as x)
new_t = select *, byRow(foo1, x) as x1, byRow(foo2, x) as x2 from t
new_t

/*
id x       x1 x2                                    
-- ------- -- --------------------------------------
1  [1,2,3] 2  [,1,0.5]                              
2  [4,5]   1  [,0.25]                               
3  [6,7,8] 2  [,0.166666666666667,0.142857142857143]
4  [9,10]  1  [,0.111111111111111]                  
*/

Higher-Order Functions each and loop

The higher-order functions each and loop can be called on array vectors to apply a user-defined function on each row element. Their differences are:

byRow only supports fast array vectors, while each and loop support both fast array vectors and columnar tuples.
For each and loop, there is no limitation on the return value of func, and it can be a vector of different lengths from the input row elements. If the return value of func is a vector, the result is a tuple, and the corresponding column in the table is columnar tuple.
For byRow, if the return value of func is a vector, the result of byRow is a fast array vector, and the corresponding column in the table is a fast array vector.

// Return value can be a vector of different length from v
def foo3(v){
    return [last(v)-first(v), last(v)+first(v)]
} 

x = array(INT[], 0).append!([1 2 3, 4 5, 6 7 8, 9 10])
z = loop(foo3, x)
z // output: ([2,4],[1,9],[2,14],[1,19])

y = [1 2 3, 4 5, 6 7 8, 9 10].setColumnarTuple!()
z = loop(foo3, y)
z // output: ([2,4],[1,9],[2,14],[1,19])

t = table(1 2 3 4 as id, x as x, y as y)
new_t = select *, loop(foo3, x) as new_x, loop(foo3, y) as new_y from t
new_t

/*
id x       y       new_x  new_y 
-- ------- ------- ------ ------
1  [1,2,3] [1,2,3] [2,4]  [2,4] 
2  [4,5]   [4,5]   [1,9]  [1,9] 
3  [6,7,8] [6,7,8] [2,14] [2,14]
4  [9,10]  [9,10]  [1,19] [1,19]
*/

2.4 API Writing

2.4.1 C++ API

Construct Data

Create a table object in C++. Specify the column type for the array vector as DT_XX_ARRAY. In the following example, the value column is an INT array vector, so it is specified as DT_INT_ARRAY.

int rowNum = 3;
int colNum = 2;
vector<string> colNames = {"id","value"};
vector<DATA_TYPE> colTypes = {DT_INT, DT_INT_ARRAY};
ConstantSP table = Util::createTable(colNames, colTypes, rowNum, rowNum+1);
vector<VectorSP> columnVecs;
columnVecs.reserve(colNum);
for(int i = 0; i < colNum; ++i) {
    columnVecs.emplace_back(table->getColumn(i));
}

Add data row by row to the C++ table object. Create a DdbVector-type vector as a row in the array vector.

for(int i = 0; i < rowNum; ++i) {
    // Construct a row of the array vector  
    DdbVector<int> intVec(0, 10);
    for(int j = 0; j < 3; ++j) {
        intVec.add(i*3+j);
    }
    VectorSP value_row = intVec.createVector(DT_INT);
    columnVecs[0]->setInt(i, i);        // id column
    columnVecs[1]->set(i, value_row);    // value column
}

Upload Data

Connect to DolphinDB and upload the data. In this example, the table is uploaded to the in-memory table myTable in DolphinDB using the upload method. For details, see User Manual > C++ API.

// Connect to the DolphinDB
DBConnection conn;
conn.connect("127.0.0.1", 8848);
conn.login("admin", "123456", false);
// Upload data to the in-memory table
conn.upload("myTable", table);

Query Data

string script = "select * from myTable;";
ConstantSP result = conn.run(script);
std::cout<<"------ check data ------"<<std::endl;
std::cout<<result->getString()<<std::endl;

2.4.2 Java API

Connect & Create Table

Connect to DolphinDB and create a dimension table to receive data. Specify the column type as XX[]. For example, in the following case, the value column is an INT array vector, so it is specified as INT[].

// Connect to the DolphinDB
DBConnection conn = new DBConnection();
boolean success = conn.connect("127.0.0.1", 8848, "admin", "123456");
// Create a dimension table
String ddb_script = "dbName = \"dfs://testDB\"\n" +"../py_api.dita"est\"\n" +
                "if(existsDatabase(dbName)) {\n" +
                "    dropDatabase(dbName)\n" +
                "}\n" +
                "db = database(dbName, VALUE, 1 2 3 4, , 'TSDB')\n" +
                "schemaTB = table(1:0, `id`value, [INT, INT[]])\n" +
                "pt = createTable(db, schemaTB, tbName, sortColumns=`id)";
conn.run(ddb_script);

Construct Data

Construct data in Java and create a table object. Each column corresponds to a list, and the array vector column is a list of type vector, where each row is a vector.

List<String> colNames = Arrays.asList("id", "value");
List<Vector> cols = new ArrayList<>(6);
int rowNum = 4;
List<Integer> idCol = new ArrayList<>(rowNum);
List<Vector> valueCol = new ArrayList<>(rowNum);    // array vector column
for(int i = 0; i < rowNum; ++i) {
    idCol.add(i + 1);
    List<Integer> valueRow = new ArrayList<>(50);   // A row in the array vector
    for(int j = 0; j < 3; ++j) {
        valueRow.add(i*3 +j);
    }
    valueCol.add(new BasicIntVector(valueRow));
}
cols.add(new BasicIntVector(idCol));
cols.add(new BasicArrayVector(valueCol));
BasicTable tb = new BasicTable(colNames, cols);

Upload Data

Insert data into the dimension table. In the example, the table data tb is uploaded to the dimension table (loaded by loadTable('dfs://testDB','test')) using the tableInsert method. For details, see User Manual > Java API.

List<Entity> tbArg = new ArrayList<>(1);
tbArg.add(tb);
conn.run("tableInsert{loadTable('dfs://testDB','test')}", tbArg);

Query Data

BasicTable t;
t = (BasicTable)conn.run("select * from loadTable('dfs://testDB','test')");
System.out.println(t.getString());

2.4.3 Python API

Construct Data

Create a table object in Python. For example, in the following case, the value column is an INT array vector.

df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'value': [np.array([1,2,3],dtype=np.int64), np.array([4,5,6],dtype=np.int64), np.array([7,8,9],dtype=np.int64), np.array([10,11,12],dtype=np.int64)]
})

Upload Data

Connect to DolphinDB and upload the data. In the example, the DataFrame df is uploaded to the in-memory table myTable in DolphinDB using the table method. For details, see User Manual > Python API.

# Connect to the DolphinDB
s = ddb.session()
s.connect("127.0.0.1", 8848, "admin", "123456")
# Upload data to the in-memory table
s.table(data=df, tableAliasName="myTable")

Query Data

t = s.run("select * from myTable")
print(t)

3. Application in Level 2 Snapshot Data

3.1 Storing Snapshot Data

A typical application scenario for array vector is storing Level 2 snapshot data. The Level 2 data usually contains 10-level bid/ask prices, trade volumes, trade counts, first 50 bid/ask orders, and other multi-level data. Level 2 snapshot data often needs to be processed by dataset, so it is optimal to store multi-level market data using array vector columns.

The following table shows the table schema used in this tutorial:


Field Name	Data Type
SecurityID	SYMBOL
DateTime	TIMESTAMP
PreClosePx	DOUBLE
OpenPx	DOUBLE
HighPx	DOUBLE
LowPx	DOUBLE
LastPx	DOUBLE
TotalVolumeTrade	INT
TotalValueTrade	DOUBLE
InstrumentStatus	SYMBOL
BidPrice	DOUBLE[]
BidOrderQty	INT[]
BidNumOrders	INT[]
BidOrders	INT[]
OfferPrice	DOUBLE[]
OfferOrderQty	INT[]
OfferNumOrders	INT[]
OfferOrders	INT[]
NumTrades	INT
IOPV	DOUBLE
TotalBidQty	INT
TotalOfferQty	INT
WeightedAvgBidPx	DOUBLE
WeightedAvgOfferPx	DOUBLE
TotalBidNumber	INT
TotalOfferNumber	INT
BidTradeMaxDuration	INT
OfferTradeMaxDuration	INT
NumBidOrders	INT
NumOfferOrders	INT
WithdrawBuyNumber	INT
WithdrawBuyAmount	INT
WithdrawBuyMoney	DOUBLE
WithdrawSellNumber	INT
WithdrawSellAmount	INT
WithdrawSellMoney	DOUBLE
ETFBuyNumber	INT
ETFBuyAmount	INT
ETFBuyMoney	DOUBLE
ETFSellNumber	INT
ETFSellAmount	INT
ETFSellMoney	DOUBLE

Eight fields (BidPrice, OfferPrice, BidOrderQty, OfferOrderQty, BidNumOrders, OfferNumOrders, BidOrders, OfferOrders) use the array vector structure. Column types are specified as DOUBLE[] and INT[] during table creation. Use the following script to create the table:

dbName = "dfs://SH_TSDB_snapshot_ArrayVector"
tbName = "snapshot"
if(existsDatabase(dbName)){
 dropDatabase(dbName)
}
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 20])
db = database(dbName, COMPO, [db1, db2], , "TSDB")
schemaTable = table(
 array(SYMBOL, 0) as SecurityID,
 array(TIMESTAMP, 0) as DateTime,
 array(DOUBLE, 0) as PreClosePx,
 array(DOUBLE, 0) as OpenPx,
 array(DOUBLE, 0) as HighPx,
 array(DOUBLE, 0) as LowPx,
 array(DOUBLE, 0) as LastPx,
 array(INT, 0) as TotalVolumeTrade,
 array(DOUBLE, 0) as TotalValueTrade,
 array(SYMBOL, 0) as InstrumentStatus,
 array(DOUBLE[], 0) as BidPrice,
 array(INT[], 0) as BidOrderQty,
 array(INT[], 0) as BidNumOrders,
 array(INT[], 0) as BidOrders,
 array(DOUBLE[], 0) as OfferPrice,
 array(INT[], 0) as OfferOrderQty,
 array(INT[], 0) as OfferNumOrders,
 array(INT[], 0) as OfferOrders,
 array(INT, 0) as NumTrades,
 array(DOUBLE, 0) as IOPV,
 array(INT, 0) as TotalBidQty,
 array(INT, 0) as TotalOfferQty,
 array(DOUBLE, 0) as WeightedAvgBidPx,
 array(DOUBLE, 0) as WeightedAvgOfferPx,
 array(INT, 0) as TotalBidNumber,
 array(INT, 0) as TotalOfferNumber,
 array(INT, 0) as BidTradeMaxDuration,
 array(INT, 0) as OfferTradeMaxDuration,
 array(INT, 0) as NumBidOrders,
 array(INT, 0) as NumOfferOrders,
 array(INT, 0) as WithdrawBuyNumber,
 array(INT, 0) as WithdrawBuyAmount,
 array(DOUBLE, 0) as WithdrawBuyMoney,
 array(INT, 0) as WithdrawSellNumber,
 array(INT, 0) as WithdrawSellAmount,
 array(DOUBLE, 0) as WithdrawSellMoney,
 array(INT, 0) as ETFBuyNumber,
 array(INT, 0) as ETFBuyAmount,
 array(DOUBLE, 0) as ETFBuyMoney,
 array(INT, 0) as ETFSellNumber,
 array(INT, 0) as ETFSellAmount,
 array(DOUBLE, 0) as ETFSellMoney
)
db.createPartitionedTable(table=schemaTable, tableName=tbName, partitionColumns=`DateTime`SecurityID, compressMethods={DateTime:"delta"}, sortColumns=`SecurityID`DateTime, keepDuplicates=ALL)

During data import, the fixedLengthArrayVector function combines multi-level data into single array vector columns. The data import script:

 def transform(data){
 t = select SecurityID, DateTime, PreClosePx, OpenPx, HighPx, LowPx, LastPx, TotalVolumeTrade, TotalValueTrade, InstrumentStatus,
        fixedLengthArrayVector(BidPrice0, BidPrice1, BidPrice2, BidPrice3,  BidPrice4, BidPrice5, BidPrice6, BidPrice7, BidPrice8, BidPrice9) as BidPrice,
        fixedLengthArrayVector(BidOrderQty0, BidOrderQty1, BidOrderQty2, BidOrderQty3,  BidOrderQty4, BidOrderQty5, BidOrderQty6, BidOrderQty7, BidOrderQty8, BidOrderQty9) as BidOrderQty,
        fixedLengthArrayVector(BidNumOrders0, BidNumOrders1, BidNumOrders2, BidNumOrders3,  BidNumOrders4, BidNumOrders5, BidNumOrders6, BidNumOrders7, BidNumOrders8, BidNumOrders9) as BidNumOrders,
        fixedLengthArrayVector(BidOrders0, BidOrders1, BidOrders2, BidOrders3,  BidOrders4, BidOrders5, BidOrders6, BidOrders7, BidOrders8, BidOrders9, BidOrders10, BidOrders11, BidOrders12, BidOrders13,  BidOrders14, BidOrders15, BidOrders16, BidOrders17, BidOrders18, BidOrders19, BidOrders20, BidOrders21, BidOrders22, BidOrders23,  BidOrders24, BidOrders25, BidOrders26, BidOrders27, BidOrders28, BidOrders29, BidOrders30, BidOrders31, BidOrders32, BidOrders33,  BidOrders34, BidOrders35, BidOrders36, BidOrders37, BidOrders38, BidOrders39, BidOrders40, BidOrders41, BidOrders42, BidOrders43,  BidOrders44, BidOrders45, BidOrders46, BidOrders47, BidOrders48, BidOrders49) as BidOrders,
        fixedLengthArrayVector(OfferPrice0, OfferPrice1, OfferPrice2, OfferPrice3,  OfferPrice4, OfferPrice5, OfferPrice6, OfferPrice7, OfferPrice8, OfferPrice9) as OfferPrice,
        fixedLengthArrayVector(OfferOrderQty0, OfferOrderQty1, OfferOrderQty2, OfferOrderQty3,  OfferOrderQty4, OfferOrderQty5, OfferOrderQty6, OfferOrderQty7, OfferOrderQty8, OfferOrderQty9) as OfferOrderQty,
        fixedLengthArrayVector(OfferNumOrders0, OfferNumOrders1, OfferNumOrders2, OfferNumOrders3,  OfferNumOrders4, OfferNumOrders5, OfferNumOrders6, OfferNumOrders7, OfferNumOrders8, OfferNumOrders9) as OfferNumOrders,
        fixedLengthArrayVector(OfferOrders0, OfferOrders1, OfferOrders2, OfferOrders3,  OfferOrders4, OfferOrders5, OfferOrders6, OfferOrders7, OfferOrders8, OfferOrders9, OfferOrders10, OfferOrders11, OfferOrders12, OfferOrders13,  OfferOrders14, OfferOrders15, OfferOrders16, OfferOrders17, OfferOrders18, OfferOrders19, OfferOrders20, OfferOrders21, OfferOrders22, OfferOrders23,  OfferOrders24, OfferOrders25, OfferOrders26, OfferOrders27, OfferOrders28, OfferOrders29, OfferOrders30, OfferOrders31, OfferOrders32, OfferOrders33,  OfferOrders34, OfferOrders35, OfferOrders36, OfferOrders37, OfferOrders38, OfferOrders39, OfferOrders40, OfferOrders41, OfferOrders42, OfferOrders43,  OfferOrders44, OfferOrders45, OfferOrders46, OfferOrders47, OfferOrders48, OfferOrders49) as OfferOrders,
        NumTrades, IOPV, TotalBidQty, TotalOfferQty, WeightedAvgBidPx, WeightedAvgOfferPx, TotalBidNumber, TotalOfferNumber, BidTradeMaxDuration,OfferTradeMaxDuration, NumBidOrders, NumOfferOrders, WithdrawBuyNumber, WithdrawBuyAmount, WithdrawBuyMoney,WithdrawSellNumber, WithdrawSellAmount, WithdrawSellMoney, ETFBuyNumber, ETFBuyAmount, ETFBuyMoney, ETFSellNumber, ETFSellAmount, ETFSellMoney
        from data
 return t
}

def loadData(csvDir, dbName, tbName){
 schemaTB = extractTextSchema(csvDir)
 update schemaTB set type = "SYMBOL" where name = "SecurityID"
 loadTextEx(dbHandle=database(dbName), tableName=tbName, partitionColumns=`DateTime`SecurityID, sortColumns=`SecurityID`DateTime, filename=csvDir, schema=schemaTB, transform=transform)
}

// Submit import job
csvDir = "/home/v2/data/testdata/snapshot_100stocks_multi.csv"
dbName, tbName = "dfs://SH_TSDB_snapshot_ArrayVector", "snapshot"
submitJob("loadData", "load Data", loadData{csvDir, dbName, tbName})
getRecentJobs()

3.2 Storing Raw Data in Grouped Calculation

Another key application of array vectors is storing original data during grouped aggregations. DolphinDB provides the toArray function, which can be combined with group by to store data in each group as a row in an array vector column, enabling users to directly examine all data within each group.

For example, aggregating snapshot data by 5 minute while preserving bid/ask prices and volumes:

// Create database for 5-minute bars
def createFiveMinuteBarDB(dbName, tbName){
 if(existsDatabase(dbName)){
  dropDatabase(dbName)
 }
 db = database(dbName, VALUE, 2021.01.01..2021.12.31, , "TSDB")
 colNames = `SecurityID`DateTime`OpenPx`HighPx`LowPx`LastPx`TotalVolume`TotalValueTrade`BidPrice0`BidOrderQty0`OfferPrice0`OfferOrderQty0
 colTypes = [SYMBOL, TIMESTAMP, DOUBLE, DOUBLE, DOUBLE, DOUBLE, LONG, DOUBLE, DOUBLE[], INT[], DOUBLE[], INT[]]
 schemaTable = table(1:0, colNames, colTypes)
 db.createPartitionedTable(table=schemaTable, tableName=tbName, partitionColumns=`DateTime, compressMethods={DateTime:"delta"}, sortColumns=`SecurityID`DateTime, keepDuplicates=ALL)
}
dbName, tbName = "dfs://fiveMinuteBar", "fiveMinuteBar"
createFiveMinuteBarDB(dbName, tbName)

// Calculate 5-minute bars from snapshot data
t = select first(OpenPx) as OpenPx, max(HighPx) as HighPx, min(LowPx) as LowPx, last(LastPx) as LastPx, last(TotalVolumeTrade) as TotalVolumeTrade, last(TotalValueTrade) as TotalValueTrade, toArray(BidPrice[0]) as BidPrice0,  toArray(BidOrderQty[0]) as BidOrderQty0, toArray(OfferPrice[0]) as OfferPrice0, toArray(OfferOrderQty[0]) as OfferOrderQty0 from loadTable("dfs://SH_TSDB_snapshot_ArrayVector", "snapshot") group by SecurityID, interval(DateTime, 5m, "none") as DateTime map

// Save results
loadTable(dbName, tbName).append!(t)

3.3 High-Frequency Factor Calculation Using Snapshot Data

DolphinDB not only provides high-performance access to time series data but also features a built-in vectorized programming language and powerful computing engines, making it efficient for quantitative finance factor development. This includes high-frequency factor calculation based on historical data and stream computation based on real-time Level 2 market data. This section demonstrates the application of array vectors in factor calculation, using high-frequency factor computation based on snapshot data as an example.

3.3.1 Net Incremental Buying Amount

The formula for calculating net incremental buying amount is defined as follows:

Where bidAmtDiff_t represents the buying order increment at time t; bid_i,t represents the i-th level bid price at time t in the snapshot data; the indicator function I is defined as:

The following script implements the factor based on the formula and DolphinDB built-in functions:

@state
def calculateAmtDiff(bid, ask, bidvol, askvol){
    lastBidPrice = prev(bid[0])  // Best bid price of the previous tick
    lastAskPrice = prev(ask[0])  // Best ask price of the previous tick
    lastBidQty = prev(bidvol[0])  // Best bid quantity of the previous tick
    lastAskQty = prev(askvol[0])  // Best ask quantity of the previous tick
    // Calculate buying order increment
    bidAmtDiff = rowSum(bid*bidvol*(bid >= lastBidPrice)) - lastBidPrice*lastBidQty
    // Calculate selling order increment
    askAmtDiff = rowSum(ask*askvol*(ask <= lastAskPrice)) - lastAskPrice*lastAsk
    return bidAmtDiff - askAmtDiff
}

The prev function is used to retrieve the price of the the previous tick, and the rowSum function calculates the sum by row of the array vector column.

Batch Processing

snapshot = loadTable("dfs://SH_TSDB_snapshot_ArrayVector", "snapshot")
res1 = select SecurityID, DateTime, calculateAmtDiff(BidPrice, OfferPrice, BidOrderQty, OfferOrderQty) as amtDiff from snapshot context by SecurityID csort DateTime

Stream Processing

// Create input and output tables
share(streamTable(1:0, snapshot.schema().colDefs.name, snapshot.schema().colDefs.typeString), `snapshotStreamTable)
share(streamTable(1:0, `SecurityID`DateTime`amtDiff, [SYMBOL, TIMESTAMP, DOUBLE]), `res2)
go
// Create streaming engine
createReactiveStateEngine(name="calAmtDiffDemo", metrics=<[DateTime, calculateAmtDiff(BidPrice, OfferPrice, BidOrderQty, OfferOrderQty)]>, dummyTable=snapshotStreamTable, outputTable=res2, keyColumn=`SecurityID)
// Subscribe to snapshot data
subscribeTable(tableName="snapshotStreamTable", actionName="calAmtDiffTest", offset=-1, handler=getStreamEngine("calAmtDiffDemo"), msgAsTable=true)
// Replay data to simulate streams
testData = select * from snapshot where date(DateTime)=2021.12.01 order by DateTime
submitJob("replayData", "replay snapshot data", replay{inputTables=testData, outputTables=snapshotStreamTable, dateColumn=`DateTime, timeColumn=`DateTime, replayRate=1000})

3.3.2 10-Level Net Incremental Buying Amount

The formula for calculating 10-level net incremental buying amount is defined as follows:

Where level10_Diff_t represents the buying order increment at time t; bid_i,t represents the i-th level bid price at time t in the snapshot data; the indicator function I is defined as:

The following script implements the factor using DolphinDB built-in functions:

@state
def level10_Diff(price, qty, buy){
        prevPrice = price.prev()
        left, right = rowAlign(price, prevPrice, how=iif(buy, "bid", "ask"))
        qtyDiff = (qty.rowAt(left).nullFill(0) - qty.prev().rowAt(right).nullFill(0)) 
        amtDiff = rowSum(nullFill(price.rowAt(left), prevPrice.rowAt(right)) * qtyDiff)
        return amtDiff
}

snapshot = loadTable("dfs://SH_TSDB_snapshot_ArrayVector", "snapshot")
res = select SecurityID, DateTime, level10_Diff(BidPrice, BidOrderQty, true) as level10_Diff from snapshot context by SecurityID csort DateTime

The row alignment function rowAlign is used to align the current and previous 10-level prices. The rowAt and nullFill functions are used to retrieve the corresponding level's order volume and price. Finally, the total change amount is calculated using the rowSum function.

4. Conclusion

This tutorial introduced DolphinDB's array vector structure for efficient variable-length 2D array storage and computation, detailing its key features, usage, and applications. DolphinDB’s capability for storing and processing large-scale high-frequency market data makes it an ideal solution for factor development. Array vectors offer a simple, efficient, and flexible approach, particularly suited for handling Level 2 snapshot data.

Appendix

snapshot_100stocks_multi.zip