indexedTable
Syntax
indexedTable(keyColumns, X, [X1], [X2], .....)
or
indexedTable(keyColumns, capacity:size, colNames, colTypes)
or
indexedTable(keyColumns, table)
Arguments
keyColumns is a string scalar or vector indicating the name(s) of the primary key column(s). The column type must be INTEGRAL, TEMPORAL or LITERAL.
For the first scenario: X, X1, .... are vectors.
For the second scenario:
capacity is the amount of memory (in terms of the number of rows) allocated to the table. When the number of rows exceeds capacity, the system will first allocate memory of 1.2~2 times of capacity, copy the data to the new memory space, and release the original memory. For large tables, these steps may use significant amount of memory.
size can be 0 or 1, indicating the initial size (in terms of the number of rows) of the table. It must be 0 if the table contains array vector columns.
-
If size=0, create an empty table.
-
If size=1, create a table with one record, and the initialized values are:
-
false for Boolean type.
-
0 for numeric, temporal, IPADDR, COMPLEX, and POINT types.
-
NULL for LITERAL and INT128 types.
-
colNames is a string vector of column names.
colTypes is a string vector of data types.
For the third scenario, table is a table. Please note that keyColumns cannot have duplicate values.
Details
Create an indexed table, which is a special type of in-memory table with primary key. The primary key can be one column or multiple columns. The indexed table uses a red-black tree to store the primary key index. During queries, as long as the query conditions include the first column of the primary key, data can be located through the index without performing a full table scan. It is recommended to use sliceByKey to improve query performance.
When adding new records to the table, if the primary key of the new record duplicates an existing record, the system updates the record in the table; otherwise, the new record is added to the table.
The following compares the query optimization techniques for indexed and keyed tables.
For indexed tables:
- The first column of keyColumns must be queried, and filter conditions for
this column can only use
=
,in
, orand
. - Columns other than the first column of keyColumns can use range queries
through
between
, comparison operators, etc., with higher query efficiency than using thein
predicate. - The number of distinct columns filtered with
in
should not exceed 2.
For keyed tables:
- All keyColumns must be queried. For such queries, key tables show better performance than indexed tables.
- Filter conditions can only use
=
,in
, orand
. - The number of distinct columns filtered with
in
should not exceed 2.
Query Optimization:
-
If the filtering conditions in a SQL statement satisfy the following conditions at the same time, query performance on an indexed table is optimized and is better than that on an ordinary in-memory table:
- The query contains the first key column in keyColumns, the
filtering condition for the first key column uses only
=
orin
operator, and it is not followed by anor
clause; - There are at most 2
in
operators
- The query contains the first key column in keyColumns, the
filtering condition for the first key column uses only
- When querying an indexed table, it is recommended to use sliceByKey to improve performance.
-
For comparison, please refer to the optimized SQL query in keyed table.
Examples
Example 1. Create an indexed table.
The first scenario:
sym=`A`B`C`D`E
id=5 4 3 2 1
val=52 64 25 48 71
t=indexedTable(`sym`id,sym,id,val)
t;
sym | id | col1 |
---|---|---|
A | 5 | 52 |
B | 4 | 64 |
C | 3 | 25 |
D | 2 | 48 |
E | 1 | 71 |
The second scenario:
t=indexedTable(`sym`id,1:0,`sym`id`val,[SYMBOL,INT,INT])
insert into t values(`A`B`C`D`E,5 4 3 2 1,52 64 25 48 71);
The third scenario:
tmp=table(sym, id, val)
t=indexedTable(`sym`id, tmp);
Create an indexed in-memory partitioned table:
t=indexedTable(`sym`id,sym,id,val)
db=database("",VALUE, sym)
pt=db.createPartitionedTable(t,`pt,`sym).append!(t);
Example 2. Update an indexed table.
t=indexedTable(`sym,1:0,`sym`datetime`price`qty,[SYMBOL,DATETIME,DOUBLE,DOUBLE])
insert into t values(`APPL`IBM`GOOG,2018.06.08T12:30:00 2018.06.08T12:30:00 2018.06.08T12:30:00,50.3 45.6 58.0,5200 4800 7800)
t;
sym | datetime | price | qty |
---|---|---|---|
APPL | 2018.06.08T12:30:00 | 50.3 | 5200 |
IBM | 2018.06.08T12:30:00 | 45.6 | 4800 |
GOOG | 2018.06.08T12:30:00 | 58 | 7800 |
Insert a new row with duplicate primary key value as an existing row. The existing row will be overwritten:
insert into t values(`APPL`IBM`GOOG,2018.06.08T12:30:01 2018.06.08T12:30:01 2018.06.08T12:30:01,65.8 45.2 78.6,5800 8700 4600)
t;
sym | datetime | price | qty |
---|---|---|---|
APPL | 2018.06.08T12:30:01 | 65.8 | 5800 |
IBM | 2018.06.08T12:30:01 | 45.2 | 8700 |
GOOG | 2018.06.08T12:30:01 | 78.6 | 4600 |
The primary key cannot be updated:
update t set sym="C_"+sym;
// Error: Can't update a key column.
Example 3. Query on an indexed table.
In some cases, queries on an indexed table are optimized. In this section we will compare the performance of queries on indexed tables vs ordinary in-memory tables.
For the following examples, we first create an ordinary in-memory table t and an indexed table t1 with 1 million records each:
id=shuffle(1..1000000)
date=take(2012.06.01..2012.06.10, 1000000)
type=take(0..9, 1000000)
val=rand(100.0, 1000000)
t=table(id, date, type, val)
t1=indexedTable(`id`date`type, id, date, type, val);
Use the first key column in the filtering condition:
timer(100) select * from t where id=500000;
// Time elapsed: 177.286 ms
timer(100) select * from t1 where id=500000;
// Time elapsed: 1.245 ms
timer(100) sliceByKey(t1, 500000)
// Time elapsed: 0.742 ms
timer(100) select * from t where id in [500000, 600000, 700000];
// Time elapsed: 1134.429 ms
timer(100) select * from t1 where id in [500000, 600000, 700000];
// Time elapsed: 1.377 ms
If the filtering condition for the first key column does not use =
or the in
operator, then the performance of a query on an indexed
table is not optimized:
timer(100) select * from t where id between 500000:500010;
// Time elapsed: 641.544 ms
timer(100) select * from t1 where id between 500000:500010;
// Time elapsed: 599.752 ms
Use the first key column and the third key column in the filtering conditions:
timer(100) select * from t where id=500000, type in [3,6];
// Time elapsed: 172.808 ms
timer(100) select * from t1 where id=500000, type in [3,6];
// Time elapsed: 1.664 ms
If the filtering conditions do not use the first key column, then the performance of a query on an indexed table is not optimized:
timer(100) select * from t where date in [2012.06.03, 2012.06.06];
// Time elapsed: 490.182 ms
timer(100) select * from t1 where date in [2012.06.03, 2012.06.06];
// Time elapsed: 544.015 ms
timer(100) select * from t where date=2012.06.03, type=8;
// Time elapsed: 205.443 ms
timer(100) select * from t1 where date=2012.06.03, type=8;
// Time elapsed: 204.532 ms
With more than 2 in
operators in the filtering conditions, the
performance of a query on an indexed table is not optimized:
timer(100) select * from t where id in [100,200], date in [2012.06.03, 2012.06.06], type in [3,6];
// Time elapsed: 208.714 ms
timer(100) select * from t1 where id in [100,200], date in [2012.06.03, 2012.06.06], type in [3,6];
// Time elapsed: 198.674 ms
Example 4. Use an indexed table with array vectors to record the 5 levels of quotes data.
sym=["a","b","c "]
time=22:58:52.827 22:58:53.627 22:58:53.827
volume=array(INT[]).append!([[100,110,120,115,125],[200,230,220,225,230],[320,300,310,315,310]])
price=array(DOUBLE[]).append!([[10.5,10.6,10.7,10.77,10.85],[8.6,8.7,8.76,8.83,8.9],[6.3,6.37,6.42,6.48,6.52]])
t=indexedTable(`sym,sym,time,volume,price)
t;
sym | time | volume | price |
---|---|---|---|
a | 22:58:52.827 | [100, 110, 120, 115, 125] | [10.5, 10.6, 10.7, 10.77, 10.85] |
b | 22:58:53.627 | [200, 230, 220, 225, 230] | [8.6, 8.7, 8.76, 8.83, 8.9] |
c | 22:58:53.827 | [320, 300, 310, 315, 310] | [6.3, 6.37, 6.42, 6.48, 6.52] |
// latest quote volume and price
newVolume=array(INT[]).append!([[130,110,110,115,120]])
newPrice= array(DOUBLE[]).append!([[10.55,10.57,10.62,10.68,10.5]])
// update for stock a
update t set volume=newVolume, price=newPrice where sym="a"
t;
sym | time | volume | price |
---|---|---|---|
a | 22:58:52.827 | [130, 110, 110, 115, 120] | [10.55, 10.57, 10.62, 10.68, 10.5] |
b | 22:58:53.627 | [200, 230, 220, 225, 230] | [8.6, 8.7, 8.76, 8.83, 8.9] |
c | 22:58:53.827 | [320, 300, 310, 315, 310] | [6.3, 6.37, 6.42, 6.48, 6.52] |
Note that when updating the array vector column, the number of elements in each column must be consistent with the original column. For example, if the vector of new record contains 4 elements, while the original contains 5 elements, an error is raised:
newVolume=array(INT[]).append!([[130,110,110,120]])
newPrice= array(DOUBLE[]).append!([[10.55,10.57,10.62,10.5]])
update t set volume=newVolume, price=newPrice where sym="a"
// error: Failed to update column: volume
Related function: keyedTable