pivot by
pivot by is a unique feature in DolphinDB and an extension to the ANSI
SQL. It rearranges a column (or multiple columns) of a table (with or without a data
transformation function) on two dimensions. If the pivot by clause
includes more than 2 columns, the last column determines the columns in the result and
all other columns determine the rows in the result. It produces a table with
select statement and a matrix with exec statement.
When used in conjunction with the pivot by clause, the select clause can specify
multiple columns. See pivot and
unpivot.
The following versions contain enhancements for the pivot by statement:
| Version | Enhancements |
|---|---|
| 1.20.4 | 1. If the pivot by clause includes more than 2 columns, the last column determines the columns in the result and all other columns determine the rows in the result. 2. When used in conjunction with the pivot by clause, the select clause can specify multiple columns. |
| 2.00.9 |
As of version 1.30.21/2.00.9, query performance has been optimized
for
|
| 2.00.10.4 |
|
Examples
Example 1
sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t2 = table(timestamp, sym, qty, price);
t2;
| timestamp | sym | qty | price |
|---|---|---|---|
| 09:34:07 | C | 2200 | 49.6 |
| 09:35:42 | MS | 1900 | 29.46 |
| 09:36:51 | MS | 2100 | 29.52 |
| 09:36:59 | MS | 3200 | 30.02 |
| 09:35:47 | IBM | 6800 | 174.97 |
| 09:36:26 | IBM | 5400 | 175.23 |
| 09:34:16 | C | 1300 | 50.76 |
| 09:35:26 | C | 2500 | 50.32 |
| 09:36:12 | C | 8800 | 51.29 |
select price from t2 pivot by timestamp, sym;
| timestamp | C | IBM | MS |
|---|---|---|---|
| 09:34:07 | 49.6 | ||
| 09:34:16 | 50.76 | ||
| 09:35:26 | 50.32 | ||
| 09:35:42 | 29.46 | ||
| 09:35:47 | 174.97 | ||
| 09:36:12 | 51.29 | ||
| 09:36:26 | 175.23 | ||
| 09:36:51 | 29.52 | ||
| 09:36:59 | 30.02 |
a=select last(price) from t2 pivot by timestamp.minute(), sym;
a;
| minute_timestamp | C | IBM | MS |
|---|---|---|---|
| 09:34m | 50.76 | ||
| 09:35m | 50.32 | 174.97 | 29.46 |
| 09:36m | 51.29 | 175.23 | 30.02 |
typestr a;
// output: TABLE
b=exec count(price) from t2 pivot by timestamp.minute(), sym;
b;
| label | C | IBM | MS |
|---|---|---|---|
| 09:34m | 2 | ||
| 09:35m | 1 | 1 | 1 |
| 09:36m | 1 | 1 | 2 |
typestr b;
// output: FAST DOUBLE MATRIX
Example 2. Calculate the basket value of an ETF's constituents. For simplicity, assume an ETF has 2 constituents AAPL and FB.
symbol=take(`AAPL, 6) join take(`FB, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`AAPL`FB, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;
select price from ETF pivot by time, symbol;
| Time | AAPL | FB |
|---|---|---|
| 2019.02.27T09:45:01.000000146 | 103.962 | |
| 2019.02.27T09:45:01.000000212 | 64.604 | |
| 2019.02.27T09:45:01.000000278 | 103.956 | |
| 2019.02.27T09:45:01.000000412 | 103.944 | |
| 2019.02.27T09:45:01.000000445 | 103.95 | |
| 2019.02.27T09:45:01.000000496 | 103.956 | |
| 2019.02.27T09:45:01.000000556 | 64.6 | |
| 2019.02.27T09:45:01.000000598 | 64.596 | |
| 2019.02.27T09:45:01.000000712 | 64.6 | |
| 2019.02.27T09:45:01.000000789 | 103.962 | |
| 2019.02.27T09:45:01.000000989 | 64.604 |
To calculate the basket value of this ETF's constituents at each timestamp that any of its constituents had a trade, we need to forward fill each stock's price in the pivot table, and then sum each row's weighted price. They can be completed in just one SQL statement with "pivot by".
select rowSum(ffill(price)) from ETF pivot by time, symbol;
| Time | rowSum |
|---|---|
| 2019.02.27T09:45:01.000000146 | 103.962 |
| 2019.02.27T09:45:01.000000212 | 168.566 |
| 2019.02.27T09:45:01.000000278 | 168.56 |
| 2019.02.27T09:45:01.000000412 | 168.548 |
| 2019.02.27T09:45:01.000000445 | 168.554 |
| 2019.02.27T09:45:01.000000496 | 168.56 |
| 2019.02.27T09:45:01.000000556 | 168.556 |
| 2019.02.27T09:45:01.000000598 | 168.552 |
| 2019.02.27T09:45:01.000000712 | 168.556 |
| 2019.02.27T09:45:01.000000789 | 168.562 |
| 2019.02.27T09:45:01.000000989 | 168.566 |
Example 3. The names of the newly created columns from a pivot by clause will be automatically adjusted.
Before version 2.00.2, column names in DolphinDB can only use letters, digits or underscores (_), and must start with letters.
Since version 2.00.2, column names generated by pivot by, addColumn can contain special characters or start with digits.
For details, please refer to section Creating tables in Table.
date = take(2021.08.01 2021.08.02 2021.08.03, 12)
sym = take(["IBM N", "_MSFTN", "3_GOOGS", ""], 12).sort()
value = 1..12
t=table(date, sym, value)
re = select value from t pivot by date, sym
| date | NULL | 3_GOOGS | IBM N | _MSFTN |
|---|---|---|---|---|
| 2021.08.01 | 1 | 4 | 7 | 10 |
| 2021.08.02 | 2 | 5 | 8 | 11 |
| 2021.08.03 | 3 | 6 | 9 | 12 |
select _"NULL" from re
| NULL |
|---|
| 1 |
| 2 |
| 3 |
select _"3_GOOGS" from re
| 3_GOOGS |
|---|
| 4 |
| 5 |
| 6 |
In the following example, set removeSpecialCharInColumnName = true. The stock symbols "600300" and "600600" are adjusted to "c600300" and "c600600" as column names.
symbol=take(`600300, 6) join take(`600600, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`600300`600600, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;
select last(price) from ETF pivot by time, symbol;
| time | C600300 | C600600 |
|---|---|---|
| 2019.02.27T09:45:01.000000146 | 103.962 | |
| 2019.02.27T09:45:01.000000212 | 64.604 | |
| 2019.02.27T09:45:01.000000278 | 103.956 | |
| 2019.02.27T09:45:01.000000412 | 103.944 | |
| 2019.02.27T09:45:01.000000445 | 103.95 | |
| 2019.02.27T09:45:01.000000496 | 103.956 | |
| 2019.02.27T09:45:01.000000556 | 64.6 | |
| 2019.02.27T09:45:01.000000598 | 64.596 | |
| 2019.02.27T09:45:01.000000712 | 64.6 | |
| 2019.02.27T09:45:01.000000789 | 103.962 | |
| 2019.02.27T09:45:01.000000989 | 64.604 |
Example 4. The pivot by clause includes more than 2 columns.
date = 2020.09.21 + 0 0 0 0 1 1 1 1
sym = `MS`MS`GS`GS`MS`MS`GS`GS$SYMBOL
factorNum = 1 2 1 2 1 2 1 2
factorValue = 1.2 -3.4 -2.5 6.3 1.1 -3.2 -2.1 5.6
t = table(date, sym, factorNum, factorValue);
t;
| date | sym | factorNum | factorValue |
|---|---|---|---|
| 2020.09.21 | MS | 1 | 1.2 |
| 2020.09.21 | MS | 2 | -3.4 |
| 2020.09.21 | GS | 1 | -2.5 |
| 2020.09.21 | GS | 2 | 6.3 |
| 2020.09.22 | MS | 1 | 1.1 |
| 2020.09.22 | MS | 2 | -3.2 |
| 2020.09.22 | GS | 1 | -2.1 |
| 2020.09.22 | GS | 2 | 5.6 |
select factorValue from t pivot by date, sym, factorNum;
| date | sym | 1 | 2 |
|---|---|---|---|
| 2020.09.21 | GS | -2.5 | 6.3 |
| 2020.09.21 | MS | 1.2 | -3.4 |
| 2020.09.22 | GS | -2.1 | 5.6 |
| 2020.09.22 | MS | 1.1 | -3.2 |
Select multiple columns for each factorNum
select factorValue, factorValue>0 as factorSign from t pivot by date, sym, factorNum;
| date | sym | factorValue_1 | factorValue_2 | factorSign_1 | factorSign_2 |
|---|---|---|---|---|---|
| 2020.09.21 | GS | -2.5 | 6.3 | 0 | 1 |
| 2020.09.21 | MS | 1.2 | -3.4 | 1 | 0 |
| 2020.09.22 | GS | -2.1 | 5.6 | 0 | 1 |
| 2020.09.22 | MS | 1.1 | -3.2 | 1 | 0 |
Example 5. Retain duplicate values after applying function
asis
In this example, the following conditions are met:
-
table pt is a DFS table
-
The
fromclause does not contain table joins. -
The first n-1 pivot-by column (i.e., the time column) does not contain DECIMAL128 values.
-
The last pivot-by column (i.e., the sym column) is a partitioning column.
-
The selected column value does not contain null values. Aggregate functions, order-sensitive functions, row-based functions, or fill functions are not used in the
selectclause. -
The table generated by the
pivot byclause can be fully populated. In this example, for each distinct value in the time column, the number of values in the last pivot-by column sym is always equal. When time=10:20:44, the sym A, B, and C all have one record; When time=10:20:45, the sym A, B, and C all have one record; When time=10:20:46, the sym A, B, and C all have two records.
Apply function asis to the value column, all
data can be retained:
time = [10:20:44,10:20:44,10:20:44,10:20:45,10:20:45,10:20:45,10:20:46,10:20:46,10:20:46,10:20:46,10:20:46,10:20:46]
sym = ["A","B","C","A","B","C","A","B","C","A","B","C"]
value = [510,434,999,837,402,615,495,885,745,968,975,165]
t = table(time,sym,value)
db = database("dfs://test_pivot",VALUE,`A`B`C)
pt = db.createPartitionedTable(t,`pt,`sym)
pt.append!(t)
select value from pt pivot by time,sym
| time | A | B | C |
|---|---|---|---|
| 10:20:44 | 510 | 434 | 999 |
| 10:20:45 | 837 | 402 | 615 |
| 10:20:46 | 968 | 975 | 165 |
select asis(value) from pt pivot by time,sym
| time | A | B | C |
|---|---|---|---|
| 10:20:44 | 510 | 434 | 999 |
| 10:20:45 | 837 | 402 | 615 |
| 10:20:46 | 495 | 885 | 745 |
| 10:20:46 | 968 | 975 | 165 |
Example 6. Use pivot by with
select or exec clause to query and aggregate
array vector columns.
Create an in-memory table where the column val1 contains array vectors:
id = `A`B`B`B`C`C`D`D`D
val1 = array(DOUBLE[], 0, 10).append!([1.0, 2.1 2.2 2.4, 2.1 2.2 2.4, 3, 5, 1.6 7, 8.9 100, 2.3, 1.1 4 5.1 4]);
val2 = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t = table(timestamp, id, val1, val2);
// Query the val1 value for each id every minute
select val1 from t pivot by timestamp.minute(), id;
| minute_timestamp | A | B | C | D |
|---|---|---|---|---|
| 09:34m | [1.0000] | [] | [] | [8.9000,100.0000] |
| 09:35m | [] | [2.1000,2.2000,2.4000] | [5.0000] | [2.3000] |
| 09:36m | [] | [3.0000] | [1.6000,7.0000] | [1.1000,4.0000,5.1000,4.0000] |
Apply aggregation on val1 with select and return
a table:
select max(val1) from t pivot by timestamp.minute(), id;
| minute_timestamp | A | B | C | D |
|---|---|---|---|---|
| 09:34m | 1.0000 | 100.0000 | ||
| 09:35m | 2.4000 | 5.0000 | 2.3000 | |
| 09:36m | 3.0000 | 7.0000 | 5.1000 |
Apply aggregation on val1 with exec and return a
matrix:
exec max(val1) from t pivot by timestamp.minute(), id;
| label | A | B | C | D |
|---|---|---|---|---|
| 09:34m | 1.0000 | 100.0000 | ||
| 09:35m | 2.4000 | 5.0000 | 2.3000 | |
| 09:36m | 3.0000 | 7.0000 | 5.1000 |
