pivot by
pivot by
是 DolphinDB 的独有功能,是对标准 SQL
语句的拓展。它将表中一列或多列的内容按照两个维度重新排列,亦可配合数据转换函数使用。与 select 子句一起使用时返回一个表,而和 exec
语句一起使用时返回一个矩阵。若重新排列后存在行维度存在多个相同值,则会进行去重,只保留最后一个值。 参考 pivot 和 unpivot。
从 2.00.10.4 版本开始,pivot by
支持搭配 select
子句查询表中的数组向量列,返回一个表,且支持对数组向量列应用聚合函数(row 系列函数除外); pivot by
支持搭配 exec
子句对数组向量列应用返回标量的聚合函数(例如 max, sum 等),此时返回一个矩阵,详见例6。
pivot by
在以下版本中进行了优化:
优化版本 | 优化内容 |
---|---|
1.20.4 | pivot by 后面的列数可为两个以上,其中最后一列为列维度,其它各列均为行维度。 使用
pivot by 时,select 子句可使用多个指标。 |
2.00.9 | 对满足以下2个条件的 pivot by 语句进行了优化:
|
2.00.10.4 | 若 pivot by 后面的列在原表中存在多个相同的组合值,当同时满足以下条件时,对至少一个查询列应用
asis 函数,则不会对数据去重而是保留所有数据:
|
例子
例1:行维度为一列(即 pivot by 之后仅有两列)
sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t2 = table(timestamp, sym, qty, price);
t2;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:35:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:35:47 | IBM | 6800 | 174.97 |
09:36:26 | IBM | 5400 | 175.23 |
09:34:16 | C | 1300 | 50.76 |
09:35:26 | C | 2500 | 50.32 |
09:36:12 | C | 8800 | 51.29 |
将表 t2 的 timestamp 列作为行索引,sym 列作为列索引,查看 price 列的值。
select price from t2 pivot by timestamp, sym;
timestamp | C | IBM | MS |
---|---|---|---|
09:34:07 | 49.6 | ||
09:34:16 | 50.76 | ||
09:35:26 | 50.32 | ||
09:35:42 | 29.46 | ||
09:35:47 | 174.97 | ||
09:36:12 | 51.29 | ||
09:36:26 | 175.23 | ||
09:36:51 | 29.52 | ||
09:36:59 | 30.02 |
查看表 t2 每一分钟,不同 sym 对应的 price。
a=select price from t2 pivot by timestamp.minute(), sym;
a;
minute_timestamp | C | IBM | MS |
---|---|---|---|
09:34m | 50.76 | ||
09:35m | 50.32 | 174.97 | 29.46 |
09:36m | 51.29 | 175.23 | 30.02 |
typestr a;
// output
TABLE
//pivot by 与 exec 搭配使用,查看一分钟内各个 sym 对应的 price 的数量,结果为一个矩阵
b=exec count(price) from t2 pivot by timestamp.minute(), sym;
b;
label | C | IBM | MS |
---|---|---|---|
09:34m | 2 | ||
09:35m | 1 | 1 | 1 |
09:36m | 1 | 1 | 2 |
typestr b;
// output
FAST DOUBLE MATRIX
例2:计算 ETF 的内在价值
为简便起见,假设 ETF 有两个成分股:AAPL 和 FB,其成分权重为 [0.6, 0.4]。
symbol=take(`AAPL, 6) join take(`FB, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`AAPL`FB, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;
select price from ETF pivot by time, symbol;
Time | AAPL | FB |
---|---|---|
2019.02.27T09:45:01.000000146 | 103.962 | |
2019.02.27T09:45:01.000000212 | 64.604 | |
2019.02.27T09:45:01.000000278 | 103.956 | |
2019.02.27T09:45:01.000000412 | 103.944 | |
2019.02.27T09:45:01.000000445 | 103.95 | |
2019.02.27T09:45:01.000000496 | 103.956 | |
2019.02.27T09:45:01.000000556 | 64.6 | |
2019.02.27T09:45:01.000000598 | 64.596 | |
2019.02.27T09:45:01.000000712 | 64.6 | |
2019.02.27T09:45:01.000000789 | 103.962 | |
2019.02.27T09:45:01.000000989 | 64.604 |
以上结果为在每个时间戳经过 ETF 权重调整的股票价格。 为了计算 ETF 在每个时间戳的内在价值,需要使用 ffill 函数来填充每个股票的 NULL 值,然后将所有股票价格相加。可在一个 SQL 语句中完成。
select rowSum(ffill(price)) from ETF pivot by time, symbol;
Time | rowSum |
---|---|
2019.02.27T09:45:01.000000146 | 103.962 |
2019.02.27T09:45:01.000000212 | 168.566 |
2019.02.27T09:45:01.000000278 | 168.56 |
2019.02.27T09:45:01.000000412 | 168.548 |
2019.02.27T09:45:01.000000445 | 168.554 |
2019.02.27T09:45:01.000000496 | 168.56 |
2019.02.27T09:45:01.000000556 | 168.556 |
2019.02.27T09:45:01.000000598 | 168.552 |
2019.02.27T09:45:01.000000712 | 168.556 |
2019.02.27T09:45:01.000000789 | 168.562 |
2019.02.27T09:45:01.000000989 | 168.566 |
例3:使用 pivot by 时某些新建列名的自动转换
版本说明:
-
2.00.2 版本以前,DolphinDB 中列名仅可使用中文或英文字母、数字或下划线 (_),且必须以中文或英文字母开头。
-
2.00.2 版本开始,由
pivot by
,addColumn
操作产生的列名,支持包含特殊字符,或以数字开头。
详情可以参考:创建表章节。
date = take(2021.08.01 2021.08.02 2021.08.03, 12)
sym = take(["IBM N", "_MSFTN", "3_GOOGS", ""], 12).sort()
value = 1..12
t=table(date, sym, value)
re = select value from t pivot by date, sym
date | NULL | 3_GOOGS | IBM N | _MSFTN |
---|---|---|---|---|
2021.08.01 | 1 | 4 | 7 | 10 |
2021.08.02 | 2 | 5 | 8 | 11 |
2021.08.03 | 3 | 6 | 9 | 12 |
select _"NULL" from re
NULL |
---|
1 |
2 |
3 |
select _"3_GOOGS" from re
3_GOOGS |
---|
4 |
5 |
6 |
下例中,设置 removeSpecialCharInColumnName = true,股票代码 "600300" 与 "600600" 在成为列名后,系统自动将其处理为 "C600300" 与 "C600600"。
symbol=take(`600300, 6) join take(`600600, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`600300`600600, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;
select last(price) from ETF pivot by time, symbol;
time | C600300 | C600600 |
---|---|---|
2019.02.27T09:45:01.000000146 | 103.962 | |
2019.02.27T09:45:01.000000212 | 64.604 | |
2019.02.27T09:45:01.000000278 | 103.956 | |
2019.02.27T09:45:01.000000412 | 103.944 | |
2019.02.27T09:45:01.000000445 | 103.95 | |
2019.02.27T09:45:01.000000496 | 103.956 | |
2019.02.27T09:45:01.000000556 | 64.6 | |
2019.02.27T09:45:01.000000598 | 64.596 | |
2019.02.27T09:45:01.000000712 | 64.6 | |
2019.02.27T09:45:01.000000789 | 103.962 | |
2019.02.27T09:45:01.000000989 | 64.604 |
例4:行维度为多列(即 pivot by 之后多于两列)
date = 2020.09.21 + 0 0 0 0 1 1 1 1
sym = `MS`MS`GS`GS`MS`MS`GS`GS$SYMBOL
factorNum = 1 2 1 2 1 2 1 2
factorValue = 1.2 -3.4 -2.5 6.3 1.1 -3.2 -2.1 5.6
t = table(date, sym, factorNum, factorValue);
t;
date | sym | factorNum | factorValue |
---|---|---|---|
2020.09.21 | MS | 1 | 1.2 |
2020.09.21 | MS | 2 | -3.4 |
2020.09.21 | GS | 1 | -2.5 |
2020.09.21 | GS | 2 | 6.3 |
2020.09.22 | MS | 1 | 1.1 |
2020.09.22 | MS | 2 | -3.2 |
2020.09.22 | GS | 1 | -2.1 |
2020.09.22 | GS | 2 | 5.6 |
将表 t 的 date 列和 sym 列作为行索引,factorNum 列作为列索引,查看 factorValue 列的值。
select factorValue from t pivot by date, sym, factorNum;
date | sym | 1 | 2 |
---|---|---|---|
2020.09.21 | GS | -2.5 | 6.3 |
2020.09.21 | MS | 1.2 | -3.4 |
2020.09.22 | GS | -2.1 | 5.6 |
2020.09.22 | MS | 1.1 | -3.2 |
select factorValue, factorValue>0 as factorSign from t pivot by date, sym, factorNum;
date | sym | factorValue_1 | factorValue_2 | factorSign_1 | factorSign_2 |
---|---|---|---|---|---|
2020.09.21 | GS | -2.5 | 6.3 | 0 | 1 |
2020.09.21 | MS | 1.2 | -3.4 | 1 | 0 |
2020.09.22 | GS | -2.1 | 5.6 | 0 | 1 |
2020.09.22 | MS | 1.1 | -3.2 | 1 | 0 |
例5:应用 asis
函数保留全量数据。
本例中满足以下条件:
-
pt 是 DFS 表
-
from 子句不包含表连接
-
pivot by
子句的前 n-1 列,即 time 列不含 DECIMAL128 类型 -
pivot by
子句最后一列 sym 是分区列 -
查询语句 select value 不含聚合函数、行处理函数和填充函数,且 value 列不含空值
-
数据可以填满
pivot by
的宽表结果,即对于 time 列的每个不同的取值,pivot by
最后一列的取值的个数必须相等, time=10:20:44 时,sym 的取值 A、B、C 对应的行数均为1;time=10:20:45 时,sym 的取值 A、B、C 对应的行数均为1;time=10:20:46 时,sym 的取值 A、B、C 对应的行数均为2。
asis
函数,可以保留全量数据。time = [10:20:44,10:20:44,10:20:44,10:20:45,10:20:45,10:20:45,10:20:46,10:20:46,10:20:46,10:20:46,10:20:46,10:20:46]
sym = ["A","B","C","A","B","C","A","B","C","A","B","C"]
value = [510,434,999,837,402,615,495,885,745,968,975,165]
t = table(time,sym,value)
db = database("dfs://test_pivot",VALUE,`A`B`C)
pt = db.createPartitionedTable(t,`pt,`sym)
pt.append!(t)
select value from pt pivot by time,sym
time | A | B | C |
---|---|---|---|
10:20:44 | 510 | 434 | 999 |
10:20:45 | 837 | 402 | 615 |
10:20:46 | 968 | 975 | 165 |
select asis(value) from pt pivot by time,sym
time | A | B | C |
---|---|---|---|
10:20:44 | 510 | 434 | 999 |
10:20:45 | 837 | 402 | 615 |
10:20:46 | 495 | 885 | 745 |
10:20:46 | 968 | 975 | 165 |
例6:pivot by
搭配 select/exec 子句查询数组向量列并进行聚合计算。
id = `A`B`B`B`C`C`D`D`D
val1 = array(DOUBLE[], 0, 10).append!([1.0, 2.1 2.2 2.4, 2.1 2.2 2.4, 3, 5, 1.6 7, 8.9 100, 2.3, 1.1 4 5.1 4]);
val2 = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t = table(timestamp, id, val1, val2);
//对查询每分钟各个 id 对应的 val1 值
select val1 from t pivot by timestamp.minute(), id;
minute_timestamp | A | B | C | D |
---|---|---|---|---|
09:34m | [1.0000] | [] | [] | [8.9000,100.0000] |
09:35m | [] | [2.1000,2.2000,2.4000] | [5.0000] | [2.3000] |
09:36m | [] | [3.0000] | [1.6000,7.0000] | [1.1000,4.0000,5.1000,4.0000] |
exec max(val1) from t pivot by timestamp.minute(), id;
label | A | B | C | D |
---|---|---|---|---|
09:34m | 1.0000 | 100.0000 | ||
09:35m | 2.4000 | 5.0000 | 2.3000 | |
09:36m | 3.0000 | 7.0000 | 5.1000 |