DataFrame

DataFrame 是 DolphinDB pandas 中最常用的数据结构之一，类似于一个二维表格，其中包含了行和列。DataFrame 可以容纳不同类型的数据，如数值、字符串、布尔值等，并且可以对这些数据进行灵活的操作和处理。

构造函数

DataFrame(data, index=None, lazy)

data: 可以是 DolphinDB 的表对象（内存表、分布式表）或字典。
- 当 data 是表对象时，其列数据为 DataFrame 的列数据，列名为 DataFrame 的列索引；
- 当 data 是字典时，其 value 为列数据， key 列索引。
index: DataFrame 的行索引。暂不支持 MultiIndex。
说明：可以通过调用 DataFrame 的属性 index 来获取索引。
Lazy: 可选参数，为一个布尔值，默认为 True。表示是否创建一个惰性的 DataFrame。惰性 DataFrame 是原对象的一个视图，计算操作不会立即执行；非惰性 DataFrame 是原对象的一个拷贝，计算操作会立即执行。
说明：
- 可以通过调用函数 is_lazy() 来判断其是否是惰性的。
- 惰性 DataFrame 参与计算时，不直接生成结果，而是产生一个中间对象，需要调用 compute() 函数才能生成执行结果。执行 DataFrame.compute() 会进行一次全表扫描。因此，在循环中使用 DataFrame.compute() 会影响到性能，建议在循环外将 DataFrame.compute() 赋值给一个变量，再将此变量应用到循环内。

注意：

data 不能为空表或空的字典。
当 data 是分区表时，lazy 必须为 True。index 必须是列名标量，指定列的数据将作为 DataFrame 的 index 而不是 data。
当 data 是非分区内存表时，lazy 的默认值是 False，此时 index 可以不指定，或者指定为长度和表的行数相同的 Python list 或 DolphinDB Vector；指定 lazy = True 时，index 必须是列名标量，指定列的数据将作为 DataFrame 的 index 而不是 data。
当 data 是字典时，lazy 默认值是 False。index 可以不指定，或者指定为长度和 value 长度相同的 Python list 或 DolphinDB Vector。

转换

目前仅支持 copy 函数，暂不支持参数 deep。

索引、迭代

目前支持以下函数：


方法	描述	兼容性说明
DataFrame.at	通过一对行/列标签，访问单个值。	对 DataFrame 使用 at[i, j]=val 修改数据时，如果 val 与原数据的类型不同，将会尝试类型转换。若无法转换，则报错。
DataFrame.iat	按整数位置访问行/列对应的单个值。	同 DataFrame.at
DataFrame.loc	通过标签或布尔数组访问一组行和列。	输入暂不支持可对齐的索引和具有一个参数（Series 或 DataFrame）的可调用函数。不支持重复取某行的数据，例如：`df.loc[["aa", "aa"]]`
DataFrame.iloc	仅基于整数位置的索引以按位置选择。	同 DataFrame.loc
DataFrame.insert	将列插入 DataFrame 中的指定位置。	不支持 allow_duplicates 参数；惰性 DataFrame 不支持该方法
DataFrame.get()	获取指定列名的数据。

二元运算符函数

支持 Python pandas.DataFrame 中的所有二元运算符函数。但需要注意以下几点：

暂不支持 level 参数。
axis 参数仅支持指定为整数（0 或 1），默认值是1。不支持指定为字符串（"index" 或 "columns"）。
other 参数指定为 Series 时不能指定 fill_value 参数。

使用 dot 时，建议 DataFrame 对象的列名使用字符串类型，other 参数指定对象的 index 也使用字符串类型。详见下例说明：

import pandas as pd
# 通过字典创建 DataFrame 对象，建议字典的 key 使用字符串类型
df = pd.DataFrame({"A1": [1,2,3],"A2": [11,12,13]},  ["a", "b", "c"])
# other 对象的 index 必须指定为字符串类型
other = pd.DataFrame({"a": [1,2],"b": [11,12],"c": [1,2]},  ["A1", "A2"])
df.dot(other)

计算/描述性统计

已经支持 Computations / descriptive stats 中的以下函数：abs, all, any, autocorr, between, corr, count, cov, diff, kurt, kurtosis, mad, max, mean, median, min, mode, nlargest, nsmallest, prod, sem, skew, std, sum, var, unique, nunique, is_unique, is_monotonic_increasing, is_monotonic_decreasing, value_counts, rank, round。

部分函数暂时还不支持某些参数，详见下表：


method	兼容性说明
all	不支持 axis, bool_only, skipna
any	不支持 axis
corr	仅支持 method = ‘pearson’
corrwith	不支持 axis, 仅支持 method = ‘pearson’, 仅支持 numeric_only = False
cummax	所有参数均不支持
cummin	所有参数均不支持
cumprod	所有参数均不支持
cumsum	所有参数均不支持
describe	所有参数均不支持
diff	不支持 periods, axis
kurt	不支持 axis, 仅支持 skipna = True
kurtosis	不支持 axis, 仅支持 skipna = True
max	仅支持 axis = 0, 不支持 skipna
mean	不支持 axis, skipna
median	不支持 axis, skipna
min	不支持 skipna
mode	仅支持 axis = 0, 仅支持 dropna =True
prod	仅支持 axis = 1, 不支持 skipna, min_count
product	同 prod
quantile	q 仅支持标量, 不支持 axis, method
rank	仅支持 axis = 0, 不支持 method = ‘dense’, 不支持 na_options = ‘bottom’。指定参数 pct = True 时，rank 的结果与 DolphinDB 内置函数 rank 的结果一致
round	decimals 不能为负数
sem	不支持 axis, skipna, ddof
skew	仅支持 skipna = True
sum	不支持 skipna, min_count，对字符串应用 sum 函数将返回空值
std	不支持 skipna, ddof
var	不支持 skipna, ddof
value_counts	不支持 normalize, sort
nunique	所有参数均不支持

重新索引/选择/标签操作

目前支持 Reindexing / selection / label manipulation 中的部分函数。函数和兼容性说明见下表：


method	兼容性说明
idxmin/idxmax	不支持参数 skipna。axis = 0时，暂不支持 COMPLEX 类型；axis = 1时，暂不支持 COMPLEX, POINT, INT128, DECIMAL32, DECIMAL64, DECIMAL128 类型
reindex	不支持参数 label, columns, axis, copy, level, tolerance
reindex_like	不支持参数 copy, tolerance
align	仅支持参数 other, join
drop	不支持参数 inplace
sample	不支持参数 axis, ignore_index
rename	不支持参数 copy, inplace, level, error
head / tail	不支持 n = 0

缺失数据处理

目前支持 Missing data handling 中的所有函数。参数的支持下说明见下表：


method	兼容性说明
backfill	所有参数均不支持
bill	所有参数均不支持
dropna	不支持 thresh。另外，lazy 模式下不支持 inplace, ignore_index，且 axis 不能指定为 1 或 'columns'
ffill	所有参数均不支持
fillna	不支持 axis, downcast, inplace, limit
interpolate	不支持 axis, limitArea
isna
isnull
notna
notnull
pad	所有参数均不支持
replace	不支持 inplace, limit。to_replace 和 value 参数暂不支持嵌套字典，例如：{'a': {'b': None}}。

特有方法

DolphinDB pandas DataFrame 除了提供看兼容 Python pandas DataFrame 的方法外，还提供了特有的方法 to_table，用于将 DataFrame 对象转换成 DolphinDB 的一个内存表。通过下例进行说明：

import dolphindb as ddb
import pandas as pd
# 创建一个包含两列的空表 t。
t=table(pair(100,0), ['x','y'].toddb(), [ddb.INT, ddb.STRING].toddb())
# 通过 to_table 方法将 df 对象转换为 DolphinDB 的内存表，并追加到表 t 中。
t.append!(df.to_table())
# 不能直接向 DolphinDB 的表对象中追加 df 对象。
t.append!(df)   # 报错：Only a table can append to another table.

创建 DataFrame

通过字典创建 DataFrame

惰性创建 DataFrame

import pandas as pd

df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, index=["A1", "A2", "A3"])
df
// output:
  A  B

A1  1  a
A2  2  b
A3  3  C

非惰性创建 DataFrame

df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, None, False)
// output:
 A  B

0  1  a
1  2  b
2  3  C

df = pd.DataFrame({"A": [1,2,3], "B": ["a", "b", "C"]}, [`x,`y,`z], False)
// output:
 A  B

x  1  a
y  2  b
z  3  C

通过内存表创建 DataFrame

创建一个 DolphinDB 的内存表 tableForDf。

timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)

惰性创建 DataFrame

指定 lazy = True 时，index 只能是表中的列名。

df = pd.DataFrame(tableForDf, 'timetag', True)
df
// output：
      price  bsFlag  name
timetag
2018.01.02     10       A  AAPL
2018.01.03     20       A  AAPL
2018.01.04     30       B    GS
2018.01.08     40       C  AAPL

指定 lazy = False 时，index 不能是表中的列名。

非惰性创建 DataFrame

level1 = ['0001', '0002', '0002', '0001']
df = pd.DataFrame(tableForDf, level1, False)
df
// output：
price  bsFlag  name     timetag

0001     10       A  AAPL  2018.01.02
0002     20       A  AAPL  2018.01.03
0002     30       B    GS  2018.01.04
0001     40       C  AAPL  2018.01.08

通过分布式表创建 DataFrame

通过分布式表只能创建惰性 DataFrame

n=1000
month=take(seq(2000.01M, 2016.12M), n);
x=rand(1.0, n);
t=table(month, x);
dbName="dfs://test_pandas"
if(exists(dbName)):
dropDatabase(dbName)
db=database(dbName, ddb.VALUE, partitionScheme=seq(2000.01M, 2016.12M))
pt=db.createPartitionedTable(t, `pt, `month).append!(t)

# index 必须指定为表中的列名，且 lazy = True
df=pd.DataFrame(pt,index="month", lazy=True)  

df=pd.DataFrame(pt,lazy=False)   # lazy 设置为 False，则报错：cannot create non-lazy DataFrame with segmented table

访问 DataFrame

创建一个内存表

timetag =[2018.01.02, 2018.01.03, 2018.01.04, 2018.01.08].toddb()
name = [`AAPL, `AAPL, `GS, `AAPL].toddb()
flag = [`A, `A, `B, `C].toddb()
p = [10, 20, 30, 40].toddb()
tableForDf = table(p as price, flag as bsFlag, name, timetag)

（1）隐式访问。因为惰性模式的 DataFrame 是原对象的一个视图，故不支持隐式访问。下例中使用非惰性的 DataFrame。

level = ['0001', '0002', '0002', '0001']
df_nonlazy=pd.DataFrame(tableForDf, level, False)

访问元素

df_nonlazy.iat[0, 3]
// output: 2018.01.02

访问列

df_nonlazy.iloc[:,1]

// output:
0001  A
0002  A
0002  B
0001  C
dtype: STRING

df_nonlazy.iloc[:,1:3] 
// output:
    bsFlag  name
  
0001       A  AAPL
0002       A  AAPL
0002       B    GS
0001       C  AAPL

df_nonlazy.iloc[:,2:]   
// output:
    name     timetag
  
0001  AAPL  2018.01.02
0002  AAPL  2018.01.03
0002    GS  2018.01.04
0001  AAPL  2018.01.08

df_nonlazy.iloc[:,:3] # select col0~col3
// output:
    price  bsFlag  name
  
0001     10       A  AAPL
0002     20       A  AAPL
0002     30       B    GS
0001     40       C  AAPL

访问行

df_nonlazy.iloc[1]
// output：
  price          20
bsFlag           A
  name        AAPL
timetag  2018.01.03
dtype: ANY

df_nonlazy.iloc[1:3] 
// output：
    price  bsFlag  name     timetag
  
0002     20       A  AAPL  2018.01.03
0002     30       B    GS  2018.01.04

行列组合访问

df_nonlazy.iloc[[0,3], [2]]
// output：
    name
  
0001  AAPL
0001  AAPL

df_nonlazy.iloc[0:3, 0:1]
// output:
    price
  
0001     10
0002     20
0002     30

df_nonlazy.iat[3, 3]
// output:
2018.01.08

（2）显式访问

访问列索引

df_nonlazy.price # 暂不支持属性方式访问

df_nonlazy['price']
// output：
0001  10
0002  20
0002  30
0001  40
dtype: INT

df_nonlazy[['timetag', 'price']]
// output：
       timetag  price
  
0001  2018.01.02     10
0002  2018.01.03     20
0002  2018.01.04     30
0001  2018.01.08     40

df_nonlazy.loc[:, ['timetag', 'price']]
// output：
       timetag  price
  
0001  2018.01.02     10
0002  2018.01.03     20
0002  2018.01.04     30
0001  2018.01.08     40

访问行索引

df_nonlazy.loc[`0001]
// output：
    price  bsFlag  name     timetag
  
0001     10       A  AAPL  2018.01.02
0001     40       C  AAPL  2018.01.08

df_nonlazy.loc[[`0001,`0002]]
// output：
    price  bsFlag  name     timetag
  
0001     10       A  AAPL  2018.01.02
0002     20       A  AAPL  2018.01.03
0002     30       B    GS  2018.01.04
0001     40       C  AAPL  2018.01.08

行列组合访问

df_nonlazy['name'][`0001]
// output：
0001  AAPL
0001  AAPL
dtype: STRING

df_nonlazy.at[`0001, 'name']
// output：
0001  AAPL
0001  AAPL
dtype: STRING

df_nonlazy.loc[[`0001, `0002], ['name']]
// output：
    name
  
0001  AAPL
0002  AAPL
0002    GS
0001  AAPL

（3）通过布尔掩码访问

df_nonlazy.loc[[True,True,False, False]]
df_nonlazy.iloc[[True, True,False, False]]
df_nonlazy[[True, True,False, False]]
// output:
    price  bsFlag  name     timetag
  
0001     10       A  AAPL  2018.01.02
0002     20       A  AAPL  2018.01.03

（4）条件访问

df_nonlazy[df_nonlazy["price"]>20]
// output:
    price  bsFlag  name     timetag
  
0002     30       B    GS  2018.01.04
0001     40       C  AAPL  2018.01.08

操作 DataFrame

惰性 DataFrame 对象不支持追加、更新和删除数据。以下示例创建的 DataFrame 均为非惰性的。

（1）追加数据

insert 追加列

df_nonlazy.insert(0, `tep, [33,33,37,37])

直接赋值追加行（暂不支持）

row = {"price":30, "bsFlag":'B', "name":"GS", "timetag":2018.01.09}
df_nonlazy.iloc[1] = row

（2）更新数据

# 创建一个示例 DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# 使用 at 方法修改单个元素的值
df.at[0, 'A'] = 10
df.iat[1, 1] = 50

# 使用 apply 方法对整列进行操作
df['A'] = df['A'].apply(lambda x: x * 2)

如果 DataFrame 对象是通过表构建的，则通过赋值方法修改它的数据时，原表的数据也会被修改。

# 通过内存表，创建一个示例 DataFrame
t1=table([1,2,3].toddb() as value)
df=pd.DataFrame(t1,lazy=False)

# 使用 at 方法修改单个元素的值
df.at[0, 'value']=10
df.iat[1, 0] = 20
t1
// output:
value
10
20
3

（3）删除数据

调用 drop 方法删除数据

df_nonlazy.drop(['name'], axis=1)