Factor Calculation Solutions Compared: Python + HDF5 vs. DolphinDB

In quantitative trading, calculating high-frequency factors based on L1/L2 market quotes and high-frequency trading data is a common requirement for research and investment strategies. However, the exponential growth in historical L1/L2 market data volumes presents significant challenges for traditional relational databases such as MS SQL Server and MySQL. Even with sharding strategies, these databases often struggle to meet the performance demands of efficient data querying and analysis. As a result, many practitioners have turned to distributed file systems, using HDF5 for data storage and Python for quantitative financial computations.

While HDF5 effectively handles large-scale high-frequency datasets, it introduces several challenges, including complex permission management, difficulties in joining isparate datasets, inefficiencies in data retrieval and querying, and issues related to data redundancy. Additionally, using Python for data processing incurs substantial time overhead due to data transfer bottlenecks.

DolphinDB, an advanced analytical platform powered by a distributed time-series database, has emerged as a preferred solution for many brokerages and private equity firms. This article aims to provide a comparative analysis of the Python + HDF5 factor calculation approach against DolphinDB's integrated factor calculation solution.

1. Testing Environment and Data

1.1 Testing Environment

This test compares factor calculation implementations based on Python + HDF5 and DolphinDB.

  • Python + HDF5 Solution: Uses libraries such as Numpy, Pandas, dolphindb, and Joblib.
  • DolphinDB Integrated Solution: Uses DolphinDB server as the computational platform, with a single-node deployment for this test.

The hardware and software environment details required for the test are as follows:

Hardware Environment

Hardware Configuration
CPU Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Memory 512 GB
Disk SSD 500 GB

Software Environment

Software Version
Operating System CentOS Linux release 7.9.2009 (Core)
DolphinDB V2.00.8.7
Python V3.6.8
Numpy V1.19.3
Pandas V1.1.5

1.2 Test Data

The test dataset consists of snapshot data for three trading days between January 2, 2020, and January 6, 2020. The data covers approximately 1,992 stocks, with an average daily data volume of about 7.6 million records and a total data volume of approximately 22 million records. The original data is stored in DolphinDB, and HDF5 format data files can be exported from DolphinDB for use in the Python + HDF5 solution.

The snapshot table in DolphinDB contains 55 fields, with some key fields displayed as follows:

Field Name Data Type
TradeDate DATE
OrigTime TIMESTAMP
dbtime TIMESTAMP
SecurityID SYMBOL
OfferPX1 INT
BidPX1 DOUBLE
OfferSize1 DOUBLE
BidSize1 INT
…… ……

An exmaple of the snapshot data is as follows:

	TradeDate	OrigTime	SendTime	Recvtime	dbtime	ChannelNo	SecurityID	SecurityIDSource	MDStreamID	OfferPX1	BidPX1	OfferSize1	BidSize1	OfferPX2	BidPX2	OfferSize2	BidSize2	OfferPX3	BidPX3	OfferSize3	BidSize3	OfferPX4	BidPX4	OfferSize4	BidSize4	OfferPX5	BidPX5	OfferSize5	BidSize5	OfferPX6	BidPX6	OfferSize6	BidSize6	OfferPX7	BidPX7	OfferSize7	BidSize7	OfferPX8	BidPX8	OfferSize8	BidSize8	OfferPX9	BidPX9	OfferSize9	BidSize9	OfferPX10	BidPX10	OfferSize10	BidSize10	NUMORDERS_B1	NOORDERS_B1	ORDERQTY_B1	NUMORDERS_S1	NOORDERS_S1	ORDERQTY_S1
0	2020.01.02	2020.01.02 09:00:09.000	2020.01.02 09:00:09.264	2020.01.02 09:00:09.901	202.01.02 09:00:09.902	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
1	2020.01.02	2020.01.02 09:01:09.000	2020.01.02 09:01:09.330	2020.01.02 09:01:09.900	2020.01.02 09:01:09.904	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
2	2020.01.02	2020.01.02 09:02:09.000	2020.01.02 09:02:09.324	2020.01.02 09:02:10.131	2020.01.02 09:02:10.139	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
3	2020.01.02	2020.01.02 09:03:09.000	2020.01.02 09:03:09.223	2020.01.02 09:03:10.183	2020.01.02 09:03:10.251	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
4	2020.01.02	2020.01.02 09:04:09.000	2020.01.02 09:04:09.227	2020.01.02 09:04:10.188	2020.01.02 09:04:10.215	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
5	2020.01.02	2020.01.02 09:05:09.000	2020.01.02 09:05:09.223	2020.01.02 09:05:09.930	2020.01.02 09:05:09.936	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
6	2020.01.02	2020.01.02 09:06:09.000	2020.01.02 09:06:09.218	2020.01.02 09:06:10.040	2020.01.02 09:06:10.044	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
7	2020.01.02	2020.01.02 09:07:09.000	2020.01.02 09:07:09.224	2020.01.02 09:07:09.922	2020.01.02 09:07:09.925	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
8	2020.01.02	2020.01.02 09:08:09.000	2020.01.02 09:08:09.220	2020.01.02 09:08:10.137	2020.01.02 09:08:10.154	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	
9	2020.01.02	2020.01.02 09:09:09.000	2020.01.02 09:09:09.215	2020.01.02 09:09:10.175	2020.01.02 09:09:10.198	1,014	000014	102	010	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	

2. Factor Code Implementation

This section compares the implementation of two high-frequency factors—Flow Factor and mathWghtSkew Factor—in DolphinDB and Python.

2.1 High-Frequency Factors

Flow Factor

The Flow Factor relies on historical data and requires four fields: buy_vol, sell_vol, askPrice1, and bidPrice1. It uses functions such as mavg and iif to calculate the flow of buyer funds.

mathWghtSkew Factor

This factor processes multi-level quote data (e.g., BidPX1, BidPX2, ..., BidPX10) and calculates the weighted skewness of the quotes.

2.2 Factor Implementation in DolphinDB

DolphinDB provides optimized functions for time-series calculations, such as mavg (moving average) and rowWavg (row-weighted average). These functions simplify the development of high-frequency factors and deliver excellent computational performance.

Flow Factor Implementation Code:

def flow(buy_vol, sell_vol, askPrice1, bidPrice1){
	buy_vol_ma = round(mavg(buy_vol, 60), 5)
	sell_vol_ma = round(mavg(sell_vol, 60), 5)
	buy_prop = iif(abs(buy_vol_ma+sell_vol_ma) < 0, 0.5 , buy_vol_ma/ (buy_vol_ma+sell_vol_ma))
	spd_tmp = askPrice1 - bidPrice1
	spd = iif(spd_tmp  < 0, 0, spd_tmp)
	spd_ma = round(mavg(spd, 60), 5)
	return iif(spd_ma == 0, 0, buy_prop / spd_ma)
}

mathWghtSkew Factor Implementation Code:

def mathWghtCovar(x, y, w){
	v = (x - rowWavg(x, w)) * (y - rowWavg(y, w))
	return rowWavg(v, w)
}
def mathWghtSkew(x, w){
	x_var = mathWghtCovar(x, x, w)
	x_std = sqrt(x_var)
	x_1 = x - rowWavg(x, w)
	x_2 = x_1*x_1
	len = size(w)
	adj = sqrt((len - 1) * len) \ (len - 2)
	skew = rowWsum(x_2, x_1) \ (x_var * x_std) * adj \ len
	return iif(x_std==0, 0, skew)
}

2.3 Factor Implementation in Python

In Python, the implementation relies on libraries such as NumPy and Pandas.

Flow Factor Implementation Code:

def flow(df):
    buy_vol_ma = np.round(df['BidSize1'].rolling(60).mean(), decimals=5)
    sell_vol_ma = np.round((df['OfferSize1']).rolling(60).mean(), decimals=5)
    buy_prop = np.where(abs(buy_vol_ma + sell_vol_ma) < 0, 0.5, buy_vol_ma / (buy_vol_ma + sell_vol_ma))
    spd = df['OfferPX1'].values - df['BidPX1'].values
    spd = np.where(spd < 0, 0, spd)
    spd = pd.DataFrame(spd)
    spd_ma = np.round((spd).rolling(60).mean(), decimals=5)
    return np.where(spd_ma == 0, 0, pd.DataFrame(buy_prop) / spd_ma)

mathWghtSkew Factor Implementation Code:

def rowWavg(x, w):
    rows = x.shape[0]
    res = [[0]*rows]
    for row in range(rows):
        res[0][row] = np.average(x[row], weights=w)
    res = np.array(res)
    return res
def rowWsum(x, y):
    rows = x.shape[0]
    res = [[0]*rows]
    for row in range(rows):
        res[0][row] = np.dot(x[row],y[row])
    res = np.array(res)
    return res
def mathWghtCovar(x, y, w):
    v = (x - rowWavg(x, w).T)*(y - rowWavg(y, w).T)
    return rowWavg(v, w)
def mathWghtSkew(x, w):
    x_var = mathWghtCovar(x, x, w)
    x_std = np.sqrt(x_var)
    x_1 = x - rowWavg(x, w).T
    x_2 = x_1*x_1
    len = np.size(w)
    adj = np.sqrt((len - 1) * len) / (len - 2)
    skew = rowWsum(x_2, x_1) / (x_var*x_std)*adj / len
    return np.where(x_std == 0, 0, skew)

2.4 Summary

From the perspective of code implementation, there is little difference in difficulty between developing these factors in DolphinDB and Python. The implementation is straightforward in DolphinDB with its built-in functions. The sliding window calculations in Python are performed using combinations of functions like rolling and mean. In DolphinDB, similar calculations can be achieved using the moving function combined with functions like mean or avg. However, such combinations may lack incremental optimization due to the lack of specific business context. To enhance performance, DolphinDB provides specialized moving window functions (e.g., mavg), which can be optimized for specific computational needs. These optimized functions can improve performance by up to 100 times compared to generic combinations.

3. Factor Calculation Performance

This chapter further demonstrates how to invoke calculations of these factors and implement parallel calculations in DolphinDB and Python.

3.1 Factor Calculation in DolphinDB

Factor Calculation

In DolphinDB, the Flow Factor requires moving window calculations, while the mathWghtSkew Factor involves multi-column calculations within the same row. These calculations are performed independently for each stock. DolphinDB simplifies such operations with the context by clause, which allows batch processing of time-series data for multiple stocks. By using context by on the stock SecurityID, DolphinDB's SQL engine can directly perform batch moving window calculations for multiple stocks. The calculation code is as follows:

m = 2020.01.02:2020.01.06
w = 10 9 8 7 6 5 4 3 2 1
res = select dbtime, SecurityID,flow(BidSize1,OfferSize1, OfferPX1, BidPX1) as Flow_val,mathWghtSkew(matrix(BidPX1,BidPX2,BidPX3,BidPX4,BidPX5,BidPX6,BidPX7,BidPX8,BidPX9,BidPX10),w) as Skew_val  from loadTable("dfs://LEVEL2_SZ","Snap") where TradeDate between  m context by SecurityID

Parallel Execution

DolphinDB's computational framework breaks down computation jobs into smaller tasks for parallel execution. Therefore, when executing the above SQL, DolphinDB automatically identifies the data partitions and performs multi-threaded parallel calculations based on the configured parameter workerNum. This parameter can be modified in the DolphinDB configuration file. Users can control the CPU resource utilization and parallelism during factor calculations. The parallelism of the current task in DolphinDB can be viewed through the web interface.

3.2 Factor Calculation in Python

HDF5 Data File Reading

Unlike DolphinDB's in-database calculations, Python requires reading snapshot data from HDF5 files before performing computations. Two common methods for reading HDF5 files in Python are pandas.HDFStore() and pandas.read_hdf(). While read_hdf() supports reading specific columns, the performance difference between reading partial and full columns is minimal. Additionally, HDF5 files saved using pandas.to_hdf() consume about 30% more storage space than those saved using pandas.HDFStore(). Due to the larger storage space, reading the same HDF5 data using pandas.read_hdf() takes longer than using pandas.HDFStore(). Therefore, this test uses pandas.HDFStore() for storing and reading HDF5 file data. The HDF5 files in this test are stored in the most generic way, without redundancy removal or other storage optimizations, as such optimizations increase data usage and management complexity, especially in massive data scenarios.

Saving and Reading HDF5 Files

# Save HDF5 file
def saveByHDFStore(path,df):
    store = pd.HDFStore(path)
    store["Table"] = df
    store.close()

# Read a single HDF5 file
def loadData(path):
    store = pd.HDFStore(path, mode='r')
    data = store["Table"]
    store.close()
    return data

Factor Calculation

For each stock, daily data is read in a loop, concatenated into a DataFrame, and factor calculations are performed.

def ParallelBySymbol(SecurityIDs,dirPath):
    for SecurityID in SecurityIDs:
        sec_data_list=[]
        for date in os.listdir(dirPath):
            filepath = os.path.join(dirPath,str(date),"data_"+str(date) + "_" + str(SecurityID) + ".h5")
            sec_data_list.append(loadData(filepath))
        df=pd.concat(sec_data_list)
        # Calculate Flow Factor
        df_flow_res = flow(df)
        # Calculate mathWghtSkew
        w = np.array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
        pxs = np.array(df[["BidPX1","BidPX2","BidPX3","BidPX4","BidPX5","BidPX6","BidPX7","BidPX8","BidPX9","BidPX10"]])
        np.seterr(divide='ignore', invalid='ignore')
        df_skew_res = mathWghtSkew(pxs,w)

Parallel Scheduling

We use Python's Joblib library to implement multi-process parallel scheduling. The stocks to be calculated are read from files and split into multiple groups based on the parallelism level.

# Stock HDF5 file directory
pathDir = "/data/stockdata"
# Parallelism level
n = 1
# SecurityIDs is the list of all stocks to be calculated
SecurityIDs_splits = np.array_split(SecurityIDs, n)
Parallel(n_jobs=n)(delayed(ParallelBySymbol)(SecurityIDs, pathDir) for SecurityIDs in SecurityIDs_splits)

3.3 Validation of Calculation Results

We stored the factor calculation results from both Python + HDF5 and DolphinDB in DolphinDB's distributed tables. A full data comparison was performed using the following code to validate consistency between calculation results.

resTb = select * from loadTable("dfs://tempResultDB", "result_cyc_test")
resTb2 = select * from loadTable("dfs://tempResultDB", "result_cyc_test2")
resTb.eq(resTb2).matrix().rowAnd().not().at()

An output of int[0] indicates that the contents of the two tables are identical.

3.4 Performance Comparison

The performance of Python + HDF5 and DolphinDB is compared under different parallelism levels using a dataset of 1,992 stocks over 3 days (22 million records). The time consumption (in seconds) is measured for different numbers of CPU cores. All tests are performed after flushing the operating system cache.

CPU Cores Python + HDF5 DolphinDB (Python + HDF5) / DolphinDB
1 2000 21.0 95
2 993 11.5 86
4 540 6.8 79
8 255 5.8 44
16 133 4.9 27
24 114 4.3 26
40 106 4.2 25

From the results, we observe that DolphinDB's in-database calculations can reach nearly 100 times faster than Python + HDF5 calculations under single-core conditions. As the number of CPU cores increases, the performance ratio stabilizes at around 1:25.

DolphinDB's proprietary storage system is significantly more efficient than Python's generic HDF5 file reading. While HDF5 files can be optimized for storage performance, this comes with additional hardware resource costs and increased data management complexity. In contrast, DolphinDB's storage system is designed for high efficiency, ease of use, and seamless integration with its analytical capabilities.

From the perspective of code implementation and optimizaiton, DolphinDB provides optimized moving window functions which deliver superior performance for moving calculations compared to Python's standard rolling() operations. Additionally, DolphinDB's in-database SQL-based calculations simplify factor implementation and enable seamless parallel execution. While DolphinDB automatically utilizes available CPU resources, Python requires explicit parallel scheduling code, which provides finer control over concurrency but also increases implementation complexity.

4. Conclusion

This article compares the performance, development efficiency, and data management capabilities of DolphinDB and the Python + HDF5 solution for high-frequency factor calculations. The comparison results in this article demonstrate that, under the condition of using the same number of CPU cores:

Performance

DolphinDB's in-database calculations are approximately 25 times faster than the Python + HDF5 factor calculation solution, while producing identical results to the Python approach.

Development Efficiency

Both solutions offer similar ease of factor implementation, though DolphinDB benefits from built-in optimized functions that reduce the need for custom parallelization.

Data Management & Reading Efficiency

HDF5 files operate in a single-threaded manner for reading and writing, making parallel data operations difficult. Data retrieval efficiency in HDF5 depends heavily on the file’s internal organizational structure (groups and datasets). Optimizing read/write performance often requires custom storage design, sometimes involving redundant data storage to meet different access needs. As data volumes grow to the terabyte level, HDF5 data management complexity increases significantly, leading to substantial time and storage costs in practical use.

In contrast, DolphinDB simplifies data storage, querying, and management through its efficient storage engines and partitioning mechanisms. Users can easily manage and utilize data at the petabyte scale and beyond, similar to operating a conventional database.