Importing HDF5 Files

HDF5 (Hierarchical Data Format version 5) is a file format designed to store and organize large and complex datasets like images, time series, etc. An HDF5 file serves as a container that holds a variety of data objects, including groups, datasets, and datatypes. Its tree-like structure helps organize multiple data objects efficiently, with support for compression and encryption.

DolphinDB has developed an HDF5 plugin to import HDF5 datasets into DolphinDB and supports data type conversions. This page explains how to use the plugin to import HDF5 datasets into DolphinDB.

Installing HDF5 Plugin

The HDF5 plugin is pre-installed by default and can be loaded simply by running the following script.

loadPlugin("hdf5")

HDF5 Data Import Methods

The DolphinDB HDF5 plugin provides the following methods to facilitate HDF5 file import:

  • loadHDF5: Imports a specified HDF5 dataset as an in-memory table.
  • loadPandasHDF5: Imports specified HDF5 dataset, saved by pandas, as an in-memory table.
  • loadHDF5Ex: Imports HDF5 datasets directly into either a distributed database or in-memory database.
  • HDF5DS: Divides a HDF5 dataset into multiple smaller data sources. It can be used with function mr to load data.

The following table outlines common parameters of these methods.

Parameter Description
filename A string indicating the directory of input HDF5 file.
datasetName A string indicating the dataset name, i.e., the table name.
schema

A table with the following columns:

  • name: STRING scalar. Column name.
  • type: STRING scalar. Data type.
startRow An integer indicating the start row from which to read the HDF5 dataset. If not specified, the dataset will be read from the beginning.
rowNum An integer indicating the number of rows to read from the HDF5 dataset. If not specified, the reading continues until the end of the dataset.

Getting Started

Loading to Memory and Importing to Database

Like loadText, loadHDF5 imports data as in-memory tables, but requires you to specify which dataset to import from the HDF5 file.

The following example demonstrates how to import the “data” dataset from demo1.h5.

dataFilePath = "./hdf5file/demo1.h5"
tmpTB = hdf5::loadHDF5(dataFilePath, "data");
select * from tmpTB

Output:

Call the schema function to view the table schema (column names, data types, etc):

tmpTB.schema().colDefs;

Output:

Note:

For HDF5 files saved by pandas, use loadPandasHDF5.

To import data to database, use tableInsert or append! functions. Ensure that the target database and table are already created before proceeding.

Script for creating the database and table:

create database "dfs://demoDB" partitioned by HASH([SYMBOL,25])
create table "dfs://demoDB"."data"(
    time LONG,
    customerId SYMBOL,
    temp INT,
    amp DOUBLE
)
partitioned by customerId
Note:

In the original data, the time column is of type LONG. To demonstrate how to import data into the database, the time column has been set to LONG during the database and table creation. Detailed instructions on converting the temporal type can be found in the section below.

Then insert the table to the database:

dfsTable = loadTable("dfs://demoDB", "data")
tableInsert(dfsTable, tmpTB)

Importing to Database Directly

In addition to explicitly using in-memory tables for database import, DolphinDB provides loadHDF5Ex function that combines loading, cleaning, and storage into one streamlined process. Ensure that the target database and table are already created before proceeding.

The following example demonstrates how to import demo1.h5 to table “data” in database “demoDB”. To start with an empty database, use the truncate function to remove existing rows.

// remove existing rows
truncate("dfs://demoDB","data")
// import to database
dataFilePath = "./hdf5file/demo1.h5"
hdf5::loadHDF5Ex(database("dfs://demoDB"), "data",
"customerId", dataFilePath, "data");

Check the results:

dfsTable = loadTable("dfs://demoDB","data")
select * from dfsTable 

Output:

Unlike loadHDF5, loadHDF5Ex requires users to specify database details including database name, table name, and partitioning columns. Besides, loadHDF5Ex offers transform parameter to clean or process data directly during database import. For more informatin about transform parameter, see Data Cleaning and Preprocessing.

Handling Column Names and Data Types

The HDF5 plugin automatically retrieves data types from file attributes during import and converts them to the corresponding DolphinDB data types. Details of supported HDF5 data types and their conversion rules can be found in Plugins > HDF5 > Data Type Mappings.

Note:

While automatic detection is generally reliable, occasional manual adjustments by modifying schemaparameter may be needed.

Extracting HDF5 File Schema

Users can use the extractHDF5Schema function to preview a HDF5 file's schema before data loading, including details like column names, data types, and the number of columns. It helps users identify potential issues such as mismatched data types and modify the schema accordingly for proper data loading.

For example, use extractHDF5Schema to get the schema of the sample file hdf5Sample.h5:

dataFilePath = "./hdf5file/hdf5Sample.h5"
hdf5::extractHDF5Schema(dataFilePath,"sample")

Output:

Specifying Column Names and Types

If the column names or data types automatically inferred by the system are not expected, we can specify the column names and data types by modifying the schema table generated by function extractHDF5Schema or creating the schema table directly.

Specifying Column Names

For example, use the following script to modify schema and rename column “symbol” as “securityID” :

dataFilePath = "./hdf5file/hdf5Sample.h5"
schemaTB = hdf5::extractHDF5Schema(dataFilePath,"sample")
update schemaTB set name = ["securityID","exchange","cycle","tradingDay","date","time","open","high","low","close","volume","turnover","unixTime"]

Use function loadHDF5 to import an HDF5 file and specify schema parameter as the modified schemaTB.

tmpTB = hdf5::loadHDF5(dataFilePath, "sample", schemaTB);

Call function schema to view the table schema. Column “symbol” has been renamed as “securityID”.

tmpTB.schema().colDefs;

Output:

Specifying Data Type

The time columns in hdf5Sample.h5 are stored as timestamps of type INT. However, the expected data type for column tradingDay and date is DATE, and for unixTime, it is TIMESTAMP. To convert the the data type of time columns:

dataFilePath = "./hdf5file/hdf5Sample.h5"
schemaTB = hdf5::extractHDF5Schema(dataFilePath,"sample")
update schemaTB set type="DATE" where name="tradingDay" or name="date"
update schemaTB set type="TIMESTAMP" where name="unixTime"

Use function loadHDF5 to import the HDF5 file and specify schema parameter as schemaTB.

tmpTB = hdf5::loadHDF5(dataFilePath, "sample", schemaTB);

Call the schema function to view the table schema:

tmpTB.schema().colDefs;

The data type of column tradingDay and date has been converted to DATE, and unixTime to TIMESTAMP.

Importing Specific Rows

loadHDF5 method supports importing specific rows by specifying startRow and rowNum parameters.

To import 100 rows starting from row 10:

tmpTB = hdf5::loadHDF5(dataFilePath, "sample", schemaTB, 9, 100);
Note: Since row indexing starts at 0, setting startRow=9 will begin reading from the 10th row of the dataset.

Check the total rows:

select count(*) from tmpTB
/*
count
-----
100  
*/

Check the first 5 rows:

select top 5 * from tmpTB;

Output:

Data Cleaning and Preprocessing

DolphinDB provides various built-in functions for common data cleaning task, including handling missing values, duplicate values, and outliers, and performing data normalization and standardization. This section focuses on several common scenarios.

HDF5 plugin provides two methods for importing data: loadHDF5 function and loadHDF5Ex function (See Getting Started). While both methods support basic data cleaning, loadHDF5Ex introduces a distinctive transform parameter that enables data cleaning and preprocessing during data import. This parameter accepts a unary function that processes the loaded table and returns a cleaned dataset, which is then directly written to the distributed database without intermediate memory storage. Our focus will be on practical applications of the transform parameter in loadHDF5Ex.

Data Type Conversion of Temporal Types

Time columns are typically stored as type INT or LONG in HDF5 files. Directly loading time columns of type INT will lead to incorrect results. For example, 9:31:00.000 is stored as 93100000 in hdf5Sample.h5. In such case, one way is to use the transform parameter of the loadHDF5Ex function to wrap the replaceColumn! function for data type conversion.

First, create a distributed database and a table.

dataFilePath = "./hdf5file/hdf5Sample.h5"
dbPath = "dfs://testDB"
db = database(dbPath,VALUE,2018.01.02..2018.01.30)
schemaTB = hdf5::extractHDF5Schema(dataFilePath,"sample")
update schemaTB set type="DATE" where name="tradingDay" or name="date"
update schemaTB set type="TIME" where name="time"
update schemaTB set type="TIMESTAMP" where name="unixTime"
tb = table(1:0,schemaTB.name,schemaTB.type)
pt = db.createPartitionedTable(tb,`tb1,`date);

Define a user-defined function i2t to encapsulate the data processing logic, converting the time column to TIME type.

def i2t(mutable t){
    replaceColumn!(table=t, colName=`time, newCol=t.time.format("000000000").string().temporalParse("HHmmssSSS"))
    return t
}
Note:

When processing data in a user-defined function, please try to use in-place modifications (functions that finish with !) if possible for optimal performance.

Call function loadHDF5Ex and assign function i2t to parameter transform. The system executes function i2t with the imported data, and then saves the result to the database.

schemaTB = hdf5::extractHDF5Schema(dataFilePath,"sample")
update schemaTB set type="DATE" where name="tradingDay" or name="date"
update schemaTB set type="TIMESTAMP" where name="unixTime"
hdf5::loadHDF5Ex(database("dfs://testDB"), "tb1", "date", dataFilePath,"sample", schemaTB, , , i2t)

Check the results:

select top 5 * from loadTable("dfs://testDB", "tb1")

Column 'time' is stored as data type TIME instead of INT.

Handling Null Values

Efficient handling of missing values is essential when working with real-world datasets. DolphinDB offers built-in functions like bfill, ffill, interpolate, and nullFill, enabling quick and effective handling of incomplete data in tables. As parameter transform only takes a function with one parameter, to assign a DolphinDB built-in function with multiple parameters to transform, we can use partial application to convert the function into a function with only one parameter. To fill null values xxx, the following statement can be passed as parameter transform:

nullFill!(t,0)

Importing Multiple HDF5 Files or Datasets in Parallel

An HDF5 file may have multiple datasets. Therefore, DolphinDB supports parallel import of both HDF5 files and datasets. While multi-file import is similar to text files, we'll focus on parallel dataset import.

First, use lsTable to list all datasets in the file.

dataFilePath = "./hdf5file/demo1.h5"
datasets = hdf5::lsTable(dataFilePath).tableName

Define a data import function and use the submitJob function to assign a thread for each dataset, enabling background batch data import:

def writeData(dbName,tbName,file,datasetName){ //write data to database
t = hdf5::loadHDF5(file,datasetName)
replaceColumn!(table = t, colName = `time, newCol = timestamp(t.time)) //convert data type
tableInsert(loadTable(dbName,tbName), t)
}
dbName = "dfs://demoDB_multi"
tbName = "data"
for(datasetName in datasets){
jobName = "loadHDF5" + datasetName
submitJob(jobName, jobName, writeData{dbName, tbName}, dataFilePath, datasetName)
}

Use the select statement to calculate the time taken for parallel batch datasets import:

select max(endTime) - min(startTime) from getRecentJobs(2);

Importing a Single Large HDF5 File

To avoid memory overflow (OOM) when importing large files, use the HDF5DS function to split the file into smaller chunks for import. Unlike textChunkDS, HDF5DS allows setting the exact number of data sources to evenly divide the dataset.

First, split the sample data into two equal parts by setting dsNum to 2.

dataFilePath = "./hdf5file/demo1.h5"
datasetName = "data"
ds = hdf5::HDF5DS(dataFilePath, datasetName, , 2)

Then, use the mr function to write the data into the database:

truncate(dbUrl = "dfs://demoDB",tableName = "data")
pt = loadTable("dfs://demoDB", "data")
mr(ds = ds, mapFunc = append!{pt}, parallel = false)

Importing Datasets of Composite Datatypes

DolphinDB supports importing complex HDF5 dataset structures, specifically with:

  • Compound type (H5T_COMPOUND)
  • Array type (H5T_ARRAY)
  • Combination of compound and array types

These composite datatypes require specialized handling during the import process to ensure accurate data representation in DolphinDB.

The following examples will demonstrate the import mechanisms for these complex HDF5 dataset structures.

Compound type

Given an HDF5 dataset contains compound type data:

1 2
1 struct{a:1 b:2 c:3.7} struct{a:12 b:22 c:32.7}
2 struct{a:11 b:21 c:31.7} struct{a:13 b:23 c:33.7}

When importing a compound type HDF5 dataset, DolphinDB converts the structure into a table where each compound type element is translated into a row. The compound type keys serve as column names, and their values populate the table using DolphinDB's type conversion rules.

The resulting DolphinDB table is structured as follows:

a b c
1 2 3.7
11 21 31.7
12 22 32.7
13 23 33.7

Array Type

1 2
1 array(1,2,3) array(4,5,6)
2 array(8,9,10) array(15,16,17)

When importing an array type HDF5 dataset, DolphinDB converts the structure into a table where each array is translated intoa row. Elements with the same index across arrays are grouped into columns, with column names labeled as array_index.

The resulting DolphinDB table is structured as follows:

array_1 array_2 array_3
1 2 3
4 5 6
8 9 10
15 16 17

Combination of Compound and Array Types

Given an HDF5 dataset contains complex combinations of datatypes:

1 2
1

struct{a:array(1,2,3)

b:2

c:struct{d:"abc"}}

struct{a:array(7,8,9)

b:5

c:struct{d:"def"}}

2

struct{a:array(11,21,31)

b:0

c:struct{d:"opq"}}

struct{a:array(51,52,53)

b:24

c:struct{d:"hjk"}}

For these complex datatype combinations, a prefix convention is used for column naming: Afor arrays and Cfor compound datatypes.

Each nested element is parsed according to its specific type, maintaining consistency with the conventions for arrays and compound datatypes described above.

The resulting DolphinDB table is structured as follows:

Aa_1 Aa_2 Aa_3 b Cc_d
1 2 3 2 abc
7 8 9 5 def
11 21 31 0 opq
51 52 53 24 hjk