10 Commonly Overlooked Details in Creating Databases and Tables
When creating databases and tables, users often lack clarity on how to choose partition columns, partition types, and sort columns. Neglecting these details may lead to issues such as poor query performance, data loss, or insertion errors. Properly configuring partition columns, partition types, and sort columns can significantly improve query performance and CPU utilization, and reduce memory consumption.
This tutorial introduces the details involved in database and table creation to help users optimize their scripts.
1. Introduction to Partitioning and Storage Structures
1.1 Partition Types
Partitioning is one of the most important techniques to manage data and improve the performance of the distributed file system. It makes tables with large datasets more manageable by breaking them into smaller chunks. A well-designed partitioning strategy ensures that only the necessary data is scanned during queries, thereby reducing the system response latency.
A suitable partition type can evenly divide data based on business characteristics. DolphinDB supports the following partition types: RANGE, HASH, VALUE, LIST and COMPO.
- Range partitioning creates a partition for each defined range interval.
- Hash partitioning creates a specified number of partitions using a hash function on the partitioning column.
- Value partitioning creates partitions based on specific values in the partitioning column, typically applied to date and time columns.
- List partitioning partitions data based on specified sets of enum values, which is more flexible than value partitioning.
- Composite partitioning is suitable for large-scale datasets where SQL
WHERE
orGROUP BY
clauses frequently involve multiple columns. It combines two or three dimensions of partition types including RANGE, HASH, VALUE and LIST.
1.2 Partitioning and Data I/O
In DolphinDB, data is managed at the partition level, meaning that both data writing and reading are handled per partition. Data is written to disk partition by partition. It is not allowed to write data to the same partition concurrently . For data reads, DolphinDB performs partition pruning based on query conditions, narrowing down relevant partitions and only scanning those that meet the filtering condition. The partition columns and the granularity of partitioning have a significant impact on query efficiency.
1.3 Storage Structures
TSDB and OLAP storage engines differ in their storage structures:
- The OLAP engine stores each column as a columnar file. Data is stored in the same order as it is written, ensuring highly efficient data writing.
- The TSDB engine is based on the LSM-tree (Log-Structured Merge Tree) model and uses a PAX (Partition Attributes Across) storage format. Data in each partition is written to one or more level files. Within each level file, data is sorted by the specified columns, based on which indexes are constructed.
Sort columns are a unique structure specific to the TSDB engine, playing a crucial role in the storage and retrieval processes. They serve two main purposes: indexing data within partitions and deduplicating records. Sort columns are specified by the sortColumns parameter of function createPartitionedTable. When multiple sort columns are specified, the last must be either a temporal column or an integer column and all other columns form the sort key. If there is only one sort column, that column becomes the sort key. Records sharing the same sort key are sorted by the last column and stored together in blocks.
The sort key allows queries to quickly locate relevant blocks, significantly enhancing retrieval speed. It can also ensure data within transactions and level files is ordered and deduplicated.
The in-partition indexes include sort keys, their entry information and sparse indexes (such as maximum, minimum, and count statistics). When the sort key entries exceed recommended limits, the indexes can grow significantly, which can adversely impact write performance, query efficiency, and memory usage. To mitigate this, the dimensionality of indexes can be reduced by hashing the sort keys.
2. Ten Commonly Overlooked Details
2.1 Too Few Records per Sort Key
You must specify sort key when creating a TSDB database. It is generally recommended that each sort key correspond to more than 10,000 rows. Otherwise, the following issues may occur:
- More index entries consume more memory.
- Fragmented data storage degrades read and write performance.
- Smaller block size reduces the compression ratio.
2.1.1 Incorrect Example
Stock data (assuming 5,000 stocks, 20 million records per day, and a daily data volume of 800MB) is value-partitioned by trading date.
Scenario 1: Specifying TradeTime
as
sortColumns leads to an excessive number of index keys, each with
very little data.
When only one column is specified, the number of unique values of that column within the partition determines the number of index keys.
In the following script, the number of index keys per partition equals the
number of distinct TradeTime
values. Even at the second
level granularity, there are 4 * 3600 = 14,400 unique values per day. In
reality, TradeTime
is at the millisecond level, so the
number of index keys is far greater than 14,400.
db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt1, partitionColumns=`TradeDate, sortColumns=`TradeTime, keepDuplicates=ALL)
Scenario 2: Specifying SecurityID
and
TradeTime
as sortColumns results in the number
of index keys equaling to the number of stock symbols.
When multiple columns are specified for sortColumns, the last column must be either a temporal column or an integer column and the preceding columns are used as the index keys.
In the following script, the number of index keys per partition equals the
number of distinct SecurityID
values. Assuming there are
5,000 stocks, the number of index keys is 5,000, with each index key
corresponding to 4,000 rows.
db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt2, partitionColumns=`TradeDate, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)
2.1.2 Solution
Scenario 1: Avoid specifying a time column or a column with a high number of unique value as the only one sort column.
In the following script, TradeTime
is replaced with
SecurityID
as the sortColumns. This results in
5,000 index keys per partition, with each index key corresponding to 4,000
rows.
db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt3, partitionColumns=`TradeDate, sortColumns=`SecurityID, keepDuplicates=ALL)
To further optimize, dimensionality reduction can be applied by hashing the sort keys. After dimensionality reduction for the sort keys, the number of index keys per partition is 500, with each index key corresponding to 40,000 rows.
db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt4, partitionColumns=`TradeDate, sortColumns=`SecurityID keepDuplicates=ALL,sortKeyMappingFunction=[hashBucket{,500}])
Scenario 2: When multiple columns are specified for sortColumns, we can specify sortKeyMappingFunction for dimensionality reduction if sort keys entries are greater than 1,000. Dimensionality reduction improves range query performance, but may slightly reduce point query performance. The number of index keys equals the product of the number of each sort key after the dimensionality reduction or deduplication.
In the following script, SecurityID
and
TradeTime
are specified as sortColumns. The
number of index keys per partition is reduced to 500, with each index key
corresponding to 40,000 rows.
db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt5, partitionColumns=`TradeDate, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL,sortKeyMappingFunction=[hashBucket{,500}])
2.1.3 Performance Test Results
The five table creation strategies described above were evaluated by importing 10 days of data (20 million records per day) and measuring their query performance.
Range query:
select avg(TradePrice) from pt where TradeDate between 2020.01.04 and 2020.01.05
Sort Columns | Number of Index Keys per Partition | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|---|
TradeTime | 28,800 | 167.9 | 215.1 |
SecurityID, TradeTime | 5000 | 46.3 | 68.6 |
SecurityID | 5000 | 47.7 | 66.9 |
SecurityID (Hash Dimensionality Reduction) | 500 | 6.6 | 8.5 |
SecurityID (Hash Dimensionality Reduction), TradeTime | 500 | 6.7 | 7.9 |
Point query1 (TradeDate
+
SecurityID
):
select avg(TradePrice) from pt where TradeDate = 2020.01.04 and SecurityID=`600
Sort Columns | Number of Index Keys per Partition | Average Execution Time (ms) | Max Execution Tim (ms) |
---|---|---|---|
TradeTime | 28,800 | 2002.6 | 2562.8 |
SecurityID, TradeTime | 5000 | 4.0 | 4.7 |
SecurityID | 5000 | 3.4 | 4.7 |
SecurityID (Hash Dimensionality Reduction) | 500 | 6.5 | 8.5 |
SecurityID (Hash Dimensionality Reduction), TradeTime | 500 | 7.7 | 8.3 |
Point query 2 (TradeDate
+ SecurityID
+
TradeTime
):
select TradePrice from pt where TradeDate = 2020.01.04 and SecurityID=`3740 and TradeTime =09:30:00.000
Sort Columns | Number of Index Keys per Partition | Average Execution Time (ms) | Max Execution Tim (ms) |
---|---|---|---|
TradeTime | 28,800 | 1.4 | 2.1 |
SecurityID, TradeTime | 5000 | 1.5 | 2.6 |
SecurityID | 5000 | 1.4 | 2.3 |
SecurityID (Hash Dimensionality Reduction) | 500 | 5.0 | 7.3 |
SecurityID (Hash Dimensionality Reduction), TradeTime | 500 | 6.2 | 8.0 |
Conclusion: If the sortColumns is set improperly or there is no sort columns in the WHERE clause, query performance will significantly degrade. When the data corresponding to the range query index key is too sparse, it will affect performance. However, performance can be greatly improved through hash dimensionality reduction, though point query performance may slightly decrease.
2.2 String Fields with Special Characters as Value-Partition Columns
When using a STRING column as a value-partition column, appended data must not contain special characters such as spaces, '\n', '\r', or '\t'. If the partition column contains such characters, data insertion will fail.
2.2.1 Incorrect Example
In the following script, special characters in the value-partitioned column
cause an insertion error A string or symbol value-partitioning
column can't contain any invisible character
.
n=1000
ID=take(["1","2","3"], n);
x=rand(1.0, n);
t=table(ID, x);
//create a DFS table value-partitioned by ID
db=database(directory="dfs://valuedb", partitionType=VALUE, partitionScheme=["1"])
pt = db.createPartitionedTable(t, `pt, `ID);
pt.append!(t);
n=1000
ID=take(["1 12","2.12","3 12"], n);
x=rand(1.0, n);
t=table(ID, x);
pt.append!(t);
2.2.2 Solution
Remove these invisible characters before insertion. If therse characters have special meaning, use hash partitioning for this column. Be sure to configure the number of hash partitions based on your data volume to avoid overly fine or coarse granularity.
2.3 Data Loss Due to Improper Range Partition Settings
When using range partitioning, records with values outside the defined partition
ranges will be silently discarded. The tableInsert
function
returns the number of inserted records, which can be used to check whether any
data was lost due to range boundaries.
2.3.1 Incorrect Example
In the following script, the ID
values range from 0–20, but
the partition scheme only defines ranges 0–5 and 5–10:
n=1000000
//ID values from 0 to 20
x=rand(1.0, n)
ID=rand(20, n)
t=table(ID, x);
//Partition ranges: 0-5, 5-10
db=database(directory="dfs://rangedb", partitionType=RANGE, partitionScheme=0 5 10)
pt = db.createPartitionedTable(t, `pt, `ID);
//Only data in range [0, 10) will be inserted
tableInsert(pt,t);
The result shows only 499,972 records inserted—far fewer than 1,000,000.
2.3.2 Solution
To avoid data loss, define a partition scheme that fully covers the expected data range. Alternatively, you can use the addRangePartitions function to add new ranges later.
2.4 Excessively Fine Partition Granularity
Improper partitioning strategy can result in too many small partitions, which may lead to:
- Errors such as
"The number of partitions relevant to the query is too large"
when querying wide ranges. - Excessive use of the level file index cache slows down all queries on databases with TSDB engine.
- High scheduling and communication overhead between controller and data nodes due to too many subtasks.
- Inefficient disk access (e.g., small file reads/writes), increasing system load.
- Exhaustion of controller memory due to large amounts of partition metadata.
Recommendation: Each partition should contain at least 100MB of uncompressed data.
2.4.1 Incorrect Example
Daily stock quote data (50 million records per day) is value-partitioned by
TradeDate
and then value-partitioned by
SecurityID
. Assuming 5,000 stocks, this results in more
than 5,000 partitions per day. Each partition is only approximately
0.6MB.
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, VALUE, `a`b)
db = database(directory="dfs://testDB1", partitionType=COMPO, partitionScheme=[db1, db2], engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)
The above script may cause the following problems:
- The default of the maximum number of partitions that a single query can
search is 65536. Querying 14 days triggers
The number of partitions relevant to the query is too large
(5,000 per day × 14 = 70,000). - The level file index cache (default 5% of system memory) fills up quickly with too many partitions, degrading TSDB query performance.
2.4.2 Solution
Instead of value-partitioning by SecurityID
, use hash
partitioning with an estimated number of partitions so each has
approximately 100MB–1GB. In this case, there are only 25 partitions per day,
the daily in-memory data volume is 3GB, and the size of each partition is
approximately 120MB.
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database(directory="dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2], engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)
2.4.3 Performance Test Results
Both schemas were tested using 10 days of data (50 million records per day):
select avg(TradePrice) from pt where TradeDate=2020.01.04
Scheme | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|
Date (VALUE) + SecurityID (VALUE) | 214.4 | 378.1 |
Date (VALUE) + SecurityID (HASH) | 13.6 | 20.5 |
Conclusion: Overly fine-grained partitions significantly degrade performance. Choose appropriate granularity during schema design.
2.5 Excessively Coarse Partition Granularity
If partition type or fields are poorly chosen, partition granularity can be too coarse. This may lead to:
- Memory shortage during multi-threaded queries.
- Frequent memory-disk swaps, degrading performance.
- Failure to utilize DolphinDB’s distributed architecture, turning parallel tasks into sequential ones.
2.5.1 Incorrect Example
In the following script, 50 million stock quote records per day
(approximately 3GB per day) are partitioned by TradeDate
:
//Partitioned by TradeDate
db= database("dfs://testDB1", VALUE, 2020.01.01..2021.01.01)
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
//One partition per day, each with a 3GB volume.
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate)
2.5.2 Solution
Add hash partitioning onSecurityID
. There are only 25
partitions per day and the size of each partition is approximately
120MB.
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database(directory="dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)
2.5.3 Performance Test Results
Both schemas were tested using 10 days of data (50 million records per day):
select avg(TradePrice) from pt where SecurityID=`600
Scheme | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|
Date (VALUE) | 560.4 | 683.1 |
Date (VALUE) + SecurityID (HASH) | 31.2 | 40.8 |
Conclusion: Coarse partitions reduce query efficiency. Use composite partitioning to achieve optimal granularity.
2.6 Using Dimension Tables for Large Datasets
Dimension tables are fully loaded into memory and are not automatically released (prior to version 2.00.11). If a large dataset is stored in a dimension table, query performance will degrade as the data volume increases, and memory usage will also increase.
2.6.1 Incorrect Example
Using a dimension table to store more than 200,000 records or datasets with uncertain volume can cause significant performance issues.
Stock data generates 20 million records per day (800MB per day in memory). If this data is stored in a dimension table, it will accumulate to hundreds of millions of rows in just a few days. Reading from the table will load all data into memory, quickly exhausting system resources.
db= database("dfs://testDB1", VALUE, 2020.01.01..2021.01.01)
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createTable(tbSchema,`dt)
2.6.2 Solution
Instead, use a partitioned table which is value-partitioned by trade date and
hash-partitioned into 25 buckets based on SecurityID
:
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database(directory="dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)
2.6.3 Performance Test Results
Both schemas were tested using one day of data (20 million rows):
select avg(TradePrice) from tb where SecurityID=`600
Table Type | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|
Dimension table | 1385.5 | 1545.2 |
Date (VALUE) + SecurityID (HASH) | 36.0 | 60.3 |
Conclusion: Use dimension tables only when the dataset is small and well-scoped. For large datasets, always use partitioned tables to ensure performance.
2.7 Poor Choice of Partition Columns Slows Queries
If frequently queried columns are not specified as partition columns, DolphinDB cannot perform partition pruning, resulting in full partition scans and degraded performance.
2.7.1 Incorrect Example
In the following script, the DFS table is partitioned only by
SecurityID
when queries are frequently based on both
TradeDate
and SecurityID
. This causes
full partition scans for TradeDate
queries and growing
partition sizes over time:
db = database("dfs://testDB1",HASH, [SYMBOL, 25])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`SecurityID)
2.7.2 Solution
Use composite partitioning by both TradeDate
and
SecurityID
:
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database("dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)
2.7.3 Performance Test Results
Both schemas were tested using 10 days of data (20 million records per day):
select avg(TradePrice) from pt where SecurityID=`600 and TradeDate=2020.01.04
Scheme | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|
SecurityID (HASH) | 174.5 | 193.1 |
Date (VALUE) + SecurityID (HASH) | 38.1 | 50.4 |
Conclusion: Choose partition columns based on query filters to improve efficiency.
2.8 Poor Partition Design Causes Write Conflicts
DolphinDB imposes some restrictions on transactions:
- A transaction cannot contain both read and write.
- Only one transaction can write to a partition at any given time. Writes to the same partition by multiple concurrent transactions can lead to conflicts.
2.8.1 Incorrect Example
In the following script, the DFS table is partitioned only by
SecurityID
, while data is written by
TradeDate
:
db = database("dfs://testDB",HASH, [SYMBOL, 25])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
//Error "xxx has been owned by transaction xx"
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`SecurityID)
2.8.2 Solution
- Partition the table using fields used for concurrent writes (e.g.,
TradeDate
) - Align the parallel writing program with the partition scheme
- Use API functions that support partition-aware writing (e.g.,
PartitionedTableAppender
in C++/Python)
Conclusion: Avoid concurrent writes to the same partition by adjusting either the partition scheme or the data ingestion process.
2.9 Not Partitioning Time-Growing Table by Date Field
For tables where data volume grows over time, not partitioning by date will lead to overly large partitions and degraded query performance.
2.9.1 Incorrect Example
In the following script, the DFS table is partitioned only by
SecurityID
, while the table grows daily:
db = database("dfs://testDB1",HASH, [SYMBOL, 25])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`SecurityID)
This leads to uncontrolled partition size growth and slow queries on
TradeDate
.
2.9.2 Solution
Partition the table by both TradeDate
and
SecurityID
:
db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database("dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)
2.9.3 Performance Test Results
Both schemas were tested using 10 days of data (20 million records per day):
select avg(TradePrice) from pt where SecurityID=`600 and TradeDate=2020.01.04
Scheme | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|
SecurityID (HASH) | 174.5 | 193.1 |
date (VALUE) + SecurityID (HASH) | 38.1 | 50.4 |
For time-growing tables, always use date as a partition column to prevent unbounded growth and improve query performance.
2.10 Use OLAP Engine to Store Wide Format Tables
Since OLAP uses columnar storage, each column in a partition is stored in a separate file. Storing wide format tables (e.g., more than 100 columns) results in too many files, causing read/write bottlenecks.TSDB’s PAX storage significantly reduces the number of physical files for wide format tables.
2.10.1 Incorrect Example
A table with 203 columns (e.g., 200 factor columns) using the default OLAP engine:
db= database("dfs://testDB1", VALUE, 2020.01.01..2021.01.01)
//use OLAP engine (by default) for a wide format table with 203 columns
colName = `ID`Date`Time
colName.append!(take("factor"+string(0..200),200))
colType = `SYMBOL`DATE
colType.append!(take(["DOUBLE"],200))
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`Date)
2.10.2 Solution
Use the TSDB engine to store wide format tables:
db= database("dfs://testDB2", VALUE, 2020.01.01..2021.01.01,engine="TSDB")
//use TSDB engine for a wide format table with 203 columns
colName = `ID`Date`Time
colName.append!(take("factor"+string(0..200),200))
colType = `SYMBOL`DATE`TIME
colType.append!(take(["DOUBLE"],200))
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`Date,sortColumns=`ID`Time)
2.10.3 Performance Test Results
Both schemas were tested using 10 days of data (2 million records per day):
select avg(factor2) from pt where ID=`60 and Date=2020.01.04
Engine | Average Execution Time (ms) | Max Execution Time (ms) |
---|---|---|
OLAP | 62.6 | 95.9 |
TSDB | 1.2 | 1.3 |
For wide format tables with more than 100 columns, prefer TSDB for better performance and file management.
3. Summary
This tutorial introduces 10 commonly overlooked details when creating databases and tables. These issues often arise from a lack of understanding of DolphinDB’s partitioning mechanism and storage engine architecture. If you would like to explore the internal principles of DolphinDB in more depth, refer to Database Partitioning.
In summary, the key points for creating databases and tables are:
- Align with business logic: Choose the appropriate storage engine based on your business scenario (e.g., use TSDB for wide format tables).
- Choose partition columns and types carefully: Select fields and partitioning types that align with query filters and data characteristics.
- Design reasonable partition granularity: Aim for 100MB to 1GB of uncompressed data per partition.
- Tune engine-specific parameters: For example, TSDB engine requires careful configuration of sortColumns.
By following these principles, you can greatly enhance data ingestion performance, query efficiency, and system stability.