10 Commonly Overlooked Details in Creating Databases and Tables

When creating databases and tables, users often lack clarity on how to choose partition columns, partition types, and sort columns. Neglecting these details may lead to issues such as poor query performance, data loss, or insertion errors. Properly configuring partition columns, partition types, and sort columns can significantly improve query performance and CPU utilization, and reduce memory consumption.

This tutorial introduces the details involved in database and table creation to help users optimize their scripts.

1. Introduction to Partitioning and Storage Structures

1.1 Partition Types

Partitioning is one of the most important techniques to manage data and improve the performance of the distributed file system. It makes tables with large datasets more manageable by breaking them into smaller chunks. A well-designed partitioning strategy ensures that only the necessary data is scanned during queries, thereby reducing the system response latency.

A suitable partition type can evenly divide data based on business characteristics. DolphinDB supports the following partition types: RANGE, HASH, VALUE, LIST and COMPO.

Range partitioning creates a partition for each defined range interval.
Hash partitioning creates a specified number of partitions using a hash function on the partitioning column.
Value partitioning creates partitions based on specific values in the partitioning column, typically applied to date and time columns.
List partitioning partitions data based on specified sets of enum values, which is more flexible than value partitioning.
Composite partitioning is suitable for large-scale datasets where SQL WHERE or GROUP BY clauses frequently involve multiple columns. It combines two or three dimensions of partition types including RANGE, HASH, VALUE and LIST.

1.2 Partitioning and Data I/O

In DolphinDB, data is managed at the partition level, meaning that both data writing and reading are handled per partition. Data is written to disk partition by partition. It is not allowed to write data to the same partition concurrently . For data reads, DolphinDB performs partition pruning based on query conditions, narrowing down relevant partitions and only scanning those that meet the filtering condition. The partition columns and the granularity of partitioning have a significant impact on query efficiency.

1.3 Storage Structures

TSDB and OLAP storage engines differ in their storage structures:

The OLAP engine stores each column as a columnar file. Data is stored in the same order as it is written, ensuring highly efficient data writing.
The TSDB engine is based on the LSM-tree (Log-Structured Merge Tree) model and uses a PAX (Partition Attributes Across) storage format. Data in each partition is written to one or more level files. Within each level file, data is sorted by the specified columns, based on which indexes are constructed.

Sort columns are a unique structure specific to the TSDB engine, playing a crucial role in the storage and retrieval processes. They serve two main purposes: indexing data within partitions and deduplicating records. Sort columns are specified by the sortColumns parameter of function createPartitionedTable. When multiple sort columns are specified, the last must be either a temporal column or an integer column and all other columns form the sort key. If there is only one sort column, that column becomes the sort key. Records sharing the same sort key are sorted by the last column and stored together in blocks.

The sort key allows queries to quickly locate relevant blocks, significantly enhancing retrieval speed. It can also ensure data within transactions and level files is ordered and deduplicated.

The in-partition indexes include sort keys, their entry information and sparse indexes (such as maximum, minimum, and count statistics). When the sort key entries exceed recommended limits, the indexes can grow significantly, which can adversely impact write performance, query efficiency, and memory usage. To mitigate this, the dimensionality of indexes can be reduced by hashing the sort keys.

2. Ten Commonly Overlooked Details

2.1 Too Few Records per Sort Key

You must specify sort key when creating a TSDB database. It is generally recommended that each sort key correspond to more than 10,000 rows. Otherwise, the following issues may occur:

More index entries consume more memory.
Fragmented data storage degrades read and write performance.
Smaller block size reduces the compression ratio.

2.1.1 Incorrect Example

Stock data (assuming 5,000 stocks, 20 million records per day, and a daily data volume of 800MB) is value-partitioned by trading date.

Scenario 1: Specifying TradeTime as sortColumns leads to an excessive number of index keys, each with very little data.

When only one column is specified, the number of unique values of that column within the partition determines the number of index keys.

In the following script, the number of index keys per partition equals the number of distinct TradeTime values. Even at the second level granularity, there are 4 * 3600 = 14,400 unique values per day. In reality, TradeTime is at the millisecond level, so the number of index keys is far greater than 14,400.

db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt1, partitionColumns=`TradeDate, sortColumns=`TradeTime, keepDuplicates=ALL)

Scenario 2: Specifying SecurityID and TradeTime as sortColumns results in the number of index keys equaling to the number of stock symbols.

When multiple columns are specified for sortColumns, the last column must be either a temporal column or an integer column and the preceding columns are used as the index keys.

In the following script, the number of index keys per partition equals the number of distinct SecurityID values. Assuming there are 5,000 stocks, the number of index keys is 5,000, with each index key corresponding to 4,000 rows.

db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt2, partitionColumns=`TradeDate, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)

2.1.2 Solution

Scenario 1: Avoid specifying a time column or a column with a high number of unique value as the only one sort column.

In the following script, TradeTime is replaced with SecurityID as the sortColumns. This results in 5,000 index keys per partition, with each index key corresponding to 4,000 rows.

db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt3, partitionColumns=`TradeDate, sortColumns=`SecurityID, keepDuplicates=ALL)

To further optimize, dimensionality reduction can be applied by hashing the sort keys. After dimensionality reduction for the sort keys, the number of index keys per partition is 500, with each index key corresponding to 40,000 rows.

db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt4, partitionColumns=`TradeDate, sortColumns=`SecurityID keepDuplicates=ALL,sortKeyMappingFunction=[hashBucket{,500}])

Scenario 2: When multiple columns are specified for sortColumns, we can specify sortKeyMappingFunction for dimensionality reduction if sort keys entries are greater than 1,000. Dimensionality reduction improves range query performance, but may slightly reduce point query performance. The number of index keys equals the product of the number of each sort key after the dimensionality reduction or deduplication.

In the following script, SecurityID and TradeTime are specified as sortColumns. The number of index keys per partition is reduced to 500, with each index key corresponding to 40,000 rows.

db=database("dfs://testDB1",VALUE, 2020.01.01..2021.01.01,engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt5, partitionColumns=`TradeDate, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL,sortKeyMappingFunction=[hashBucket{,500}])

2.1.3 Performance Test Results

The five table creation strategies described above were evaluated by importing 10 days of data (20 million records per day) and measuring their query performance.

Range query:

select avg(TradePrice) from pt where TradeDate between 2020.01.04 and 2020.01.05


Sort Columns	Number of Index Keys per Partition	Average Execution Time (ms)	Max Execution Time (ms)
TradeTime	28,800	167.9	215.1
SecurityID, TradeTime	5000	46.3	68.6
SecurityID	5000	47.7	66.9
SecurityID (Hash Dimensionality Reduction)	500	6.6	8.5
SecurityID (Hash Dimensionality Reduction), TradeTime	500	6.7	7.9

Point query1 (TradeDate + SecurityID):

select avg(TradePrice) from pt where TradeDate = 2020.01.04 and SecurityID=`600


Sort Columns	Number of Index Keys per Partition	Average Execution Time (ms)	Max Execution Tim (ms)
TradeTime	28,800	2002.6	2562.8
SecurityID, TradeTime	5000	4.0	4.7
SecurityID	5000	3.4	4.7
SecurityID (Hash Dimensionality Reduction)	500	6.5	8.5
SecurityID (Hash Dimensionality Reduction), TradeTime	500	7.7	8.3

Point query 2 (TradeDate + SecurityID + TradeTime):

select TradePrice from pt where TradeDate = 2020.01.04 and SecurityID=`3740 and TradeTime =09:30:00.000


Sort Columns	Number of Index Keys per Partition	Average Execution Time (ms)	Max Execution Tim (ms)
TradeTime	28,800	1.4	2.1
SecurityID, TradeTime	5000	1.5	2.6
SecurityID	5000	1.4	2.3
SecurityID (Hash Dimensionality Reduction)	500	5.0	7.3
SecurityID (Hash Dimensionality Reduction), TradeTime	500	6.2	8.0

Conclusion: If the sortColumns is set improperly or there is no sort columns in the WHERE clause, query performance will significantly degrade. When the data corresponding to the range query index key is too sparse, it will affect performance. However, performance can be greatly improved through hash dimensionality reduction, though point query performance may slightly decrease.

2.2 String Fields with Special Characters as Value-Partition Columns

When using a STRING column as a value-partition column, appended data must not contain special characters such as spaces, '\n', '\r', or '\t'. If the partition column contains such characters, data insertion will fail.

2.2.1 Incorrect Example

In the following script, special characters in the value-partitioned column cause an insertion error A string or symbol value-partitioning column can't contain any invisible character.

n=1000
ID=take(["1","2","3"], n);
x=rand(1.0, n);
t=table(ID, x);
//create a DFS table value-partitioned by ID
db=database(directory="dfs://valuedb", partitionType=VALUE, partitionScheme=["1"])
pt = db.createPartitionedTable(t, `pt, `ID);
pt.append!(t);
n=1000
ID=take(["1 12","2.12","3   12"], n);
x=rand(1.0, n);
t=table(ID, x);
pt.append!(t);

2.2.2 Solution

Remove these invisible characters before insertion. If therse characters have special meaning, use hash partitioning for this column. Be sure to configure the number of hash partitions based on your data volume to avoid overly fine or coarse granularity.

2.3 Data Loss Due to Improper Range Partition Settings

When using range partitioning, records with values outside the defined partition ranges will be silently discarded. The tableInsert function returns the number of inserted records, which can be used to check whether any data was lost due to range boundaries.

2.3.1 Incorrect Example

In the following script, the ID values range from 0–20, but the partition scheme only defines ranges 0–5 and 5–10:

n=1000000
//ID values from 0 to 20
x=rand(1.0, n)
ID=rand(20, n)
t=table(ID, x);
//Partition ranges: 0-5, 5-10
db=database(directory="dfs://rangedb", partitionType=RANGE, partitionScheme=0 5 10)

pt = db.createPartitionedTable(t, `pt, `ID);
//Only data in range [0, 10) will be inserted
tableInsert(pt,t);

The result shows only 499,972 records inserted—far fewer than 1,000,000.

2.3.2 Solution

To avoid data loss, define a partition scheme that fully covers the expected data range. Alternatively, you can use the addRangePartitions function to add new ranges later.

2.4 Excessively Fine Partition Granularity

Improper partitioning strategy can result in too many small partitions, which may lead to:

Errors such as "The number of partitions relevant to the query is too large" when querying wide ranges.
Excessive use of the level file index cache slows down all queries on databases with TSDB engine.
High scheduling and communication overhead between controller and data nodes due to too many subtasks.
Inefficient disk access (e.g., small file reads/writes), increasing system load.
Exhaustion of controller memory due to large amounts of partition metadata.

Recommendation: Each partition should contain at least 100MB of uncompressed data.

2.4.1 Incorrect Example

Daily stock quote data (50 million records per day) is value-partitioned by TradeDate and then value-partitioned by SecurityID. Assuming 5,000 stocks, this results in more than 5,000 partitions per day. Each partition is only approximately 0.6MB.

db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, VALUE, `a`b)
db = database(directory="dfs://testDB1", partitionType=COMPO, partitionScheme=[db1, db2], engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)

The above script may cause the following problems:

The default of the maximum number of partitions that a single query can search is 65536. Querying 14 days triggers The number of partitions relevant to the query is too large (5,000 per day × 14 = 70,000).
The level file index cache (default 5% of system memory) fills up quickly with too many partitions, degrading TSDB query performance.

2.4.2 Solution

Instead of value-partitioning by SecurityID, use hash partitioning with an estimated number of partitions so each has approximately 100MB–1GB. In this case, there are only 25 partitions per day, the daily in-memory data volume is 3GB, and the size of each partition is approximately 120MB.

db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database(directory="dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2], engine="TSDB")
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)

2.4.3 Performance Test Results

Both schemas were tested using 10 days of data (50 million records per day):

select avg(TradePrice) from pt where TradeDate=2020.01.04


Scheme	Average Execution Time (ms)	Max Execution Time (ms)
Date (VALUE) + SecurityID (VALUE)	214.4	378.1
Date (VALUE) + SecurityID (HASH)	13.6	20.5

Conclusion: Overly fine-grained partitions significantly degrade performance. Choose appropriate granularity during schema design.

2.5 Excessively Coarse Partition Granularity

If partition type or fields are poorly chosen, partition granularity can be too coarse. This may lead to:

Memory shortage during multi-threaded queries.
Frequent memory-disk swaps, degrading performance.
Failure to utilize DolphinDB’s distributed architecture, turning parallel tasks into sequential ones.

2.5.1 Incorrect Example

In the following script, 50 million stock quote records per day (approximately 3GB per day) are partitioned by TradeDate:

//Partitioned by TradeDate
db= database("dfs://testDB1", VALUE, 2020.01.01..2021.01.01)
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
//One partition per day, each with a 3GB volume.
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate)

2.5.2 Solution

Add hash partitioning onSecurityID. There are only 25 partitions per day and the size of each partition is approximately 120MB.

db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database(directory="dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)

2.5.3 Performance Test Results

Both schemas were tested using 10 days of data (50 million records per day):

select avg(TradePrice) from pt where SecurityID=`600


Scheme	Average Execution Time (ms)	Max Execution Time (ms)
Date (VALUE)	560.4	683.1
Date (VALUE) + SecurityID (HASH)	31.2	40.8

Conclusion: Coarse partitions reduce query efficiency. Use composite partitioning to achieve optimal granularity.

2.6 Using Dimension Tables for Large Datasets

Dimension tables are fully loaded into memory and are not automatically released (prior to version 2.00.11). If a large dataset is stored in a dimension table, query performance will degrade as the data volume increases, and memory usage will also increase.

2.6.1 Incorrect Example

Using a dimension table to store more than 200,000 records or datasets with uncertain volume can cause significant performance issues.

Stock data generates 20 million records per day (800MB per day in memory). If this data is stored in a dimension table, it will accumulate to hundreds of millions of rows in just a few days. Reading from the table will load all data into memory, quickly exhausting system resources.

db= database("dfs://testDB1", VALUE, 2020.01.01..2021.01.01)
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createTable(tbSchema,`dt)

2.6.2 Solution

Instead, use a partitioned table which is value-partitioned by trade date and hash-partitioned into 25 buckets based on SecurityID:

db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database(directory="dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)

2.6.3 Performance Test Results

Both schemas were tested using one day of data (20 million rows):

select avg(TradePrice) from tb where SecurityID=`600


Table Type	Average Execution Time (ms)	Max Execution Time (ms)
Dimension table	1385.5	1545.2
Date (VALUE) + SecurityID (HASH)	36.0	60.3

Conclusion: Use dimension tables only when the dataset is small and well-scoped. For large datasets, always use partitioned tables to ensure performance.

2.7 Poor Choice of Partition Columns Slows Queries

If frequently queried columns are not specified as partition columns, DolphinDB cannot perform partition pruning, resulting in full partition scans and degraded performance.

2.7.1 Incorrect Example

In the following script, the DFS table is partitioned only by SecurityID when queries are frequently based on both TradeDate and SecurityID. This causes full partition scans for TradeDate queries and growing partition sizes over time:

db = database("dfs://testDB1",HASH, [SYMBOL, 25])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`SecurityID)

2.7.2 Solution

Use composite partitioning by both TradeDate and SecurityID:

db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database("dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)

2.7.3 Performance Test Results

Both schemas were tested using 10 days of data (20 million records per day):

select avg(TradePrice) from pt where SecurityID=`600 and TradeDate=2020.01.04


Scheme	Average Execution Time (ms)	Max Execution Time (ms)
SecurityID (HASH)	174.5	193.1
Date (VALUE) + SecurityID (HASH)	38.1	50.4

Conclusion: Choose partition columns based on query filters to improve efficiency.

2.8 Poor Partition Design Causes Write Conflicts

DolphinDB imposes some restrictions on transactions:

A transaction cannot contain both read and write.
Only one transaction can write to a partition at any given time. Writes to the same partition by multiple concurrent transactions can lead to conflicts.

2.8.1 Incorrect Example

In the following script, the DFS table is partitioned only by SecurityID, while data is written by TradeDate:

db = database("dfs://testDB",HASH, [SYMBOL, 25])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)

//Error "xxx has been owned by transaction xx"
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`SecurityID)

2.8.2 Solution

Partition the table using fields used for concurrent writes (e.g., TradeDate)
Align the parallel writing program with the partition scheme
Use API functions that support partition-aware writing (e.g., PartitionedTableAppender in C++/Python)

Conclusion: Avoid concurrent writes to the same partition by adjusting either the partition scheme or the data ingestion process.

2.9 Not Partitioning Time-Growing Table by Date Field

For tables where data volume grows over time, not partitioning by date will lead to overly large partitions and degraded query performance.

2.9.1 Incorrect Example

In the following script, the DFS table is partitioned only by SecurityID, while the table grows daily:

db = database("dfs://testDB1",HASH, [SYMBOL, 25])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`SecurityID)

This leads to uncontrolled partition size growth and slow queries on TradeDate.

2.9.2 Solution

Partition the table by both TradeDate and SecurityID:

db1 = database(, VALUE, 2020.01.01..2021.01.01)
db2 = database(, HASH, [SYMBOL, 25])
db = database("dfs://testDB2", partitionType=COMPO, partitionScheme=[db1, db2])
colName = `SecurityID`TradeDate`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo
colType = `SYMBOL`DATE`TIME`DOUBLE`INT`DOUBLE`INT`INT
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`TradeDate`SecurityID)

2.9.3 Performance Test Results

Both schemas were tested using 10 days of data (20 million records per day):

select avg(TradePrice) from pt where SecurityID=`600 and TradeDate=2020.01.04


Scheme	Average Execution Time (ms)	Max Execution Time (ms)
SecurityID (HASH)	174.5	193.1
date (VALUE) + SecurityID (HASH)	38.1	50.4

For time-growing tables, always use date as a partition column to prevent unbounded growth and improve query performance.

2.10 Use OLAP Engine to Store Wide Format Tables

Since OLAP uses columnar storage, each column in a partition is stored in a separate file. Storing wide format tables (e.g., more than 100 columns) results in too many files, causing read/write bottlenecks.TSDB’s PAX storage significantly reduces the number of physical files for wide format tables.

2.10.1 Incorrect Example

A table with 203 columns (e.g., 200 factor columns) using the default OLAP engine:

db= database("dfs://testDB1", VALUE, 2020.01.01..2021.01.01)
//use OLAP engine (by default) for a wide format table with 203 columns
colName = `ID`Date`Time
colName.append!(take("factor"+string(0..200),200))
colType = `SYMBOL`DATE
colType.append!(take(["DOUBLE"],200))
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`Date)

2.10.2 Solution

Use the TSDB engine to store wide format tables:

db= database("dfs://testDB2", VALUE, 2020.01.01..2021.01.01,engine="TSDB")
//use TSDB engine for a wide format table with 203 columns
colName = `ID`Date`Time
colName.append!(take("factor"+string(0..200),200))
colType = `SYMBOL`DATE`TIME
colType.append!(take(["DOUBLE"],200))
tbSchema = table(1:0, colName, colType)
db.createPartitionedTable(table=tbSchema, tableName=`pt, partitionColumns=`Date,sortColumns=`ID`Time)

2.10.3 Performance Test Results

Both schemas were tested using 10 days of data (2 million records per day):

select avg(factor2) from pt where ID=`60 and Date=2020.01.04


Engine	Average Execution Time (ms)	Max Execution Time (ms)
OLAP	62.6	95.9
TSDB	1.2	1.3

For wide format tables with more than 100 columns, prefer TSDB for better performance and file management.

3. Summary

This tutorial introduces 10 commonly overlooked details when creating databases and tables. These issues often arise from a lack of understanding of DolphinDB’s partitioning mechanism and storage engine architecture. If you would like to explore the internal principles of DolphinDB in more depth, refer to Database Partitioning.

In summary, the key points for creating databases and tables are:

Align with business logic: Choose the appropriate storage engine based on your business scenario (e.g., use TSDB for wide format tables).
Choose partition columns and types carefully: Select fields and partitioning types that align with query filters and data characteristics.
Design reasonable partition granularity: Aim for 100MB to 1GB of uncompressed data per partition.
Tune engine-specific parameters: For example, TSDB engine requires careful configuration of sortColumns.

By following these principles, you can greatly enhance data ingestion performance, query efficiency, and system stability.