10 Commonly Overlooked Details in Data Importing

Whether you're importing from local disk files (such as .csv or .txt) or using plugins to import data from external sources, overlooking certain operational details can lead to failed imports or incorrect results.

This article outlines the 10 most commonly overlooked pitfalls during data import in DolphinDB—ranging from data formats and type handling, import speed, preprocessing, connection issues, and concurrent write-write conflicts—along with practical solutions for each.

1. Handle Numeric Column Names

loadText and ploadText are the most commonly used functions for importing text files in DolphinDB. However, DolphinDB requires that all column names start with a letter. When headers in your file start with digits, these functions behave as follows:

  • If containHeader is not specified:

    The first row is treated as data. If any value in that row starts with a digit, DolphinDB assigns default column names such as col0, col1, etc. In this case, the first row of the file will be imported as the first row of data in the table.

  • If set containHeader = true:

    The first row is treated as the header. If any column name starts with a digit, DolphinDB automatically prefixes it with “c”. For example, "123Name" becomes "c123Name".

Neglecting this behavior may result in unexpected column names in the imported data.

For example, when importing the following colName.csv with column names starting with numbers:

id name totalSale 2023Q1 2023Q2 2023Q3
1 shirt 8000 2000 3000 3000
2 skirt 10000 2000 5000 3000
3 hat 2000 1000 300 700

1.1. Incorrect Operation

When containHeader is not specified, DolphinDB defaults to assigning column names as col0, col1, …, etc.

For instance, when executing loadText("/xxx/colName.csv") to import the data, the result is as follows:

Note that since “col0” and “col1” contain discrete values, DolphinDB defaults to parsing them as SYMBOL type. In contrast, columns “col2” through “col5” contain integer data, which is parsed as INT type by default. As a result, the original column name “totalSale” cannot be correctly mapped in “col2”, leaving the first row empty. Furthermore, the column names in “col3” through “col5” only retain the numeric portion that can be parsed (e.g., 2023).

1.2. Correct Operation

When containHeader = true is specified, DolphinDB prefixes a "c" to column names that start with a number, ensuring valid column names.

For example, when executing loadText("/xxx/colName.csv", containHeader = true) to import the data, the result is as follows:

By specifying containHeader = true, column names starting with a number are correctly handled by simply adding a "c" prefix, which is the recommended approach. Therefore, when importing a text file using loadText, if the header contains column names starting with a number, containHeader = true should always be specified.

2. Infer Data Types Automatically

The loadText, ploadText, and loadTextEx functions all support the schema parameter, which specifies the data types for each column by passing a table object. If the schema parameter is not specified, DolphinDB infers the data type for each column based on a random sample of the rows. However, this approach may not always accurately determine the correct data types for all columns. To address this, DolphinDB provides the extractTextSchema function, which allows users to view the data types inferred by DolphinDB. It is recommended to use extractTextSchema to review the automatic detection results before importing data. If the inferred types do not meet expectations, the data types can be adjusted, and the modified table can be passed as the schema parameter to the loadText, ploadText, or loadTextEx functions for data import.

For example, when importing the following type.csv file:

id ticker price
1 300001 25.80
2 300002 6.85
3 300003 7.19

2.1. Incorrect Operation

Executing loadText("/xxx/type.csv") results in the following:

By executing extractTextSchema("/xxx/type.csv"), it can be observed that the “ticker” column is incorrectly inferred as INT type, which does not meet expectations:

2.2. Correct Operation

You can use the extractTextSchema function to first view the inferred results, then specify the “ticker” column's type as SYMBOL. The modified result can then be passed as the schema parameter to the loadText function for importing, which will yield the expected outcome.

schema = extractTextSchema("/xxx/type.csv")
update schema set type =`SYMBOL where name = "ticker"
loadText("/xxx/type.csv", schema=schema)

3. Import Dates and Times Data

When using the loadText function to import data that contains dates and times without explicitly specifying schema, the system will automatically infer the column types based on the data. The inference rules are as follows:

  • Data with delimiters:
    • Data containing "-", "/", or "." will be converted to the DATE type (e.g., "2023-04-10", "23.04.10").
    • Data containing ":" will be converted to the SECOND type (e.g., "12:34:56").
  • Numeric strings without delimiters:
    • Data in the format of “yyMMdd” that meets 0 ≤ yy ≤ 99, 0 ≤ MM ≤ 12, 1 ≤ dd ≤ 31 will be preferentially parsed as DATE.
    • Data in the format of “yyyyMMdd” pattern that meets 1900 ≤ yyyy ≤ 2100, 0 ≤ MM ≤ 12, 1 ≤ dd ≤ 31 will also be preferentially parsed as DATE.

For example, when importing the following notdate.csv file:

id ticker price
1 111011 133.950
2 111012 125.145
3 111013 113.240

3.1. Incorrect Operation

Executing loadText("/xxx/notdate.csv") results in the following:

3.2. Correct Operation

Explicitly specify the column type (e.g., SYMBOL) via the schema parameter during import to ensure correct parsing.

Example:

schema = table(`id`ticker`price as name, `INT`SYMBOL`DOUBLE as type)
loadText("/xxx/notdate.csv", schema = schema)

Result:

4. Improve Import Performance with loadTextEx

In many practical scenarios, data files are imported directly into the distributed database without intermediate in-memory processing. To support this, DolphinDB provides the loadTextEx function, which writes data to the database directly, bypassing in-memory tables. This streamlines the workflow and improves performance.

For example, consider importing the following data.csv file:

ID date vol
23 2023.08.08 34.461863990873098
27 2023.08.08 4.043174418620766
36 2023.08.08 5.356599518563599
98 2023.08.09 5.630887264851481

4.1. Import Files with loadText

Loading the file into memory with loadText before writing it to the database takes 771.618 ms.

timer{
    t = loadText("/xxx/data.csv")
    loadTable("dfs://test_load","data").append!(t)
}

>> Time elapsed: 771.618ms

4.2. Import Files with loadTextEx

In contrast, using loadTextEx to import the file directly to the database takes only 384.097 ms.

timer{
    loadTextEx(database("dfs://test_load"), "data", "date", "/xxx/data.csv", sortColumns=`id)
}

>> Time elapsed: 384.097ms

loadTextEx offers increasing performance benefits as the dataset grows.

5. Preprocess Data with loadTextEx

The loadTextEx function provides a transform parameter that enables data cleaning and preprocessing during data import. The processed data is then written directly to the distributed database without intermediate memory storage. This approach eliminates the need to load data into memory, streamlining the workflow and improving performance.

For example, when importing the dataWithNULL.csv file, the “vol” column contains null values that need to be replaced with 0 before writing to the database:

ID date vol
52 2023.08.08 5.08143
77 2023.08.09
35 2023.08.08 0.22431
99 2023.08.09

5.1. Import Files with loadTex

Using loadText to load the file into memory, applying nullFill! to handle null values, and then writing to the database takes 802.23 ms.

timer{
    t = loadText("/xxx/dataWithNULL.csv")
    t.nullFill!(0.0)
    loadTable("dfs://test_load","data").append!(t)
}

>> Time elapsed: 802.23ms

5.2. Import Files with loadTextEx

In contrast, using loadTextEx with the transform parameter to import and preprocess data in one step takes only 385.086 ms.

timer{
    loadTextEx(database("dfs://test_load"), "data", "date", "/xxx/dataWithNULL.csv", transform = nullFill!{, 0.0}, sortColumns=`id)
}

>> Time elapsed: 385.086 ms

6. Import Long Integer Timestamps

In many datasets, Unix timestamps are stored as long integers (LONG type), representing the number of milliseconds since January 1, 1970 (UTC).

When importing such data:

  • If the type is not specified, the timestamp column is imported as LONG, with values displayed as integers.
  • If the type is explicitly set to TIMESTAMP, the system fails to parse the string format correctly and imports null values instead.

Neither approach yields the expected result.

The recommended solution is to first import the column as LONG, and then manually convert it using the timestamp function.

Below is the original structure of the sample file time.csv:

timestamp ticker price
1701199585108 SH2575 9.05991137959063
1701101960267 SH1869 9.667245978489518
1701292328832 SH1228 19.817104414105415
1701186220641 SH2471 3.389011020772159

6.1. Incorrect Operation

If the “timestamp” column is specified as TIMESTAMP during import, the result will be all null values.

schema = extractTextSchema("/xxx/time.csv")
update schema set type = "TIMESTAMP" where name = "timestamp"
loadText("/xxx/time.csv", schema=schema)

Result:

6.2. Correct Operation

The correct approach is to import the ”timestamp” column as LONG type, then use the replaceColumn! function and timestamp function to convert it to TIMESTAMP type, achieving the expected result.

t = loadText("/xxx/time.csv")
replaceColumn!(t, `timestamp, timestamp(exec timestamp from t))
t

Although converting long integers to the TIMESTAMP type allows them to be displayed in a readable time format, DolphinDB parses all timestamps as UTC by default. If the original timestamps were generated based on local time, you may need to adjust for time zone differences as described in the next section.

7. Handle UTC Offsets in Long Integer Timestamps

Some databases store time data as Unix timestamps (UTC) and store the UTC offset separately.

In contrast, DolphinDB stores local timestamps without separately saving time zone information.

DolphinDB defaults to parsing all timestamps as UTC time without automatically applying UTC offsets. If the original long integer timestamps are already converted to UTC milliseconds based on a specific time zone (e.g., Beijing Time), this may result in discrepancies between the parsed and expected values upon import.

For example, in the localtime.csv file below, the “timestamp” column contains long integer values converted from the local time 2023.11.30T16:00:00.000 in Beijing Time:

timestamp ticker price
1701331200000 SH3149 8.676103590987622
1701331200000 SH0803 12.16052254475653
1701331200000 SH2533 12.076009283773601
1701331200000 SH3419 0.239130933769047

7.1. Incorrect Operation

Directly using the timestamp function to convert a LONG value to TIMESTAMP results in UTC time, which is 8 hours behind Beijing Time.

t = loadText("/xxx/localtime.csv")
replaceColumn!(t, `timestamp, timestamp(exec timestamp from t))
t

Result:

7.2. Correct Operation

Using the time zone conversion function localtime to convert the UTC time to local time (in this example, the local time is UTC+8).

t = loadText("/xxx/localtime.csv")
replaceColumn!(t, `timestamp, localtime(timestamp(exec timestamp from t)))
t

Result:

Alternatively, using the convertTZ function to convert between two specified time zones.

t = loadText("/xxx/localtime.csv")
replaceColumn!(t, `timestamp, convertTZ(timestamp(exec timestamp from t), "UTC", "Asia/Shanghai"))
t

Result:

8. Use Standard ODBC Connection Strings

DolphinDB supports importing data from external sources via the ODBC plugin. If a connection attempt results in an error such as FATAL: password authentication failed for user 'xxx', even when the username and password are correct, the issue is often caused by an incorrectly configured ODBC connection string. Refer to the Connection String Reference for the correct format.

8.1. Incorrect Operation

In the following example, a DSN connection string is used to connect to PostgreSQL, resulting in the authentication failure mentioned above:

odbcConfig = "DSN=PIE_PGSQL;Server="+ip+";port="+port+";DATABASE="+db+";Uid="+userName+";Password="+password+";"
conn = odbc::connect(odbcConfig)

>> FATAL: password authentication failed for user "pie_dev"

DolphinDB’s ODBC plugin on Linux is built on unixODBC. To verify whether a connection string is correctly configured, you can use the isql tool provided by unixODBC, which operates independently of DolphinDB. If the same DSN string results in an error with isql, the issue is likely with the connection string itself rather than with the DolphinDB plugin.

[28P01][unixODBC]FATAL: password authentication failed for user "pie_dev"
[ISQL]ERROR: Could not SQLDriverConnect

8.2. Correct Operation

Refer to the Connection String Reference for the standard ODBC connection string format for PostgreSQL:

Modify the connection string according to the standard format, and the connection will succeed:

odbcConfig = "Driver={PostgreSQL};Server="+ip+";port="+port+";DATABASE="+db+";Uid="+userName+";Password="+password+";"
conn = odbc::connect(odbcConfig)

9. Avoid I/O Conflicts in Data Import

When dealing with compressed data files, it is common to decompress the files before import. If the decompression and data import operations are performed concurrently on the same disk, disk I/O contention may occur, significantly impacting import performance.

For example, importing the mdl_6_28_0.csv file, which is 4.3 GB in size:

9.1. Incorrect Operation

If decompression and data import are performed simultaneously on the same disk where DolphinDB stores its data—such as decompressing a .zip file while using loadTextEx to import a CSV—can result in an import time of about 1 minute and 41 seconds.

timer loadTextEx(database("dfs://test_load"), "data", "SecurityID", "/xxx/mdl_6_28_0.csv", sortColumns=`SecurityID`UpdateTime)

>> Time elapsed: 101156.78ms

During the import process, the disk read/write speed is as follows:

  • Disk read speed: 16.2 KB/s
  • Disk write speed: 32.6 MB/s
  • Disk read volume in the past minute: 1.1 GB
  • Disk write volume in the past minute: 1.1 GB
  • Remaining disk space: 1.3 TB
  • Total disk capacity: 2.2 TB
  • Available disk space: 58.74%

Performing both decompression and import on the same disk concurrently significantly slows down the data import speed.

9.2. Correct Operation

When the disk is idle and no decompression tasks are running, directly using the loadTextEx function to import the same file takes about 46 seconds:

timer loadTextEx(database("dfs://test_load"), "data", "SecurityID", "/xxx/mdl_6_28_0.csv", sortColumns=`SecurityID`UpdateTime)

>> Time elapsed: 46203.117ms

The disk read/write speed during this process is as follows:

  • Disk read speed: 174.8 MB/s
  • Disk write speed: 138.7 MB/s
  • Disk read volume in the past minute: 3.5 GB
  • Disk write volume in the past minute: 3.5 GB
  • Remaining disk space: 1.3 TB
  • Total disk capacity: 2.2 TB
  • Available disk space: 58.23%

To improve data import performance, avoid performing data extraction and import on the same disk simultaneously. Separating these tasks or using different disks for each can reduce disk I/O contention and enhance import speed.

10. Avoid Concurrent Writes to the Same Partition

When importing data into a distributed database, concurrent writes to the same partition may occur. The behavior depends on the atomic parameter specified during database creation:

  • atomic='TRANS': Concurrent writes to the same partition are not allowed. If a transaction attempts to write to multiple partitions and one of them is locked by another transaction, a write-write conflict occurs and entire transaction fails.
  • atomic='CHUNK': Concurrent writes to the same partition are allowed. If a transaction encounters a locked partition, it continues writing to other available partitions and repeatedly attempts to write to the locked one for a few minutes. The writes may succeed in some partitions and fail in others. Note that repeated attempts may impact write performance.

When a write-write conflict occurs, error code S00002 is raised with the message: <ChunkInTransaction>filepath has been owned by transaction, indicating the partition is locked by another transaction and cannot be locked again. Ignoring this issue could lead to data import failures.

Consider a database dfs://test_month, partitioned by month and created with atomic='TRANS'.

10.1. Incorrect Operation

For example, parallel import of CSV files into the same partition (2023.08M) is considered:

Submit the import tasks using submitJob:

filePath = "/xxx/dailyData"
for (i in files(filePath).filename){
    submitJob("loadCSVTest", "load daily files to monthly db", loadTextEx, database("dfs://test_month"), "data", "date", filePath+i)
}

Upon checking the task completion status, only one task succeeds, while others fail due to write-write conflicts, throwing error code S00002.

10.2. Correct Operation

To avoid write-write conflicts, import the files sequentially:

def loadDaily(filePath){
	for (i in files(filePath).filename){
		loadTextEx(database("dfs://test_month"), "data",`date,filePath+i,sortColumns=`ID)
	}
}

path = "/xxx/dailyData/"
submitJob("loadCSVTest","load daily files to monthly db",loadDaily, path)

Upon checking the task completion status, the data is successfully imported without errors.

11. Conclusion

This article highlights 10 commonly overlooked details in data import with DolphinDB, covering aspects such as data format and column name handling, data type inference, import performance optimization, data preprocessing methods, ODBC connection configuration, I/O conflict management, and concurrent write-write conflicts. Understanding and addressing these factors can significantly improve the speed and quality of data imports.