textChunkDS

Syntax

textChunkDS(filename, chunkSize, [delimiter], [schema], [skipRows=0], [arrayDelimiter], [containHeader], [arrayMarker])

Arguments

filename is a string indicating the input text file name with its absolute path. Currently only .csv files are supported.

chunkSize is a positive integer indicating the size of a file chunk (in MB). The upper limit is max(maxMemSize / workerNum, 2048MB), representing the greater of the maximum available memory per worker or 2048MB.

delimiter (optional) is a STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (',').

schema (optional) a table. It can have the following columns, among which "name" and "type" columns are required.


Column	Data Type	Description
name	STRING scalar	column name
type	STRING scalar	data type
format	STRING scalar	the format of temporal columns
col	INT scalar or vector	the columns to be loaded

skipRows (optional) is an integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.

arrayDelimiter (optional) is a single character indicating the delimiter for columns holding the array vectors in the file. Since the array vectors cannot be recognized automatically, you must use the schema parameter to update the data type of the type column with the corresponding array vector data type before import.

containHeader (optional) a Boolean value indicating whether the file contains a header row. The default value is null. See loadText for the detailed determining rules.

arrayMarker is a string containing 2 characters or a CHAR pair. These two characters represent the identifiers for the left and right boundaries of an array vector. The default identifiers are double quotes (").

It cannot contain spaces, tabs (\t), or newline characters (\t or \n).
It cannot contain digits or letters.
If one is a double quote ("), the other must also be a double quote.
If the identifier is ', ", or \, a backslash ( \ ) escape character should be used as appropriate. For example, arrayMarker="\"\"".
If delimiter specifies a single character, arrayMarker cannot contain the same character.
If delimiter specifies multiple characters, the left boundary of arrayMarker cannot be the same as the first character of delimiter.

Details

To load an extremely large text file into DolphinDB database, we can first use function textChunkDS to divide the text file into multiple data sources with the size of each data source specified by chunkSize, then use function mr to load data.

When loading data files in DolphinDB, a random sample of the data is analyzed to determine the data type for each column. However, this sampling method does not always accurately determine the column types. It is recommend to use the extractTextSchema function to check the schema of the input file before loading the data. You can specify the intended data type for each column in the "type" field of the schema. For date or time columns particularly, if DolphinDB does not recognize the correct data types, you need to set the temporal type in the "type" field, and provide the date/time format string (e.g. "MM/dd/yyyy") in the "format" field. Refer to Parsing and Format of Temporal Variables for temproal formats in DolphinDB.

Examples

Use the following script to generate the data file of about 3.2GB:

n=30000000
workDir = "/home/DolphinDB"
if(!exists(workDir)) mkdir(workDir)
trades=table(rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym, 2000.01.01+rand(365,n) as date, 10.0+rand(2.0,n) as price1, 100.0+rand(20.0,n) as price2, 1000.0+rand(200.0,n) as price3, 10000.0+rand(2000.0,n) as price4, 10000.0+rand(3000.0,n) as price5, 10000.0+rand(4000.0,n) as price6, rand(10,n) as qty1, rand(100,n) as qty2, rand(1000,n) as qty3, rand(10000,n) as qty4, rand(10000,n) as qty5, rand(10000,n) as qty6)
trades.saveText(workDir + "/trades.txt");

Load the file into a DFS database with functions textChunkDS and mr:

db=database("dfs://db1",VALUE, `IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S)
pt=db.createPartitionedTable(trades,`pt,`sym)
ds=textChunkDS(workDir + "/trades.txt",500)
mr(ds,append!{pt},,,false)

Note:

Different data sources here may contain data in the same partition. As DolphinDB does not allow writing to the same partition simultaneously from multiple threads, we must set parameter 'parallel' of function mr to false, otherwise an exception might be thrown.