textChunkDS
Syntax
textChunkDS(filename, chunkSize, [delimiter=’,’], [schema], [skipRows=0])
Arguments
filename the input text file name with its absolute path.
chunkSize an integer between 1 and 2048 (MB) indicating the size of a file chunk.
delimiter a STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (‘,’).
schema a table. It can have the following columns, among which “name” and “type” columns are required.
column name |
data type |
meaning |
---|---|---|
name |
STRING scalar |
column name |
type |
STRING scalar |
data type |
format |
STRING scalar |
the format of temporal columns |
col |
INT scalar or vector |
the columns to be loaded |
skipRows is an integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.
Details
To load an extremely large text file into DolphinDB database, we can first use function textChunkDS
to divide the text file into multiple data sources with the size of each data source specified by chunkSize, then use function mr to load data.
Examples
Use the following script to generate the data file of about 3.2GB:
$ n=30000000
$ workDir = "/home/DolphinDB"
$ if(!exists(workDir)) mkdir(workDir)
$ trades=table(rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as sym, 2000.01.01+rand(365,n) as date, 10.0+rand(2.0,n) as price1, 100.0+rand(20.0,n) as price2, 1000.0+rand(200.0,n) as price3, 10000.0+rand(2000.0,n) as price4, 10000.0+rand(3000.0,n) as price5, 10000.0+rand(4000.0,n) as price6, rand(10,n) as qty1, rand(100,n) as qty2, rand(1000,n) as qty3, rand(10000,n) as qty4, rand(10000,n) as qty5, rand(10000,n) as qty6)
$ trades.saveText(workDir + "/trades.txt");
Load the file into a dfs database with functions textChunkDS
and mr
:
$ db=database("dfs://db1",VALUE, `IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S)
$ pt=db.createPartitionedTable(trades,`pt,`sym)
$ ds=textChunkDS(workDir + "/trades.txt",500)
$ mr(ds,append!{pt},,,false)
Please note that different data sources here may contain data in the same partition. As DolphinDB does not allow writing to the same partition simultaneously from multiple threads, we must set parameter ‘parallel’ of function mr
to false, otherwise an exception might be thrown.