loadText

Syntax

loadText(filename, [delimiter], [schema], [skipRows=0], [arrayDelimiter], [containHeader], [arrayMarker])

Arguments

filename is the input text file name with its absolute path. Currently only .csv files are supported.

delimiter (optional) is a STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (',').

schema (optional) is a table. It can have the following columns, among which "name" and "type" columns are required.


Column	Data Type	Description
name	STRING scalar	column name
type	STRING scalar	data type
format	STRING scalar	the format of temporal columns
col	INT scalar or vector	the columns to be loaded

Note:

If "type" specifies a temporal data type, the format of the source data must match a DolphinDB temporal data type. If the format of the source data and the DolphinDB temporal data types are incompatible, you can specify the column type as STRING when loading the data and convert it to a DolphinDB temporal data type using the temporalParse function afterwards.

skipRows (optional) is an integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.

arrayDelimiter (optional) is a single character indicating the delimiter for columns holding the array vectors in the file. You must use the schema parameter to update the data type of the type column with the corresponding array vector data type before import.

containHeader (optional) is a Boolean value indicating whether the file contains a header row. The default value is NULL.

arrayMarker is a string containing 2 characters or a CHAR pair. These two characters represent the identifiers for the left and right boundaries of an array vector. The default identifiers are double quotes (").

It cannot contain spaces, tabs (\t), or newline characters (\t or \n).
It cannot contain digits or letters.
If one is a double quote ("), the other must also be a double quote.
If the identifier is ', ", or \, a backslash ( \ ) escape character should be used as appropriate. For example, arrayMarker="\"\"".
If delimiter specifies a single character, arrayMarker cannot contain the same character.
If delimiter specifies multiple characters, the left boundary of arrayMarker cannot be the same as the first character of delimiter.

Details

Load a text file into memory as a table. loadText loads data in single thread. To load data in multiple threads, use ploadText .

How a header row is determined:
- The first row of the file is read in string format, and the column names are parsed from that data. Please note that the upper limit for the first row is 256 KB. If none of the columns in the first row of the file starts with a number, the first row is treated as the header with column names of the text file. If at least one of the columns in the first row of the file starts with a number, the system uses col0, col1, … as the column names;
- When containHeader is true, the first row is determined as the header row, and the column names are parsed from that data;
- When containHeader is false, the system uses col0, col1, … as the column names.
How the column types are determined:
- When loading a text file, the system determines the data type of each column based on a random sample of rows. This convenient feature may not always accurately determine the data type of all columns. We recommend users check the data type of each column with the extractTextSchema function after loading.
- When the input file contains dates and times:
  - For data with delimiters (date delimiters "-", "/" and ".", and time delimiter ":"), it will be converted to the corresponding type. For example, "12:34:56" is converted to the SECOND type; "23.04.10" is converted to the DATE type.
  - For data without delimiters, data in the format of "yyMMdd" that meets 0<=yy<=99, 0<=MM<=12, 1<=dd<=31, will be preferentially parsed as DATE; data in the format of "yyyyMMdd" that meets 1900<=yyyy<=2100, 0<=MM<=12, 1<=dd<=31 will be preferentially parsed as DATE.
- If a column does not have the expected data type, then we need to enter the correct data type of the column in the schema table. Users can also specify data types for all columns. For a temporal column, if it does not have the expected data type, we also need to specify a format such as "MM/dd/yyyy" in the schema table. For details about temporal formats please refer to Parsing and Format of Temporal Variables.

To load a subset of columns, specify the column index in the "col" column of schema.

As string in DolphinDB is encoded in UTF-8, we require input text files be encoded in UTF-8.

Column names in DolphinDB must only contain letters, numbers or underscores and must start with a letter. If a column name in the text file does not meet the requirements, the system automatically adjusts it:

If the column name contains characters other than letters, numbers or underscores, these characters are converted into underscores.
If the column name does not start with a letter, add "c" to the column name so that it starts with "c".

A few examples:


Column name in data files	Adjusted column name
1_test	c1_test
test-a!	test_a_
[test]	c_test_

Note: Starting from version 1.30.22/2.00.10, loadText supports files containing multiple line breaks in a single record.

Examples

Use the following script to generate the data file to be used for the examples:

n=10
sym=rand(`AAPL`ORCL`MS`SUN,n)
permno=take(10001,n)
date=rand(2019.06.01..2019.06.10,n)
open=rand(100.0,n)
high=rand(200.0,n)
close=rand(200.0,n)
pre_close=rand(200.0,n)
change=rand(100.0,n)
vol=rand(10000,n)
amount=rand(100000.0,n)
t=table(sym,permno,date,open,high,close,pre_close,change,vol,amount)
saveText(t,"C:/DolphinDB/Data/stock.csv");

Example: Use loadText without specifying any optional parameters:

tt=loadText("C:/DolphinDB/Data/stock.csv");
tt;


sym	permno	date	open	high	close	pre_close	change	vol	amount
MS	10001	2019.06.06	90.346594	80.530542	96.474428	146.305659	0.720236	1045	90494.568297
AAPL	10001	2019.06.07	91.165315	8.482074	85.514922	16.259077	76.797829	7646	91623.485996
AAPL	10001	2019.06.03	45.361885	14.077451	149.848419	89.110375	45.499145	9555	98171.601654
MS	10001	2019.06.04	8.98688	0.591778	155.54643	132.423187	69.95799	1202	3512.927634
MS	10001	2019.06.07	62.866173	33.465237	174.20712	102.695818	74.580523	3524	61943.64517
MS	10001	2019.06.09	32.819915	13.319577	136.729618	63.980405	60.66375	7078	85138.216568
MS	10001	2019.06.07	90.210866	22.728777	150.212291	59.454705	73.916303	5306	19883.845607
AAPL	10001	2019.06.06	83.752686	71.3501	98.211979	145.60098	94.428343	8852	9236.020781
ORCL	10001	2019.06.01	81.64719	129.702202	182.784373	117.575967	74.84595	2942	43394.871242
AAPL	10001	2019.06.02	10.068382	80.875383	181.674585	138.783821	25.298267	1088	82981.043775

schema(tt).colDefs;


name	typeString	typeInt
sym	SYMBOL	17
permno	INT	4
date	DATE	6
open	DOUBLE	16
high	DOUBLE	16
close	DOUBLE	16
pre_close	DOUBLE	16
change	DOUBLE	16
vol	INT	4
amount	DOUBLE	16

Example: Specify the data type of a column before loading the file.

We may want to change the data type of column "permno" to be SYMBOL. For this, we need to use function extractTextSchema to get the schema table, update it, then load the text file with the revised schema table.

schema=extractTextSchema("C:/DolphinDB/Data/stock.csv");
update schema set type=`SYMBOL where name=`permno;
tt=loadText("C:/DolphinDB/Data/stock.csv",,schema);
schema(tt).colDefs;


name	typeString	typeInt
sym	SYMBOL	17
permno	SYMBOL	17
date	DATE	6
open	DOUBLE	16
high	DOUBLE	16
close	DOUBLE	16
pre_close	DOUBLE	16
change	DOUBLE	16
vol	INT	4
amount	DOUBLE	16

You can also specify the data types of all columns:

schematable=table(`sym`permno`date`open`high`close`pre_close`change`vol`amount as name,`SYMBOL`SYMBOL`DATE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`INT`DOUBLE as type)
tt=loadText("C:/DolphinDB/Data/stock.csv",,schematable)
schema(tt).colDefs;


name	typeString	typeInt
sym	SYMBOL	17
permno	SYMBOL	17
date	DATE	6
open	DOUBLE	16
high	DOUBLE	16
close	DOUBLE	16
pre_close	DOUBLE	16
change	DOUBLE	16
vol	INT	4
amount	DOUBLE	16

Example: Load only a subset of columns.

For example, we may only need to load the following 7 columns: sym, date, open, high, close, vol, amount. Please note that we cannot change the order of columns when loading data. To change the order of columns in the loaded table, use function reorderColumns!.

schema=extractTextSchema("C:/DolphinDB/Data/stock.csv");
schema=select * from schema where name in `sym`date`open`high`close`vol`amount
schema[`col]=[0,2,3,4,5,8,9]

tt=loadText("C:/DolphinDB/Data/stock.csv",,schema);
tt;


sym	date	open	high	close	vol	amount
SUN	2019.06.10	18.675316	72.754005	136.463909	1376	31371.319038
AAPL	2019.06.05	42.098717	196.873587	41.513899	3632	9950.864129
ORCL	2019.06.05	62.223474	197.099027	123.785675	3069	38035.800937
SUN	2019.06.03	0.18163	50.669866	4.652098	6213	1842.198893
SUN	2019.06.06	32.54134	67.012502	130.312294	4891	55744.156823
SUN	2019.06.07	56.899091	81.709825	61.786176	1133	69057.849515
AAPL	2019.06.08	77.026838	38.504431	22.68496	3672	34420.187073
ORCL	2019.06.07	62.752656	39.33621	48.483091	4382	41601.601639
AAPL	2019.06.02	8.5487	17.623418	141.88325	8092	15449.159988
AAPL	2019.06.02	26.178685	197.320455	110.52407	5541	14616.820449

Example: Skip the first 2 rows when loading.

Please note that as the header is the first line of the text file, it is also skipped.

re=loadText(filename="C:/DolphinDB/Data/stock.csv",skipRows=2)
select count(*) from re;


count
9

Example: Specify the temporal format when loading the file.

Generate the text file to be used:

time=["20190623145457","20190623155423","20190623163025"]
sym=`AAPL`MS`IBM
qty=2200 5400 8670
price=54.78 59.64 65.23
t=table(time,sym,qty,price)
saveText(t,"C:/DolphinDB/Data/t2.csv");

Obtain the text schema with extractTextSchema before loading the file:

extractTextSchema("C:/DolphinDB/Data/t2.csv");


name	type
time	LONG
sym	SYMBOL
qty	INT
price	DOUBLE

From the example above, if we load this text file without specifying the format of column "time", column "time" is empty as the system cannot parse the raw data correctly. For this scenario we must specify the format of the column.

schema=extractTextSchema("C:/DolphinDB/Data/t2.csv")
update schema set type = "DATETIME" where name = "time"
schema[`format]=["yyyyMMddHHmmss",,,];

loadText("C:/DolphinDB/Data/t2.csv",,schema);


time	sym	qty	price
2019.06.23T14:54:57	AAPL	2200	54.78
2019.06.23T15:54:23	MS	5400	59.64
2019.06.23T16:30:25	IBM	8670	65.23

Example: Load array vectors

Create a .csv file

bid = array(DOUBLE[], 0, 20).append!([1.4799 1.479 1.4787, 1.4796 1.479 1.4784, 1.4791 1.479 1.4784])
ask = array(DOUBLE[], 0, 20).append!([1.4821 1.4825 1.4828, 1.4818 1.482 1.4821, 1.4814 1.4818 1.482])
TradeDate = 2022.01.01 + 1..3
SecurityID = rand(`APPL`AMZN`IBM, 3)
t = table(SecurityID as `sid, TradeDate as `date, bid as `bid, ask as `ask)
t;
saveText(t,filename="/home/t.csv",delimiter=',',append=true)

Load with loadText

path = "/home/t.csv"
schema=extractTextSchema(path);
update schema set type = "DOUBLE[]" where name="bid" or name ="ask"
t = loadText(path, schema=schema, arrayDelimiter=",")
t;


sid	date	bid	ask
AMZN	2022.01.02	[1.4799,1.479,1.4787]	[1.4821,1.4825,1.4828]
AMZN	2022.01.03	[1.4796,1.479,1.4784]	[1.4818,1.482,1.4821]
IBM	2022.01.04	[1.4791,1.479,1.4784]	[1.4814,1.4818,1.482]

When exporting data to a .csv file, the default delimiter for array vectors are double quotes ("). For example, a default exported t.csv would be:

sid,date,bid,ask
APPL,2022.01.02,"1.4799,1.479,1.4787","1.4821,1.4825,1.4828"
IBM,2022.01.03,"1.4796,1.479,1.4784","1.4818,1.482,1.4821"
APPL,2022.01.04,"1.4791,1.479,1.4784","1.4814,1.4818,1.482"

If the boundary identifier is not double quote for array vectors, you can load the file with arrayMarker specified. For example, if the t.csv is in the following format:

sid,date,bid,ask
APPL,2022.01.02,[1.4799,1.479,1.4787],[1.4821,1.4825,1.4828]
IBM,2022.01.03,[1.4796,1.479,1.4784],[1.4818,1.482,1.4821]
APPL,2022.01.04,[1.4791,1.479,1.4784],[1.4814,1.4818,1.482]

Load it with the following script:

path = "/home/DolphinDB/Data/t.csv"
schema=extractTextSchema(path);
update schema set type = "DOUBLE[]" where name="bid" or name ="ask"
t = loadText(path, schema=schema, arrayDelimiter=",",arrayMarker="[]")
t;