Binary File Processing

DolphinDB provide a wide spectrum of functions to manipulate binary file processing, from raw bytes read/write to high level object read/write.

Read and Write Raw Bytes

The writeBytes function writes the entire buffer to the file. The buffer must be a CHAR scalar or CHAR vector. If the operation succeeds, the function returns the actual number of bytes written; otherwise, an IOException will be raised. The readBytes function reads a given number of bytes from the file. If the file reaches the end or an IO error occurs, an IOException will be raised; otherwise a buffer containing the given number of bytes will return. Therefore, one must know the exact number of bytes to read before calling readBytes.

// define a file copy function
def fileCopy(source, target){
 s = file(source)
 len = s.seek(0,TAIL)
 s.seek(0,HEAD)
 t = file(target,"w")
 if(len==0) return
 do{
    buf = s.readBytes(min(len,1024))
    t.writeBytes(buf)
    len -= buf.size()
 }while(len)
}

fileCopy("test.txt","testcopy.txt");

The readBytes action always returns a new CHAR vector. As we discussed earlier in the section of text file processing, it takes some time to create a new vector buffer. To improve the performance, we can create a buffer and reuse it. read! is such a function that accepts an existing buffer. Another advantage of the read! function is that one doesn't have to know the exact number of bytes to read. The function returns if the file reaches the end or the give number of bytes have been read. If the returned count is less than expected, it indicates the file has reached the end.

// define a file copy function using read! and write functions
def fileCopy2(source, target){
 s = file(source)
 t = file(target,"w")
 buf = array(CHAR,1024)
 do{
    numByte = s.read!(buf,0,1024)
    t.write(buf,0, numByte)
 }while(numByte==1024)
}

fileCopy2("test.txt","testcopy.txt");

The performance of file copy function is dominated by the write part. To compare the performance of readBytes with read!, we design another comparative experiment below.

fileLen = file("test.txt").seek(0, TAIL)
timer(1000){
   fin = file("test.txt")
   len = fileLen
   do{
      buf = fin.readBytes(min(len,1024))
      len -= buf.size()
   }while(len)
};

// Time elapsed: 210.593 ms

timer(1000){
   fin = file("test.txt")
   buf = array(CHAR,1024)
   do{numBytes = fin.read!(buf,0,1024)}while(numBytes==1024)
};

// Time elapsed: 194.519 ms

We can conclude that function read! is much faster than readBytes.

Function readRecord! coverts a binary file to a DolphinDB object. The binary files are read by row and each row should contain records with fixed data types and lengths. For example, if a binary file contains 5 data fields with the following types (length): char(1), boolean(1), short(2), int(4), long(8), and double(8), the function readRecord! will take every 24 bytes as a new row.

The following example introduces how to import a binary file binSample.bin with readRecord!.

Create an in-memory table

tb=table(1000:0, `id`date`time`last`volume`value`ask1`ask_size1`bid1`bid_size1, [INT,INT,INT,FLOAT,INT,FLOAT,FLOAT,INT,FLOAT,INT])

Open files by function file, then import binary files with function readRecord!. Data will be loaded into table tb.

dataFilePath="/home/DolphinDB/binSample.bin"
f=file(dataFilePath)
f.readRecord!(tb);
select top 5 * from tb;


id	date	time	ask1	ask_size1	bid1	bid_size1
1	20190902	91804000	11.45	200	11.45	200
2	20190902	92007000	11.45	200	11.45	200
3	20190902	92046000	11.45	1200	11.45	1200
4	20190902	92346000	11.45	1200	11.45	1200
5	20190902	92349000	11.45	5100	11.45	5100

Function readRecord! doesn't support string type. The type of date and time is INT. Users can convert their type from string into a temporal type with function temporalParse and replace the original columns with function replaceColumn!.

tb.replaceColumn!(`date, tb.date.string().temporalParse("yyyyMMdd"))
tb.replaceColumn!(`time, tb.time.format("000000000").temporalParse("HHmmssSSS"))
select top 5 * from tb;


id	date	time	ask1	ask_size1	bid1	bid_size1
1	2019.09.02	09:18:04.000	11.45	200	11.45	200
2	2019.09.02	09:20:07.000	11.45	200	11.45	200
3	2019.09.02	09:20:46.000	11.45	1200	11.45	1200
4	2019.09.02	09:23:46.000	11.45	1200	11.45	1200
5	2019.09.02	09:23:49.000	11.45	5100	11.45	5100

Read and Write Multi-byte Integer and Floating Number

The write function converts the specified buffer to a stream of bytes and then saves to the file. The buffer could be a scalar or a vector with various types. If an error occurs, an IOException is raised. Otherwise, the function returns the number of elements (not the number of bytes) written. The read! function reads a given number of elements to the buffer. For example, if the buffer is an INT vector, the function will convert the bytes from the file to INT. Both write and read function involve the conversion between streams of bytes and multi-byte words, which is termed as endianness in computer science. The big endianness has the most significant byte in the lowest address whereas the little endianness has the least significant byte in the lowest address. The write function always uses the endianness of the operating system. The read! function can convert the endianness if the endianness of the file is different from the one of the operating system. When one uses the file function to open a file, there is an optional boolean argument indicating if the file adopts the little endian format. By default, it is the endianness of the operating system.

x=10h
y=0h
file("C:/DolphinDB/test.bin","w").write(x);
// output: 1

file("C:/DolphinDB/test.bin","r",true).read!(y);
// assume the file format is little endianness
// output: 1

y;
// output: 10

file("C:/DolphinDB/test.bin","r",false).read!(y);
// assume the file format is big endianness
// output: 1

y;
// output: 2560

We perform a simple experiment: write a short integer (2 bytes) with value of 10 to the file and read the number to another short integer variable y with 2 endianness: little and big. As expected, the two readouts are 10 and 2560, respectively. If one performs all file operations on the same machine, one doesn't have to worry about the endianness. But in a distributed system, one must pay attention to the endianness of the network streams or files. The above example uses scalar as the buffer for read and write. We give another example that takes an INT vector as the buffer. It generates one million random integers between 0 and 10000, saves them to a file, then reads them out using a small buffer and calculates the sum.

n=1000000
x=rand(10000,n)
file("test.bin","w").write(x,0,n)
sum=0
buf=array(INT,1024)
fin=file("test.bin")
do{
   len = fin.read!(buf,0, 1024)
   if(len==1024)
      sum +=buf.sum()
   else
      sum += buf.subarray(0:len).sum()
}while(len == 1024)
fin.close()
sum;

// output: 4994363593

In addition to numbers, strings can also be saved to files in binary format. An additional null character (a byte with value zero) will be appended as the delimiter of a string. So if the length of a string is n bytes, the actual number of bytes written to the file is n+1. The example below demonstrates the use of write and read! for string read and write. We first generate one million random stock tickers and save them to a file in binary format. Then we use a small buffer to read out the entire file sequentially. After each readout, we use the dictUpdate! function to count the distribution of words.

file("test.bin","w").write(rand(`IBM`MSFT`GOOG`YHOO`C`FORD`MS`GS`BIDU,1000000));
// output: 1000000

words=dict(STRING,LONG)
buf=array(STRING,1024)
counts=array(LONG,1024,0,1)
fin=file("test.bin")
do{
   len = fin.read!(buf,0,1024)
   if(len==1024)
      dictUpdate!(words, +, buf, counts)
   else
      dictUpdate!(words, +, buf.subarray(0:len), counts.subarray(0:len))
}while(len==1024)
fin.close();

words;

/* output
MSFT->111294
BIDU->110800
FORD->110916
GS->111233
MS->110859
C->110591
YHOO->111069
GOOG->111972
IBM->111266
*/

words.values().sum();

// output: 1000000

Read and Write Object

The read! and write functions provide much flexibility to manipulate the read/write of binary data. However, one has to know the exact number of elements to write and read as well as the types of data. Therefore, when dealing with complex data structures such as matrix, table, or tuple, one has to design a complicated protocol to coordinate the write and read. We offer 2 high level functions, readObject and writeObject, to manipulate object read and write. All data structures including scalar, vector, matrix, set, dictionary, and table can use these two functions.

a1=10.5
a2=1..10
a3=cross(*,1..5,1..10)
a4=set(`IBM`MSFT`GOOG`YHOO)
a5=dict(a4.keys(),125.6 53.2 702.3 39.7)
a6=table(1 2 3 as id, `Jenny`Tom`Jack as name)
a7=(1 2 3, "hello world!", 25.6)

fout=file("test.bin","w")
fout.writeObject(a1)
fout.writeObject(a2)
fout.writeObject(a3)
fout.writeObject(a4)
fout.writeObject(a5)
fout.writeObject(a6)fout.writeObject(a7)
fout.close();

The script above writes 7 different types of objects to a file. The script below reads out those seven objects from the file and prints out a short description of the objects.

fin = file("test.bin")
for(i in 0:7) print typestr fin.readObject()
fin.close();

/* output
DOUBLE
FAST INT VECTOR
INT MATRIX
STRING SET
STRING->DOUBLE Dictionary
TABLE
ANY VECTOR
*/