Binary File Processing
DolphinDB provide a wide spectrum of functions to manipulate binary file processing, from raw bytes read/write to high level object read/write.
Read and Write Raw Bytes
The writeBytes function writes the entire buffer to the file. The
                buffer must be a CHAR scalar or CHAR vector. If the operation succeeds, the function
                returns the actual number of bytes written; otherwise, an IOException will be
                raised. The readBytes function reads a given number of bytes from
                the file. If the file reaches the end or an IO error occurs, an IOException will be
                raised; otherwise a buffer containing the given number of bytes will return.
                Therefore, one must know the exact number of bytes to read before calling
                    readBytes.
// define a file copy function
def fileCopy(source, target){
 s = file(source)
 len = s.seek(0,TAIL)
 s.seek(0,HEAD)
 t = file(target,"w")
 if(len==0) return
 do{
    buf = s.readBytes(min(len,1024))
    t.writeBytes(buf)
    len -= buf.size()
 }while(len)
}
fileCopy("test.txt","testcopy.txt");The readBytes action always returns a new CHAR vector. As we
                discussed earlier in the section of text file processing, it takes some time to
                create a new vector buffer. To improve the performance, we can create a buffer and
                reuse it. read! is such a function
                that accepts an existing buffer. Another advantage of the read!
                function is that one doesn't have to know the exact number of bytes to read. The
                function returns if the file reaches the end or the give number of bytes have been
                read. If the returned count is less than expected, it indicates the file has reached
                the end.
// define a file copy function using read! and write functions
def fileCopy2(source, target){
 s = file(source)
 t = file(target,"w")
 buf = array(CHAR,1024)
 do{
    numByte = s.read!(buf,0,1024)
    t.write(buf,0, numByte)
 }while(numByte==1024)
}
fileCopy2("test.txt","testcopy.txt");The performance of file copy function is dominated by the write part. To compare the
                performance of readBytes with read!, we design
                another comparative experiment below.
fileLen = file("test.txt").seek(0, TAIL)
timer(1000){
   fin = file("test.txt")
   len = fileLen
   do{
      buf = fin.readBytes(min(len,1024))
      len -= buf.size()
   }while(len)
};
// Time elapsed: 210.593 ms
timer(1000){
   fin = file("test.txt")
   buf = array(CHAR,1024)
   do{numBytes = fin.read!(buf,0,1024)}while(numBytes==1024)
};
// Time elapsed: 194.519 msWe can conclude that function read! is much faster than
                    readBytes.
Function readRecord! coverts a binary file to a DolphinDB object.
                The binary files are read by row and each row should contain records with fixed data
                types and lengths. For example, if a binary file contains 5 data fields with the
                following types (length): char(1), boolean(1), short(2), int(4), long(8), and
                double(8), the function readRecord! will take every 24 bytes as a
                new row.
The following example introduces how to import a binary file binSample.bin with
                    readRecord!.
Create an in-memory table
tb=table(1000:0, `id`date`time`last`volume`value`ask1`ask_size1`bid1`bid_size1, [INT,INT,INT,FLOAT,INT,FLOAT,FLOAT,INT,FLOAT,INT])Open files by function file, then import binary files with function
                readRecord!. Data will be loaded into table tb.
dataFilePath="/home/DolphinDB/binSample.bin"
f=file(dataFilePath)
f.readRecord!(tb);
select top 5 * from tb;| id | date | time | last | volume | value | ask1 | ask_size1 | bid1 | bid_size1 | 
|---|---|---|---|---|---|---|---|---|---|
| 1 | 20190902 | 91804000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 | 
| 2 | 20190902 | 92007000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 | 
| 3 | 20190902 | 92046000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 | 
| 4 | 20190902 | 92346000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 | 
| 5 | 20190902 | 92349000 | 0 | 0 | 0 | 11.45 | 5100 | 11.45 | 5100 | 
Function readRecord! doesn't support string type. The type of date
                and time is INT. Users can convert their type from string into a temporal type with
                function temporalParse and replace the original columns with
                function replaceColumn!.
tb.replaceColumn!(`date, tb.date.string().temporalParse("yyyyMMdd"))
tb.replaceColumn!(`time, tb.time.format("000000000").temporalParse("HHmmssSSS"))
select top 5 * from tb;| id | date | time | last | volume | value | ask1 | ask_size1 | bid1 | bid_size1 | 
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2019.09.02 | 09:18:04.000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 | 
| 2 | 2019.09.02 | 09:20:07.000 | 0 | 0 | 0 | 11.45 | 200 | 11.45 | 200 | 
| 3 | 2019.09.02 | 09:20:46.000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 | 
| 4 | 2019.09.02 | 09:23:46.000 | 0 | 0 | 0 | 11.45 | 1200 | 11.45 | 1200 | 
| 5 | 2019.09.02 | 09:23:49.000 | 0 | 0 | 0 | 11.45 | 5100 | 11.45 | 5100 | 
Read and Write Multi-byte Integer and Floating Number
The write function converts the specified buffer to a stream of
                bytes and then saves to the file. The buffer could be a scalar or a vector with
                various types. If an error occurs, an IOException is raised. Otherwise, the function
                returns the number of elements (not the number of bytes) written. The
                    read! function reads a given number of elements to the buffer.
                For example, if the buffer is an INT vector, the function will convert the bytes
                from the file to INT. Both write and read function
                involve the conversion between streams of bytes and multi-byte words, which is
                termed as endianness in computer science. The big endianness has the most
                significant byte in the lowest address whereas the little endianness has the least
                significant byte in the lowest address. The write function always
                uses the endianness of the operating system. The read! function can
                convert the endianness if the endianness of the file is different from the one of
                the operating system. When one uses the file function to open a
                file, there is an optional boolean argument indicating if the file adopts the little
                endian format. By default, it is the endianness of the operating system.
x=10h
y=0h
file("C:/DolphinDB/test.bin","w").write(x);
// output: 1
file("C:/DolphinDB/test.bin","r",true).read!(y);
// assume the file format is little endianness
// output: 1
y;
// output: 10
file("C:/DolphinDB/test.bin","r",false).read!(y);
// assume the file format is big endianness
// output: 1
y;
// output: 2560We perform a simple experiment: write a short integer (2 bytes) with value of 10 to the file and read the number to another short integer variable y with 2 endianness: little and big. As expected, the two readouts are 10 and 2560, respectively. If one performs all file operations on the same machine, one doesn't have to worry about the endianness. But in a distributed system, one must pay attention to the endianness of the network streams or files. The above example uses scalar as the buffer for read and write. We give another example that takes an INT vector as the buffer. It generates one million random integers between 0 and 10000, saves them to a file, then reads them out using a small buffer and calculates the sum.
n=1000000
x=rand(10000,n)
file("test.bin","w").write(x,0,n)
sum=0
buf=array(INT,1024)
fin=file("test.bin")
do{
   len = fin.read!(buf,0, 1024)
   if(len==1024)
      sum +=buf.sum()
   else
      sum += buf.subarray(0:len).sum()
}while(len == 1024)
fin.close()
sum;
// output: 4994363593In addition to numbers, strings can also be saved to files in binary format. An
                additional null character (a byte with value zero) will be appended as the delimiter
                of a string. So if the length of a string is n bytes, the actual number of bytes
                written to the file is n+1. The example below demonstrates the use of write
                and read! for string read and write. We first generate one million random
                stock tickers and save them to a file in binary format. Then we use a small buffer
                to read out the entire file sequentially. After each readout, we use the
                    dictUpdate! function to count the distribution of words.
file("test.bin","w").write(rand(`IBM`MSFT`GOOG`YHOO`C`FORD`MS`GS`BIDU,1000000));
// output: 1000000words=dict(STRING,LONG)
buf=array(STRING,1024)
counts=array(LONG,1024,0,1)
fin=file("test.bin")
do{
   len = fin.read!(buf,0,1024)
   if(len==1024)
      dictUpdate!(words, +, buf, counts)
   else
      dictUpdate!(words, +, buf.subarray(0:len), counts.subarray(0:len))
}while(len==1024)
fin.close();
words;
/* output
MSFT->111294
BIDU->110800
FORD->110916
GS->111233
MS->110859
C->110591
YHOO->111069
GOOG->111972
IBM->111266
*/
words.values().sum();
// output: 1000000Read and Write Object
The read! and write functions provide much
                flexibility to manipulate the read/write of binary data. However, one has to know
                the exact number of elements to write and read as well as the types of data.
                Therefore, when dealing with complex data structures such as matrix, table, or
                tuple, one has to design a complicated protocol to coordinate the write and read. We
                offer 2 high level functions, readObject and writeObject, to
                manipulate object read and write. All data structures including scalar, vector,
                matrix, set, dictionary, and table can use these two functions.
a1=10.5
a2=1..10
a3=cross(*,1..5,1..10)
a4=set(`IBM`MSFT`GOOG`YHOO)
a5=dict(a4.keys(),125.6 53.2 702.3 39.7)
a6=table(1 2 3 as id, `Jenny`Tom`Jack as name)
a7=(1 2 3, "hello world!", 25.6)
fout=file("test.bin","w")
fout.writeObject(a1)
fout.writeObject(a2)
fout.writeObject(a3)
fout.writeObject(a4)
fout.writeObject(a5)
fout.writeObject(a6)fout.writeObject(a7)
fout.close();The script above writes 7 different types of objects to a file. The script below reads out those seven objects from the file and prints out a short description of the objects.
fin = file("test.bin")
for(i in 0:7) print typestr fin.readObject()
fin.close();
/* output
DOUBLE
FAST INT VECTOR
INT MATRIX
STRING SET
STRING->DOUBLE Dictionary
TABLE
ANY VECTOR
*/