# Binary File Processing {#binary-file-processing}

DolphinDB provide a wide spectrum of functions to manipulate binary file processing, from raw bytes read/write to high level object read/write.

**Parent topic:**[File Operations](../../Programming/FileOperations/FileOperations.md)

## Read and Write Raw Bytes {#read-and-write-raw-bytes}

The `writeBytes` function writes the entire buffer to the file. The buffer must be a CHAR scalar or CHAR vector. If the operation succeeds, the function returns the actual number of bytes written; otherwise, an IOException will be raised. The `readBytes` function reads a given number of bytes from the file. If the file reaches the end or an IO error occurs, an IOException will be raised; otherwise a buffer containing the given number of bytes will return. Therefore, one must know the exact number of bytes to read before calling `readBytes`.

```
// define a file copy function
def fileCopy(source, target){
 s = file(source)
 len = s.seek(0,TAIL)
 s.seek(0,HEAD)
 t = file(target,"w")
 if(len==0) return
 do{
    buf = s.readBytes(min(len,1024))
    t.writeBytes(buf)
    len -= buf.size()
 }while(len)
}

fileCopy("test.txt","testcopy.txt");
```

The `readBytes` action always returns a new CHAR vector. As we discussed earlier in the section of text file processing, it takes some time to create a new vector buffer. To improve the performance, we can create a buffer and reuse it. [read!](../../Functions/r/read!.md) is such a function that accepts an existing buffer. Another advantage of the `read!` function is that one doesn't have to know the exact number of bytes to read. The function returns if the file reaches the end or the give number of bytes have been read. If the returned count is less than expected, it indicates the file has reached the end.

```
// define a file copy function using read! and write functions
def fileCopy2(source, target){
 s = file(source)
 t = file(target,"w")
 buf = array(CHAR,1024)
 do{
    numByte = s.read!(buf,0,1024)
    t.write(buf,0, numByte)
 }while(numByte==1024)
}

fileCopy2("test.txt","testcopy.txt");
```

The performance of file copy function is dominated by the write part. To compare the performance of `readBytes` with `read!`, we design another comparative experiment below.

```
fileLen = file("test.txt").seek(0, TAIL)
timer(1000){
   fin = file("test.txt")
   len = fileLen
   do{
      buf = fin.readBytes(min(len,1024))
      len -= buf.size()
   }while(len)
};

// Time elapsed: 210.593 ms

timer(1000){
   fin = file("test.txt")
   buf = array(CHAR,1024)
   do{numBytes = fin.read!(buf,0,1024)}while(numBytes==1024)
};

// Time elapsed: 194.519 ms
```

We can conclude that function `read!` is much faster than `readBytes`.

Function `readRecord!` coverts a binary file to a DolphinDB object. The binary files are read by row and each row should contain records with fixed data types and lengths. For example, if a binary file contains 5 data fields with the following types \(length\): char\(1\), boolean\(1\), short\(2\), int\(4\), long\(8\), and double\(8\), the function `readRecord!` will take every 24 bytes as a new row.

The following example introduces how to import a binary file binSample.bin with `readRecord!`.

Create an in-memory table

```
tb=table(1000:0, `id`date`time`last`volume`value`ask1`ask_size1`bid1`bid_size1, [INT,INT,INT,FLOAT,INT,FLOAT,FLOAT,INT,FLOAT,INT])
```

Open files by function `file`, then import binary files with function readRecord!. Data will be loaded into table tb.

```
dataFilePath="/home/DolphinDB/binSample.bin"
f=file(dataFilePath)
f.readRecord!(tb);
select top 5 * from tb;
```

|id|date|time|last|volume|value|ask1|ask\_size1|bid1|bid\_size1|
|---|----|----|----|------|-----|----|----------|----|----------|
|1|20190902|91804000|0|0|0|11.45|200|11.45|200|
|2|20190902|92007000|0|0|0|11.45|200|11.45|200|
|3|20190902|92046000|0|0|0|11.45|1200|11.45|1200|
|4|20190902|92346000|0|0|0|11.45|1200|11.45|1200|
|5|20190902|92349000|0|0|0|11.45|5100|11.45|5100|

Function `readRecord!` doesn't support string type. The type of date and time is INT. Users can convert their type from string into a temporal type with function `temporalParse` and replace the original columns with function `replaceColumn!`.

```
tb.replaceColumn!(`date, tb.date.string().temporalParse("yyyyMMdd"))
tb.replaceColumn!(`time, tb.time.format("000000000").temporalParse("HHmmssSSS"))
select top 5 * from tb;
```

|id|date|time|last|volume|value|ask1|ask\_size1|bid1|bid\_size1|
|---|----|----|----|------|-----|----|----------|----|----------|
|1|2019.09.02|09:18:04.000|0|0|0|11.45|200|11.45|200|
|2|2019.09.02|09:20:07.000|0|0|0|11.45|200|11.45|200|
|3|2019.09.02|09:20:46.000|0|0|0|11.45|1200|11.45|1200|
|4|2019.09.02|09:23:46.000|0|0|0|11.45|1200|11.45|1200|
|5|2019.09.02|09:23:49.000|0|0|0|11.45|5100|11.45|5100|

## Read and Write Multi-byte Integer and Floating Number {#read-and-write-multi-byte-integer-and-floating-number}

The `write` function converts the specified buffer to a stream of bytes and then saves to the file. The buffer could be a scalar or a vector with various types. If an error occurs, an IOException is raised. Otherwise, the function returns the number of elements \(not the number of bytes\) written. The `read!` function reads a given number of elements to the buffer. For example, if the buffer is an INT vector, the function will convert the bytes from the file to INT. Both `write` and `read` function involve the conversion between streams of bytes and multi-byte words, which is termed as endianness in computer science. The big endianness has the most significant byte in the lowest address whereas the little endianness has the least significant byte in the lowest address. The `write` function always uses the endianness of the operating system. The `read!` function can convert the endianness if the endianness of the file is different from the one of the operating system. When one uses the `file` function to open a file, there is an optional boolean argument indicating if the file adopts the little endian format. By default, it is the endianness of the operating system.

```
x=10h
y=0h
file("C:/DolphinDB/test.bin","w").write(x);
// output: 1

file("C:/DolphinDB/test.bin","r",true).read!(y);
// assume the file format is little endianness
// output: 1

y;
// output: 10

file("C:/DolphinDB/test.bin","r",false).read!(y);
// assume the file format is big endianness
// output: 1

y;
// output: 2560
```

We perform a simple experiment: write a short integer \(2 bytes\) with value of 10 to the file and read the number to another short integer variable y with 2 endianness: little and big. As expected, the two readouts are 10 and 2560, respectively. If one performs all file operations on the same machine, one doesn't have to worry about the endianness. But in a distributed system, one must pay attention to the endianness of the network streams or files. The above example uses scalar as the buffer for read and write. We give another example that takes an INT vector as the buffer. It generates one million random integers between 0 and 10000, saves them to a file, then reads them out using a small buffer and calculates the sum.

```
n=1000000
x=rand(10000,n)
file("test.bin","w").write(x,0,n)
sum=0
buf=array(INT,1024)
fin=file("test.bin")
do{
   len = fin.read!(buf,0, 1024)
   if(len==1024)
      sum +=buf.sum()
   else
      sum += buf.subarray(0:len).sum()
}while(len == 1024)
fin.close()
sum;

// output: 4994363593
```

In addition to numbers, strings can also be saved to files in binary format. An additional null character \(a byte with value zero\) will be appended as the delimiter of a string. So if the length of a string is n bytes, the actual number of bytes written to the file is n+1. The example below demonstrates the use of *write* and *read!* for string read and write. We first generate one million random stock tickers and save them to a file in binary format. Then we use a small buffer to read out the entire file sequentially. After each readout, we use the `dictUpdate!` function to count the distribution of words.

```
file("test.bin","w").write(rand(`IBM`MSFT`GOOG`YHOO`C`FORD`MS`GS`BIDU,1000000));
// output: 1000000
```

```
words=dict(STRING,LONG)
buf=array(STRING,1024)
counts=array(LONG,1024,0,1)
fin=file("test.bin")
do{
   len = fin.read!(buf,0,1024)
   if(len==1024)
      dictUpdate!(words, +, buf, counts)
   else
      dictUpdate!(words, +, buf.subarray(0:len), counts.subarray(0:len))
}while(len==1024)
fin.close();

words;

/* output
MSFT->111294
BIDU->110800
FORD->110916
GS->111233
MS->110859
C->110591
YHOO->111069
GOOG->111972
IBM->111266
*/

words.values().sum();

// output: 1000000
```

## Read and Write Object {#read-and-write-object}

The `read`! and `write` functions provide much flexibility to manipulate the read/write of binary data. However, one has to know the exact number of elements to write and read as well as the types of data. Therefore, when dealing with complex data structures such as matrix, table, or tuple, one has to design a complicated protocol to coordinate the write and read. We offer 2 high level functions, [readObject](../../Functions/r/readObject.md) and [writeObject](../../Functions/w/writeObject.md), to manipulate object read and write. All data structures including scalar, vector, matrix, set, dictionary, and table can use these two functions.

```
a1=10.5
a2=1..10
a3=cross(*,1..5,1..10)
a4=set(`IBM`MSFT`GOOG`YHOO)
a5=dict(a4.keys(),125.6 53.2 702.3 39.7)
a6=table(1 2 3 as id, `Jenny`Tom`Jack as name)
a7=(1 2 3, "hello world!", 25.6)

fout=file("test.bin","w")
fout.writeObject(a1)
fout.writeObject(a2)
fout.writeObject(a3)
fout.writeObject(a4)
fout.writeObject(a5)
fout.writeObject(a6)fout.writeObject(a7)
fout.close();
```

The script above writes 7 different types of objects to a file. The script below reads out those seven objects from the file and prints out a short description of the objects.

```
fin = file("test.bin")
for(i in 0:7) print typestr fin.readObject()
fin.close();

/* output
DOUBLE
FAST INT VECTOR
INT MATRIX
STRING SET
STRING->DOUBLE Dictionary
TABLE
ANY VECTOR
*/
```

