Precision and Accuracy in DECIMAL Calculations
1. Characteristics of DECIMAL Calculations
1.1. Calculation Logic
In DolphinDB, a DECIMAL value is divided into two parts for storage: a raw integer containing the raw data, and a scale denoting the number of decimal places. For example, the decimal value 1.23 with a precision of 2 decimal places would be stored as the raw integer 123 and the scale: 2. This storage approach allows calculations to be performed directly on the integer data to avoid precision loss.
When calculating DECIMAL values with functions that return Floating results, calculations are first performed on the integers. The eventual result is converted to Floating type to avoid precision loss during calculation. For example, when calculating the average of DECIMAL32(2) values [1.11, 2.22, 3.33], DolphinDB first computes the sum of the integers 111 + 222 + 333 = 666
, and then converts the result to DOUBLE type and divides by the count double(666) / 10^2 / 3 = 2.22
.
For the rules of DECIMAL arithmetic operations, see DECIMAL Data Type.
1.2. Data Types of Results
This section discusses the return types when a computational function takes DECIMAL values as inputs.
Only the following DDB functions take DECIMAL inputs and also return DECIMAL results to retain the full decimal precision throughout: sum
, max
, min
, first
, last
, firstNot
, lastNot
, and their corresponding cum, m, tm and TopN functions, as well as cumPositiveStreak
.
a = decimal64(rand(10,100),4)
typestr(sum(a))
// output: DECIMAL128
typestr(cummax(a))
// output: FAST DECIMAL64 VECTOR
typestr(mmin(a,5))
// output: FAST DECIMAL64 VECTOR
T = 2023.03.23..2023.06.30
typestr(tmsum(T, a, 3))
// output: FAST DECIMAL128 VECTOR
typestr(cumPositiveStreak(a))
// output: FAST DECIMAL128 VECTOR
Note that DECIMAL128 is introduced since version 2.00.10, and the input and output of sum-related functions (including sum
/cumsum
/msum
/tmsum
/msumTopN
/cumsumTopN
/tmsumTopN
/cumPositiveStreak
) are as follows:
Input Type | Output Type |
---|---|
DECIMAL32 | DECIMAL64 |
DECIMAL64 | DECIMAL128 |
DECIMAL128 | DECIMAL128 |
Except the functions listed above, functions such as avg
, std
, var
, and skew
, and their corresponding cum, m, tm, TopN functions, take DECIMAL inputs and return DOUBLE results.
a = decimal64(rand(10,100),4)
typestr(avg(a))
// output: DOUBLE
typestr(cumstd(a))
// output: FAST DOUBLE VECTOR
typestr(mvar(a,5))
// output: FAST DOUBLE VECTOR
T = 2023.03.23..2023.06.30
typestr(tmskew(T, a, 3))
// output: FAST DOUBLE VECTOR
The following table lists the output data types when calculating DECIMAL64 data:
Function | Input Type | Output Type |
---|---|---|
sum/cumsum/msum/tmsum | DECIMAL64 | DECIMAL128 |
msumTopN/cumsumTopN/tmsumTopN | DECIMAL64 | DECIMAL128 |
max/cummax/mmax/tmmax | DECIMAL64 | DECIMAL64 |
min/cummin/mmin/tmmin | DECIMAL64 | DECIMAL64 |
first/mfirst/tmfirst | DECIMAL64 | DECIMAL64 |
last/mlast/tmlast | DECIMAL64 | DECIMAL64 |
firstNot/cumfirstNot | DECIMAL64 | DECIMAL64 |
lastNot/cumlastNot | DECIMAL64 | DECIMAL64 |
cumPositiveStreak | DECIMAL64 | DECIMAL128 |
avg/cumavg/mavg/tmavg/mavgTopN/cumavgTopN/tmavgTopN | DECIMAL64 | DOUBLE |
sum2/cumsum2/msum2/tmsum2 | DECIMAL64 | DOUBLE |
prod/cumprod/mprod/tmprod | DECIMAL64 | DOUBLE |
med/cummed/mmed/tmmed | DECIMAL64 | DOUBLE |
std/cumstd/mstd/tmstd/mstdTopN/cumstdTopN/tmstdTopN | DECIMAL64 | DOUBLE |
skew/mskew/tmskew/mskewTopN/cumskewTopN/tmskewTopN | DECIMAL64 | DOUBLE |
kurtosis/mkurtosis/tmkurtosis/mkurtosisTopN/cumkurtosisTopN/tmkurtosisTopN | DECIMAL64 | DOUBLE |
corr/cumcorr/mcorr/tmcorr/mcorrTopN/cumcorrTopN/tmcorrTopN | DECIMAL64 | DOUBLE |
covar/cumcovar/mcovar/tmcovar/mcovarTopN/cumcovarTopN/tmcovarTopN | DECIMAL64 | DOUBLE |
beta/cumbeta/mbeta/tmbeta/mbetaTopN/cumbetaTopN/tmbetaTopN | DECIMAL64 | DOUBLE |
2. Performance and Accuracy Tradeoffs
2.1. Advantages of DECIMAL
Floating numbers have inherent limitations in accurately representing real values in computer systems. This stems from two main reasons. First, binary floating-point formats can only approximate decimal values that have finite representations, such as 0.1. The binary representation cannot precisely encode such decimal fractions. Second, Floating types have a finite range and precision. If a number exceeds the maximum or minimum representable value, the computer system will round or truncate it, introducing further imprecision. In contrast, the DECIMAL data type is capable of precisely representing and computing with data that has an exact decimal representation.
For example, the Floating type is unable to precisely represent the value 123.0001, whereas the DECIMAL type can accurately represent it.
a =123.0001
print(a)
// output: 123.000100000000003
b = decimal64(`123.0001,15)
print(b)
// output: 123.000100000000000
Calculating avg
of Floating numbers may yield precision errors, whereas DECIMAL returns an accurate result of DOUBLE type.
a = array(DOUBLE,0)
for (i in 1..100){
a.append!(123.0000+0.0003*i)
}
avg(a)
// output: 123.015149999999
avg(a) == 123.01515
// output: false
eqFloat(avg(a),123.01515)
// output: true
b= array(DECIMAL64(4),0)
for (i in 1..100){
b.append!(123.0000+0.0003*i)
}
avg(b)
// output: 123.015150000000
typestr(avg(b))
// output: DOUBLE
avg(b) == 123.01515
// output: true
2.2. Disadvantages of DECIMAL
Overflow Issues
The numeric ranges for the DECIMAL32/DECIMAL64/DECIMAL128 types are as shown in the table below, where S designates the preserved decimal digits.
Underlying Storage Data Types | Byte Size | Scale Range | Valid Numeric Range | Maximum Number of Digits | |
---|---|---|---|---|---|
DECIMAL32 | int32_t | 4 | [0,9] | (-1 * 10 (9 - S), 1 * 10 (9 - S)) | 9 |
DECIMAL64 | int64_t | 8 | [0,18] | (-1 * 10 (18 - S), 1 * 10 (18 - S)) | 18 |
DECIMAL128 | int128_t | 16 | [0,38] | (-1 * 10 (38 - S), 1 * 10 (38 - S)) | 38 |
DECIMAL values may encounter overflow issues constrained by the valid numeric range and maximum number of digits. As of version 2.00.10, DolphinDB supports automatic conversion to a higher precision DECIMAL type to avoid overflow in arithmetic operations:
version 2.00.9.6:
a = decimal32(4.0000,4)
b = decimal32(8.0000,4)
c = a*b
// Server response: 'c = a * b => Decimal math overflow'
version 2.00.10:
a = decimal32(4.0000,4)
b = decimal32(8.0000,4)
c = a*b
print(c)
// output: 32.00000000
typestr(c)
// output: DECIMAL64
However, there still remains overflow risks with DECIMAL128:
a = decimal128(36.00000000,8)
b = a*a*a*a
// Server response: 'b = a * a * a * a => Decimal math overflow'
As shown in the above example, when multiplying two DECIMAL values in DolphinDB, the result scale is determined by summing the operand scales. This can lead to an overflow error if the number of digits exceeds the DECIMAL precision limit. In the case of variable "b", the expected result is 1679616.00000000000000000000000000000000, which contains 39 digits. This exceeds the maximum representation of 38 digits for DECIMAL128 and leads to an overflow.
As of version 2.00.10, the decimalMultiply
function is introduced for DECIMAL multiplication. Compared with the multiply
function (and *
operator), this function allows specifying the precision of result. In cases where multiplying DECIMAL values could lead to overflow due to accumulated scales, decimalMultiply
can be used to set the result precision explicitly and avoid overflow.
a = decimal64(36.00000000,8) decimalMultiply(a, a, 8)
// output: 1296.00000000
Conversion Errors
If we generate DECIMAL values from constants, errors might occur due to the conversion of floating-point numbers:
a = 0.5599
decimal64(a, 4)
// output: 0.5598
To mitigate such errors, using strings to generate DECIMAL types is recommended:
a = "0.5599"
decimal64(a, 4)
// output: 0.5599
Memory Usage
DECIMAL32, DECIMAL64, and DECIMAL128 types occupy 4 bytes, 8 bytes, and 16 bytes of memory. FLOAT and DOUBLE types use 4 bytes and 8 bytes. DECIMAL128 could take up twice the memory of DOUBLE to represent the same value. Additionally, arithmetic overflows in DECIMAL32 and DECIMAL 64 that trigger type promotion may double the memory usage. This can pose a potential memory risk when working with large datasets of DECIMAL type.
Computing Performance
Compared to FLOAT and DOUBLE types, calculations of DECIMAL are slower due to the more complex internal representation. Specific performance comparisons will be presented in the next chapter.
Limitations
Currently, the DECIMAL type in DolphinDB has fewer functions and less compatibility than FLOAT/DOUBLE:
- Some computational functions do not yet support DECIMAL data types.
- Most statistical functions will generate Floating results even when taking DECIMAL numbers as inputs.
- Matrices and sets do not support DECIMAL data types.
- For data type conversion, the DolphinDB system currently does not support the conversion between DECIMAL types and BOOL/CHAR/SYMBOL/UUID/IPADDR/INT128 types or temporal types. To convert STRING or BLOB type to DECIMAL, the string must be convertible to a numeric type.
3. Floating v.s. DECIMAL: Performance Comparison
To evaluate the computational performance of Floating and DECIMAL types, the following example uses several commonly-used computational functions and measures the computation time for each data type using the same dataset.
First, execute the following script to simulate data:
n = 1000000
data1 = rand(float(100.0), n)
data2 = double(data1)
data3 = decimal32(data1, 4)
data4 = decimal64(data1, 4)
data5 = decimal128(data1, 4)
The timer
statement is used to measure the computation time:
timer(100) { sum(data1) } // Executing 100 times to lower single computation errors and magnify differences in time consumption between different types
timer(100) { sum(data2) }
timer(100) { sum(data3) }
timer(100) { sum(data4) }
timer(100) { sum(data5) }
...
Below is the computation time (in ms) for each function and data type:
Function | FLOAT | DOUBLE | DECIMAL32 | DECIMAL64 | DECIMAL128 |
---|---|---|---|---|---|
sum | 52.985 | 64.331 | 28.721 | 57.234 | 107.126 |
sum2 | 173.278 | 211.772 | 176.261 | 180.051 | 603.422 |
prod | 63.650 | 55.641 | 171.718 | 174.389 | 18850.009 |
avg | 178.086 | 210.495 | 26.094 | 106.869 | 258.508 |
std | 362.388 | 345.758 | 278.933 | 319.177 | 968.276 |
kurtosis | 547.751 | 407.561 | 585.859 | 769.879 | 1013.591 |
skew | 546.705 | 439.348 | 622.758 | 842.994 | 1074.128 |
The performance statistics reveal the performance differences when calculating Floating and DECIMAL values in DolphinDB:
(1) For most of the computational functions, computing DECIMAL is slower than FLOAT/DOUBLE types.
(2) During computations, DECIMAL128 is converted into LONG DOUBLE (and DECIMAL32/DECIMAL64 into DOUBLE). A LONG DOUBLE can take up 12 or 16 bytes, depending on the compiler and CPU. Multiplying LONG DOUBLE values can be highly time-consuming when dealing with huge volume of data. In the given example, the values of the vectors are relatively large, resulting in large products and slower calculations. However, when working with smaller value ranges like [0.5, 1.0], the calculation speed of DECIMAL128 is more comparable to that of DECIMAL32/64.
(3) When conducting calculations with DECIMAL types, certain functions exhibit minimal performance differences compared to Floating numbers. DECIMAL32/64 can even achieve performance comparable to or even better than FLOAT/DOUBLE. This is because DECIMAL computations essentially operate on raw integers, allowing results to be returned directly as DECIMAL or converted to Floating types. Additionally, operations on Floating data (such as sum
) may involve loop unrolling and consequently take more time during calculations.
4. Best Practice for Calculating DECIMAL: mavg
In this chapter, the precision of calculations in DECIMAL and Floating types is measured, with a specific focus on the precision loss that occurs during calculations such as moving average.
DolphinDB provides two methods for calculating moving average: mavg
and moving(avg, ...)
. However, the calculation logic differs. The mavg
algorithm involves adding new values to the window and subtracting unneeded values while calculating the average. This process introduces precision issues due to the addition and subtraction operations. On the other hand, moving(avg, ...)
calculates the average of all the elements every time when a window is determined.
First, import test data from tick.csv (with data of DOUBLE type), and calculate mavg
and moving(avg, ...)
:
data = loadText("<yourDirectory>/tick.csv")
t = select
MDTime,
((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000) as val,
mavg(((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000), 20, 1),
moving(avg, ((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000), 20, 1)
from data
The result is as shown below. Calculation errors start to appear between the two methods from 09:31:53.000 onwards.
To mitigate these calculation errors, the following script converts the intermediate calculation results to DECIMAL128 type before invoking mavg
and moving(avg, ...)
:
data = loadText("<yourDirectory>/tick.csv")
t = select
MDTime,
((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000) as val,
mavg(decimal128(((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000), 4), 20, 1),
moving(avg, decimal128(((LastPx - prev(LastPx)) / (prev(LastPx) + 1E-10) * 1000), 4), 20, 1)
from data
The result is as shown below:
As observed, the results of mavg
and moving(avg, ...)
are identical.
Functions like avg
and std
, and their corresponding cum, tm, m, TopN functions, may encounter precision errors in computations due to the precision issues in floating-point calculations. In situations where precision is important, it is recommended to use DECIMAL for calculations.
5. Conclusion
Floating-point representations can introduce small errors and imprecise values due to their internal binary storage. The DECIMAL data types in DolphinDB provide accurate representations to avoid these precision issues. However, using DECIMAL comes with drawbacks such as potential overflow, higher memory usage, and relatively lower performance. Despite these limitations, in certain scenarios, using the DECIMAL types can effectively prevent discrepancies between expected and computed outcomes. In summary, DECIMAL types provide accurate representations at the potential expense of performance and scalability.