The TopN Functions

In data analysis, we often encounter scenarios like calculating the average stock price for the 10 most actively traded days within a 100-day window, or determining the average energy consumption for the 5 highest-temperature records within a 5-minute window for each device. Traditionally, such tasks require extracting data from databases and performing calculations in Python or other systems, which could be time-consuming, especially with large datasets.

DolphinDB offers the TopN built-in functions—including mTopN, tmTopN, and cumTopN. These functions not only simplify programming but also optimize performance through incremental calculations.

Note: The features described are available in DolphinDB versions 2.00.10/1.30.22 and higher.

1. Introduction

The TopN functions are designed to sort data based on a specific metric (such as price changes, trading volumes, etc.) and select the top N elements (or top N%) for further calculation. These functions can be categorized as follows:

  • mTopN Functions: Used for calculations within a sliding window, where the window size can be based on the number of elements or time interval. Examples include msumTopN, mcorrTopN, etc.
  • tmTopN Functions: Used for calculations within a time-based sliding window. Examples include tmsumTopN, tmcorrTopN, etc. N can also be a floating-point number between 0 and 1, representing the top N% of elements.
  • cumTopN Functions: Used for calculations within a cumulative window, such as calculating the sum of trading volumes for the top N days with the highest historical price changes for each stock. Examples include cumsumTopN, cumcorrTopN, etc.
  • aggrTopN Higher-Order Function: Used for calculation logic not covered by the TopN functions above. It can be combined with a user-defined aggregate function and higher-order window functions (such as moving, tmoving) to create customized TopN functions. N can be a floating-point number between 0 and 1, representing the top N% of elements.

The table below shows the functions categorized under the TopN functions:

Basic Function mTopN tmTopN cumTopN
sum msumTopN tmsumTopN cumsumTopN
avg mavgTopN tmavgTopN cumavgTopN
std mstdTopN tmstdTopN cumstdTopN
stdp mstdpTopN tmstdpTopN cumstdpTopN
var mvarTopN tmvarTopN cumvarTopN
varp mvarpTopN tmvarpTopN cumvarpTopN
skew mskewTopN tmskewTopN cumskewTopN
kurtosis mkurtosisTopN tmkurtosisTopN cumkurtosisTopN
beta mbetaTopN tmbetaTopN cumbetaTopN
corr mcorrTopN tmcorrTopN cumcorrTopN
covar mcovarTopN tmcovarTopN cumcovarTopN
wsum mwsumTopN tmwsumTopN cumwsumTopN
Figure 1. Table 1-1 TopN Functions
Note: More TopN functions may be added. For a complete list, please refer to mTopN, tmTopN and cumTopN.

Calculation Rules of the mTopN Functions

Diving into TopN functions, we'll use the mTopN functions as an example to unpack their core principles. In later sections, we'll zoom in on the specific applications of different TopN function categories, revealing their distinctive characteristics and shared traits. This will enable you to find the right function for for real-world problems.

The mTopN functions follow this syntax:

mTopN(X, S, window, top, [ascending=true], [tiesMethod])

The calculation proceeds as follows:

  1. The function sorts X based on S (ascending by default).
  2. Then, it picks the top N values. If there are equal values, the tiesMethod parameter decides how to handle them:
    • latest – Keeps the most recent data.
    • oldest – Keeps the oldest data.
    • all – Includes all tied values.

For full details, check the mTopN User Manual.

2. Sliding Window Calculations

The calculation in a sliding window involves sorting records within the window, selecting the top N records (or top N% of records), and then performing an aggregation.

This section illustrates the concept with practical examples. The following script generates minute-level OHLC data for demonstration:

n = 5*121
timeVector = 2023.04.30T09:30:00.000 + 0..120 * 60000
tradingTime = take(timeVector,n)
windCode = stretch(format(000001..000005, "000000") + "AA", n)
open = (20.00+0.01*0..120) join (30.00-0.01*0..120) join (40.00+0.01*0..120) join (50.00-0.01*0..120) join (60.00+0.01*0..120)
high = (20.50+0.01*0..120) join (31.00-0.01*0..120) join (40.80+0.01*0..120) join (50.90-0.01*0..120) join (60.70+0.01*0..120)
low = (19.50+0.01*0..120) join (29.00-0.01*0..120) join (39.00+0.01*0..120) join (48.00-0.01*0..120) join (59.00+0.01*0..120)
close = (20.00+0.01*0..120) join (30.00-0.01*0..120) join (40.00+0.01*0..120) join (50.00-0.01*0..120) join (60.00+0.01*0..120)
volume= 10000+ take(-100..100,n)
t = table(tradingTime, windCode, open, high, low, close, volume)

2.1 mTopN Functions

The mTopN functions operate on sliding windows that can be defined either by record count or by time interval. For details on window calculation rules, please refer to our comprehensive window functions tutorial.

To find the average price of the 3 records with the highest volume within a 5-record window for each stock in our simulated minute-level data, we can use mavgTopN with the SQL CONTEXT BY clause for grouping.

select windCode, tradingTime, mavgTopN(close, volume, 5, 3, false) as mavgTop3Close from t context by windCode
//output
windCode tradingTime             mavgTop3Close
-------- ----------------------- ------------------
000001AA 2023.04.30T09:30:00.000 20
000001AA 2023.04.30T09:31:00.000 20.005
000001AA 2023.04.30T09:32:00.000 20.01
000001AA 2023.04.30T09:33:00.000 20.02
000001AA 2023.04.30T09:34:00.000 20.03
...

Similarly, calculating the average price change for the 10 highest-volume records within a 100-minute sliding window:

select windCode, tradingTime, mavgTopN(ratios(close), volume, 100, 10, false) as mavgTop10RatioClose from t context by windCode, date(tradingTime)
//output
windCode tradingTime             mavgTop10RatioClose
-------- ----------------------- -------------------
000001AA 2023.04.30T09:30:00.000
000001AA 2023.04.30T09:31:00.000 1.0005
000001AA 2023.04.30T09:32:00.000 1.000499875
000001AA 2023.04.30T09:33:00.000 1.000499750
000001AA 2023.04.30T09:34:00.000 1.000499625
...

The topN funtions also support binary operations including beta, correlation, and covariance calculations. For example, to find the correlation between two factors for the top 3 highest-price records within a 5-record window, we can use the mcorrTopN function along with the SQL CONTEXT BY clause:

select windCode, tradingTime, mcorrTopN(low, close * volume,  log(ratios(close)), 5, 3, false) as mcorrTop3CloseVol from t context by windCode, date(tradingTime)
//output
windCode tradingTime             mcorrTop3CloseVol
-------- ----------------------- -----------------
000001AA 2023.04.30T09:30:00.000
000001AA 2023.04.30T09:31:00.000
000001AA 2023.04.30T09:32:00.000 1.00000
000001AA 2023.04.30T09:33:00.000 0.99999
000001AA 2023.04.30T09:34:00.000 0.99999

2.2 tmTopN Functions

The tmTopN functions use a time-based window defined by a specific interval, such as 5 minutes or 20 seconds. The window slides over data according to a specified temporal-type vector.

For the tmTopN functions, the top parameter can either be an integer (e.g., top N records) or a floating-point number between 0 and 1, which represents a percentage. For example, a value of 0.2 means selecting the top 20% of records within the window.

For example, to calculate the average return of the two records with the highest trading volume within a 3-minute window, we can use tmavgTopN:

select windCode, tradingTime, tmavgTopN(tradingTime, ratios(close), volume, 3m, 2, false) as tmavgTop2RatioClose from t context by windCode, date(tradingTime)
//output
windCode tradingTime             tmavgTop2RatioClose
-------- ----------------------- -------------------
000001AA 2023.04.30T09:30:00.000
000001AA 2023.04.30T09:31:00.000 1.0005
000001AA 2023.04.30T09:32:00.000 1.000499875
000001AA 2023.04.30T09:33:00.000 1.000499625
000001AA 2023.04.30T09:34:00.000 1.000499376
...

The tmTopN functions also encompass correlation-based calculations. For example, you can calculate the correlation between closing prices and volumes for the top 3 records with the highest trading volume in a 5-minute window using tmcorrTopN:

select windCode, tradingTime, tmcorrTopN(tradingTime, close, volume, volume, 5m, 3, false) as tmavgTop3CorrCloseVolume from t context by windCode, date(tradingTime)
//output
windCode tradingTime             tmavgTop3CorrCloseVolume
-------- ----------------------- ------------------------
000001AA 2023.04.30T09:30:00.000
000001AA 2023.04.30T09:31:00.000 0.999999990552169
000001AA 2023.04.30T09:32:00.000 1.000000001625267
000001AA 2023.04.30T09:33:00.000 1.000000006877599
000001AA 2023.04.30T09:34:00.000 1.000000012129931
...

3. Cumulative Window Calculations

Unlike the sliding window calculations of mTopN and tmTopN functions, the cumTopN functions focus on cumulative windows.

For example, to calculate the cumulative trading volume for the top 3 records with the largest price increases in history, you can use cumsumTopN:

select windCode, tradingTime, cumsumTopN(volume, ratios(close), 3, false) as cumsumTop3Volume from t context by windCode
//output
windCode tradingTime             cumsumTop3Volume
-------- ----------------------- ----------------
000001AA 2023.04.30T09:30:00.000
000001AA 2023.04.30T09:31:00.000 9901
000001AA 2023.04.30T09:32:00.000 19803
000001AA 2023.04.30T09:33:00.000 29706
000001AA 2023.04.30T09:34:00.000 29706
...

In the above script, we can see that the second parameter of cumsumTopN, the sorting metric S, is not a column but rather the result of a function applied to a column. This demonstrates the flexibility of the TopN functions.

4. Custom Logic Implementation

We've now covered the built-in TopN functions. Additionally, DolphinDB provides an interface for custom implementations, allowing you to define TopN functions that fit your specific requirements.

Note: The built-in TopN functions are optimized for their respective computation scenarios and generally outperform user-defined TopN functions.

4.1 Implementing a User-Defined TopN Function

The higher-order function aggrTopN follows this syntax:

aggrTopN(func, funcArgs, sortingCol, top, [ascending=true])

The func parameter accepts an aggregate function, which allows you to define a TopN function for aggregation. As a simple example, to compute the average OHLC for the top 40% of records with the highest trading volume using a user-defined TopN function, you can implement it as follows:

// Define a UDAF avgOHLC
defg avgOHLC(price){ 
	return avg(price)
}
select aggrTopN(avgOHLC, funcArgs =(open + high + low + close) , sortingCol=volume, top=0.4, ascending=true) as factor_value from t group by windCode
//output
windCode factor_value
-------- -------------------
000001AA 80.94
000002AA 116.56583
000003AA 160.74
000004AA 196.40
000005AA 240.76167

Similarly, sliding window TopN calculations also support user-defined aggregate functions (UDAFs). For example, the UDAF avgOHLC can be passed into the moving (or tmoving) higher-order function:

select windCode, tradingTime, moving(aggrTopN{avgOHLC,,,0.4,true},(open + high + low + close, volume),10,1) as udfmTopN from t context by windCode
//output
windCode  tradingTime             udfmTopN          
--------- ----------------------- ------------------
000001AA 2023.04.30T09:30:00.000 80 
...
000001AA 2023.04.30T09:39:00.000 80.06
000001AA 2023.04.30T09:40:00.000 80.10
000001AA 2023.04.30T09:41:00.000 80.14
...

4.2 Performance Comparison

Calculation in the example above can also be performed using the built-in mavgTopN function:

select windCode, tradingTime, mavgTopN(open + high + low + close, volume,10,4) as udfmTopN from t context by windCode
//output
windCode  tradingTime             udfmTopN          
--------- ----------------------- ------------------
000001AA 2023.04.30T09:30:00.000 80 
...
000001AA 2023.04.30T09:39:00.000 80.06
000001AA 2023.04.30T09:40:00.000 80.10
000001AA 2023.04.30T09:41:00.000 80.14
...

However, the built-in mTopN functions are optimized for their respective computation scenarios, offering better performance than the combination of moving and a UDAF.

We conducted a performance test between the built-in mavgTopN and its user-defined equivalents.

Test Environment:

  • CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  • Operating System: 64-bit CentOS Linux 7 (Core)
  • DolphinDB Version: 2.00.10, Single Node Deployment

Script:

//moving+aggrTopN+avgOHLC
timer(10000)select windCode, tradingTime, moving(aggrTopN{avgOHLC,,,0.4,true},(open + high + low + close, volume),10,1) as udfmTopN from t context by windCode
//8394 ms
//moving+aggrTopN+avg
timer(10000)select windCode, tradingTime, moving(aggrTopN{avg,,,0.4,true},(open + high + low + close, volume),10,1) as udfmTopN from t context by windCode
//6812 ms
//mavgTopN
timer(10000)select windCode, tradingTime, mavgTopN(open + high + low + close, volume,10,4) as udfmTopN from t context by windCode
//1394 ms

Result:

TopN Execution Time (Running 10,000 times)
moving+aggrTopN+avgOHLC 8,394 ms
moving+aggrTopN+avg 6,812 ms
mavgTopN 1,394 ms

5. Stream Computing Scenarios

The TopN functions also support stream computing scenarios. For example, in a reactive state engine, to calculate the average price change of the top 10 records with the highest trading volume within a 100-minute sliding window:

//Factor expressions used in batch computation can be directly used in a streaming engine
factor = <mavgTopN(ratios(close), volume, 100, 10, false)>

// Define the input and output table schemas
share streamTable(1:0, `tradingTime`windCode`open`high`low`close`volume, [TIMESTAMP,STRING,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT]) as tickStream
result = table(1000:0, `windCode`tradingTime`mavgTop10RatioClose, [STRING,TIMESTAMP,DOUBLE])

//Define the streaming engine
rse = createReactiveStateEngine(name="streamTopN", metrics =[<tradingTime>, factor], dummyTable=t, outputTable=result, keyColumn="windCode")

//Subscribe to the stream table and replay data
subscribeTable(tableName=`tickStream, actionName="mTopN", handler=tableInsert{rse})
replay(inputTables=t.copy().sortBy!(`tradingTime), outputTables=tickStream, timeColumn=`tradingTime)

//Query the stream computation result
select * from result

//To run the script above again, use the following script to unsubscribe from the stream table first
unsubscribeTable(tableName=`tickStream, actionName="mTopN")
dropStreamEngine(`streamTopN)
undef(`tickStream, SHARED)

6. DECIMAL Type Support

Starting from version 2.00.10, the TopN functions support the use of DECIMAL types (including DECIMAL32, DECIMAL64, and DECIMAL128) for the sorting field (S parameter) and the calculation fields (X and Y).

If the calculation fields are of DECIMAL type, the functions msumTopN, tmsumTopN, and cumsumTopN will return results in DECIMAL type, while other TopN functions will return results in DOUBLE type.

A special note for the TopN versions of var, varp, std, stp, corr, covar, beta, and wsum: These functions perform intermediate calculations in DECIMAL128 when processing DECIMAL inputs to maintain precision, though they ultimately return DOUBLE results. However, using DECIMAL for intermediate results comes with trade-offs: computations take longer, and there is a risk of overflow. For performance optimization, the system doesn't throw overflow exceptions, so users should exercise additional caution with extreme values.

DECIMAL128 supports up to 38 significant digits (including both integer and decimal parts). For example, when calculating the variance of price data like 18.2345 (6 digits), squaring it results in 12 digits. With 100 million data points, the total precision used is about 20 (12+8) digits—well within the 38-digit limit, so overflow won't occur. However, if the number of decimal places of 18.2345 increases significantly—for instance, from 4 to 15—the number becomes 17 digits long, and squaring it gives 34 digits. Just 10,000 such numbers could exceed the 38-digit limit of DECIMAL128.

To prevent such issues, in cases where data has many decimal places, consider either converting it to DOUBLE for processing or using functions like decimal32, decimal64, or decimal128 to reduce precision before calculation.