where

The where clause is used to extract only the records that satisfy the specified condition(s).

A SQL query with a where clause first loads and filters the columns specified in the where conditions, then retrieves data from other columns corresponding to the filtered results. When the columns specified in the where conditions account for a large proportion of the queried data, you can specify keyword [HINT_PRELOAD] to preload all data into memory before filtering to improve the query performance.

Examples

sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);
t1;
timestamp sym qty price
09:34:07 C 2200 49.6
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:34:16 C 1300 50.76
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29

Use where clause with one condition:

select * from t1 where sym=`IBM;
timestamp sym qty price
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
select * from t1 where timestamp.minute()>=09:36m;
timestamp sym qty price
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:38:12 C 8800 51.29

In where clause, "==" is equivalent to "=".

select * from t1 where sym==`IBM;
timestamp sym qty price
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23

Use where clause with multiple conditions:

select * from t1 where sym=`IBM and qty>=2000 or timestamp>09:37:00;
timestamp sym qty price
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:38:12 C 8800 51.29

For multiple conditions in a where clause, symbols "and", "&&" and "," can be used to perform logical "AND" operation, and symbols "or" and "||" can be used to perform logical "OR" operation.

select * from t1 where qty>=2000, timestamp.minute()>=09:36m;
timestamp sym qty price
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:38:12 C 8800 51.29
select * from t1 where qty>=2000 and timestamp.minute()>=09:36m;
timestamp sym qty price
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:38:12 C 8800 51.29
select * from t1 where qty>=2000 && timestamp.minute()>=09:36m;
timestamp sym qty price
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:38:12 C 8800 51.29

Differences between comma "," and "and/&&":

"and" is equivalent to "&&". Both test each condition on the entire table. In contrast, when we use "," in the conditions, data is filtered by the first condition, then the next condition is applied to the filtered dataset, and so on. This reduces the search footprint. Therefore, comma "," is preferred to "and/&&" in a where statement when some of the conditions are very specific. We should use the most specific conditions first when using commas in the conditions.

In the following example, with a very restrictive condition "id=1" as only about 0.1% of the records satisfy this condition, using "," in the where clause takes only about 50% of the time as using "and/&&".

Use where clause with a function:

To select records whose prices are higher than the average price, function avg is applied to all records in the table.

select * from t1 where price>avg(price);
timestamp sym qty price
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23

To calculate the average price for each stock, we can use the template contextby. The example below selects the records with price greater than the average price for each stock.

select * from t1 where price>contextby(avg, price, sym) order by sym, price;
timestamp sym qty price
09:34:16 C 1300 50.76
09:38:12 C 8800 51.29
09:35:26 IBM 5400 175.23
09:36:59 MS 3200 30.02

We can take a random sample of a number of partitions in a partitioned table. For details please check function sample .

n=1000000
ID=rand(50, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://rangedb1", RANGE,  0 10 20 30 40 50)
pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
pt=loadTable(db,`pt);

Table pt has 5 partitions. To take a random sample of 2 partitions, we can use either of the following queries:

x = select * from pt where sample(ID, 0.4);

x = select * from pt where sample(ID, 2);

Use where clause with a nested SQL statement:

t1 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 2022.01.01 + 1..6 as date);
t2 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.8 2.3 3.7 3.1 4.2 2.8 as price)
select count(*) from t1 where sym in select sym from t2 where price > 3
// output
3

t3 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.9 2.1 2.5 2.6 2.7 3.2 as price)
select * from t2 where price > select avg(price) from t3
sym price
IBM 3.7
IBM 3.1
AAPL 4.2
AMZN 2.8