where

The where clause is used to extract only the records that satisfy the specified condition(s).

A SQL query with a where clause first loads and filters the columns specified in the where conditions, then retrieves data from other columns corresponding to the filtered results. When the columns specified in the where conditions account for a large proportion of the queried data, you can specify keyword [HINT_PRELOAD] to preload all data into memory before filtering to improve the query performance.

Examples

sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);
t1;


timestamp	sym	qty	price
09:34:07	C	2200	49.6
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:34:16	C	1300	50.76
09:34:26	C	2500	50.32
09:38:12	C	8800	51.29

Use where clause with one condition:

select * from t1 where sym=`IBM;


timestamp	sym	qty	price
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23

select * from t1 where timestamp.minute()>=09:36m;


timestamp	sym	qty	price
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:38:12	C	8800	51.29

In where clause, "==" is equivalent to "=".

select * from t1 where sym==`IBM;


timestamp	sym	qty	price
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23

Use where clause with multiple conditions:

select * from t1 where sym=`IBM and qty>=2000 or timestamp>09:37:00;


timestamp	sym	qty	price
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:38:12	C	8800	51.29

For multiple conditions in a where clause, symbols "and", "&&" and "," can be used to perform logical "AND" operation, and symbols "or" and "||" can be used to perform logical "OR" operation.

select * from t1 where qty>=2000, timestamp.minute()>=09:36m;


timestamp	sym	qty	price
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:38:12	C	8800	51.29

select * from t1 where qty>=2000 and timestamp.minute()>=09:36m;


timestamp	sym	qty	price
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:38:12	C	8800	51.29

select * from t1 where qty>=2000 && timestamp.minute()>=09:36m;


timestamp	sym	qty	price
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:38:12	C	8800	51.29

Differences between comma "," and "and/&&":

"and" is equivalent to "&&". Both test each condition on the entire table. In contrast, when we use "," in the conditions, data is filtered by the first condition, then the next condition is applied to the filtered dataset, and so on. This reduces the search footprint. Therefore, comma "," is preferred to "and/&&" in a where statement when some of the conditions are very specific. We should use the most specific conditions first when using commas in the conditions.

In the following example, with a very restrictive condition "id=1" as only about 0.1% of the records satisfy this condition, using "," in the where clause takes only about 50% of the time as using "and/&&".

Use where clause with a function:

To select records whose prices are higher than the average price, function avg is applied to all records in the table.

select * from t1 where price>avg(price);


timestamp	sym	qty	price
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23

To calculate the average price for each stock, we can use the template contextby. The example below selects the records with price greater than the average price for each stock.

select * from t1 where price>contextby(avg, price, sym) order by sym, price;


timestamp	sym	qty	price
09:34:16	C	1300	50.76
09:38:12	C	8800	51.29
09:35:26	IBM	5400	175.23
09:36:59	MS	3200	30.02

We can take a random sample of a number of partitions in a partitioned table. For details please check function sample .

n=1000000
ID=rand(50, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://rangedb1", RANGE,  0 10 20 30 40 50)
pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
pt=loadTable(db,`pt);

Table pt has 5 partitions. To take a random sample of 2 partitions, we can use either of the following queries:

x = select * from pt where sample(ID, 0.4);

x = select * from pt where sample(ID, 2);

Use where clause with a nested SQL statement:

t1 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 2022.01.01 + 1..6 as date);
t2 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.8 2.3 3.7 3.1 4.2 2.8 as price)
select count(*) from t1 where sym in select sym from t2 where price > 3
// output: 3

t3 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.9 2.1 2.5 2.6 2.7 3.2 as price)
select * from t2 where price > select avg(price) from t3


sym	price
IBM	3.7
IBM	3.1
AAPL	4.2
AMZN	2.8