where
The where clause is used to extract only the records that satisfy the specified condition(s).
A SQL query with a where
clause first loads and filters the columns
specified in the where conditions, then retrieves data from other columns corresponding
to the filtered results. When the columns specified in the where conditions account for
a large proportion of the queried data, you can specify keyword [HINT_PRELOAD] to preload all data into memory before filtering to improve
the query performance.
Examples
sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);
t1;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:34:16 | C | 1300 | 50.76 |
09:34:26 | C | 2500 | 50.32 |
09:38:12 | C | 8800 | 51.29 |
Use where clause with one condition:
select * from t1 where sym=`IBM;
timestamp | sym | qty | price |
---|---|---|---|
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
select * from t1 where timestamp.minute()>=09:36m;
timestamp | sym | qty | price |
---|---|---|---|
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:38:12 | C | 8800 | 51.29 |
In where clause, "==" is equivalent to "=".
select * from t1 where sym==`IBM;
timestamp | sym | qty | price |
---|---|---|---|
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
Use where clause with multiple conditions:
select * from t1 where sym=`IBM and qty>=2000 or timestamp>09:37:00;
timestamp | sym | qty | price |
---|---|---|---|
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:38:12 | C | 8800 | 51.29 |
For multiple conditions in a where clause, symbols "and", "&&" and "," can be used to perform logical "AND" operation, and symbols "or" and "||" can be used to perform logical "OR" operation.
select * from t1 where qty>=2000, timestamp.minute()>=09:36m;
timestamp | sym | qty | price |
---|---|---|---|
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:38:12 | C | 8800 | 51.29 |
select * from t1 where qty>=2000 and timestamp.minute()>=09:36m;
timestamp | sym | qty | price |
---|---|---|---|
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:38:12 | C | 8800 | 51.29 |
select * from t1 where qty>=2000 && timestamp.minute()>=09:36m;
timestamp | sym | qty | price |
---|---|---|---|
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:38:12 | C | 8800 | 51.29 |
Differences between comma "," and "and/&&":
"and" is equivalent to "&&". Both test each condition on the entire table. In contrast, when we use "," in the conditions, data is filtered by the first condition, then the next condition is applied to the filtered dataset, and so on. This reduces the search footprint. Therefore, comma "," is preferred to "and/&&" in a where statement when some of the conditions are very specific. We should use the most specific conditions first when using commas in the conditions.
In the following example, with a very restrictive condition "id=1" as only about 0.1% of the records satisfy this condition, using "," in the where clause takes only about 50% of the time as using "and/&&".
Use where clause with a function:
To select records whose prices are higher than the average price, function avg is applied to all records in the table.
select * from t1 where price>avg(price);
timestamp | sym | qty | price |
---|---|---|---|
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
To calculate the average price for each stock, we can use the template contextby. The example below selects the records with price greater than the average price for each stock.
select * from t1 where price>contextby(avg, price, sym) order by sym, price;
timestamp | sym | qty | price |
---|---|---|---|
09:34:16 | C | 1300 | 50.76 |
09:38:12 | C | 8800 | 51.29 |
09:35:26 | IBM | 5400 | 175.23 |
09:36:59 | MS | 3200 | 30.02 |
We can take a random sample of a number of partitions in a partitioned table. For details please check function sample .
n=1000000
ID=rand(50, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://rangedb1", RANGE, 0 10 20 30 40 50)
pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
pt=loadTable(db,`pt);
Table pt has 5 partitions. To take a random sample of 2 partitions, we can use either of the following queries:
x = select * from pt where sample(ID, 0.4);
x = select * from pt where sample(ID, 2);
Use where clause with a nested SQL statement:
t1 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 2022.01.01 + 1..6 as date);
t2 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.8 2.3 3.7 3.1 4.2 2.8 as price)
select count(*) from t1 where sym in select sym from t2 where price > 3
// output
3
t3 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.9 2.1 2.5 2.6 2.7 3.2 as price)
select * from t2 where price > select avg(price) from t3
sym | price |
---|---|
IBM | 3.7 |
IBM | 3.1 |
AAPL | 4.2 |
AMZN | 2.8 |