map
The map clause executes the SQL statement in each partition separately, then merges the results.
Scenarios where it's suitable to add the map keyword:
-
Queries that are intended to be performed in each partition separately. (see Example 1)
-
Performance optimization for group by or context by statement. (see Example 2)
When querying grouped data, the system usually performs calculation in each partition, then conduct further calculation with the results from all partitions to make sure the final result is correct. But if the granularity of the GROUP BY columns is smaller than the partitioning columns, then it can be sure that there will be no cross-partition calculations. In that case, we can add the map keyword to the query statement to reduce the cost of calculation and optimize performance.
The execution of group by
and context by
statements on
a partitioned table usually involves 2 steps:
(1) Calculation within each partition
(2) Further calculation with the results from all partitions to make sure the final result is correct.
If the granularity of the GROUP BY columns is smaller than the partitioning columns, then the first step is sufficient to get the final result. In this case, we can add the map keyword to the query to reduce computational overhead by skipping the second step, thus improving performance.
Note:
Aggregate functions and sequence-related functions are generally not allowed in SQL WHERE clauses, otherwise a full table scan would be needed. However, the map keyword allows us to use these functions in the WHERE clauses as the specified calculations will be performed within each partition to filter the records. (see Example 3)
Functions such as isDuplicated
, first
,
firstNot
, etc. whose results are sensitive to the order of the rows
can be used in the where clause of a SQL statement on multiple partitions only if the
map clause is used.
Examples
Example 1. The following query must be performed in each partition.
t = table(0..9 as id, take(1 2 3, 10) as qty)
db=database("dfs://rangedb", RANGE, 0 5 10)
pt = db.createPartitionedTable(t, `pt, `id)
pt.append!(t);
select * from pt;
id | qty |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 1 |
4 | 2 |
5 | 3 |
6 | 1 |
7 | 2 |
8 | 3 |
9 | 1 |
select first(id), count(*) from pt map;
first_id | count |
---|---|
0 | 5 |
5 | 5 |
Example 2. Proper use of the map keyword can help to optimize performance of queries on grouped data.
t = table(2022.01.01T00:00:00 + rand(10000000, 10000) as dateTime, rand(1000, 10000) as qty)
if(existsDatabase("dfs://valuedb")) dropDatabase("dfs://valuedb")
db=database("dfs://valuedb", VALUE, 2022.02.01..2022.02.05)
pt = db.createPartitionedTable(t, `pt, `dateTime)
pt.append!(t)
timer(1000) select count(*) from pt group by bar(dateTime, 60)
// output
Time elapsed: 4010.31 ms
timer(1000) select count(*) from pt group by bar(dateTime, 60) map
// output
Time elapsed: 3607.331 ms
Example 3.When querying across partitions, add the map keyword to your statement, so aggregate or order-sensitive functions can be used in the where clause for conditional filtering.
t = table(0..9 as id, take(1 2 3, 10) as qty)
db=database("dfs://rangedb", RANGE, 0 5 10)
pt = db.createPartitionedTable(t, `pt, `id)
pt.append!(t);
select * from pt where isDuplicated([id,qty]) = false map;
id | qty |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 1 |
4 | 2 |
5 | 3 |
6 | 1 |
7 | 2 |
8 | 3 |
9 | 1 |