summary
Syntax
summary(X,[interpolation],[characteristic],[percentile],[precision],[partitionSampling])
Details
summary
generates summary statistics for the input data. It returns
an in-memory table containing the minimum, maximum, count, mean, standard deviation,
and specified percentiles in ascending order.
-
If X is a table,
summary
only computes statistics for the numeric columns. -
If X is a data source, it can only contain numeric columns, otherwise an error will occur during computation.
Arguments
X can be an in-memory table, DFS table or data source generated from sqlDS. Note that data sources with SQL metacode containing table joins are currently not supported.
interpolation (optional) is a string indicating the interpolation method for percentiles. It can be "linear" (default), "nearest", "lower", "higher" and "midpoint".
characteristic (optional) is a string scalar or vector indicating the characteristics to compute. It can be "avg" and/or "std". Default is both characteristics.
percentile (optional) is a DOUBLE vector of percentiles to compute. Each vector element falls between 0 ang 100. The default is [25,50,75], which returns the 25th, 50th, and 75th percentiles.
precision (optional) is a DOUBLE scalar greater than 0. The default value is 1e-3, which means the iteration for computing statistics will stop when the difference between the current and previous result is less than or equal to 1e-3. It is recommended to set precision between [1e-3, 1e-9] - small enough for adequate precision but not too small to impact performance through excessive iterations.
-
For a partitioned table:
-
at least one partition will always be sampled. If the sampling ratio * total partitions < 1, one partition is sampled.
-
The sampling ratio is rounded down if the sampling ratio * total partitions is not an integer. E.g. if ratio=0.26 and total partitions is 10, 2 partitions are sampled.
-
If partitionSampling (integer) > total partitions, all partitions are used.
-
-
partitionSampling has no effect for non-partitioned tables.
Examples
n=2022
data=1..n
value=take(1..3,n)
name=take(`APPLE`IBM`INTEL,n)
t=table(data,value,name);
summary(t, precision=0.001);
// name is not a numeric column and therefore will not be output
name | min | max | nonNullCount | count | avg | std | percentile |
---|---|---|---|---|---|---|---|
data | 1 | 2,022 | 2,022 | 2,022 | 1,011.5 | 583.8454 | [506.24,1011.50,1516.75] |
value | 1 | 3 | 2,022 | 2,022 | 2 | 0.8167 | [1.00,1.99,2.99] |
n = 5000
data1 = take(1..5000000, n)
data2 = rand(10000000, n)
data3 = take("A" + string(0..10), n)
t = table(data1, data2, data3)
dbname = "dfs://summary"
if(existsDatabase(dbname)) {
dropDatabase(dbname)
}
db = database(dbname, HASH, [INT, 10])
pt = createPartitionedTable(db, t, `pt, `data1)
pt.append!(t)
ds = sqlDS(<select data1,data2 from loadTable(db, `pt)>)
query_percentile = [25,50,75,90]
ds_re1 = summary(ds);
//returns the 25th, 50th, 75th and 90th percentiles
ds_re2 = summary(ds, percentile=query_percentile, precision=0.0001);
// the partition sampling ratio is 0.6. As there are 10 partitions in total, 6 partitions will be sampled for statistics computation
ds_re3 = summary(loadTable(db, `pt), percentile=query_percentile, precision=0.0001, partitionSampling=0.6);