Create Databases and Tables

Create Databases

Use function database to create a new database or to get the handle of an existing database.

Syntax: database(directory, [partitionType], [partitionScheme], [locations], [engine='OLAP'], [atomic='TRANS'])

To establish a distributed database in the distributed file system, parameter directory should start with "dfs://".

When we create a new distributed database, we need to specify partitionType and partitionScheme. When we reopen an existing distributed database, we only need to specify directory. We cannot overwrite an existing distributed database with a different partitionType or partitionScheme. We may, however, append new partitions to a value or a range domain.

The table below shows the partition types. PartitionScheme is an integer scalar for the sequential domain, a tuple for the hash domain, and a vector for all other domains. The interpretation of the partition scheme depends on the partition type. partitionScheme supports the following data types: CHAR, SHORT, INT, LONG, DATE, MONTH, TIME, MINUTE, SECOND, DATETIME, DATEHOUR and SYMBOL.

Partition Type Partition Type Symbol Partition Scheme
sequential domain SEQ An integer. It means the number of partitions.
range domain RANGE A vector. Any two adjacent elements of the vector define the range for a partition.
hash domain HASH A tuple. The first element is the data type of partitioning column. The second element is the number of partitions.
value domain VALUE A vector. Each element of the vector defines a partition.
list domain LIST A vector. Each element of the vector defines a partition.
composite domain COMPO A vector. Each element of the vector is a database handle.

After we create a database, we cannot revise the partition type or partition scheme with function database. You can only add new VALUE and RANGE partitions with function addValuePartitions and addRangePartitions.

Create Dimension Table

A dimension table is a non-partition table in a distributed database. It is used to store data sets in small size that are not frequently updated. We can use the function createDimensionTable to create a dimension table.
db=database("dfs://db1",VALUE,1 2 3)
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
t = table(timestamp, sym, qty, price)

dt=db.createDimensionTable(t,`dt).append!(t)
select * from dt;
timestamp sym qty price
09:34:07 C 2200 49.6
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:34:16 C 1300 50.76
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29

Create Partitioned Tables

Use function createPartitionedTable to create a distributed table.

We need to provide a model table whose schema is adopted by the newly created distributed table. All partition domains other than the sequential domain must specify partition column or columns. If the database uses TSDB storage engine, sortColumns must be specified.

db=database("dfs://db1",VALUE,1 2 3)
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
t = table(timestamp, sym, qty, price)

dt=db.createDimensionTable(t,`dt).append!(t)
select * from dt;
timestamp sym qty price
09:34:07 C 2200 49.6
09:36:42 MS 1900 29.46
09:36:51 MS 2100 29.52
09:36:59 MS 3200 30.02
09:32:47 IBM 6800 174.97
09:35:26 IBM 5400 175.23
09:34:16 C 1300 50.76
09:34:26 C 2500 50.32
09:38:12 C 8800 51.29

Range Domain

In a range domain (RANGE), partitions are determined by ranges specified by any two adjacent elements of the partition scheme vector. The starting value is inclusive and the ending value is exclusive. A row with value of the partitioning column falling within a range belongs to the partition defined by this range.

Example 1. Create a distributed database dfs://rangedb of range domain and save an in-memory table as a distributed table in the database.

n=1000000
ID=rand(10, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://rangedb", RANGE,  0 5 10)
pt=db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
select count(x) from pt;
count_x
1000000

The database above has 2 partitions: [0,5) and [5,10). Table t is saved as a partitioned table pt with the partitioning column of ID.

To get a database handle, use function database and only specify the database path.

db=database("dfs://rangedb")

Value Domain

n=1000000
month=take(2000.01M..2016.12M, n)
x=rand(1.0, n)
t=table(month, x)
db=database("dfs://valuedb", VALUE, 2000.01M..2016.12M)
pt = db.createPartitionedTable(t, `pt, `month)
pt.append!(t)
select count(x) from pt;
count_x
1000000

The database above has 17*12=204 partitions. Each partition is a month between January 2000 and December 2016. Table t is saved as a partitioned table pt with the partitioning column of month.

For a database with value domain, we can increase the number of partitions after it is created. Please check addValuePartitions for details.

Hash Domain

In a hash domain (HASH), we need to specify the data type of the partitioning column and the number of partitions.

n=1000000
ID=rand(10, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://hashdb", HASH,  [INT, 2])
pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
select count(x) from pt;
count_x
1000000

The database dfs://hashdb above has 2 partitions. Table t is saved as a partitioned table pt with the partitioning column of ID.

List Domain

In a list domain (LIST), each element of the partition scheme vector corresponds to a partition.

n=1000000
ticker = rand(`MSFT`GOOG`FB`ORCL`IBM,n);
x=rand(1.0, n)
t=table(ticker, x)
db=database("dfs://listdb", LIST, [`IBM`ORCL`MSFT, `GOOG`FB])
pt = db.createPartitionedTable(t, `pt, `ticker)
pt.append!(t)
select count(x) from pt;
count_x
1000000

The database dfs://listdb above has 2 partitions. The first partition contains 3 tickers and the second contains 2 tickers.

Composite Domain

A composite domain (COMPO) can have 2 or 3 partitioning columns. Each partitioning column can be of range, value, list or hash domain.

n=1000000
ID=rand(100, n)
dates=2017.08.07..2017.08.11
date=rand(dates, n)
x=rand(10.0, n)
t=table(ID, date, x)

dbDate = database(, VALUE, 2017.08.07..2017.08.11)
dbID = database(, RANGE, 0 50 100)
db = database("dfs://compodb", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID)
pt.append!(t)
select count(x) from pt;
count_x
1000000

The database dfs://compodb above has 2 levels of partitions. One level uses the value domain with 5 partitions. The other level uses the range domain with 2 partitions. In total the database has 5*2=10 partitions.

TSDB engine

When creating a partitioned table in a TSDB database, sortColumns must be specified.
dbName = "dfs://tsdb_value_int"
if(existsDatabase(dbName)){
    dropDatabase(dbName)
}
db = database(directory=dbName, partitionType=VALUE, partitionScheme=1..10,engine="TSDB")
n = 10000
t = table(n:n, [`int,`long,`short,`float,`double,`string,`char,`bool,`timestamp], [INT, LONG, SHORT,FLOAT, DOUBLE, STRING, CHAR, BOOL, TIMESTAMP])
pt1 = db.createPartitionedTable(table=t, tableName=`pt1, partitionColumns=`int,sortColumns=`short`int,keepDuplicates=ALL)
t[`int] = rand(100,n)
t[`long] = rand(100000l,n)
t[`short] = rand(10h,n)
t[`float] = rand(100.0f,n)
t[`double] = rand(100.0,n)
t[`string] = rand("A"+string(1..1000),n)
t[`char] = rand(100,n)
t[`bool] = rand([true,false],n)
tem = 2012.06.13T13:30:10.000
ts = array(timestamp,0,n)
for(i in 1..n){
    tem=temporalAdd(tem, 1, `s)
    ts.append!(tem)
}
t[`timestamp] = ts
pt1.append!(t)