Create Databases and Tables
Create Databases
Use function database to create a new database or to get the handle of an existing database.
Syntax: database(directory, [partitionType], [partitionScheme], [locations],
[engine='OLAP'], [atomic='TRANS'])
To establish a distributed database in the distributed file system, parameter directory should start with "dfs://".
When we create a new distributed database, we need to specify partitionType and partitionScheme. When we reopen an existing distributed database, we only need to specify directory. We cannot overwrite an existing distributed database with a different partitionType or partitionScheme. We may, however, append new partitions to a value or a range domain.
The table below shows the partition types. PartitionScheme is an integer scalar for the sequential domain, a tuple for the hash domain, and a vector for all other domains. The interpretation of the partition scheme depends on the partition type. partitionScheme supports the following data types: CHAR, SHORT, INT, LONG, DATE, MONTH, TIME, MINUTE, SECOND, DATETIME, DATEHOUR and SYMBOL.
Partition Type | Partition Type Symbol | Partition Scheme |
---|---|---|
sequential domain | SEQ | An integer. It means the number of partitions. |
range domain | RANGE | A vector. Any two adjacent elements of the vector define the range for a partition. |
hash domain | HASH | A tuple. The first element is the data type of partitioning column. The second element is the number of partitions. |
value domain | VALUE | A vector. Each element of the vector defines a partition. |
list domain | LIST | A vector. Each element of the vector defines a partition. |
composite domain | COMPO | A vector. Each element of the vector is a database handle. |
After we create a database, we cannot revise the partition type or partition scheme
with function database
. You can only add new VALUE and RANGE
partitions with function addValuePartitions and addRangePartitions.
Create Dimension Table
db=database("dfs://db1",VALUE,1 2 3)
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
t = table(timestamp, sym, qty, price)
dt=db.createDimensionTable(t,`dt).append!(t)
select * from dt;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:34:16 | C | 1300 | 50.76 |
09:34:26 | C | 2500 | 50.32 |
09:38:12 | C | 8800 | 51.29 |
Create Partitioned Tables
Use function createPartitionedTable to create a distributed table.
We need to provide a model table whose schema is adopted by the newly created distributed table. All partition domains other than the sequential domain must specify partition column or columns. If the database uses TSDB storage engine, sortColumns must be specified.
db=database("dfs://db1",VALUE,1 2 3)
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
t = table(timestamp, sym, qty, price)
dt=db.createDimensionTable(t,`dt).append!(t)
select * from dt;
timestamp | sym | qty | price |
---|---|---|---|
09:34:07 | C | 2200 | 49.6 |
09:36:42 | MS | 1900 | 29.46 |
09:36:51 | MS | 2100 | 29.52 |
09:36:59 | MS | 3200 | 30.02 |
09:32:47 | IBM | 6800 | 174.97 |
09:35:26 | IBM | 5400 | 175.23 |
09:34:16 | C | 1300 | 50.76 |
09:34:26 | C | 2500 | 50.32 |
09:38:12 | C | 8800 | 51.29 |
Range Domain
In a range domain (RANGE), partitions are determined by ranges specified by any two adjacent elements of the partition scheme vector. The starting value is inclusive and the ending value is exclusive. A row with value of the partitioning column falling within a range belongs to the partition defined by this range.
Example 1. Create a distributed database dfs://rangedb of range domain and save an in-memory table as a distributed table in the database.
n=1000000
ID=rand(10, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://rangedb", RANGE, 0 5 10)
pt=db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
select count(x) from pt;
count_x |
---|
1000000 |
The database above has 2 partitions: [0,5) and [5,10). Table t is saved as a partitioned table pt with the partitioning column of ID.
To get a database handle, use function database
and only specify the
database path.
db=database("dfs://rangedb")
Value Domain
n=1000000
month=take(2000.01M..2016.12M, n)
x=rand(1.0, n)
t=table(month, x)
db=database("dfs://valuedb", VALUE, 2000.01M..2016.12M)
pt = db.createPartitionedTable(t, `pt, `month)
pt.append!(t)
select count(x) from pt;
count_x |
---|
1000000 |
The database above has 17*12=204 partitions. Each partition is a month between January 2000 and December 2016. Table t is saved as a partitioned table pt with the partitioning column of month.
For a database with value domain, we can increase the number of partitions after it is created. Please check addValuePartitions for details.
Hash Domain
In a hash domain (HASH), we need to specify the data type of the partitioning column and the number of partitions.
n=1000000
ID=rand(10, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://hashdb", HASH, [INT, 2])
pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t)
select count(x) from pt;
count_x |
---|
1000000 |
The database dfs://hashdb above has 2 partitions. Table t is saved as a partitioned table pt with the partitioning column of ID.
List Domain
In a list domain (LIST), each element of the partition scheme vector corresponds to a partition.
n=1000000
ticker = rand(`MSFT`GOOG`FB`ORCL`IBM,n);
x=rand(1.0, n)
t=table(ticker, x)
db=database("dfs://listdb", LIST, [`IBM`ORCL`MSFT, `GOOG`FB])
pt = db.createPartitionedTable(t, `pt, `ticker)
pt.append!(t)
select count(x) from pt;
count_x |
---|
1000000 |
The database dfs://listdb above has 2 partitions. The first partition contains 3 tickers and the second contains 2 tickers.
Composite Domain
A composite domain (COMPO) can have 2 or 3 partitioning columns. Each partitioning column can be of range, value, list or hash domain.
n=1000000
ID=rand(100, n)
dates=2017.08.07..2017.08.11
date=rand(dates, n)
x=rand(10.0, n)
t=table(ID, date, x)
dbDate = database(, VALUE, 2017.08.07..2017.08.11)
dbID = database(, RANGE, 0 50 100)
db = database("dfs://compodb", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID)
pt.append!(t)
select count(x) from pt;
count_x |
---|
1000000 |
The database dfs://compodb above has 2 levels of partitions. One level uses the value domain with 5 partitions. The other level uses the range domain with 2 partitions. In total the database has 5*2=10 partitions.
TSDB engine
dbName = "dfs://tsdb_value_int"
if(existsDatabase(dbName)){
dropDatabase(dbName)
}
db = database(directory=dbName, partitionType=VALUE, partitionScheme=1..10,engine="TSDB")
n = 10000
t = table(n:n, [`int,`long,`short,`float,`double,`string,`char,`bool,`timestamp], [INT, LONG, SHORT,FLOAT, DOUBLE, STRING, CHAR, BOOL, TIMESTAMP])
pt1 = db.createPartitionedTable(table=t, tableName=`pt1, partitionColumns=`int,sortColumns=`short`int,keepDuplicates=ALL)
t[`int] = rand(100,n)
t[`long] = rand(100000l,n)
t[`short] = rand(10h,n)
t[`float] = rand(100.0f,n)
t[`double] = rand(100.0,n)
t[`string] = rand("A"+string(1..1000),n)
t[`char] = rand(100,n)
t[`bool] = rand([true,false],n)
tem = 2012.06.13T13:30:10.000
ts = array(timestamp,0,n)
for(i in 1..n){
tem=temporalAdd(tem, 1, `s)
ts.append!(tem)
}
t[`timestamp] = ts
pt1.append!(t)