orc
ORC is a self-describing columnar file format designed for Hadoop. It's efficient for storing and retrieving data, making it suitable for large-scale streaming data scenarios. DolphinDB's ORC plugin supports importing and exporting ORC files, automatically converting data types in the process.
Installation (with installPlugin
)
Required server version: DolphinDB 2.00.10 or higher
Supported OS: Linux x86-64 and Linux JIT.
Installation Steps:
(1) Use listRemotePlugins to check plugin information in the plugin repository.
Note: For plugins not included in the provided list, you can install through precompiled binaries or compile from source. These files can be accessed from our GitHub repository by switching to the appropriate version branch.
login("admin", "123456")
listRemotePlugins(, "http://plugins.dolphindb.com/plugins/")
(2) Invoke installPlugin for plugin installation.
installPlugin("orc");
(3) Use loadPlugin to load the plugin before using the plugin methods.
installPlugin("orc");
Method References
extractORCSchema
Syntax
extractORCSchema(fileName)
Details
The method parses the structure of the dataset in the specified ORC file and returns a table with two columns: name and type, representing the column names and data types respectively.
Parameters
- fileName: A STRING scalar indicating the ORC file name.
Examples
orc::extractORCSchema("userdata1.orc")
loadORC
Syntax
loadORC(fileName,[schema],[column],[rowStart],[rowNum])
Details
The method loads ORC file data into a DolphinDB in-memory table and returns it. For supported ORC data types and conversion rules, refer to Data Type Mappings below.
Parameters
- fileName: A STRING scalar indicating the name of the ORC file.
- schema: A table containing column names and data types. To change a column's data type, modify the data type in the schema table.
- column (optional): An INT vector indicating the indices of the columns to read. If not specified, all columns are read.
- rowStart (optional): An INT scalar indicating the starting row from which to read the ORC file. If not specified, reading starts from the beginning of the file.
- rowNum (optional): An INT scalar indicating the number of rows to read. If not specified, reading continues until the end of the file.
Examples
orc::loadORC("userdata1.orc")
Note: Column names starting with underscores are not supported. If the ORC file contains such column names, the system automatically adds the letter "Z" as a prefix.
loadORCEx
Syntax
loadORCEx(dbHandle,tableName,[partitionColumns],fileName,[schema],[column],[rowStart],[rowNum],[transform])
Details
The method converts ORC file data into a DolphinDB DFS table, then returns a table object containing the distributed table's metadata. For supported ORC data types and conversion rules, refer to Data Type Mappings below.
Parameters
- dbHandle: The handle of database.
- tableName: The name of the table.
- partitionColumns: A STRING scalar or vector indicating the partitioning columns. For non-SEQ partitioned databases, specify the partitioning columns. In composite partitions, partitionColumns is a STRING vector.
- fileName: A STRING scalar indicating the name of the ORC file.
- schema: A table containing column names and data types. To change a column's data type, modify the data type in the schema table.
- column (optional): An INT vector indicating the indices of the columns to read. If not specified, all columns are read.
- rowStart (optional): An INT scalar indicating the starting row from which to read the ORC file. If not specified, reading starts from the beginning of the file.
- rowNum (optional): An INT scalar indicating the number of rows to read. If not specified, reading continues until the end of the file.
- transform (optional): A unary function that accepts a table as its parameter. If specified, the partitioned table is created first, and then data is loaded. The specified transform function is applied to the data from the file, and the result is saved to the database.
Examples
- Without specifying transform, data is directly imported into the database without processing.
db = database("dfs://db1", RANGE, 0 500 1000)
orc::loadORCEx(db,`tb,`id,"userdata1.orc")
- Specifying transform to convert numerically represented dates and times (e.g., 20200101) to specific temporal types.
dbPath="dfs://db2"
db=database(dbPath,VALUE,2020.01.01..2020.01.30)
dataFilePath="userdata1.orc"
schemaTB=orc::extractORCSchema(dataFilePath)
update schemaTB set type="DATE" where name="date"
tb=table(1:0,schemaTB.name,schemaTB.type)
tb1=db.createPartitionedTable(tb,`tb1,`date);
def i2d(mutable t){
return t.replaceColumn!(`date,datetimeParse(t.date),"yyyy.MM.dd"))
}
t = orc::loadORCEx(db,`tb1,`date,dataFilePath,datasetName,,,,i2d)
orcDS
Syntax
orcDS(fileName,chunkSize,[schema],[skipRows])
Details
The method creates a vector of data sources based on the input file name. The number of sources is determined by the number of rows in the file and the chunkSize.
Parameters
- fileName: A STRING scalar indicating the ORC file name.
- chunkSize: An INT scalar indicating the number of rows each data source contains.
- schema: A table containing column names and data types. To change the data type of a column automatically determined by the system, modify the data type in the schema table and pass it the loadORC function as an argument.
- skipRows: An INT scalar indicating the number of rows to skip from the beginning of the file. The default value is 0.
Examples
>ds = orc::orcDS("userdata1.orc", 1000)
>size ds;
1
>ds[0];
DataSource<loadORC("userdata1.orc", , 0, 1000) >
saveORC
Syntax
saveORC(table, fileName)
Details
The method saves a DolphinDB in-memory table as an ORC format file. Note that saving table data to an existing ORC file will overwrite the original content of the ORC file.
Parameters
- table: The in-memory table object to be saved.
- fileName: A STRING scalar indicating the name of the ORC file to save to.
Examples
orc::saveORC(tb, "example.orc")
Data Type Mappings
The following is the data type mappings when an ORC file is imported to DolphinDB.
Type in ORC | Type in DolphinDB |
---|---|
boolean | BOOL |
tinyint | CHAR |
smallint | SHORT |
int | INT |
bigint | LONG |
float | FLOAT |
double | DOUBLE |
string | STRING |
char | STRING |
varchar | STRING |
binary | not support |
timestamp | NANOTIMESTAMP |
date | DATE |
struct | not support |
list | not support |
map | not support |
union | not support |
Usage Examples
// Extract the structure of the dataset in the ORC file
orcSchema = orc::extractORCSchema("userdata1.orc")
// Load data from the ORC file into an in-memory table
orcData = orc::loadORC("userdata1.orc")
// Load data from the ORC file into a DFS table
db = database("dfs://db1", RANGE, 0 500 1000)
orc::loadORCEx(db,`tb,`id,"userdata1.orc")
// Create a list of data sources based on the ORC file
ds = orc::orcDS("userdata1.orc", 1000)
// Save the in-memory table as an ORC file
orc::saveORC(orcData, "example.orc")