In-Memory Online Transaction Processing Engine

In the financial industry, particularly for OLTP scenarios, databases must handle intensive workloads characterized by high-frequency updates and queries on relational tables. These systems demand minimal latency, robust concurrent operation handling, strict data consistency, and full ACID compliance.

While traditional database engines store data primarily on disk, this approach creates bottlenecks at both hardware and software levels when managing high-frequency operations and concurrency. Given that these systems often need to maintain relatively small datasets despite high transaction volumes, a key optimization approach is to explore the possibility of storing all data in memory.

The In-Memory Online Transaction Processing (IMOLTP) engine, introduced in version 3.00.0, takes advantage of in-memory storage to significantly reduce latency and enhance concurrency, providing a more efficient solution for OLTP scenarios. Key features of the IMOLTP engine include:

Row-Based Storage: Data is stored in a row-based format, making it well-suited for OLTP scenarios that include high-frequency transactions.
Transaction Support: It achieves full ACID compliance and uses snapshot isolation to maintain data consistency.
Efficient Indexing: B+ tree indexing, including both primary and secondary indexes, are supported to optimize performance for high-concurrency, high-frequency updates and queries.
Persistence and Recovery: Data persistence and fast recovery are ensured through WAL and checkpoint mechanisms.

Architecture

The IMOLTP engine consists of four core modules: indexing, transactions, persistence, and recovery.

Indexing

As an in-memory storage engine, IMOLTP significantly reduces disk I/O, with query operations eliminating the need for disk reads or writes. Without the need for buffer management, the indexing module becomes the primary performance bottleneck rather than disk access.

Data Structure

While lock-free data structures like BwTree theoretically offer better scalability, they are complex to implement, and ensuring correctness can be difficult. Furthermore, BwTree's complex structure can lead to cache misses, causing its performance to often fall short of that of optimized B+ trees in practice.

The IMOLTP engine uses the B+ tree data structure for indexing. It is simpler to implement than lock-free structures and delivers efficient performance for both point queries and range queries.

Algorithm

B+ trees usually use Lock Coupling to enable concurrent operations through a fine-grained locking mechanism. Instead of locking the entire tree, each node has its own read/write lock. The algorithm follows a top-down approach: it first acquires a lock on a parent node, then locks its child node, and releases the parent's lock only after securing access to the child. This process is depicted in the following figure.

The traditional B+ tree design faces a significant performance bottleneck: since all operations must begin at the root node, they all compete for the root's lock. Even with read-only locks, the constant lock acquisition and release operations write to the same memory location. Frequent cache invalidation in multi-core systems leads to increased cache misses, significantly degrading the tree's performance.

To address this challenge, the IMOLTP engine implements the Optimistic-Lock-Coupling (OLC) algorithm. OLC provides an elegant solution that preserves the straightforward nature of Lock Coupling while dramatically improving concurrent performance. The following figure demonstrates how the OLC algorithm works.

The OLC algorithm introduces a version-based approach that fundamentally changes how nodes are accessed. When accessing a node, the operation reads its version number rather than acquiring a lock. Similarly, instead of releasing a lock, it verifies that the version number hasn't changed. This design eliminates the need for write operations during tree traversal, thereby preventing cache invalidation and its associated performance penalties.

Both the B+ tree structure and the OLC algorithm maintain implementation simplicity. This straightforward design makes testing and verification more manageable, ultimately contributing to system reliability—a crucial requirement for OLTP systems.

Transaction

The IMOLTP engine implements transaction support through MV2PL (Multi-Version Two-Phase Locking), which combines MVCC and 2PL protocols to enable snapshot isolation in transactions.

A transaction can include multiple DML operations on IMOLTP tables. The engine enforces atomic execution, meaning all operations within a transaction will either complete successfully or roll back entirely.

Note:

Currently, DDL operations are not supported within transactions. If a transaction includes a DDL operation, an exception will be thrown.

MVCC

The IMOLTP engine uses a row-based data organization where each row (referred to as a tuple) maintains version information with the following structure:

txnId functions as a write lock.
readCount functions as a read lock.
beginTs and endTs control version visibility.

When a transaction begins, the system assigns it a unique, globally incrementing transaction ID (tid). A version is visible to a transaction when beginTs ≤ tid < endTs.

When a write transaction ends, the system assigns it a unique, globally incrementing commit ID (cid). Any modified tuples receive a new version with their version beginTs set to this cid and endTs to infinity. When these new versions are created, the system also updates the endTs of their immediate previous versions to match the same cid.

The IMOLTP engine implements MVCC using an append-only mode, where new data is exclusively added rather than modified in place. When each tuple is updated, the system creates a new version and adds it to the head of a version chain, with all versions maintained in chronological order from most recent to oldest. The index maintains a pointer to the newest version (the chain's head), enabling quick retrieval of the current tuple state during query processing. This leads to growing version chains and increased memory usage. Old versions in the version chain that are no longer visible to active transactions must be cleared to free up memory. The IMOLTP engine handles this via dedicated Garbage Collection (GC) threads.

2PL

In 2PL (Two-Phase Locking), transactions have two phases: locking and unlocking. Each transaction maintains:

Read set: Tuples with read locks.
Write set: Tuples with write locks.

On rollback, the transaction releases read locks from the read set, deletes versions created in the write set, and then releases write locks. On commit, it releases read locks, updates version beginTsand endTs for tuples in the write set, and then releases write locks.

Persistence

Disk-based databases store data in pages and maintain durability through Write-Ahead Logging (WAL). The system first logs all changes and ensures it's written to disk before applying modifications to the actual data pages.

While operating entirely in memory, IMOLTP uses WAL to enable persistence, ensuring the system can recover its state after a restart. It only records redo logs since data lives in memory and doesn't require page flushing or undo logging. Index operations aren't logged either, as indexes are reconstructed in memory during recovery rather than persisted to disk.

When log files grow beyond a specified threshold, IMOLTP performs checkpoints by saving the current system state (snapshot) to disk and clearing older logs, minimizing disk space usage and recovery time.

IMOLTP's MVCC design enables snapshot isolation. During a checkpoint, a read-only transaction is initiated using a transaction ID (tid) to capture the system state at that moment. This process runs without blocking ongoing read or write operations, ensuring smooth system performance.

Recovery

The IMOLTP engine recovers data using checkpoint and log files. Checkpoints capture the system state at a specific timestamp, while logs record subsequent transactions. The logging system maintains two consistency guarantees:

Operations within a transaction are logged sequentially in execution order. For example, if a transaction inserts A and then deletes A, the log will first record the insertion, then the deletion.
Dependent transactions are logged in their actual occurrence order. For example, if transaction Txn1 modifies A and Txn2 modifies A afterward, Txn1's operations will appear in the log before Txn2's operations.

During recovery, the system first loads the checkpoint file to restore the state at a specific timestamp. It then replays logged transactions sequentially to reconstruct the latest state. Primary indexes are rebuilt during recovery, and all secondary indexes are rebuilt after the log replay is complete.

Read and Write Operations

Writing Data

When writing to an IMOLTP database, the data will be written through the following phases:

Transaction start (skipped if already within an active transaction block and not the first statement): Generate and queue a WAL log for transaction initiation.
Primary index construction: Insert primary key into index, abort if uniqueness violations or conflicts (read-write/write-write) occur.
Secondary index construction: Insert all secondary keys into index, abort if any uniqueness violations or conflicts (read-write/write-write) occur.
Data write: Create new tuple version in memory, link it to the head of the version chain, and generate and queue WAL log for the write.
Transaction completion (skipped if already within an active transaction block and not the first statement):
- Commit: If no errors occur, the transaction is committed. A WAL log for the commit is generated and persisted asynchronously.
- Rollback: If an error occurs, all changes are rolled back. A rollback WAL log is generated and persisted asynchronously.

Reading Data

The data retrieval process unfolds as follows:

Transaction start (skipped if already within an active transaction block and not the first statement): No logging needed for read-only queries.
Index selection: Determine the appropriate index based on the filter conditions.
Index lookup: Search using constructed index key. Since the secondary index stores primary keys as its values, the primary key index need to be queried to locate the actual tuple's location.
Version chain traversal: Scan the tuple(s)' version chain to find the version visible to the transaction. Abort the transaction if a read-write conflict is detected.
Filters Applied: Evaluate any remaining filter conditions not covered by the index, adding matching tuples to the result set.
Transaction Completion.

Deleting Data

The data deletion process unfolds as follows:

Transaction start (skipped if already within an active transaction block and not the first statement): Generate and queue a WAL log for transaction initiation.
Data Location: Follow the data retrieval process to find the tuple(s) to be deleted, abort if errors occur.
Mark for Deletion: Create deletion marker at the head of the version chain (to be removed later by the GC thread), and generate and queue WAL log for the write. Abort the transaction if a read-write conflict is detected.
Transaction completion (skipped if already within an active transaction block and not the first statement):
- Commit: If no errors occur, the transaction is committed. A WAL log for the commit is generated and persisted asynchronously.
- Rollback: If an error occurs, all changes are rolled back. A rollback WAL log is generated and persisted asynchronously.

Updating Data

The IMOLTP engine treats an update operation as a combination of a data deletion followed by a data write. The update process follows the steps outlined in the previous sections.

Usage Example

The IMOLTP engine is available from version 3.00.0 onwards, which can be enabled in either standalone mode or within a single data node in a single-machine cluster.

Configuration

Before using the IMOLTP engine, set the configuration parameter enableIMOLTPEngine to true. For more configuration parameters, refer to Configuration > Reference > IMOLTP.

Creating a Database

To create an IMOLTP database, you can use the database function or the CREATE DATABASE statement.

Note:

The directory must start with oltp:// to indicate an IMOLTP in-memory database.
The enginemust be set to IMOLTP to specify the IMOLTP engine.
Partitioning is not currently supported for IMOLTP engine. Parameters partitionType and partitionScheme do not take effect for IMOLTP database, but they must still be properly specified for interface consistency. For instance, the following example specifies a VALUE partition type.

// use the database function
database(directory="oltp://my_imoltp",
    partitionType=VALUE, partitionScheme=1..3, engine="IMOLTP")
    
// use the CREATE DATABASE statement
CREATE DATABASE "oltp://my_imoltp"
PARTITIONED BY VALUE(1..3)
engine="IMOLTP"

Creating a Table

Use the createIMOLTPTable function to create a table within an IMOLTP database. Its syntax is as follows:

createIMOLTPTable(dbHandle, table, tableName, primaryKey, [secondaryKey], [uniqueFlag])

The primaryKey is required to uniquely identify each row in a table, which consists of one or more columns. The secondaryKey is optional and allows multiple secondary indexes per table, with their uniqueness determined by the uniqueFlag.

Note: Selecting appropriate indexes during table creation can significantly enhance query performance. Queries using the primary key index generally perform better than those using only secondary indexes, while secondary indexes are faster than queries without any indexes. If no index is used, the system will perform a full table scan. Although secondary indexes speed up queries, they may slightly reduce the efficiency of write, delete, and update operations due to the overhead of updating the indexes.

The example below shows how to create three types of tables:

t1, a table with no secondary indexes
t2, a table with a unique secondary index
t3, a table with both unique and non-unique secondary indexes

dummyTable = table(1:0, `id`sym`val1`val2, [INT,STRING,LONG,DOUBLE])

// create t1 with id as primary key and with no secondary index
t1 = createIMOLTPTable(dbHandle=database("oltp://my_imoltp"), 
    table=dummyTable, tableName="t1", primaryKey="id")

// create t2 with id as primary key and sym, val1 as unique secondary keys
t2 = createIMOLTPTable(dbHandle=database("oltp://my_imoltp"),
    table=dummyTable, tableName="t2", primaryKey="id",
    secondaryKey=`sym`val1, uniqueFlag=true)

// create t3 with id as primary key, sym as unique secondary key and val1 as non-unique secondary key
t3 = createIMOLTPTable(dbHandle=database("oltp://my_imoltp"),
    table=dummyTable, tableName="t3", primaryKey="id",
    secondaryKey=`sym`val1, uniqueFlag=[true,false])

Writing Data

The IMOLTP in-memory table supports data insertion through the INSERT INTO statement, the append! function, and the tableInsert function.

insert into t3 VALUES (1,"Aa",6,5)

t3.append!(table([2,3,4] as id, ["Bb","Cc","Dd"] as sym,
 [3,6,9] as val1, [2,4,6] as val2))
 
t3.tableInsert(table([5,6,7] as id, ["Ee","Ff","Gg"] as sym,
 [5,6,7] as val1, [7,8,9] as val2))

Note: Insertion fails if duplicate values are detected in primary key or unique secondary indexes.

Examples are provided below:

// Insertion fails due to duplicate primary key: id
 insert into t3 VALUES (1,"Hh",6,5)

// Insertion fails due to duplicate unique secondary key: sym
insert into t3 VALUES (8,"Aa",6,5)

// Insertion succeeds when non-unique secondary key is duplicate
insert into t3 VALUES (8,"Hh",6,5)

Querying Data

The following examples compare the performance when querying data using primary key index, secondary index, and without any index.

// Insert 1 million rows
n = 1000000
t2.tableInsert(table(1..n as id, take("Aa"+string(1..1000),n) as sym,
 stretch(1..1000,n) as val1, 1..n as val2))

// With primary key index: 10 queries take 0.304 ms
timer(10) select * from t2 where id = 1

// With secondary index only: 10 queries take 8.424 ms
timer(10) select * from t2 where sym = "Aa1" and val2 = 1

// Without any index: 10 queries take 3003.61 ms
timer(10) select * from t2 where val2 = 1

For secondary indexes with multiple columns (e.g., sym and val1), queries can only use the index for acceleration if all preceding columns in the index order have equality conditions.

For example, if secondary keys are defined as sym and val1 (secondaryKey=`sym`val1), queries can use the index efficiently when filtering by sym alone or sym AND val1. Filtering by val1 alone cannot speed the query.

// Filter by sym only: 10 queries take 8.241 ms
timer(10) select * from t2 where sym = "Aa1" and val2 = 1

// Filter by val1 only: 10 queries take 3128.887 ms
timer(10) select * from t2 where val1 = 1 and val2 = 1

// Filter by val1 and sym: 10 queries take 0.456 ms
timer(10) select * from t2 where sym = "Aa1" and val1 = 1 and val2 = 1

Transaction Management

For IMOLTP engine offers two transaction modes: default transactions and transaction blocks that encapsulate multiple operations.

Default Transactions

Every SQL statement executes as a separate transaction.

Transaction Blocks

Multiple operations can be encapsulated in a transaction block where all changes either succeed together or fail together. Within the block:

Support DML operations only. DDL statements are not supported.
Can be committed explicitly (via commit command) or auto-commit on successful completion.
Allow manual rollback (via rollback command) to revert changes.

Example 1: Successful transaction

The following transaction block is successfully executed and committed automatically:

transaction{
    insert into t1 VALUES (1,"Aa",1,1)
    insert into t1 VALUES (2,"Bb",2,2)
    insert into t1 VALUES (3,"Cc",3,3)
    // commit
}

// obtain row count: 3
select count(*) from t1

Example 2: Transaction fails, all operations rollback

If the last statement in the transaction block fails, all operations within the block are rolled back:

transaction{
    insert into t1 VALUES (4,"Dd",4,4)
    insert into t1 VALUES (1,"Aa",1,1)
}

// obtain row count: 3. The transaction was rolled back.
select count(*) from t1

Example 3: Manually rollback with the rollback command

transaction{
    insert into t1 VALUES (4,"Dd",4,4)
    if((exec count(*) from t1) > 3){
        rollback
    }
}

// obtain row count: 3. The transaction was rolled back.
select count(*) from t1