Node Startup Exception

When a DolphinDB node fails to start, you won't be able to access its web interface. For a cluster setup, the node’s status will appear in red on the web Cluster tab. To troubleshoot, first determine if the node is:

Shut down unexpectedly during startup
Stuck in the startup process
Taking unusually long to start

Diagnosing the issue

First, verify if the process is still running using the following command:

ps -ef | grep dolphindb
// If you have renamed the executable file, replace dolphindb with your actual file name.

If no process is found,the node may have crashed. If the process is running, proceed to verify if the node started successfully by checking for the following log entry:

grep "Job scheduler initialization completed." dolphindb.log

If not found, it suggests the failure occurred during startup.

A log entry with a timestamp following node startup likely indicates the node started successfully. To confirm, refresh the web interface to see if the node status turns green. If there’s no log entry, the node may be stuck or starting slowly.

For further investigation, searchthe log for ERROR with the following command:

grep "ERROR" dolphindb.log

If, after starting, you notice a repeated ERROR message in the logs and the node process is still running, this indicates the node is stuck at a specific startup stage.

If no repeated ERROR logs appear, the startup is likely progressing normally but slowly, so allow more time and monitor the progress.

Note: To diagnose specific issues, check the node’s runtime log. By default, the log of a data node is named as dolphindb.log in the server diectory in standalone mode, or under the clusterDemo/log directory in a cluster mode. The storage path can be modified with the logFile configuration parameter.

Unexpected Shutdown

If the node crashed after started successfully ("Job scheduler initialization completed"). Proceed with Node Crashes troubleshooting precedures.

If the failure occurred during the startup process. Review the ERROR logs from the startup phase in the latest runtime log to help pinpoint the issue, ignoring normal shutdown errors like:

...
<ERROR> : The socket server ended.
...
<ERROR> : AsynchronousSubscriberImp::run Shut down the subscription daemon.
...

In cases where the process crashes unexpectedly without formal shutdown, examine the core dump for further details. Run the following commands to view the stack trace:

cd /path/to/dolphindb
gdb dolphindb /path/to/corefile
bt

Share the stack trace with DolphinDB technical support for further analysis.

Process Stuck

When a node becomes stuck during startup, check the latest runtime log for recurring ERROR messages - these often indicate where the system is repeatedly failing. You can monitor the logs in real-time using:

tail -f dolphindb.log

To analyze what each thread is doing during the stuck state, you have two options:

Use pstack to view thread stack traces:

pstack dolphindb_pid > /tmp/pstack.log 
// replace dolphindb_pid with the actual process ID

Use gdb to capture detailed thread stack traces:

gdb -p dolphindb_pid -batch -ex 'thread apply all bt' -ex 'quit' > /tmp/gdb_stacks.log 
// replace dolphindb_pid with the actual process ID

Share the stack trace with DolphinDB technical support for further analysis.

Slow Startup

When experiencing a slow startup, an ERROR log may not appear. Common causes of slow starts are transaction rollbacks and redo log replays (see Common Issues and Solutions).

To analyze what each thread is doing during startup, you have two options:

Use pstack to view thread stack traces:

pstack dolphindb_pid > /tmp/pstack.log 
// replace dolphindb_pid with the actual process ID

Use gdb to capture detailed thread stack traces.

gdb -p dolphindb_pid -batch -ex 'thread apply all bt' -ex 'quit' > /tmp/gdb_stacks.log 
// replace dolphindb_pid with the actual process ID

Share the stack trace with DolphinDB technical support for further analysis.

Common Issues and Solutions

This section covers frequently encountered problems that may occur during node startup and provides their solutions to help streamline troubleshooting. For issues not addressed here, please contact DolphinDB technical support team for assistance.

Unexpected Shutdown

License Expiration

When your DolphinDB license approaches expiration, you'll receive notifications on the web/GUI interface 15 days in advance. The nodes will automatically shut down after 15 days after expiration. Any attempts to start DolphinDB with an expired license will fail, with WARNING and ERROR messages appearing in the runtime logs.

2023-10-13 09:52:30.007743 <WARNING> :
    The license has expired. Please renew the license and restart the server.
2023-10-13 09:52:30.163238 <ERROR> : 
    The license has expired.

To resolve: Contact DolphinDB support to obtain a renewed license.

Port Conflicts

At startup, DolphinDB uses a network port specified by localSite in the config file. If this port is occupied by another program or a previous DolphinDB session, startup will fail with errors like:

2023-10-26 09:01:31.349118 <ERROR> :Failed to bind the socket on port 8848 with error code 98
2023-10-26 09:01:31.349273 <ERROR> :Failed to bind the socket on port 8848. Shutting down the server. Please try again in a couple of minutes.

Check which program is using the port:

netstat -nlp | grep <port number>

To resolve: Either stop the program using the port or wait for previous DolphinDB node to fully shut down. As a last resort, use kill -9 to force terminate, but this may cause data loss.

Corrupted Redo Log

If a data node's redo log file becomes corrupted (possibly due to disk-full errors, system crashes, or bugs during previous runtime), the node will fail to start as the system can't replay logs to restore past transactions.

Note: The location of redo logs depends on your storage engine configuration parameters.

OLAP: redoLogDir. Defaults to /log/redoLog.
TSDB: TSDBRedoLogDir. Defaults to /log/TSDBRedo
PKEY: PKEYRedoLogDir. Defaults to <ALIAS>/log/PKEYRedo

For example, there exists the following ERROR logs:

2023-12-11 15:18:58.888865 <INFO> :applyTidRedoLog : 
    2853,c686664b-d020-429a-1746-287d670099e9,
        /hdd/hdd7/server/clusterDemo/data/P1-datanode/storage/CHUNKS/multiValueTypeDb1/20231107/Key0/g
z,pt_2,32054400,1046013,0
2023-12-11 15:18:58.895064 <ERROR> :VectorUnmarshall::start Invalid data form 0 type 0
2023-12-11 15:18:58.895233 <ERROR> :The redo log for transaction [2853] comes across error: 
    Failed to unmarshall data.. Invalid message format
2023-12-11 15:18:58.895476 <ERROR> :The ChunkNode failed to initialize with exception 
    [Failed to unmarshall data.. Invalid message format].
2023-12-11 15:18:58.895555 <ERROR> :ChunkNode service comes up with the error message: 
    Failed to unmarshall data.. Invalid message format

It indicates that the replay failed due to an invalid format in the redo log file for transaction ID 2853.

To resolve: Bypass redo log replay with the following steps:

Move the head.log files from <redoLogDir> to a backup location.
Back up the corrupted 2853.log file separately.
Restart the node without these logs. After successful startup, check data integrity, especially for data written just before the previous shutdown, and restore any missing data if needed.

If disk space wasn't the issue, provide both head.log and 2853.log files to DolphinDB technical support for analysis.

Unknown Methods in Function Views/Scheduled Jobs

Deserialization of function views and scheduled jobs may fail if it encounters unknown methods in memory. This typically happens when:

Required plugins/modules aren't set to preload
Plugin/module updates have changed method names

For example, consider a scheduled job myTest that calls a method from the rabbitmq plugin:

loadPlugin("plugins/rabbitmq/PluginRabbitMQ.txt")

def myTest() {
	HOST="192.168.0.53"
    PORT=5672
    USERNAME="guest"
    PASSWORD="guest"

    conn = rabbitmq::connection(HOST, PORT, USERNAME, PASSWORD);
}

scheduleJob("myTest", "myTest", myTest, 15:50m, startDate=today(), endDate=today()+3, frequency='D')

Without setting preloadModules=plugins::rabbitmq in the configuration, the rabbitmq plugin methods won't be available during startup. This causes the scheduled job deserialization to fail, resulting in error messages in the log file.

2023-10-13 09:55:30.166268 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception: 
    Can't recognize function: rabbitmq::connection
2023-10-13 09:55:30.166338 <ERROR> :Failed to unmarshall the job [myTest]. Can't recognize function: 
    rabbitmq::connection. Invalid message format

To resolve: Add the relevant plugin/module to configuration preloadModules, e.g., preloadModules=plugins::rabbitmq, and restart node.

If error is due to updated plugin/module methods: roll back to previous plugin/module version; remove affected function views or scheduled jobs; then proceed with plugin/module update.

Unknown Shared Tables in Function Views/Scheduled Jobs

Deserialization of function views and scheduled jobs may fail if it encounters unknown shared tables.

Note: This issue was fixed in version 1.30.23.1/2.00.11.1, ensuring that the node startup proceeds without interruption, and the error is logged.

If a scheduled job references a shared table undefined in startup script startup.dos, the deserialization will fail during startup with errors. Take a scheduled job “myTest” for example:

share table(1 2 3 as id, 1 2 3 as val) as t

def myTest() {
	update t set val = val + 1
}

scheduleJob("myTest", "myTest", myTest, minute(now())+5, today(), today(), 'D')

The shared table t is not defined in startup.dos, causing the scheduled job deserialization to fail with the following errors:

2023-10-23 09:38:27.746184 <WARNING> :Failed to recognize shared variable t
2023-10-23 09:38:27.746343 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception: 
    Failed to deserialize update statement
2023-10-23 09:38:27.746404 <ERROR> :Failed to deserialize update statement. 
    Invalid message format

To resolve:

Check if any undefined shared tables exist in your jobs.
Define these tables in startup.dos script on the node where the scheduled job is located.
Restart the node.

When a scheduled job (e.g. myTest) is added as a function view using addFunctionView(), it may fail to deserialize during startup, showing similar error messages. Note that since deserialization of function views is performed before executing startup script, defining shared tables in the startup.dos won't help - the table definitions won't be available in time.

To resolve:

For a regular cluster:

Remove the following files from server/clusterDemo/data/dnode1/sysmgmt: aclEditlog.meta, aclCheckPoint.meta, and aclCheckPoint.tmp.
Restart the node.
After the restart, re-add all function views and corresponding permissions.

For a high-availability (HA) cluster:

If the cluster is still running or has a majority of controllers active:
1. Delete the function view with dropFunctionView("myTest").
2. Generate a checkpoint file for function views to prevent raft log replay from reapplying the previous function views during startup. Use the following command to force checkpoint creation: rpc(getControllerAlias(), aclCheckPoint, true).
If the cluster has already been restarted:
1. On each controller, remove the following files under <HOME_DIR>/<NodeAlias>/raft: raftHardstate, raftWAL, raftSnapshot, raftWAL.old, and raftSnapshot.tmp. Note that this will invalidate all cluster metadata.
2. Restart the nodes.

Method Name Conflicts

Note: This issue was fixed in version 2.00.11.

When a function view references a module method whose name conflict with one loaded by preloadModule, function view deserialization fails. For instance, add ops module's cancelJobEx method as a function view:

use ops
addFunctionView(ops::cancelJobEx)

If preloadModules=ops is set, this creates a naming conflict and causes startup errors.

2023-10-20 08:46:15.733365 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception: 
    Not allowed to overwrite existing functions/procedures [ops::cancelJobEx] by system users.
2023-10-20 08:46:15.733422 <ERROR> :Not allowed to overwrite existing functions/procedures
    [ops::cancelJobEx] by system users.. Invalid message format

To resolve: Remove the corresponding module configured by preloadModules and then restart the node. It is not recommended to add module functions as function views.

Corrupted Scheduled Job File

If a scheduled job file becomes corrupted (possibly due to disk-full errors, system crashes, or bugs during previous runtime), the node will fail to start as the system cannot deserialize scheduled job file during startup. For instance:

2023-10-13 09:57:30.456789 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception: 
    Failed to deserialize update statement
2023-10-13 09:57:30.456789 <ERROR> :Failed to unmarshall the job [myTest].
    Failed to deserialize update statement. Invalid message format

To resolve: Remove the following files from server/clusterDemo/data/dnode1/sysmgmt: jobEditlog.meta, jobCheckPoint.meta, and jobCheckPoint.tmp. Then, restart the node. After startup, re-submit all scheduled jobs.

Alternatively, gather and package the affected job files, the scheduled job script reported in the error, and the node’s runtime log. Reach out to DolphinDB technical support for further assistance.

Corrupted Function View File

If a function view file becomes corrupted (possibly due to disk-full errors, system crashes, or bugs during previous runtime), the node will fail to start as the system cannot deserialize function view file during startup. For instance:

2023-10-13 09:59:35.786438 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception:
    Failed to deserialize sql query object
2023-10-13 09:59:35.786438 <ERROR> :Failed to unmarshall the job [myTest1].
    Failed to deserialize sql query object. Invalid message format

To resolve: Remove related files. For more instructions, see solutions in section Unknown shared tables in Function Views/Scheduled Jobs.

After startup, re-add all function views and corresponding permissions.

Alternatively, gather and package the affected files, the function view script reported in the error, and the node’s runtime log. Reach out to DolphinDB technical support for further assistance.

Corrupted Raft File

If a raft file becomes corrupted (possibly due to disk-full errors, system crashes, or bugs during previous runtime), the node will fail to start as the system cannot restore raft data during startup. For instance:

2023-10-13 09:59:35.786438 <WARNING> :[Raft] incomplete hardstate file 
    [/data/server/data/controllerl/raft/raftHardstatel]
2023-10-13 09:59:35.786438 <INFO> :[Raft] Group DFSMaster RaftWAL::reconstruct: 
    read new file with 83213 entries 
2023-10-13 09:59:35.786438 <ERROR> :[Raft] Group DFSMaster RawNode::init: 
    failed to initialize with exception [basic_string::_S_create].
2023-10-13 09:59:35.786438 <ERROR> :Failed to start DFSMaster with the error message:
    basic_string::_S_create

To resolve: Remove folders <HomeDir>/<nodeAlias>/raft and <HomeDir>/<nodeAlias>/dfsMeta (configured by dfsMetaDir). Then, restart the node. After startup, the metadata of the leader will be automatically synchronized.

Alternatively, gather and package both folders and the node’s runtime log. Reach out to DolphinDB technical support for further assistance.

Note: Do not proceed until another node is elected as the leader of the raft cluster.

Process Stuck

Network Issues Among Cluster Nodes

For multi-machine clusters, ensuring network connectivity between all nodes is essential to avoid startup stuck. A typical issue occurs when restarting a high-availability cluster after network configuration changes, leading to a blank controller web interface. Controller logs errors like:

2023-11-01 16:00:34.992500 <INFO> :New connection from ip = 192.168.0.44 port = 35416
2023-11-01 16:00:35.459220 <INFO> :DomainSiteSocketPool::closeConnection: 
    Close connection to ctl1 #44 with error: epoll-thread: Read/Write failed.
    Connection timed out. siteIndex 0, withinSiteIndex 44

The issue also may occur when launching data nodes, it may report an"IO error type 1: Socket is disconnected/closed or file is closed" error, indicating network connectivity problems.

To resolve: Contact IT support to restore network connections.

Corrupted RSA Key

If an RSA key becomes corrupted (possibly due to disk-full errors, system crashes, or bugs during previous runtime), it can prevent successful communictaion between nodes, causing the process to be stuck. For instance:

2023-10-25 11:55:04.987161 <ERROR> :Failed to decrypt the message by RSA public key.

To resolve: Remove the <HOME_DIR>/<NodeAlias>/keys folder on all controllers and restart the cluster. This action will prompt DolphinDB to regenerate the RSA key. After startup, re-submit all scheduled jobs.

Slow Startup

Transaction Rollback

If startup logs show "Will process pending transactions" but lack "ChunkMgmt initialization completed" message, the system is performing a transaction rollback. If the node crashed during a write transaction involving a large amount of data or multiple partitions, the rollback process might take long time.

To monitor the rollback progress, follow these steps.

Check transaction folders in the <chunkMetaDir>/LOG directory, where a decreasing number of tid-named folders indicates ongoing rollback. Rollback speed can only be estimated by the folder deletion rate.
If there were pending delete transactions before shutdown, also check the <Alias>/<volumes>/LOG directory for tid-named folders. The rollback process is complete when the folder count in this directory reaches zero.

Note: To count the number of folders on Linux, you can use the command such as ll -hrt <chunkMetaDir>/LOG | wc -l.

To resolve: While completing the rollback is recommended to prevent data inconsistency, you can skip it if immediate startup is required:

Safely shut down the node with kill -15 pid or use kill -9 to force shut down.
Move the <chunkMetaDir>/LOG and <Alias>/<volumes>/LOG folders to a backup location.
Restart node and verify data integrity of previously written data.
Recover data if needed.

Redo Log Replay

If startup logs show "Start recovering from redo log. This may take a few minutes" but lack "Completed CacheEngine GC and RedoLog GC after applying all redo logs and engine is <engineType>" message, the system is replaying redo logs. During this process, logs containing "RedoLog" are recorded:

"applyTidRedoLog : 20716,f7dbaef9-05bc-10b6-f042-a14bc0e9c897,
    /home/DolphinDB/server/clusterDemo/data/node2/storage/CHUNKS/snapshotDB
    /20220803/Key17/5o7,shfe_5,166259,107,0"

DolphinDB has three types of redo log replays (OLAP, TSDB, PKEY storage engines). To monitor their progress:

Check tid.log files in the following directories: redoLogDir/TSDBRedoLogDir/PKEYRedoLogDir. Replay completes when no files remain.
Get directory sizes (using du -sh <redoLogDir> on Linux). Divide each size by disk read speed to estimate the minimum time required to complete each replay.

To resolve: While completing the redo log replay is recommended to prevent data inconsistency, you can skip it if immediate startup is required:

Safely shut down the node with kill -15 pid or use kill -9 to force shut down.
Move the head.log file from redoLogDir/TSDBRedoLogDir/PKEYRedoLogDir to a backup location.
Restart node and verify data integrity of previously written data.
Recover data if needed.

Other Issues

Startup Script Execution Failed or Running Slowly

When startup script startup.dos executes slowly, it delays the initialization of scheduled jobs. Although the node appears green (accessible) in the web interface after redo log replay finishes, scheduled job functions remain unavailable until startup.dos completes its execution.

When startup.dos or postStart.dos encounters errors, they will be logged but the script continues running, skipping the failed line without rollback. The node will still start, but users must handle any problems caused by these script failures. Note that in cluster mode, scripts may fail when accessing DFS tables since they might run before the database is fully initialized.

To resolve: Keep startup scripts simple by limiting them to basic operations like creating shared tables or loading plugins. Avoid lengthy operations or DFS table operations in these scripts. For more details, see Tutorial > Startup Scripts.

You can refer to the following script in startup.dos or postStart.dos to ensure the distribution module is ready before proceeding.

def isClusterOk() {
    do {
        try {
            meta = rpc(getControllerAlias(), getClusterChunksStatus)
            configReplicaCount = 2 // configured as dfsReplicationFactor

            cnt1 = exec count(*) from meta where state != "COMPLETE"
            cnt2 = exec count(*) from meta where replicaCount != configReplicaCount
  
            if (cnt1 == 0 and cnt2 == 0) {
                break
            } else {
                writeLog("startup isClusterOk: state != 'COMPLETE' cnt: " + string(cnt1) + ",
                " + "replicaCount != " + string(configReplicaCount) + " cnt: " + string(cnt2))
            }
        } catch (err) {
            writeLog("startup isClusterOk: " + err)
        }

        sleep(3*1000)
    } while(1)

    return true
}

res = isClusterOk()