Node Startup Exception
When a DolphinDB node fails to start, you won't be able to access its web interface. For a cluster setup, the node’s status will appear in red on the web Cluster tab. To troubleshoot, first determine if the node is:
- Shut down unexpectedly during startup
- Stuck in the startup process
- Taking unusually long to start
Diagnosing the issue
ps -ef | grep dolphindb
// If you have renamed the executable file, replace dolphindb with your actual file name.
grep "Job scheduler initialization completed." dolphindb.log
If not found, it suggests the failure occurred during startup.
A log entry with a timestamp following node startup likely indicates the node started successfully. To confirm, refresh the web interface to see if the node status turns green. If there’s no log entry, the node may be stuck or starting slowly.
grep "ERROR" dolphindb.log
If, after starting, you notice a repeated ERROR message in the logs and the node process is still running, this indicates the node is stuck at a specific startup stage.
If no repeated ERROR logs appear, the startup is likely progressing normally but slowly, so allow more time and monitor the progress.
Note: To diagnose specific issues, check the node’s runtime log. By default, the log of a data node is named as dolphindb.log in the server diectory in standalone mode, or under the clusterDemo/log directory in a cluster mode. The storage path can be modified with the logFile configuration parameter.
Unexpected Shutdown
If the node crashed after started successfully ("Job scheduler initialization completed"). Proceed with Node Crashes troubleshooting precedures.
...
<ERROR> : The socket server ended.
...
<ERROR> : AsynchronousSubscriberImp::run Shut down the subscription daemon.
...
cd /path/to/dolphindb
gdb dolphindb /path/to/corefile
bt
Share the stack trace with DolphinDB technical support for further analysis.
Process Stuck
tail -f dolphindb.log
To analyze what each thread is doing during the stuck state, you have two options:
- Use
pstack
to view thread stack traces:pstack dolphindb_pid > /tmp/pstack.log // replace dolphindb_pid with the actual process ID
- Use
gdb
to capture detailed thread stack traces:gdb -p dolphindb_pid -batch -ex 'thread apply all bt' -ex 'quit' > /tmp/gdb_stacks.log // replace dolphindb_pid with the actual process ID
Share the stack trace with DolphinDB technical support for further analysis.
Slow Startup
When experiencing a slow startup, an ERROR log may not appear. Common causes of slow starts are transaction rollbacks and redo log replays (see Common Issues and Solutions).
To analyze what each thread is doing during startup, you have two options:
- Use
pstack
to view thread stack traces:
pstack dolphindb_pid > /tmp/pstack.log
// replace dolphindb_pid with the actual process ID
- Use
gdb
to capture detailed thread stack traces.
gdb -p dolphindb_pid -batch -ex 'thread apply all bt' -ex 'quit' > /tmp/gdb_stacks.log
// replace dolphindb_pid with the actual process ID
Share the stack trace with DolphinDB technical support for further analysis.
Common Issues and Solutions
This section covers frequently encountered problems that may occur during node startup and provides their solutions to help streamline troubleshooting. For issues not addressed here, please contact DolphinDB technical support team for assistance.
Unexpected Shutdown
License Expiration
2023-10-13 09:52:30.007743 <WARNING> :
The license has expired. Please renew the license and restart the server.
2023-10-13 09:52:30.163238 <ERROR> :
The license has expired.
To resolve: Contact DolphinDB support to obtain a renewed license.
Port Conflicts
2023-10-26 09:01:31.349118 <ERROR> :Failed to bind the socket on port 8848 with error code 98
2023-10-26 09:01:31.349273 <ERROR> :Failed to bind the socket on port 8848. Shutting down the server. Please try again in a couple of minutes.
netstat -nlp | grep <port number>
To resolve: Either stop the program using the port or wait for
previous DolphinDB node to fully shut down. As a last resort, use
kill -9
to force terminate, but this may cause data
loss.
Corrupted Redo Log
If a data node's redo log file becomes corrupted (possibly due to disk-full errors, system crashes, or bugs during previous runtime), the node will fail to start as the system can't replay logs to restore past transactions.
Note: The location of redo logs depends on your storage engine configuration parameters.
- OLAP: redoLogDir. Defaults to /log/redoLog.
- TSDB: TSDBRedoLogDir. Defaults to /log/TSDBRedo
- PKEY: PKEYRedoLogDir. Defaults to <ALIAS>/log/PKEYRedo
2023-12-11 15:18:58.888865 <INFO> :applyTidRedoLog :
2853,c686664b-d020-429a-1746-287d670099e9,
/hdd/hdd7/server/clusterDemo/data/P1-datanode/storage/CHUNKS/multiValueTypeDb1/20231107/Key0/g
z,pt_2,32054400,1046013,0
2023-12-11 15:18:58.895064 <ERROR> :VectorUnmarshall::start Invalid data form 0 type 0
2023-12-11 15:18:58.895233 <ERROR> :The redo log for transaction [2853] comes across error:
Failed to unmarshall data.. Invalid message format
2023-12-11 15:18:58.895476 <ERROR> :The ChunkNode failed to initialize with exception
[Failed to unmarshall data.. Invalid message format].
2023-12-11 15:18:58.895555 <ERROR> :ChunkNode service comes up with the error message:
Failed to unmarshall data.. Invalid message format
It indicates that the replay failed due to an invalid format in the redo log file for transaction ID 2853.
To resolve: Bypass redo log replay with the following steps:
- Move the head.log files from <redoLogDir> to a backup location.
- Back up the corrupted 2853.log file separately.
- Restart the node without these logs. After successful startup, check data integrity, especially for data written just before the previous shutdown, and restore any missing data if needed.
If disk space wasn't the issue, provide both head.log and 2853.log files to DolphinDB technical support for analysis.
Unknown Methods in Function Views/Scheduled Jobs
Deserialization of function views and scheduled jobs may fail if it encounters unknown methods in memory. This typically happens when:
- Required plugins/modules aren't set to preload
- Plugin/module updates have changed method names
myTest
that calls a
method from the rabbitmq
plugin:loadPlugin("plugins/rabbitmq/PluginRabbitMQ.txt")
def myTest() {
HOST="192.168.0.53"
PORT=5672
USERNAME="guest"
PASSWORD="guest"
conn = rabbitmq::connection(HOST, PORT, USERNAME, PASSWORD);
}
scheduleJob("myTest", "myTest", myTest, 15:50m, startDate=today(), endDate=today()+3, frequency='D')
2023-10-13 09:55:30.166268 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception:
Can't recognize function: rabbitmq::connection
2023-10-13 09:55:30.166338 <ERROR> :Failed to unmarshall the job [myTest]. Can't recognize function:
rabbitmq::connection. Invalid message format
To resolve: Add the relevant plugin/module to configuration preloadModules, e.g., preloadModules=plugins::rabbitmq, and restart node.
If error is due to updated plugin/module methods: roll back to previous plugin/module version; remove affected function views or scheduled jobs; then proceed with plugin/module update.
Unknown Shared Tables in Function Views/Scheduled Jobs
Deserialization of function views and scheduled jobs may fail if it encounters unknown shared tables.
Note: This issue was fixed in version 1.30.23.1/2.00.11.1, ensuring that the node startup proceeds without interruption, and the error is logged.
share table(1 2 3 as id, 1 2 3 as val) as t
def myTest() {
update t set val = val + 1
}
scheduleJob("myTest", "myTest", myTest, minute(now())+5, today(), today(), 'D')
2023-10-23 09:38:27.746184 <WARNING> :Failed to recognize shared variable t
2023-10-23 09:38:27.746343 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception:
Failed to deserialize update statement
2023-10-23 09:38:27.746404 <ERROR> :Failed to deserialize update statement.
Invalid message format
To resolve:
- Check if any undefined shared tables exist in your jobs.
- Define these tables in startup.dos script on the node where the scheduled job is located.
- Restart the node.
When a scheduled job (e.g. myTest
) is added as a function
view using addFunctionView()
, it may fail to
deserialize during startup, showing similar error messages. Note that since
deserialization of function views is performed before executing startup
script, defining shared tables in the startup.dos won't help - the
table definitions won't be available in time.
To resolve:
For a regular cluster:
- Remove the following files from server/clusterDemo/data/dnode1/sysmgmt: aclEditlog.meta, aclCheckPoint.meta, and aclCheckPoint.tmp.
- Restart the node.
- After the restart, re-add all function views and corresponding permissions.
For a high-availability (HA) cluster:
- If the cluster is still running or has a majority of controllers
active:
- Delete the function view with
dropFunctionView("myTest")
. - Generate a checkpoint file for function views to prevent raft
log replay from reapplying the previous function views during
startup. Use the following command to force checkpoint creation:
rpc(getControllerAlias(), aclCheckPoint, true)
.
- Delete the function view with
- If the cluster has already been restarted:
- On each controller, remove the following files under <HOME_DIR>/<NodeAlias>/raft: raftHardstate, raftWAL, raftSnapshot, raftWAL.old, and raftSnapshot.tmp. Note that this will invalidate all cluster metadata.
- Restart the nodes.
Method Name Conflicts
Note: This issue was fixed in version 2.00.11.
ops
module's cancelJobEx
method as a function view:use ops
addFunctionView(ops::cancelJobEx)
2023-10-20 08:46:15.733365 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception:
Not allowed to overwrite existing functions/procedures [ops::cancelJobEx] by system users.
2023-10-20 08:46:15.733422 <ERROR> :Not allowed to overwrite existing functions/procedures
[ops::cancelJobEx] by system users.. Invalid message format
To resolve: Remove the corresponding module configured by preloadModules and then restart the node. It is not recommended to add module functions as function views.
Corrupted Scheduled Job File
2023-10-13 09:57:30.456789 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception:
Failed to deserialize update statement
2023-10-13 09:57:30.456789 <ERROR> :Failed to unmarshall the job [myTest].
Failed to deserialize update statement. Invalid message format
To resolve: Remove the following files from server/clusterDemo/data/dnode1/sysmgmt: jobEditlog.meta, jobCheckPoint.meta, and jobCheckPoint.tmp. Then, restart the node. After startup, re-submit all scheduled jobs.
Alternatively, gather and package the affected job files, the scheduled job script reported in the error, and the node’s runtime log. Reach out to DolphinDB technical support for further assistance.
Corrupted Function View File
2023-10-13 09:59:35.786438 <ERROR> :CodeUnmarshall::start readObjectAndDependency exception:
Failed to deserialize sql query object
2023-10-13 09:59:35.786438 <ERROR> :Failed to unmarshall the job [myTest1].
Failed to deserialize sql query object. Invalid message format
To resolve: Remove related files. For more instructions, see solutions in section Unknown shared tables in Function Views/Scheduled Jobs.
After startup, re-add all function views and corresponding permissions.
Alternatively, gather and package the affected files, the function view script reported in the error, and the node’s runtime log. Reach out to DolphinDB technical support for further assistance.
Corrupted Raft File
2023-10-13 09:59:35.786438 <WARNING> :[Raft] incomplete hardstate file
[/data/server/data/controllerl/raft/raftHardstatel]
2023-10-13 09:59:35.786438 <INFO> :[Raft] Group DFSMaster RaftWAL::reconstruct:
read new file with 83213 entries
2023-10-13 09:59:35.786438 <ERROR> :[Raft] Group DFSMaster RawNode::init:
failed to initialize with exception [basic_string::_S_create].
2023-10-13 09:59:35.786438 <ERROR> :Failed to start DFSMaster with the error message:
basic_string::_S_create
To resolve: Remove folders <HomeDir>/<nodeAlias>/raft and <HomeDir>/<nodeAlias>/dfsMeta (configured by dfsMetaDir). Then, restart the node. After startup, the metadata of the leader will be automatically synchronized.
Alternatively, gather and package both folders and the node’s runtime log. Reach out to DolphinDB technical support for further assistance.
Note: Do not proceed until another node is elected as the leader of the raft cluster.
Process Stuck
Network Issues Among Cluster Nodes
2023-11-01 16:00:34.992500 <INFO> :New connection from ip = 192.168.0.44 port = 35416
2023-11-01 16:00:35.459220 <INFO> :DomainSiteSocketPool::closeConnection:
Close connection to ctl1 #44 with error: epoll-thread: Read/Write failed.
Connection timed out. siteIndex 0, withinSiteIndex 44
The issue also may occur when launching data nodes, it may report an"IO error type 1: Socket is disconnected/closed or file is closed" error, indicating network connectivity problems.
To resolve: Contact IT support to restore network connections.
Corrupted RSA Key
2023-10-25 11:55:04.987161 <ERROR> :Failed to decrypt the message by RSA public key.
To resolve: Remove the <HOME_DIR>/<NodeAlias>/keys folder on all controllers and restart the cluster. This action will prompt DolphinDB to regenerate the RSA key. After startup, re-submit all scheduled jobs.
Slow Startup
Transaction Rollback
If startup logs show "Will process pending transactions" but lack "ChunkMgmt initialization completed" message, the system is performing a transaction rollback. If the node crashed during a write transaction involving a large amount of data or multiple partitions, the rollback process might take long time.
To monitor the rollback progress, follow these steps.
- Check transaction folders in the <chunkMetaDir>/LOG directory, where a decreasing number of tid-named folders indicates ongoing rollback. Rollback speed can only be estimated by the folder deletion rate.
- If there were pending delete transactions before shutdown, also check the <Alias>/<volumes>/LOG directory for tid-named folders. The rollback process is complete when the folder count in this directory reaches zero.
Note: To count the number of folders on Linux, you can use the command
such as ll -hrt <chunkMetaDir>/LOG | wc -l
.
To resolve: While completing the rollback is recommended to prevent data inconsistency, you can skip it if immediate startup is required:
- Safely shut down the node with
kill -15 pid
or usekill -9
to force shut down. - Move the <chunkMetaDir>/LOG and <Alias>/<volumes>/LOG folders to a backup location.
- Restart node and verify data integrity of previously written data.
- Recover data if needed.
Redo Log Replay
"applyTidRedoLog : 20716,f7dbaef9-05bc-10b6-f042-a14bc0e9c897,
/home/DolphinDB/server/clusterDemo/data/node2/storage/CHUNKS/snapshotDB
/20220803/Key17/5o7,shfe_5,166259,107,0"
DolphinDB has three types of redo log replays (OLAP, TSDB, PKEY storage engines). To monitor their progress:
- Check tid.log files in the following directories: redoLogDir/TSDBRedoLogDir/PKEYRedoLogDir. Replay completes when no files remain.
- Get directory sizes (using
du -sh <redoLogDir>
on Linux). Divide each size by disk read speed to estimate the minimum time required to complete each replay.
To resolve: While completing the redo log replay is recommended to prevent data inconsistency, you can skip it if immediate startup is required:
- Safely shut down the node with
kill -15 pid
or usekill -9
to force shut down. - Move the head.log file from redoLogDir/TSDBRedoLogDir/PKEYRedoLogDir to a backup location.
- Restart node and verify data integrity of previously written data.
- Recover data if needed.
Other Issues
Startup Script Execution Failed or Running Slowly
When startup script startup.dos executes slowly, it delays the initialization of scheduled jobs. Although the node appears green (accessible) in the web interface after redo log replay finishes, scheduled job functions remain unavailable until startup.dos completes its execution.
When startup.dos or postStart.dos encounters errors, they will be logged but the script continues running, skipping the failed line without rollback. The node will still start, but users must handle any problems caused by these script failures. Note that in cluster mode, scripts may fail when accessing DFS tables since they might run before the database is fully initialized.
To resolve: Keep startup scripts simple by limiting them to basic operations like creating shared tables or loading plugins. Avoid lengthy operations or DFS table operations in these scripts. For more details, see Tutorial > Startup Scripts.
def isClusterOk() {
do {
try {
meta = rpc(getControllerAlias(), getClusterChunksStatus)
configReplicaCount = 2 // configured as dfsReplicationFactor
cnt1 = exec count(*) from meta where state != "COMPLETE"
cnt2 = exec count(*) from meta where replicaCount != configReplicaCount
if (cnt1 == 0 and cnt2 == 0) {
break
} else {
writeLog("startup isClusterOk: state != 'COMPLETE' cnt: " + string(cnt1) + ",
" + "replicaCount != " + string(configReplicaCount) + " cnt: " + string(cnt2))
}
} catch (err) {
writeLog("startup isClusterOk: " + err)
}
sleep(3*1000)
} while(1)
return true
}
res = isClusterOk()