System Freeze
A cluster may freeze during:
- Node startup or initialization: Refer to Node Startup Exception - Process Stuck for solutions.
- System runtime:
- Check for machine-related abnormal metrics to see if the cluster load is too high, which could slow down system responsiveness.
- Provide the collected stack traces to DolphinDB technical support for further troubleshooting. To assist technical support in accurately analyzing stack changes, it is recommended to collect stack traces at least twice, with a 3-5 minute interval.
The following introduces two methods for collecting stack traces —
pstack
and gdb
.
Method 1: pstack
pstack
to collect stack traces by running the
following shell script on each machine in the
cluster:#!/bin/bash
mkdir /root/output/
dpid=`ps -ef |grep "mode datanode" |grep -v grep | awk '{print $2}'`
cpid=`ps -ef |grep "mode controller" |grep -v grep | awk '{print $2}'`
for i in $dpid
do
cd /ddb/software/server
pstack $i > /root/output/pstack_dnode_${i}.log
done
for i in $cpid
do
cd /ddb/software/server
pstack $i > /root/output/pstack_ctrl_${i}.log
done
Then, send the generated stack traces in the /root/output directory to DolphinDB technical support for further troubleshooting.
Method 2: gdb
gdb
to collect stack
traces:#!/bin/bash
mkdir /root/output/
dpid=`ps -ef |grep "mode datanode" |grep -v grep | awk '{print $2}'`
cpid=`ps -ef |grep "mode controller" |grep -v grep | awk '{print $2}'`
for i in $dpid
do
cd /home/dolphindb/server
gdb --eval-command "set logging file /root/output/pstack_dnode_$i.log" --eval-command "set logging on" --eval-command "thread apply all bt" --batch --pid $i;
done
for i in $cpid
do
cd /home/dolphindb/server
gdb --eval-command "set logging file /root/output/pstack_ctl_$i.log" --eval-command "set logging on" --eval-command "thread apply all bt" --batch --pid $i;
done
2. Releasing Resources via the Emergency Channel
In certain severe scenarios (such as when all worker threads are blocked and cannot be released), the system may enter a "deadlock" state as follows: thread release → requires task cancellation → requires connection → connection fails due to thread exhaustion.
In this case, you can use the emergency channel to log in to the target DolphinDB process and execute commands (such as canceling jobs) without relying on a remote connection, thereby freeing critical resources and restoring system operation.
Note: This feature is supported starting from DolphinDB 2.00.16 / 3.00.3, and is available on Linux only.
2.1 Principles
The emergency channel is a special mechanism internally designed by DolphinDB, dedicated to handling failures under extreme system conditions such as out-of-memory (OOM) scenarios.
This channel operates using a specially reserved memory segment (no more than 10% of the maximum memory), ensuring minimal connectivity and command execution capabilities even when the main memory is exhausted. With this mechanism, the system provides a fallback path to recover from scenarios where remote connections are unavailable and all main threads are blocked.
2.2 Usage Instructions
- Log in to the server of the node that is experiencing the deadlock.
- Use the same system user as the DolphinDB service processto run the following command in another installation directory:
./dolphindb -attach 1 -target-pid <target process PID>
Parameter explanation:
- -attach 1: Indicates that the current process is used for an attached connection.
- -target-pid: Specifies the PID of the target DolphinDB service process.
Once connected, you will enter an interactive DolphinDB console (the command
prompt is usually >
, but in some cases, it may not display).
You can then execute regular script commands to check the status or release
resources, for example:
> clearAllCache();
2.3 Notes
- The emergency channel is intended for emergency operation scenarios and should not be used for routine operations.
- To cancel a job, ensure you have the job ID in advance.
- Always use
quit
to disconnect after completing operations. - If the connection fails, check the following:
- Whether the -target-pid is a valid DolphinDB service process PID.
- Whether the current user is the same as the user running the target process.
- Whether the current working directory is the correct DolphinDB installation path.