Node Crashes
When working with DolphinDB, you may encounter situations where a node crashes unexpectedly. This can manifest as a "Connection refused" error message on the client side, or DolphinDB missing from the system's running processes. This article provides a step-by-step guide to diagnose the possible causes of such crashes and provides the tools and techniques you can use to tackle them.
Checking Node Logs for Diagnostic Information
DolphinDB logs all node activities into log files, providing insights into the system's behavior and identifying potential causes for the crashes.
Locating Log Files
By default, the log of a data node is named as dolphindb.log in the server
diectory in standalone mode, or under the server/log directory in a
cluster mode. The storage path can be modified with the logFile
configuration parameter. If there are multiple nodes running in the cluster, use
the ps
command to check the log path of each node.
Common Crash Causes
Several non-DolphinDB system issues can lead to node crashes:
- Manual shutdown through the Web interface or
stopDataNode
function. - Process termination by the operating system's
kill
command. - License expiration
For example, you can use the following command to search the log of node datanode1:
less datanode1.log | grep "MainServer shutdown"
If this message MainServer shutdown
appears around the time of
the crash, the process could have been shut down.
Analyzing Log Files
(1) Manually Stopped via Web Interface or
stopDataNode
To check if the node was manually stopped via the web interface or using the
stopDataNode
function, search the controller log for the
message has gone offline
:
less controller.log | grep "has gone offline"
(2) Killed by Operating System
To check if the node process was killed by the operating system, search for the
Received signal
message in the node log:
less datanode1.log | grep "Received signal"
(3) License Expiration
To determine if the node shutdown was caused by an expired license, search for the following message in the log:
The license has expired
To address this issue, you can update the license to avoid future disruptions.
Checking Operating System Logs for OOM Events
The Linux kernel employs the Out Of Memory (OOM) killer to prevent system crashes by terminating processes that consume excessive memory.
Inspecting System Logs
To check if the OOM Killer has terminated the DolphinDB process, use the following command to inspect the system logs:
dmesg -T | grep dolphindb
Addressing OOM Terminations
If the message Out of memory: Kill process
is shown, it
indicates that DolphinDB exceeded the available memory, causing the system to
kill the process.
To address this problem, set the maxMemSize parameter in the configuration files to limit the memory usage of the node. For example, if the machine has 16 GB of memory and is running one node, set maxMemSize to approximately 12 GB.
Identifying Segmentation Faults
If the dmesg
command reveals a "segfault" message, it indicates
that a segmentation fault has occurred. This happens when the DolphinDB process
attempts to access memory that has not been allocated to it.
Common causes of segmentation faults include:
- Accessing system data areas (often by operating on a pointer at address 0x00).
- Memory access out of bounds (e.g., array index out of range, variable type inconsistency).
- Stack overflow (the default stack size in Linux is 8192 KB, verifiable with
the
ulimit -s
command).
Configuring Core Files
Core dump files are crucial for diagnosing and debugging program crashes. They contain a snapshot of the program's memory state and other vital information at the time of termination.
Enabling Core Dump
To check if core dumps are currently enabled, use the following command:
ulimit -c
A result of 0 indicates that core dumps are disabled, while any other number or "unlimited" means they are enabled. To enable core dumps with unlimited size, run:
ulimit -c unlimited
Note: This setting only applies to the current session. For persistent configuration, use one of the following methods:
-
Add the following line to /etc/profile and then reload the server:
ulimit -S -c unlimited >/dev/null 2>&1
Alternatively, use
source/etc/profile
to make the configuration take effect immediately without restarting the server. To set it for specific user only, modify the ~/.bashrc or ~/.bash_profile file for the user. -
Add the following two lines to /etc/security/limits.conf to enable core dumps for all users.
* soft core unlimited * hard core unlimited
Note: After enabling the core function, you need to restart the agent first, followed by the data nodes.
Setting Core Dump File Path
A core dump file has a default file name in the format of core.pid, where pid denotes the process ID of the program that causes the segmentation faults. The default file path is the program directory.
/proc/sys/kernel/core_uses_pid
specifies whether to add pid as
the filename suffix to the generated core file. Use the following command to
change the setting:
echo "1" > /proc/sys/kernel/core_uses_pid
/proc/sys/kernel/core_pattern
specifies the file path and file
name format.
This example saves core files to /corefile with the format "core-command name-pid-timestamp".
echo /corefile/core-%e-%p-%t > /proc/sys/kernel/core_pattern
Parameter Reference:
- %p: process ID
- %u: current user ID
- %g: current group ID
- %s: signal that caused the core dump
- %t: time of core dump (UNIX timestamp)
- %h: hostname
- %e: executable filename
Debugging Core Files
To debug core files, use the GNU Debugger (GDB). Install GDB with:
yum install gdb
Debug the core file:
gdb [exec file] [core file]
Use the bt
command to display the stack trace for further
analysis.
By following these steps, you can effectively configure and utilize core dumps to diagnose DolphinDB crashes and identify the root causes of issues.
Preventing Node Crashes
Implementing preventive measures can significantly reduce the occurrence of node crashes in DolphinDB. Here are some best practices:
Avoiding Infinite Recursion
When writing recursive functions, always include a termination condition to prevent stack overflow errors. For example, avoid dangerous recursive patterns such as:
def danger(x) {
return danger(x) + 1
}
danger(1)
Monitoring and Optimizing Memory Usage
High memory usage can trigger OOM events. Monitor and optimize memory usage by appropriately configuring write caches for DFS databases and message queues for streaming.
Avoiding Concurrent Writes to In-Memory Tables
Avoid concurrent writes to in-memory tables. For example, the following script creates a partitioned in-memory table:
t=table(1:0,`id`val,[INT,INT])
db=database("",RANGE,1 101 201 301)
pt=db.createPartitionedTable(t,`pt,`id)
Running two concurrent write jobs on the same partitioned table can cause a crash:
def writeData(mutable t,id,batchSize,n){
for(i in 1..n){
idv=take(id,batchSize)
valv=rand(100,batchSize)
tmp=table(idv,valv)
t.append!(tmp)
}
}
job1=submitJob("write1","",writeData,pt,1..300,1000,1000)
job2=submitJob("write2","",writeData,pt,1..300,1000,1000)
This generates a core dump as follows:
Implementing Proper Exception Handling in Custom Plugins
Custom plugins that fail to handle exceptions properly might crash the server. Ensure proper error-handling mechanisms are in place to avoid this. Detailed instructions can be found at Plugin Development Tutorial.
Conclusion
Node crashes in DolphinDB can occur for various reasons, ranging from system resource constraints to coding practices. This guide outlines the steps to diagnose and resolve these issues:
- Examine DolphinDB logs to identify intentional shutdowns, license expirations, or process terminations.
- Check system logs for OOM events and segmentation faults.
- Configure and analyze core dumps for detailed debugging.
- Implement preventive measures, including proper script design, memory management, and exception handling.
For complex issues, please retain all relevent logs and core files, document the steps taken to reproduce the issue, and contact DolphinDB support for further assistance.