Best Practices for Log Monitoring in DolphinDB

Efficient log monitoring is key to maintaining stability in complex big data systems. DolphinDB, a high-performance time-series database, requires real-time log analytics to support high availability (HA) operations. This guide introduces a lightweight monitoring solution built on Loki, Promtail, and Grafana. It covers system deployment, multi-node configuration, log label extraction, alert rule definition, and common issue diagnosis.

With low storage overhead and high scalability, the solution enables rapid anomaly detection and efficient operations in latency-sensitive environments such as finance and IoT.

1. Architecture Overview

A typical Loki monitoring architecture comprises three core components: Promtail, Loki, and Grafana.

Promtail acts as the log collection agent. It reads logs from local sources (such as DolphinDB datanode), attaches metadata labels for classification, and streams logs to Loki via HTTP.
Loki serves as the centralized log aggregator. It ingests and indexes logs based on labels rather than full-text content, enabling efficient storage and fast retrieval.
Grafana offers visualization, query interfaces, and alerting dashboards.

The architecture diagram (Figure 1-1) illustrates the data flow: red arrows represent log ingestion, while blue arrows indicate query and alert flows.

1.1 Promtail Overview

Promtail is a lightweight log forwarder typically deployed on each monitored node. It supports the following features:

Discover and collect logs from local files or systemd journals (for ARM and AMD64 platforms).
Attach metadata as labels to log streams.
Push labeled logs to a designated Loki instance.

Log File Discovery

Before Promtail can push log data to Loki, it must first understand the logging environment. This involves identifying which applications are writing logs and determining which log files should be monitored.

Although Promtail uses Prometheus-style service discovery, its local daemon mode limits cross-node discovery. In Kubernetes environments, Promtail integrates with the Kubernetes API to enrich logs with pod and container metadata, enabling scalable log routing in distributed systems.

Configuration

Promtail is configured using stanzas, allowing fine-grained control over log sources, filters, and label enrichment. For advanced configuration, refer to the Promtail Configuration Guide.

1.2 Loki Overview

Loki is an open source log aggregation system developed by Grafana Labs, optimized for cloud-native observability. Unlike traditional log systems, Loki avoids full-text indexing. Instead, it indexes only the log metadata (labels), which makes it more efficient, scalable, and tightly integrated with Prometheus.

Key Characteristics:

Horizontal scalability: Loki can scale from small deployments (e.g., Raspberry Pi) to petabyte-scale daily log volumes. Its decoupled read/write paths and microservice architecture make it well-suited for Kubernetes.
Multi-tenancy: Loki supports tenant isolation through label-based scoping and tenant IDs, allowing multiple clients to share a single Loki instance securely.
Third-Party integration: Loki is compatible with a wide range of log forwarders and observability tools via plugin-based integration.
Efficient storage: Log data is stored in compressed chunks using object storage backends (e.g., Amazon S3, GCS). Minimal indexing results in significantly reduced storage costs compared to traditional systems.
LogQL: Loki's query language, LogQL, enables powerful log filtering and metric extraction, bridging the gap between metrics and logs.
Alerting with ruler: Loki includes a Ruler component for real-time alert evaluation based on logs, integrating seamlessly with Grafana's alert manager and Prometheus Alertmanager.

This tutorial uses Loki V2.5, aligned with the corresponding Grafana version. The following sections detail the deployment and integration process of Loki-based logging for a DolphinDB HA cluster with three data nodes.

2. Environment Setup

Simulated Server Topology

The deployment environment consists of three nodes simulating a distributed DolphinDB cluster. Each node runs DolphinDB components and log monitoring services as shown below:


IP Address	Hostname	Node Roles	DolphinDB Ports	Monitoring Services	Monitoring Ports
10.0.0.80	vagrant1	controller, agent, datanode, computenode	8800, 8801, 8802, 8803	Grafana, Loki, Promtail	3000, 3100, 9080
10.0.0.81	vagrant2	controller, agent, datanode, computenode	8800, 8801, 8802, 8803	Promtail	9080
10.0.0.82	vagrant3	controller, agent, datanode, computenode	8800, 8801, 8802, 8803	Promtail	9080

Software Versions


Component	Version
Grafana	9.0.5
Loki	2.5
Promtail	2.5

Note: Ensure Grafana is pre-installed on the monitoring node (vagrant1) before proceeding.

3. Installation and Deployment

Before proceeding, ensure a high availability DolphinDB cluster is already deployed with multiple data nodes. Refer to High-availability Cluster Deployment or Multi-Container Deployment With Docker Compose for setup instructions.

3.1 Installing Loki and Promtail

File Preparation

Download the installation packages (included with this guide):

loki-linux-amd64.zip → upload to the monitoring server (10.0.0.80)
promtail-linux-amd64.zip → upload to all DolphinDB nodes (10.0.0.80, 10.0.0.81, 10.0.0.82)

Install Loki on the Monitoring Server (10.0.0.80)

Create the installation directory:

mkdir -p /usr/local/logsCollect/loki

Create directories for Loki storage and index:

mkdir  /data/loki
mkdir  /data/loki/{chunks,index}

Unzip the Loki binary to the target directory:

unzip loki-linux-amd64.zip -d /usr/local/logsCollect/loki

Navigate to the installation directory and create the configuration file:

cd /usr/local/logsCollect/loki
vim config.yaml

Add the following content to "config.yaml":

auth_enabled: false  # Enable or disable authentication

server:
  http_listen_port: 3100  # HTTP service listening port

ingester:
  lifecycler:
    address: 10.0.0.80  # IP address of the monitoring server
    ring:
      kvstore:
        store: inmemory  # Storage backend for ring metadata. Options: inmemory, consul, etcd
      replication_factor: 1  # Sets the number of data replicas. A value of 1 disables replication.
    final_sleep: 0s  # Wait time before shutdown for graceful termination
  chunk_idle_period: 5m  # Marks a chunk as complete if it receives no logs for 5 minutes
  chunk_retain_period: 30s  # Waits 30 seconds after chunk completion before writing to storage

schema_config:
  configs:
    - from: 2024-04-01  # Effective start date of this schema configuration
      store: boltdb  # Storage backend for index data. Common values: boltdb, cassandra
      object_store: filesystem  # Object store type for chunk storage. Common values: filesystem, s3, gcs
      schema: v11  # Storage schema version used by Loki
      index:
        prefix: index_  # Prefix for index tables
        period: 168h  # Duration covered by each index table (7 days)

storage_config:
  boltdb:
    directory: /data/loki/index  # Directory path for index files
  filesystem:
    directory: /data/loki/chunks  # Directory path for chunk files

limits_config:
  enforce_metric_name: false  # Enforce requirement for metric name label in log streams
  reject_old_samples: true  # Reject log samples older than the allowed time window
  reject_old_samples_max_age: 168h  # Maximum age of accepted log samples (7 days)
  ingestion_rate_mb: 1024  # Global ingestion rate limit in MB/s
  ingestion_burst_size_mb: 2048  # Global burst ingestion limit in MB/s

chunk_store_config:
  max_look_back_period: 168h  # Maximum lookback duration for log queries (must align with index period)

table_manager:
  retention_deletes_enabled: true  # Enable automatic deletion of expired tables
  retention_period: 168h  # Log retention duration (7 days)

Make sure to modify the server IP and listening port to match your deployment environment. By default, Loki enforces an ingestion rate limit of 4 MB/s. Without proper configuration, high log volume may trigger rate limit errors. The above configuration increases the global ingestion threshold to prevent this.

Run the following command in the installation directory:

nohup ./loki-linux-amd64 -config.file=./config.yaml >./server.log 2>&1 &

To stop Loki:

kill -9 $(pgrep -f "loki-linux-amd64")

To check logs:

tail -200f server.log

Once started successfully, you should see startup logs similar to Figure 3-1.

Configure Promtail in the DolphinDB HA Cluster

Follow the steps below on each DolphinDB node (10.0.0.80, 10.0.0.81, 10.0.0.82). The following example uses 10.0.0.80.

Create the installation directory:

mkdir -p /usr/local/logsCollect/promtail

Navigate to the directory where the installation package is located and extract the Promtail archive:

unzip promtail-linux-amd64.zip -d /usr/local/logsCollect/promtail

Navigate to the install directory and create the "promtail.yaml" file:

cd /usr/local/logsCollect/promtail
vim promtail.yaml

Add the following configuration:

server:
  http_listen_port: 9080        
  grpc_listen_port: 0           

positions:
  filename: ./positions.yaml    

clients:
  - url: http://10.0.0.80:3100/loki/api/v1/push  # Loki server URL for pushing logs

scrape_configs:
  # ucenter1
  - job_name: dolphinDB
    static_configs:
      - targets:
          - 10.0.0.80
        labels:
          job: dolphinDB
          host: 10.0.0.80
          __path__: /home/vagrant/v2.00.11.13/server/clusterDemo/log/*.log  # Match all *.log files in directory
    pipeline_stages:
      - regex:
          expression: '^(?P<ts>\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\.\d+)\s(?P<level><\w+>)\s:(?P<message>.*)$'
      - timestamp:
          source: ts
          format: 2006-01-02 15:04:05.000000
          timezone: "China/Beijing"
      - labels:
          level:
      - output:
          source: message
  - job_name: core_file_monitor
    static_configs:
      - targets:
          - 10.0.0.80
        labels:
          job: core_files
          host: 10.0.0.80
          __path__: /home/vagrant/v2.00.11.13/server/clusterDemo/log/core.*  # Match core.* files
    pipeline_stages:
      - labels:
          filename: __path__   # Extract filename from file path as label
      - output:
          source: filename     # Use filename as log content
      - limit:
          rate: 10             # Max 10 logs per second
          burst: 10            # Allow short bursts of up to 10 logs
          drop: true           # Drop logs exceeding rate limits

Notes:

job_name: dolphinDB monitors DolphinDB log output.
job_name: core_file_monitor tracks core dump file creation.
Adjust __path__ to match your local DolphinDB log directory.
The url under clients should point to the correct Loki server.
IP addresses in targets and host should reflect the local node where Promtail is running. These are used to identify the log source in Loki.

Execute the following command in the Promtail install directory:

nohup ./promtail-linux-amd64 -config.file=./promtail.yaml >./server.log 2>&1 &

Stop promtail:

kill -9 $(pgrep -f "promtail-linux-amd64")

Verify startup:

tail -200f server.log

If logs indicate successful startup, proceed to deploy Promtail on the remaining nodes in the cluster.

3.2 Alert Configuration in Grafana

3.2.1 Configure Loki as a Data Source in Grafana

Assuming Grafana is already installed, it is accessible via port 3000.

In this tutorial, Grafana runs on the monitoring server at 10.0.0.80, so the default URL is 10.0.0.80:3000, and the default username and password are both admin.

After logging in, navigate to the "Data Sources" in the Grafana interface:

Click "Add data source", then choose Loki from the list, locate the URL field (see figure 3‑4) and enter the following address:

Click "Save & Test" to validate the connection and save the configuration. Go back to the home page and click Explore from the sidebar. In the Data source dropdown, select Loki, click “Log browser”, select the dolphinDB job, then click “Show logs”.

If logs are displayed successfully, the data source is correctly configured. See figure 3‑6 for a sample result.

3.2.2 Alert Panel Configuration in Grafana

This section explains how to configure alert rules in Grafana based on logs collected by Loki. In this example, Loki performs log checks every 1 minute to detect new error messages. Each check scans logs generated within the previous 5 minutes. If any error logs are detected, the system enters a 2-minute evaluation period. If the alert condition persists throughout this period, Grafana triggers an alert (e.g., sends an email notification).

To create a new alert rule, click Alerting → Alert rules on the main sidebar, then click New alert rule (Figure 3‑7):

Use the numbered interface elements in Figure 3‑7 to follow the configuration steps below:

Select the data source: Choose Loki as the data source.
Set the time range for each evaluation: Configure the query to scan logs from the last 5 minutes during each evaluation cycle.
Define the query expression: Use count_over_time({job="dolphinDB"} |= "ERROR"[5m]) to count log lines containing "ERROR" within the past 5 minutes. For example, if the current time is 15:25:10, the expression will evaluate logs from 15:20:10 to 15:25:10 and record the result at 15:25:10 (see Figure 3‑8).
Configure the alert condition: Set WHEN last() OF A IS ABOVE 0, where:
1. last() retrieves the latest value from the query in step 3.
2. IS ABOVE 0 triggers the alert if any "ERROR" log entries are detected within the defined time window.
Apply the expression as the condition: Use the expression from step 4 as the alert condition.
Configure evaluation frequency and duration: Grafana checks once every minute.
1. Evaluate every 1m means the check runs every minute.
2. For 2m defines the alert evaluation window as 2 minutes.
Handle no data scenarios: If no error logs are generated, the expression in step 3 returns no data, resulting in the "Nodata" state. To prevent false alerts, set Alert state if no data or all values are null to "OK". This ensures that no data does not imply a failure.

Note:

Grafana determines whether to trigger an alert based on a boolean result. When last() OF A > 0, the expression in step 4 returns true, initiating the alert evaluation period. If the result is less than or equal to zero, the alert is not triggered. Since a log query alone returns a numeric value rather than a boolean, an additional condition expression is required to convert the query result into a boolean outcome for the alert system.

Figure 3-9 illustrates the possible alert rule states:

Pending: The alert condition has been met, but the system is still within the evaluation period. If the condition remains true throughout this period, the state transitions to "Alerting".
Alerting: The system has confirmed the anomaly and the alert is actively triggered.
Nodata: The query expression returned no data.
Normal: The monitored metric is operating within normal parameters.

Click Preview Alerts to preview the alert output. In Figure 3‑10, the "Info" column displays labels extracted by Promtail (such as filename, job, and host), which are configured under the "labels" of the Promtail config file (as shown in Figure 3‑5).

You can customize alert names, groups, and other metadata as needed (Figure 3‑11):

Click Save & Exit to save and exit the configuration.

3.2.3 Alert Email Configuration

To enable email alerts, Grafana must be configured with a valid SMTP server.

Configure the SMTP section in Grafana by editing the "./grafana-9.0.5/conf/defaults.ini" file.

[smtp]
enabled = true
host = smtp.163.com:465             # Specify the SMTP server and port (163 Mail in this example)
user = xxxxxxxxx@163.com            # Use your email address (e.g., a personal 163 Mail account)

password = xxxxxxxx                 # Use the app-specific authorization code (not your login password)
cert_file =
key_file =
skip_verify = false
from_address = xxxxxxxx@163.com     # Same as the user email above
from_name = Grafana
ehlo_identity =
startTLS_policy =

Note: If the password includes # or ;, wrap it with triple quotes (e.g., """#password;""").

You may use a custom SMTP server or third-party services such as Gmail. Note that the authorization code is not your email login password—it is an app-specific password that must be generated in your email provider’s security settings. Once SMTP is configured, restart the Grafana service.

Navigate to Alerting → Contact points and click New contact point.

Configure it as shown in Figure 3-12. If you want to include dynamic labels in your alert emails, refer to Template annotations and labels.

Addresses: The recipient email address for the alerts.
Message and subject: Customize the email body and title based on your needs.
Disable resolved message: When checked, Grafana will not send an additional email when the alert resolves.

Click Test to verify that emails can be sent correctly. If the setup is correct, a test email will be delivered (see Figure 3-13).

Click Save contact point once confirmed.

Next, go to Notification policies and assign the previously created contact point as the default. Refer to Figure 3-14 for guidance.

Click Save to apply the policy.

3.2.4 Verify the Alert Pipeline

On any data node in the DolphinDB HA cluster, execute the following command multiple times to inject log entries with the ERROR keyword into log file:

writeLogLevel(ERROR,"This is an ERROR message")

If configured correctly, Grafana will detect the logs via Promtail → Loki and send an alert email that matches the defined conditions. A successful alert delivery will appear as shown in Figure 3-15.

This confirms that your alert email pipeline is working as expected.

4. Common Alert Rules

Silence Alerts During After-Hours

To avoid unnecessary alerts outside trading hours, go to Notification policies and create a new mute timing entry. As shown in Figure 4-1, define silent periods using the following settings:

Time range: Configure multiple entries to cover after-hours. For example, set 00:00–09:00 and 15:00–23:59.
Days of the week: Set monday:friday to cover weekdays.
Days of the month: Set 1:31 to include all days.
Month: Set 1:12 to include all months.
Years: Set 2025 to apply this configuration throughout the year.

Packet Loss Detection for High-Frequency Data Ingestion

In the SSE (Shanghai Stock Exchange) tick data stream, both stocks and funds share the same channel. Within each channel, OrderIndex and TradeIndex are expected to be continuous. A jump in sequence typically indicates a packet loss.

This can be monitored via changes in the SeqNo field. The Insight plugin already supports logging such events, and additional plugin support is under development.

To enable this alert, define a log rule in DolphinDB that outputs an error when a discontinuity is detected, as illustrated in Figure 4-2.

In Alert Rules, configure a LogQL expression to count error logs related to packet loss over a 5-minute window, with Evaluate every:10s, For:20s.

count_over_time({job="dolphinDB"} |= "wrong applseqnum" [5m])

Monitoring for Missing Stock ID or TradeDate in Real-Time Market Data

If a real-time quote record lacks either the stock ID or TradeDate, this should trigger an error log using writeLogLevel. Since these issues are reported with an ERROR level, you can set up a LogQL query to detect ERROR logs from the corresponding job. As shown in Figure 4-3:

The LogQL query is as follows, with additional parameters set to evaluate every 1 minute for a duration of 2 minutes, with an alert condition defined as WHEN last() OF A IS ABOVE 0.

count_over_time({job="dolphinDB"} |= "ERROR"[5m])

Client Connection Timeout Alert

To catch connection issues, define an alert that triggers when more than five logs containing timeout or connection failed appear within 5 minutes. The configuration is illustrated in Figure 4-4.

Metadata Recovery Log Monitoring

Trigger an alert if recovery failures exceed 10 times within 1 minute, configured as shown in Figure 4-5.

The LogQL query is as follows, with evaluation set to run every 15 seconds over a 30-second window, triggering when last() OF A IS ABOVE 10.

count_over_time({job="dolphinDB"} |~ "(?i)failed to incrementally recover chunk"[1m])

Alert on Missing Log Entries

Trigger an alert if no logs are generated by a service within 10 minutes, configured as in Figure 4-6.

The LogQL query is below, evaluated every 1 minute for 2 minutes, triggering when last() OF A IS BELOW 1.

Note:

Ensure the query covers at least the time range from now-10m to now.
To monitor specific services, add labels to the query, for example: count_over_time({filename="/home/vagrant/v2.00.11.13/server/clusterDemo/log/agent.log",job="dolphinDB"}[10m]). This counts log entries for the specified file (agent.log) in the past 10 minutes.

User Login Failure Monitoring

Trigger an alert if a user fails login more than 10 times within 1 hour, configured as in Figure 4-7.

The LogQL query is below:

sum by (remoteIP) (count_over_time({job="dolphinDB"} |~ "failed.*The user name or password is incorrect" | logfmt | remoteIP!="" [1h]))

This query counts login failures grouped by IP (remoteIP) and triggers alerts only when a specific IP exceeds the threshold. Adjust the query time range to cover at least now-1h to now.

This query is evaluated every 1 minute for 2 minutes, with the alert condition last() OF A IS ABOVE 10.

Out of Memory Log Monitoring

Trigger an alert if more than 2 "Out of memory" errors occur within 5 minutes, configured as in Figure 4-8.

The LogQL query is evaluated every 1 minute for 2 minutes and triggers when last() OF A IS ABOVE 2.

count_over_time({job="dolphinDB"} |= "Out of memory" [5m])

Core Dump Monitoring

Trigger an alert when a core dump file is generated.

The core_file_monitor job defined in the Promtail configuration (promtail.yaml) is specifically used to monitor the generation of core dump files, as shown in Figure 4-9.

Use the following LogQL expression, with alert evaluation every 1 minute over a 2-minute period, and the condition set to WHEN last() OF A IS ABOVE 0.

count_over_time({job="core_files"}[5m])

Shutdown Detection

Trigger an alert when a node goes offline.

Use the following LogQL query, with evaluations every 15 seconds for a 30-second window, and trigger condition WHEN last() OF A IS ABOVE 0, as shown in Figure 4-10.

Low Disk Space Monitoring

Trigger an alert when the system detects insufficient disk space.

Use the LogQL expression below, evaluated every 1 minute for 2 minutes, triggering when last() OF A IS ABOVE 0, as shown in Figure 4-11.

count_over_time({job="dolphinDB"} |~ "(?i)No space left on device" [5m])

Frequent Node Join/Leave or Network Instability Alert

Trigger an alert when a node frequently disconnects and reconnects—specifically, if more than 10 transitions occur within 3 minutes.

Use the following LogQL query, with evaluation every 15 seconds for a 30-second duration, and alert condition WHEN last() OF A IS ABOVE 10, as shown in Figure 4-12.

count_over_time({job="dolphinDB"} |~ "(?i)HeartBeatSender exception" [3m])

5. FAQ

Promtail Timestamp Parsing Behavior

In the Loki and Promtail log monitoring architecture, if Promtail does not explicitly parse timestamps from log entries, Loki will use the ingestion time (i.e., the time the log is pushed to Loki) as the timestamp. To preserve original log timestamps, you must configure "promtail.yaml" accordingly:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: ./positions.yaml

clients:
  - url: http://10.0.0.80:3100/loki/api/v1/push  # Loki server endpoint

scrape_configs:
  - job_name: dolphinDB
    static_configs:
      - targets:
          - 10.0.0.80
        labels:
          job: dolphinDB
          host: 10.0.0.80
          __path__: /home/vagrant/v2.00.11.13/server/clusterDemo/log/*.log
    pipeline_stages:
      # Extract timestamp, log level, and message using regex
      - regex:
          expression: '^(?P<ts>\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\.\d+)\s(?P<level><\w+>)\s:(?P<message>.*)$'
      # Parse the extracted timestamp into standard format
      - timestamp:
          source: ts
          format: 2006-01-02 15:04:05.000000  # Go-style layout
          timezone: "China/Beijing"
      # Attach the log level as a label
      - labels:
          level:
      # Use the extracted message as the log output
      - output:
          source: message

After applying this configuration and restarting Promtail and Loki, you may encounter a "timestamp too old" error (see Figure 5-1).

This occurs because Promtail now uses the timestamp extracted from the log itself, and if the log is too old, Loki will reject it.

To resolve this, increase the reject_old_samples_max_age in Loki’s "limits_config" section:

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 1680h  # Increase max allowed age for logs
  ingestion_rate_mb: 1024
  ingestion_burst_size_mb: 2048

After restarting Grafana, you'll find that Loki can now filter logs using the level label (see Figure 5-2). Additionally, log entries display only a single timestamp (see Figure 5-3), indicating that the original timestamps extracted from the logs have been successfully parsed and applied. This confirms that the timestamp parsing configuration is working correctly, and Loki no longer attaches its own ingestion time.

Note: Once the level field is extracted as a label, it must be queried using label matchers instead of plain text searches. For instance, replace count_over_time({job="dolphinDB"} |= "ERROR"[5m]) with count_over_time({job="dolphinDB", level="ERROR"}[5m]). Be sure to update your alert rules accordingly.

Minimizing Alert Evaluation Delay

To improve alert responsiveness, reduce the evaluation interval and the alerting hold time (evaluation window). The configuration shown in Figure 5-4 sets the minimum recommended values, with Evaluate every: 10s (check every 10 seconds) and For: 20s (evaluate over 20 seconds before firing).

Note: The "For" duration must be at least twice the "Evaluate every" interval.For example, if Evaluate every = 1m, then "For" must be at least 2m.

6. Summary

In high-availability deployments of the distributed time-series database DolphinDB, log monitoring plays a vital role in ensuring system reliability and accelerating issue diagnosis. This document introduces a lightweight, efficient, and scalable logging solution built on Loki, Promtail, and Grafana, delivering a robust and cost-effective framework for log collection, analysis, and visualization tailored to DolphinDB’s operational needs.

7. Appendix

Promtail Linux amd64 installation package and configuration file:
- promtail.yaml
- promtail-linux-amd64.zip
Loki Linux amd64 installation package and configuration file:
- loki-linux-amd64.zip
- config.yaml