Kinetica provides a simple alerting system controlled by Host Manager that can monitor the health of the system and resource usage.
Alerts are enabled by default and pre-configured to report on errors, critical system states, and a nominal set of resource usage thresholds.
If an alert is triggered, the alert is stored in memory and also logged to
alert_store.json in the head node's persist directory. In a default
configuration that file is here:
If specified, the system can run a user-provided executable upon receiving each alert. See Run an Executable for details.
Below is a table containing the different alert types and the parameters associated with each.
|host_status||host [host IP] status [status code]|
|rank_status||rank [rank number] status [status code]|
|memory_absolute||host [host IP] threshold [threshold crossed] available_memory [free mem in bytes] total_memory [total mem on host] percentage_space [percent free mem]|
|memory_percentage||host [host IP] threshold [threshold crossed] available_memory [free mem in bytes] total_memory [total mem on host] percentage_space [percent free mem]|
|disk_absolute||directory [directory where threshold crossed] host [host IP] threshold [threshold crossed] available_space [free space in bytes] total_space [total space on disk] percentage_space [percentage of free disk space]|
|disk_percentage||directory [directory where threshold crossed] host [host IP] threshold [threshold crossed] available_space [free space in bytes] total_space [total space on disk] percentage_space [percentage of free disk space]|
|rank_cuda_error||rank [rank where error occurred] error_count [number of calls with errors] error_code [most recent cuda error code] error_string [most recent error message]|
|rank_fallback_allocator||failure_count [number of GPU memory allocation failures] min_failure_size [smallest allocation that failed in bytes] avg_failure_size [average size of allocation failure] max_failure_size [largest allocation that failed in bytes]|
|error_message||rank [rank where error occurred] topic [1-2 word string that indicates what functionality caused the error] message [free text description of error wrapped in single quotes]|
To enable alerts, navigate to
/opt/gpudb/core/etc/gpudb.conf and modify
the enable_alerts parameter:
enable_alerts = true
Alerting can be modified in any of the following ways by setting the
appropriate parameter in
- Run an Executable
- Status Changes
- CUDA Errors
- Fallback Allocator Errors
- System Errors
- Memory Usage
- Disk Usage
- Maximum Alerts
Run an Executable
An executable can be run if an alert is triggered. The executable cannot have command-line parameters as the system passes parameters that pertain to the alert itself. The executable must be present on the host where the head node is located and it must be executable by the gpudb user. The executable does not need to be present on other hosts.
alert_exe = /file/path/to/script
The executable will be passed alert strings in the form:
[alert_type] --[parameter1] [value1] ... --[parameterN] [valueN]
rank_status --rank 0 --status running
The order of command line parameters passed to the executable can't be guaranteed.
Alerts can be generated if the status of a host or rank changes. Filters can be applied which govern which statuses generate alerts. All the statuses are available here.
alert_host_status = true alert_host_status_filter = fatal_init_error alert_rank_status = true alert_rank_status_filter = fatal_init_error, not_responding, terminated
The host and rank status filters are inclusive, alerting only on the specified statuses; e.g., alert_host_status_filter = fatal_init_error, shutdown would only display alerts for fatal init errors and shutdown events. Leave filter setting blank to alert on all status changes.
CUDA errors on ranks can generate alerts.
alert_rank_cuda_error = true
Fallback Allocator Errors
Rank fallback allocator errors can generate alerts.
alert_rank_fallback_allocator = true
System errors can generate alerts in the event of significant or noteworthy runtime issues.
alert_error_messages = true
Memory usage alerts can be generated based on either an absolute threshold (in bytes) or a percentage. These thresholds will be measured against the available host memory as reported by sysconf.
alert_memory_absolute = <mem-free-in-bytes> # or alert_memory_percentage = <mem-free-percent>
While only one type of threshold can be enabled, the settings support multiple values, e.g., alert_memory_percent = 20, 10, 5, 1. However, if multiple thresholds are crossed, only the lowest one will trigger an alert.
Disk usage alerts can be generated based on either an absolute threshold (in bytes) or a percentage. These thresholds will be measured against the available disk space on the drive(s) hosting each persist directory.
alert_disk_absolute = <disk-free-in-bytes> # or alert_disk_percentage = <disk-free-percent>
The monitored persist directories are those specified in the following
While only one type of threshold can be enabled, the settings support multiple values, e.g., alert_disk_percent = 20, 10, 5, 1. However, if multiple thresholds are crossed, only the lowest one will trigger an alert.
The number of generated alerts stored in memory/disk can be limited to a specified maximum.
alert_max_stored_alerts = 100