Kinetica provides a simple alerting system controlled by Host Manager that can monitor the health of the system and resource usage.
Topics
Alerts are enabled by default and pre-configured to report on errors, critical system states, and a nominal set of resource usage thresholds.
If an alert is triggered, the alert is stored in memory and also logged to
alert_store.json
in the head node's persist directory. In a default
configuration that file is here:
/opt/gpudb/persist/gpudb/rank-0/alert_store.json
The alert is viewable through the Kinetica Administration Application under Alerts. The /admin/show/alerts endpoint can be used to retrieve previous alerts.
If specified, the system can run a user-provided executable upon receiving each alert. See Run an Executable for details.
Below is a table containing the different alert types and the parameters associated with each.
Alert Type | Parameters |
---|---|
host_status |
host [host IP] status [status code] |
rank_status |
rank [rank number] status [status code] |
memory_absolute |
host [host IP] threshold [threshold crossed] available_memory [free mem in bytes]
total_memory [total mem on host] percentage_space [percent free mem] |
memory_percentage |
host [host IP] threshold [threshold crossed] available_memory [free mem in bytes]
total_memory [total mem on host] percentage_space [percent free mem] |
disk_absolute |
directory [directory where threshold crossed] host [host IP] threshold [threshold crossed]
available_space [free space in bytes] total_space [total space on disk]
percentage_space [percentage of free disk space] |
disk_percentage |
directory [directory where threshold crossed] host [host IP] threshold [threshold crossed]
available_space [free space in bytes] total_space [total space on disk]
percentage_space [percentage of free disk space] |
rank_cuda_error |
rank [rank where error occurred] error_count [number of calls with errors]
error_code [most recent cuda error code] error_string [most recent error message] |
rank_fallback_allocator |
failure_count [number of GPU memory allocation failures]
min_failure_size [smallest allocation that failed in bytes]
avg_failure_size [average size of allocation failure]
max_failure_size [largest allocation that failed in bytes] |
error_message |
rank [rank where error occurred]
topic [1-2 word string that indicates what functionality caused the error]
message [free text description of error wrapped in single quotes] |
To enable alerts, navigate to /opt/gpudb/core/etc/gpudb.conf
and modify
the enable_alerts
parameter:
enable_alerts = true
Alerting can be modified in any of the following ways by setting the
appropriate parameter in /opt/gpudb/core/etc/gpudb.conf
:
An executable can be run if an alert is triggered. The executable cannot have
command-line parameters as the system passes parameters that pertain to the
alert itself. The executable must be present on the host where the head node is
located and it must be executable by the gpudb
user. The executable
does not need to be present on other hosts.
alert_exe = /file/path/to/script
The executable will be passed alert strings in the form:
[alert_type] --[parameter1] [value1] ... --[parameterN] [valueN]
For instance:
rank_status --rank 0 --status running
Important
The order of command line parameters passed to the executable can't be guaranteed.
Alerts can be generated if the status of a host or rank changes. Filters can be applied which govern which statuses generate alerts. All the statuses are available here.
alert_host_status = true
alert_host_status_filter = fatal_init_error
alert_rank_status = true
alert_rank_status_filter = fatal_init_error, not_responding, terminated
Important
The host and rank status filters are inclusive, alerting only on
the specified statuses; e.g.,
alert_host_status_filter = fatal_init_error, shutdown
would only display
alerts for fatal init errors and shutdown events. Leave filter setting blank
to alert on all status changes.
Rank fallback allocator errors can generate alerts.
alert_rank_fallback_allocator = true
System errors can generate alerts in the event of significant or noteworthy runtime issues.
alert_error_messages = true
Memory usage alerts can be generated based on either an absolute threshold (in bytes) or a percentage. These thresholds will be measured against the available host memory as reported by sysconf.
alert_memory_absolute = <mem-free-in-bytes>
# or
alert_memory_percentage = <mem-free-percent>
Note
While only one type of threshold can be enabled, the settings support
multiple values, e.g., alert_memory_percent = 20, 10, 5, 1
. However, if
multiple thresholds are crossed, only the lowest one will trigger an alert.
Disk usage alerts can be generated based on either an absolute threshold (in bytes) or a percentage. These thresholds will be measured against the available disk space on the drive(s) hosting each persist directory.
alert_disk_absolute = <disk-free-in-bytes>
# or
alert_disk_percentage = <disk-free-percent>
The monitored persist directories are those specified in the following
/opt/gpudb/core/etc/gpudb.conf
settings:
persist_directory
data_directory
object_directory
sms_directory
text_index_directory
Note
While only one type of threshold can be enabled, the settings support
multiple values, e.g., alert_disk_percent = 20, 10, 5, 1
. However, if
multiple thresholds are crossed, only the lowest one will trigger an alert.
The number of generated alerts stored in memory/disk can be limited to a specified maximum.
alert_max_stored_alerts = 100