Alert Parameters
Below is a table containing the different alert types and the parameters associated with each.| Alert Type | Parameters |
|---|---|
host_status | host [host IP] status [status code] |
rank_status | rank [rank number] status [status code] |
memory_absolute | host [host IP] threshold [threshold crossed] available_memory [free mem in bytes] total_memory [total mem on host] percentage_space [percent free mem] |
memory_percentage | host [host IP] threshold [threshold crossed] available_memory [free mem in bytes] total_memory [total mem on host] percentage_space [percent free mem] |
disk_absolute | directory [directory where threshold crossed] host [host IP] threshold [threshold crossed] available_space [free space in bytes] total_space [total space on disk] percentage_space [percentage of free disk space] |
disk_percentage | directory [directory where threshold crossed] host [host IP] threshold [threshold crossed] available_space [free space in bytes] total_space [total space on disk] percentage_space [percentage of free disk space] |
rank_cuda_error | rank [rank where error occurred] error_count [number of calls with errors] error_code [most recent cuda error code] error_string [most recent error message] |
rank_fallback_allocator | failure_count [number of GPU memory allocation failures] min_failure_size [smallest allocation that failed in bytes] avg_failure_size [average size of allocation failure] max_failure_size [largest allocation that failed in bytes] |
error_message | rank [rank where error occurred] topic [1-2 word string that indicates what functionality caused the error] message [free text description of error wrapped in single quotes] |
Using Alerts
To enable alerts, navigate to /opt/gpudb/core/etc/gpudb.conf and modify theenable_alerts parameter:
- Run an Executable
- Status Changes
- CUDA Errors
- Fallback Allocator Errors
- System Errors
- Memory Usage
- Disk Usage
- Maximum Alerts
Run an Executable
An executable can be run if an alert is triggered. The executable cannot have command-line parameters as the system passes parameters that pertain to the alert itself. The executable must be present on the host where the head node is located and it must be executable by thegpudb user. The executable
does not need to be present on other hosts.
The order of command line parameters passed to the executable
can’t be guaranteed.
Status Changes
Alerts can be generated if the status of a host or rank changes. Filters can be applied which govern which statuses generate alerts. All the statuses are available here.The host and rank status filters are inclusive, alerting only on
the specified statuses; e.g.,
alert_host_status_filter = fatal_init_error, shutdown would only display
alerts for fatal init errors and shutdown events. Leave filter setting blank
to alert on all status changes.CUDA Errors
CUDA errors on ranks can generate alerts.Fallback Allocator Errors
Rank fallback allocator errors can generate alerts.System Errors
System errors can generate alerts in the event of significant or noteworthy runtime issues.Memory Usage
Memory usage alerts can be generated based on either an absolute threshold (in bytes) or a percentage. These thresholds will be measured against the available host memory as reported bysysconf.
While only one type of threshold can be enabled, the settings support
multiple values, e.g.,
alert_memory_percent = 20, 10, 5, 1. However, if
multiple thresholds are crossed, only the lowest one will trigger an alert.Disk Usage
Disk usage alerts can be generated based on either an absolute threshold (in bytes) or a percentage. These thresholds will be measured against the available disk space on the drive(s) hosting each persist directory.persist_directorydata_directoryobject_directorysms_directorytext_index_directory
While only one type of threshold can be enabled, the settings support
multiple values, e.g.,
alert_disk_percent = 20, 10, 5, 1. However, if
multiple thresholds are crossed, only the lowest one will trigger an alert.