Orchestrator Health Monitoring
Monitoring Orchestrator health involves both Orchestrator KPIs that Orchestrator monitors as well as Linux server monitoring tools natively offered by the hypervisor.
The metrics that Orchestrator can provide which are of use include:
-
Orchestrator Health & Reachability
a. Using GET /gms/rest/gmsserver/ping to check on reachability to Orchestrator web service and check the return value for a high-level indicator of database health. b. If this returns 200, the Orchestrator is OK. c. The response should take a few 100 milliseconds max. Any trend towards longer response times to this ping (for example, much greater than 1 second) is an indicator that other checks should be done to see if Orchestrator is running out of resources (CPU, memory, etc.) d. The response from the REST request is as follows: { "hostname": "orchestrator.localdomain", "dbHealth": true, "timeStr": "Thu Jan 30 16:15:31 PST 2020", "time": 1580429731303, "message": "I am alive!", "version": "8.8.3.40500", "uptime": "9d 2h 52m 39s" } The “dbHealth” attribute is an indicator of whether the Orchestrator process has a healthy connection to its database and whether the database server is up and running. It is not an indicator of whether the database is reaching saturation. e. The “uptime” field should be monitored for continuous operation of the Orchestrator. If the uptime resets to 0 and no manual restart of the Orchestrator is initiated, then this is an indicator that the orchestrator restarted on its own. **Note:** Orchestrator has a built-in mechanism to restart upon failure to minimize Orchestrator unavailability. During any downtime, the SD-WAN is unaffected and continues to operate. Stats and alarms are buffered by the appliance until Orchestrator communications are reestablished. When the orchestrator recovers, it resynchronizes with the SD-WAN, and resumes its operations. f. Optional: for a history of Orchestrator reboots, use GET /gms/rest/gms/rebootHistory 1. API will return an array of reboot occurrences over the past 12 months as follows: [{"reboot_time":1572461092,"version":"99.99.99.43561"}, … ] Where reboot_time is the timestamp in seconds in EPOCH format and version is the Orchestrator release.
-
Orchestrator Heap Status
a. The heap size is an indicator of how memory is being utilized by the Orchestrator. Heap size should naturally oscillate up and down, as memory is consumed and freed. If the used heap trends higher or is pegged near the maximum heap size, this is an indication that additional memory may be required. b. Use GET /gms/rest/stats/timeseries/metrics?startTime=<x>&endTime=<y> to retrieve heap statistics for every hour between the start and end times. 1. x and y represent start and end times in seconds, using EPOCH format 2. There are other stats returned by this API, but only the totalHeapMemory and usedHeapMemory is helpful for health monitoring 3. The response to this API call is an array of records. Each record represents the heap status for the hour, and is structured as follows: [{ "stats": { "buffersMemory": 2704, "usedMemory": 124163080, "totalMemory": 387740160, "applianceCount": 74, "usedSwapMemory": 0, "freeSwapMemory": 0, "totalHeapMemory": 1933049856, "usedHeapMemory": 1052471048, "totalSwapMemory": 0, "cachedSwapMemory": 77150288, "freeMemory": 263577080 }, "key": null, "timestamp": 1580407637 }, … ]
-
For disk space usage, Orchestrator generates 2 alarms based on how much disk space is remaining. The two alarms to look out for include:
a. WARNING: Disk partition {0} is more than {1}% used (generated when 70% disk is utilized)
b. MAJOR: Disk partition {0} is dangerously full - {1}% used (generated when >90% of disk is utilized. -
For the following resource utilization metrics, it is suggested that Linux monitoring tools be used, such as those offered through the hypervisor where the VM resides or used by your IT for Linux server monitoring
a. Total and Used memory
b. CPU utilization
c. Swap space
d. Disk utilization
Note For Orchestrator VM deployments, CPU & memory must be reserved for the virtual machine.
Updated about 1 month ago