Orchestrator Health Monitoring
Monitoring Orchestrator health involves both Orchestrator KPIs that Orchestrator monitors as well as Linux server monitoring tools natively offered by the hypervisor.
The metrics that Orchestrator can provide which are of use include:
Orchestrator Health & Reachability
1.0 Using GET /gms/rest/gmsserver/ping to check on reachability to Orchestrator web service and check
the return value for a high-level indicator of database health.
-
if this returns 200, the Orchestrator is OK.
-
The response should take a few 100 ms max. Any trend towards longer response time to this ping (for example, much greater than 1 second) is an indicator that other checks should be done to see if Orchestrator is running out of resources (CPU, memory, etc.)
-
The response from the REST request is as follows:
{ "hostname":"orchestrator.localdomain", "dbHeatlh": true, "timeStr": "Thu Jan 30 16:51:31 PDT 2020", "time": 1580429731303, "message": "I am alive!", "version": "9.3.1.40717", "uptime": "9d 2h 52m 39s"
-
The "dbHeatlh" attribute is an indicator of whether the Orchestrator process has a healthy connection to its database and whether the database server is up and running. It is not an indicator of whether the database is reaching saturation.
-
The "uptime" field should be monitored for continuous operation of the Orchestrator. If the uptime resets to 0 and no manual restart of the Orchestrator is initiated, then this is an indicator that Orchestrator restarted on its own.
NOTE: The Orchestrator has a built in mechanism to restart upon failure to minimize Orchestrator unavailability. During any downtime, the SD-WAN is unaffected and continues to operate. Stats and alarms are buffered by the appliance until Orchestrator communications are re-established. When the Orchestrator recovers, it resynchronizes with the SD-WAN, and resumes its operations.
Optional: for a history of Orchestrator reboots, use GET /gms/rest/gms/rebootHistory
-
API will return an array of reboot occurrences over the past 12 months as follows:
- [{"reboot_time": 15722461092, "version":9.3.1.45634"},...]
Where reboot_time is the timestamp in seconds in EPOCH format and version is the Orchestrator release.
Orchestrator Heap Status2.0 The heap size is an indicator of how memory is being utilized by the Orchestrator. Heap size should naturally oscillate up and down, as memory is consumed and freed. If the used heap trends higher or is pegged near the maximum heap size, this is an indication that additional memory may be required.
2.1 Use GET /gms/rest/stats/timeseries/metrics?startTime=&endTime= to retrieve heap statistics for every hour between start and end times. Sometimes the complete list of stats will not appear more than once. For example, if the startTime and endTime are for an hour and half so from 1:00 PM to 1:30 PM in EPOCH time it will only show the full list of the stats once and it will show 30 one-minutes stats of totalHeapMemory and usedHeapMemory.
- [{"reboot_time": 15722461092, "version":9.3.1.45634"},...]
-
x and y represent start and end times in seconds, using EPOCH format
-
There are other stats returned by this API, but only the totalHeapMemory and usedHeapMemory is helpful for health monitoring
-
The response to this API call is an array of records. Each record represents the heap for the hour, and it is structured as follows:
[{
"stats": {
"buffersMemory": 2704,
"usedMemory": 124163080,
"totalMemory": 387740160,
"applianceCount": 74,
"usedSwapMemory": 0,
"freeSwapMemory": 0,
"totalHeapMemory": 1933049856,
"usedHeapMemory": 1052471048,
"totalSwapMemory": 0,
"cachedSwapMemory": 77150288,
"freeMemory": 263577080
},
"key": null,
"timestamp": 1580407637
},
…
]
- The minute stats are represented for the heapMemory as follows:
-
{ "stats":{ "totalHeapMemory": 1933049856, "usedHeapMemory": 1052471048, "filled": true }, "key": null, "timestamp": 1580407636
Orchestrator VM Disk space usage and Linux Tools
- For disk space usage, Orchestrator generates 2 alarms based on how much disk space is remaining. The two alarms to look out for include:
a. WARNING: Disk partition {0} is more than {1}% used (generated when 70% disk is utilized)
b. MAJOR: Disk partition {0} is dangerously full - {1}% used (generated when >90% of disk is utilized. - For the following resource utilization metrics, it is suggested that Linux monitoring tools be used, such as those offered through the hypervisor where the VM resides or used by your IT for Linux server monitoring
a. Total and Used memory (linux tools: top or free)
b. CPU utilization (linux tool: top)
c. Swap space
d. Disk utilization
NOTE: For Orchestrator VM deployments, CPU & memory must be reserved for the virtual machine.
Updated about 1 month ago