HomeGuidesAPI ReferenceChangelog
GuidesAPI ReferenceGitHubAirheads Developer CommunityLog In

Orchestrator Health Monitoring

Monitoring Orchestrator health involves both Orchestrator KPIs that Orchestrator monitors as well as Linux server monitoring tools natively offered by the hypervisor.

The metrics that Orchestrator can provide which are of use include:

  1. Orchestrator Health & Reachability

         a.	Using GET /gms/rest/gmsserver/ping to check on reachability to Orchestrator web service and check  
                    the return value for a high-level indicator of database health.
    
         b.	If this returns 200, the Orchestrator is OK.
    
         c.	The response should take a few 100 milliseconds max.  Any trend towards longer response times to 
                    this ping (for example, much greater than 1 second) is an indicator that other checks should be done 
                    to see if Orchestrator is running out of resources (CPU, memory, etc.)
    
         d.	The response from the REST request is as follows:
                                {
                                 "hostname": "orchestrator.localdomain",
                                 "dbHealth": true,
                                 "timeStr": "Thu Jan 30 16:15:31 PST 2020",
                                 "time": 1580429731303,
                                 "message": "I am alive!",
                                 "version": "8.8.3.40500",
                                 "uptime": "9d 2h 52m 39s"
                                 }
    
                   The “dbHealth” attribute is an indicator of whether the Orchestrator process has a healthy connection  
                   to its database and whether the database server is up and running.  It is not an indicator of whether the 
                   database is reaching saturation.
    
         e.	The “uptime” field should be monitored for continuous operation of the Orchestrator. If the uptime 
                    resets to 0 and no manual restart of the Orchestrator is initiated, then this is an indicator that the  
                    orchestrator restarted on its own.
                 **Note:** Orchestrator has a built-in mechanism to restart upon failure to minimize Orchestrator 
                    unavailability.  During any downtime, the SD-WAN is unaffected and continues to operate.  Stats and 
                    alarms are buffered by the appliance until Orchestrator communications are reestablished. When 
                    the orchestrator recovers, it resynchronizes with the SD-WAN, and resumes its operations.
    
          f.	Optional: for a history of Orchestrator reboots, use GET /gms/rest/gms/rebootHistory
                         1.	API will return an array of reboot occurrences over the past 12 months as follows:
    
                    [{"reboot_time":1572461092,"version":"99.99.99.43561"}, … ]
                    
                                   Where reboot_time is the timestamp in seconds in EPOCH format and version is the  
                                   Orchestrator release.
    
  2. Orchestrator Heap Status

         a.	The heap size is an indicator of how memory is being utilized by the Orchestrator.  Heap size should  
                    naturally oscillate up and down, as memory is consumed and freed.  If the used heap trends higher or  
                    is pegged near the maximum heap size, this is an indication that additional memory may be required.
    
         b.	Use GET /gms/rest/stats/timeseries/metrics?startTime=<x>&endTime=<y> to retrieve heap statistics 
                    for every hour between the start and end times.
    
                         1.	x and y represent start and end times in seconds, using EPOCH format
                         2.	There are other stats returned by this API, but only the totalHeapMemory and 
                                    usedHeapMemory is helpful for health monitoring
                         3.	The response to this API call is an array of records.  Each record represents the heap status 
                                    for the hour, and is structured as follows:
                                   [{
                                     "stats": {
                                     "buffersMemory": 2704,
                                     "usedMemory": 124163080,
                                     "totalMemory": 387740160,
                                     "applianceCount": 74,
                                     "usedSwapMemory": 0,
                                     "freeSwapMemory": 0,
                                     "totalHeapMemory": 1933049856,
                                     "usedHeapMemory": 1052471048,
                                     "totalSwapMemory": 0,
                                     "cachedSwapMemory": 77150288,
                                     "freeMemory": 263577080
                                   },
                                   "key": null,
                                   "timestamp": 1580407637
                                    },
                                  …
                                  ]
    
  3. For disk space usage, Orchestrator generates 2 alarms based on how much disk space is remaining. The two alarms to look out for include:
    a. WARNING: Disk partition {0} is more than {1}% used (generated when 70% disk is utilized)
    b. MAJOR: Disk partition {0} is dangerously full - {1}% used (generated when >90% of disk is utilized.

  4. For the following resource utilization metrics, it is suggested that Linux monitoring tools be used, such as those offered through the hypervisor where the VM resides or used by your IT for Linux server monitoring
    a. Total and Used memory
    b. CPU utilization
    c. Swap space
    d. Disk utilization

Note For Orchestrator VM deployments, CPU & memory must be reserved for the virtual machine.