Appliance Health Monitoring for Memory, CPU and tunnel performance
Appliance Health – Memory
The appliance supports a REST API to retrieve memory statistics. This can be accessed through Orchestrator, using the following API:
The statics returned that are useful for monitoring are “free”, “buffers”, “cache”, and “swapUsed”, all of which are measured in kB. For all EC models, the general guidance is to check if
• free+buffer+cached > 250MB, and
• swapUsed < 250MB.
If either of these conditions is not met, It does not necessarily mean that the appliance is degraded or will misbehave, but rather, provides an indicator that someone should look at the system to further evaluate the health of the memory. The EdgeConnect software is designed to maintain relatively constant memory consumption – looking at memory trends on Orchestrator will help identify scenarios where memory consumption is increasing over time. If the swap space is impacting the performance of the device, the recommended action is to reboot the appliance.
Getting Memory Statistics
Memory API is very different. It gives you an instant snapshot of current memory usage. Memory usage of an appliance does not change because EdgeConnect uses statically allocated memory for 99% of its functions. But occasionally, memory leaks can crop up. This API can be useful to monitor and detect these situations.
GET /rest/json/memory
This will return data in the following format:
{
"total": 3932860,
"free": 885672,
"buffers": 135056,
"cached": 668484,
"used": 3047188,
"swapTotal": 3906244,
"swapFree": 3905476,
"swapUsed": 768
}
Each of these numbers are in kilobytes. the main thing to look for is “free”. This number must be above 256,000 for EdgeConnect to be considered healthy. This number can go as low as 50,000 and EdgeConnect will still function normally - but it is better to alarm at 256,000.
Appliance Health – CPU
• Getting CPU Statistics
CPU usage can also be monitored using REST APIs. It is tricky on EdgeConnect to monitor CPU usage and set a threshold crossing alert. This is because many CPUs work on “compression” and if the CPU is completely busy, one just ends up getting less compression. There are other CPUs, however, where 100% CPU usage will result in the EdgeConnect dropping packets. We will not be documenting the critical CPUs to monitor which depend on the hardware model and software version.
GET /rest/json/cpustat?time=0
This will return the latest CPU stats from now till last 5 minutes in 5 second intervals (total of 60 objects). This can be quite a bit of data. But this is high resolution enough to meet the needs of most customers.
Only statistics of interest to most customers is “pIdle”. All the numbers are in percentage.
We recommend that customers set the Threshold Crossing Alert to 95% for warning and 99% for alert.
{
"latestTimestamp": 1652470818069,
"data": [
{
"1652470527344": [
{
"cpu_number": "ALL",
"pIdle": "80.44",
"pUser": "14.67",
"pSys": "4.89",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 0,
"pIdle": "76.86",
"pUser": "16.14",
"pSys": "7.01",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 1,
"pIdle": "83.88",
"pUser": "13.27",
"pSys": "2.86",
"pIRQ": "0.00",
"pNice": "0.00"
}
]
},
{
"1652470532355": [
{
"cpu_number": "ALL",
"pIdle": "79.15",
"pUser": "16.49",
"pSys": "4.36",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 0,
"pIdle": "74.68",
"pUser": "19.62",
"pSys": "5.70",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 1,
"pIdle": "83.47",
"pUser": "13.47",
"pSys": "3.06",
"pIRQ": "0.00",
"pNice": "0.00"
}
]
},
........
{
"1652470818069": [
{
"cpu_number": "ALL",
"pIdle": "80.06",
"pUser": "16.03",
"pSys": "3.91",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 0,
"pIdle": "75.99",
"pUser": "18.58",
"pSys": "5.43",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 1,
"pIdle": "84.01",
"pUser": "13.56",
"pSys": "2.43",
"pIRQ": "0.00",
"pNice": "0.00"
}
]
}
]
}
Appliance Health – Disk Usage
GET /gms/rest/appliance/rest/{nePk}/diskUsage
Tunnel Performance – Packet Loss
Based on aggregate stats, the packet loss % for a tunnel can be calculated as
(Packet Loss Percent = (SUM_PRE_LOSS / (SUM_WRX_PKTS + SUM_PRE_LOSS + 1)) * 100)
For timeseries data (and trend charting), the relevant stats can be pulled from /gms/rest/stats/timeseries:
Index | Column Name in REST API | UI Field Name |
---|---|---|
77 | PRE_LOSS | Pre-FEC Packets |
78 | POST_LOSS | Post-FEC Packets |
81 | PRE_PCT_LOSS | Pre-FEC |
82 | POST_PCT_LOSS | Post-FEC |
93 | PRE_PCT_LOSS_MAX | Peak Pre-FEC |
95 | POST_PCT_LOSS_MAX | Peak Post-FEC |
Note: it is possible for SUM_PRE_LOSS to be > SUM_WRX_PKTS. This is because preloss is a delayed stat based on the FEC engine. Even without a FEC engine, the loss can only be measured after a timeout - because the packet might still be arriving. Wan RX packets on the other hand are the actual packets received in that period. The loss stats are correct when the packets are streaming - but if the network has very low throughput, loss stats should be calculated using minute or hourly aggregations.
Updated about 1 month ago