Appliance Health Monitoring for Memory, CPU and Disk performance

Appliance Health – Memory

The EdgeConnect appliance supports a REST API to retrieve memory statistics. This can be accessed directly from the appliance, using the following API:

GET /memory

The statics returned that are useful for monitoring are “free”, “buffers”, “cache”, and “swapUsed”, all of which are measured in KB. For all EC models, the general guidance is to check if:

• free+buffer+cached > 250MB, and
• swapUsed < 250MB.

If either of these conditions is not met, It does not necessarily mean that the appliance is degraded or will misbehave, but rather, provides an indicator that someone should look at the system to further evaluate the health of the memory. The EdgeConnect software is designed to maintain relatively constant memory consumption – looking at memory trends on Orchestrator will help identify scenarios where memory consumption is increasing over time. If the swap space is impacting the performance of the device, the recommended action is to reboot the appliance.

Getting Memory Statistics

The memory API provides a quick overview of the current memory usage. As EdgeConnect uses statically allocated memory for almost all of its functions, the memory usage of an appliance remains unchanged. However, memory leaks may occur from time to time, and this API can help monitor and detect such situations.

This will return data in the following format:
{
"total": 3932860,
"free": 885672,
"buffers": 135056,
"cached": 668484,
"used": 3047188,
"swapTotal": 3906244,
"swapFree": 3905476,
"swapUsed": 768
}

Each of these numbers are in kilobytes. the main thing to look for is “free”. This number must be above 256,000 for EdgeConnect to be considered healthy. This number can go as low as 50,000 and EdgeConnect will still function normally - but it is better to alarm at 256,000.

Appliance Health – CPU

The appliance supports a REST API to retrieve CPU statistics. This can be accessed through appliance, using the following API:

GET /cpustat

Getting CPU Statistics

CPU usage can also be monitored using REST APIs. It is tricky on EdgeConnect to monitor CPU usage and set a threshold crossing alert. This is because many CPUs work on “compression” and if the CPU is completely busy, one just ends up getting less compression. There are other CPUs, however, where 100% CPU usage will result in the EdgeConnect dropping packets. We will not be documenting the critical CPUs to monitor which depend on the hardware model and software version.

GET /rest/json/cpustat?time=0

This will return the latest CPU stats from now till last 5 minutes in 5 second intervals (total of 60 objects). This can be quite a bit of data. But this is high resolution enough to meet the needs of most customers.

Only statistics of interest to most customers is “pIdle”. All the numbers are in percentage.

We recommend setting network monitoring tools to alert at 99% and warn at 95% threshold crossing.
{
"latestTimestamp": 1652470818069,
"data": [
{
"1652470527344": [
{
"cpu_number": "ALL",
"pIdle": "80.44",
"pUser": "14.67",
"pSys": "4.89",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 0,
"pIdle": "76.86",
"pUser": "16.14",
"pSys": "7.01",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 1,
"pIdle": "83.88",
"pUser": "13.27",
"pSys": "2.86",
"pIRQ": "0.00",
"pNice": "0.00"
}
]
},
{
"1652470532355": [
{
"cpu_number": "ALL",
"pIdle": "79.15",
"pUser": "16.49",
"pSys": "4.36",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 0,
"pIdle": "74.68",
"pUser": "19.62",
"pSys": "5.70",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 1,
"pIdle": "83.47",
"pUser": "13.47",
"pSys": "3.06",
"pIRQ": "0.00",
"pNice": "0.00"
}
]
},

....

{
"1652470818069": [
{
"cpu_number": "ALL",
"pIdle": "80.06",
"pUser": "16.03",
"pSys": "3.91",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 0,
"pIdle": "75.99",
"pUser": "18.58",
"pSys": "5.43",
"pIRQ": "0.00",
"pNice": "0.00"
},
{
"cpu_number": 1,
"pIdle": "84.01",
"pUser": "13.56",
"pSys": "2.43",
"pIRQ": "0.00",
"pNice": "0.00"
}
]
}....

]
}

Appliance Health – Disk Usage

The appliance supports a REST API to retrieve Disk usage statistics. This can be accessed through the appliance, using the following API:

GET /diskUsage

Getting disk usage statistics

GET /rest/json/diskUsage

{
"/dev": {
"1k-blocks": 1966820,
"used": 0,
"available": 1966820,
"usedpercent": 0,
"filesystem": "none"
},
"/": {
"1k-blocks": 1998608,
"used": 996588,
"available": 897168,
"usedpercent": 53,
"filesystem": "/dev/disk/by-label/ROOT_1"
},
"/boot": {
"1k-blocks": 122820,
"used": 26657,
"available": 89610,
"usedpercent": 23,
"filesystem": "/dev/sda5"
},
"/bootmgr": {
"1k-blocks": 245679,
"used": 2264,
"available": 230308,
"usedpercent": 1,
"filesystem": "/dev/sda1"
},
"/var": {
"1k-blocks": 20510716,
"used": 4243600,
"available": 15218544,
"usedpercent": 22,
"filesystem": "/dev/sda9"
},
"/config": {
"1k-blocks": 1015700,
"used": 1320,
"available": 961952,
"usedpercent": 1,
"filesystem": "/dev/sda3"
},
"/run": {
"1k-blocks": 1972684,
"used": 3988,
"available": 1968696,
"usedpercent": 1,
"filesystem": "tmpfs"
},
"/var/volatile": {
"1k-blocks": 1972684,
"used": 2004,
"available": 1970680,
"usedpercent": 1,
"filesystem": "tmpfs"
}
}