Websocket Examples

Monitoring the Websocket Connection

The health and status of the websocket connection can be monitored using the Orchestrator REST API request below:

GET /gms/rest/remoteLogReceiver/websocket/status

This returns the following JSON result example:

{
: {
"connected": false,
"receiverId": ,
"clientIP": "127.0.0.1",
"establishedOn": 1588978904,
"lastPongReceivedOn": 1588979204,
"lastMessageSent": "some message...",
"lastMessageSentOn": 1588978952
}
}

Alarm Correlation Rules

The appliance sends a number of alarms depending on the state of the appliance and network. There are some basic alarm conditions that are worth noting when looking to correlate alarms.

If all underlay tunnel endpoints at EC1 that are pointing to EC2 are in a Downstate, then all supported overlays will be Down
There is no way for an overlay tunnel to be down if at least 1 underlay is in the up state
If all overlays are in a down state between EC1 and EC2, and both EC1 and EC2 are reachable, then the root cause can be isolated to all underlay tunnels corresponding to overlays that are Down
The endpoints of tunnels are eventually consistent in terms of oper state, but it is possible the state of one tunnel EP is inconsistent with the state of the other tunnel EP
If an interface fails, all tunnels supported by that interface will go to the down state. The remote endpoints corresponding to those tunnels will eventually go to a down state based on a configured threshold.
If an appliance is completely unreachable, the data plane may still be active. To identify this scenario, one should look at all tunnels pointing to the unreachable appliance and if the state of all those tunnels is Down (after x minutes), then the root cause is a failed appliance. If only a subset of the tunnels failed, but those tunnels all have a common destination interface ID, then the root cause is likely a failed interface.
Between appliances, if there are many tunnels, is it likely that there are individual tunnel failures, or do tunnels tend to behave as an entire group?
Is there any degradation on the tunnel (eg, loss/latency) that we measure and use as a trigger for changing the oper state from Up to Down? If so, what are those thresholds, and are they visible via the API?