Background:
My Firebrick has
<ppp/> elements in its XML config file, one for each modem, with the following attributes set:
<ppp …
lcp-rate="1" lcp-timeout="5" …
/>The modem PPP-LCP-pings the other end with an LCP-PDU every
lcp-rate seconds, and my understanding is that a link-failure event is generated if there is no reply after
lcp-timeout seconds. Does that sound about right?
I did some experimentation with these values which turned out to be a disastrous failure. If I remember rightly, the values given above were suggested by AA. One idea was to tune the
lcp-timeout value such as to get a real (total) link failure reported as quickly as possible, or to decide to report a ‘dodgy’, semi-working link as a failure if packets were being randomly dropped.
Shortening the lcp-timeout value was a disaster. It seems it was very well tuned already. However it is perhaps a bit
too closely tuned. Every once in a very long while, say once every month or two, there is a one second long (approx) link failure event reported, and I am assuming that a link is dropped out of the bonded set.
Now the question: how do we look at this? Do we regard it as a problem, something that needs to be fixed? (‘We have got the value set too short, so it’s reporting red-herrings’) Or do we say that since it has been dropping LCP ping responses, or not getting the pings in the first place, anyway something somewhere is broken, we assume that
the link is broken in the sense that it is not reliably transporting L3 SDUs in whichever direction either. And dropping it out of the set for a while (
how long?) is much better than risking packet loss which is this bad link’s fault. So which way do we go?
Do we leave the value as is and say that a very rare one-second drop is not a problem to be fixed but a sign that the mechanism is doing its best to protect us with a closely-tuned minimal setting? Or do we try lengthening the timeout value to remove these events? Lengthening the timeout value will possibly reduce reliability? Because if we had lengthened the value, the next upstream PDU might have gone out on the possibly still bad link
n and could have been lost, but if we shorten the timeout, the time that link
n is no longer used for upstream is brought forward and future upstream PDUs go out on link
m instead.
My current thinking is as follows: leave the value as it is. Even if any possible one-second drops are considered to be ‘a problem’, then we surely need to know about it, not ‘fix’ it by making it go away? The question then would be why might we be getting this timeout in the first case, what is happening - something not transmitting, or something not getting through, or something not being heard, don’t know which. From looking at stats around the times that these events have been reported it seems that either a droop or sudden drop in SNRM is very often involved. So presumably corrupt packets and PhyR L2ReTx overworked.
But the 5 s timeout is a long time to not be working properly. What about a scenario where the upstream tx queues are getting overfilled because the Firebrick is overdriving a modem? Would perhaps see latency times reported in CQM that are very high, as a symptom? So ensuring that latency max values do not get out of control might also put a stop these events if corruption is not a contributor fault as well.
I would welcome some comments/guidance. I know, I can ask A&A who sold me the Firebrick. But I would like some opinions.