Topic: Firebrick Polling PPP LCP Timeout (Read 2354 times)

Weaver · « **on:** May 03, 2019, 01:06:11 AM »

Background:

My Firebrick has <ppp/> elements in its XML config file, one for each modem, with the following attributes set:
<ppp … lcp-rate="1" lcp-timeout="5" … />

The modem PPP-LCP-pings the other end with an LCP-PDU every lcp-rate seconds, and my understanding is that a link-failure event is generated if there is no reply after lcp-timeout seconds. Does that sound about right?

I did some experimentation with these values which turned out to be a disastrous failure. If I remember rightly, the values given above were suggested by AA. One idea was to tune the lcp-timeout value such as to get a real (total) link failure reported as quickly as possible, or to decide to report a ‘dodgy’, semi-working link as a failure if packets were being randomly dropped.

Shortening the lcp-timeout value was a disaster. It seems it was very well tuned already. However it is perhaps a bit too closely tuned. Every once in a very long while, say once every month or two, there is a one second long (approx) link failure event reported, and I am assuming that a link is dropped out of the bonded set.

Now the question: how do we look at this? Do we regard it as a problem, something that needs to be fixed? (‘We have got the value set too short, so it’s reporting red-herrings’) Or do we say that since it has been dropping LCP ping responses, or not getting the pings in the first place, anyway something somewhere is broken, we assume that the link is broken in the sense that it is not reliably transporting L3 SDUs in whichever direction either. And dropping it out of the set for a while (how long?) is much better than risking packet loss which is this bad link’s fault. So which way do we go?

Do we leave the value as is and say that a very rare one-second drop is not a problem to be fixed but a sign that the mechanism is doing its best to protect us with a closely-tuned minimal setting? Or do we try lengthening the timeout value to remove these events? Lengthening the timeout value will possibly reduce reliability? Because if we had lengthened the value, the next upstream PDU might have gone out on the possibly still bad link n and could have been lost, but if we shorten the timeout, the time that link n is no longer used for upstream is brought forward and future upstream PDUs go out on link m instead.

My current thinking is as follows: leave the value as it is. Even if any possible one-second drops are considered to be ‘a problem’, then we surely need to know about it, not ‘fix’ it by making it go away? The question then would be why might we be getting this timeout in the first case, what is happening - something not transmitting, or something not getting through, or something not being heard, don’t know which. From looking at stats around the times that these events have been reported it seems that either a droop or sudden drop in SNRM is very often involved. So presumably corrupt packets and PhyR L2ReTx overworked.

But the 5 s timeout is a long time to not be working properly. What about a scenario where the upstream tx queues are getting overfilled because the Firebrick is overdriving a modem? Would perhaps see latency times reported in CQM that are very high, as a symptom? So ensuring that latency max values do not get out of control might also put a stop these events if corruption is not a contributor fault as well.

I would welcome some comments/guidance. I know, I can ask A&A who sold me the Firebrick. But I would like some opinions.

burakkucat · « **Reply #1 on:** May 03, 2019, 01:39:10 AM »

I am not sure what to suggest but I do have a question --

* Do you have any evidence that your link (the bonded set), {from|to} your Firebrick {to|from} the Firebrick in A&A-land, is {not correctly carrying|corrupting} the data {from|to} the Weaving shed {to|from} the outside world?

If the answer is "no" then I have a subsidiary question --

* Would it be true to say that the objective is, therefore, just to understand the {situation|configuration}? (Which would be interesting.)

[Edited to correct spelling mishap.]

Weaver · « **Reply #2 on:** May 03, 2019, 10:36:13 AM »

A thoughtful dissection of the underlying questions.

Q1 is yes and no. I did have some problems discussed in that earlier thread, with CQM graphs that feature red dots. Recently I have looked into poor upstream throughput and have managed to isolate it, with lots and lots of repeated and tedious timing measurement, to problems with upstream corruption in line 3 which had a low upstream SNRM, of 2dB at times, instead of 6dB. That was due to the stupid daily SNRM cycle and was set up when the modem initially connected during the good-attenuation half of the day and then the attenuation jumps up for the other half of the day. This was successfully fixed by increasing the upstream target to 9dB and currently the SNRM is 5dB minimum even when the initial sync is at the most unfortunate time of day. Looking at detailed error counts the effects of the low SNRM could be seen in figures for corrupted PDUs. Fixing the problem pushed up the upstream throughput to the point where I now think it is fully optimised assuming that that link’s variable attenuation is not fixed (which would give me another ~120-150kbps). My observation is that running these modems with no PhyR at 2 dB SNRM is not good at all and for reasons completely unknown there is no upstream PhyR so keeping the u/s target SNRM up at around 6dB minimum is an absolute must. (No problems are seen now that it stays above 5dB even.) PhyR allows me to successfully run downstream at 2dB or even very slightly less and regardless the L2ReTx process keeps errors at bay successfully. It is just superb and I do wish I could work out why it is not available on upstream. Something to do with the DSLAM model? But then why don’t they upgrade the DSLAM, given they already have the software done, for the downstream side? And isn’t TX more costly than RX?

So to finally answer your question properly. It turns out that there was a problem. It was with upstream line 3 and that was souring the entire throughput.

I need to add a Q1a myself. Q1a After the corruption due to line three upstream was fixed, have the 1 second drops fixed? I am beginning to think that they might have gone away for good. Saying ‘I can’t remember when I last saw one’ does not help, because my memory is atrocious, thanks to all the drugs. It is embarrassing making posts here about question that I only asked about a month or so back, but sheer stupidity must be the cause for the fact that it only occurs to me to do a search after I have posted some duplicate question too late, before thinking. Note to self: try not to be such a moron (despite the bottles of NHS ‘liquid moron’ juice. In my day, I used to have such a thing, but it was then called ‘beer’ and it wasn’t free but did not taste horrible.)

Q2: Yes. Since I don’t see any data corruption. So if my understanding is right then LCP packet ‘loss’ - which should be called non-response, since it could be simply that one device is not sending a packet, due to who-knows-what problem, either some overload or bug - would have to be due to some kind of unknown bug, or unfortunate design feature. What about if it is the case that the link is under heavy load in whichever direction(s) and LCP packets are also getting dropped because fifos are overflowing and PPP LCP is low priority. Not that I have any reason to think they are low priority.

So yes to 2 - I would like to understand what is going on better and that is because if I do see any more of these 1 s drops, then what would be the implications? (aesmith kindly discussed this with me in an earlier thread iirc).

Do we have to agree then, that before, when there was a corruption problem, the link was so bad that four or five (depending on the definitions in the FB link monitoring protocol re counters and timers) consecutive PPP LCP PDUs - in one or other direction - were corrupted, even though they are very short? If corruption was the issue then I would think it has to be safe to assume that upstream PPP LCP echo requests were getting corrupted so they were not received or dropped due to bad CRC when received, whichever.

I realise that I forgot to say that the link drops could be due to an echo request-response pair in the d/s-request-then-u/s-response direction which is initiated by the Firebricks at the A&A end which always do such monitoring. So if I were corrupting anything sent upstream, it could be an echo response that is being corrupted. Those remote Firebrick’s LCP link monitoring timers are iirc not user-controllable to such a fine degree; I don’t recall seeing any facility for users to enter values for timeouts. There is however a 1-bit control, "fast" ie. short or normal timeout. The clueless.aa.net.uk help file says

Quote

LCP echoes usually stop responding if the line has gone down. Our LCP monitoring, which produces the graphs, will drop the line if there are no replies after 60 seconds. When bonding or used in a fall-back setup, having a faster timeout is useful in order to fall-back quicker.

I have this setting set to ‘fast’ since I have many lines.

aesmith · « **Reply #3 on:** May 03, 2019, 12:55:17 PM »

What's missing in my opinion is a number of retries. Failing a link because it drops one packet doesn't sound too clever and I'd expect that to create quite an overhead from lines dropping in and out of the bonded set. I'm actually wondering if the "timeout" actually refers to overall loss of service, rather than timeout for each individual poll.

Weaver · « **Reply #4 on:** May 03, 2019, 01:37:52 PM »

@aesmith you can set the ‘lcp-rate’ as well. I am assuming that the following is the case:

Firebrick sends out one PPP LCP echo request every lcp-rate seconds. After lcp-timeout seconds, if no response has been heard then you call link-failed.

Please someone tell me if the above story is wrong. Assuming hypothetically that it is correct, then if there have been no responses heard, a total of n tests must have failed. The defaults are every 10 s and timeout after 61 s, so according to my assumed picture that is 7 consecutive tests failed. Not just one.

So when you say ‘number of retries’, then I am assuming that you are intended to be able to get just such a thing by setting the two values such that the ratio gives you a total number of retries.

What do you think?

aesmith · « **Reply #5 on:** May 03, 2019, 03:25:05 PM »

Quote from: Weaver on May 03, 2019, 01:37:52 PM

So when you say ‘number of retries’, then I am assuming that you are intended to be able to get just such a thing by setting the two values such that the ratio gives you a total number of retries.

It's maybe a just a difference in concepts compared to the gear that I normally work with. To give an example we tell it how often to poll (say 60 seconds), the timeout for a poll response (say 200ms), and the number of tries (say 2). So every 60 seconds it sends a poll, if there's no response in 200ms it tries again, and if three fail then the trunk is taken out of service. The balance to strike is always polling overhead vs speed of detecting a fault vs overhead of failing over and back for short outages. On a lot of systems you set different parameters for polling a failed link so you can make sure it's not back in service until it shows signs of staying up.

Weaver · « **Reply #6 on:** May 03, 2019, 04:05:45 PM »

Understood. I was expecting an additional parameter - the time before a link is again eligible to be considered good when it has gone into the bad state. I don’t see any explanation of how links get rehabilitated.

burakkucat · « **Reply #7 on:** May 03, 2019, 06:30:40 PM »

Asymmetry in a Circuit's PhyR Configuration.

It is operational on the DS band. The DSLAM/MSAN has sufficient buffering capacity to hold packets for potential future retransmission.

It is not operational on the US band. I suspect that the DSLAM/MSAN configuration is set "off" due to the assumption that the average CPE does not have sufficient buffering capacity to hold packets for potential future retransmission.

Perhaps it has been discovered detrimental, to all circuits connected to the DSLAM/MSAN, for some circuits to have US PhyR "on" whilst for others it is "off".

Weaver · « **Reply #8 on:** May 04, 2019, 12:38:29 AM »

I’m sorry, my fried,

I’m not following your point about the assumed capabilities of CPE. Perhaps you will expand a little further for me. I just assumed that there is some sort of declaration of capabilities? From each end? So if the DSLAM says "capable" referring to upstream PhyR, could the CPE nit just declare itself as "incapable" and so there would be a negotiated no u/s PhyR? Is there some ‘harm’ in the DSLAM just declaring itself as u/s-capable?

Would be a harm if there were known broken u/s-PhyR implementations of CPE where the CPE says "yes" about itself and then actually cannot do the job properly. But why u/s only? Perhaps to do with the buffering, but then why would less testing have been done on one direction vs the other? Seems very odd. Also is there a chance that DSLAM could filter out specific CPE models based on a whitelist or blacklist? Perhaps asking what the CPE model is first and then possibly lying about its own PhyR capabilities selectively in order to make negotiation fail.

I wonder if there is anything that we can read about this negotiation and about DSLAMs in this particular area?

What is your thinking behind the ‘all circuits’ point? Could you perhaps expand on that?

burakkucat · « **Reply #9 on:** May 04, 2019, 01:17:55 AM »

I was just speculating, without any absolute knowledge of the protocol implementation. It is the sender, the transmitter, that needs to keep the data in a buffer for any possible retransmission. I postulate that the DSLAM/MSAN has more capabilities, more abilities, than any typical CPE. Hence DS PhyR is seen to be operational, whereas US PhyR it is not.

I speculate that it might be something like what happened when G.Inp was first deployed across the mix of Huawei and ECI cabinet based DSLAMs, for the Openreach FTTC product. The ECI DSLAMs could not cope with certain types of CPEs and so G.Inp was withdrawn from usage on the ECI equipment.

Rather tenuous, I admit.

Weaver · « **Reply #10 on:** May 04, 2019, 03:29:39 AM »

I remember the ECI thing, and so that was a source of analogy?

In that case, or in this case, is there a way that a DSLAM could whitelist or blacklist based on CPE model, if it became necessary to do a workaround for a known CPE bug?

And does anyone know of any reading matter ? I should be digging through standards documents protocol to try and find negotiations for capabilities / features at connect-time. I am too fuzzy though and had a painful day, now thankfully gone.

burakkucat · « **Reply #11 on:** May 04, 2019, 06:46:59 PM »

In the case of G.Inp and ECI Hi-FOCuS DSLAMs/MSANs the work-around was first to turn of G.Inp and subsequently (so it appears) to forget that the ECI kit was ever installed.

To come back to Broadcom's PhyR, although I always configure it on/on (DS/US) in a CPE device, I only know of one case where it is available for use and actually used. Where? Who? For some reason the Isle of Skye comes to mind . . .

I'm not aware of any documentation for PhyR, so perhaps this can be regarded as a call, to Kitizens, for references, etc, please.

Weaver · « **Reply #12 on:** May 05, 2019, 01:21:03 AM »

You were saying that I’m the only known case of PhyR in use - among kitizens and kitz cases. And of course there surely could be unknown cases, where ADSL users do not know they have such a thing, no?

I am wondering if I am the one known example because (i) I am sufficiently obsessive as to have looked into what is going on in detail with my links (ii) G.992.3 and G.992.5 are now presumably endangered species what with FTTx having completely taken over (iii) I have a fairly recent MSAN - installed at the end of 2015, but any way installed really late, not in the years when G.992.3 and G.992.5 were first becoming available anywhere in the country, so maybe this means it is most likely to have a top spec and the best software load. I don’t know. For all I know it could be the case that all MSANS regardless of their age get all known software upgrades all the time ASAP and where applicable they do actually receive every upgrade they can have. Someone will probably know the answer to that. (iv) have a really top class modem with a Broadcom chipset so PhyR capability at my end.

So it’s a Venn diagram with the intersection of four subsets and maybe that is enough to reduce it down to a sub-subset of approximately one.

burakkucat · « **Reply #13 on:** May 05, 2019, 07:09:46 PM »

Code: [Select]

[bcat ~]$ ssh supervisor@AP
supervisor@ap's password: 
 > xdslctl profile --show

Modulations:
	G.Dmt	Disabled
	G.lite	Disabled
	T1.413	Disabled
	ADSL2	Enabled
	AnnexL	Disabled
	ADSL2+	Disabled
	AnnexM	Disabled
	VDSL2	Enabled
VDSL2 profiles:
	8a	Disabled
	8b	Disabled
	8c	Disabled
	8d	Disabled
	12a	Disabled
	12b	Disabled
	17a	Enabled
	30a	Disabled
	US0	Disabled
Phone line pair:
	Inner pair
Capability:
	bitswap		On
	sra		On
	trellis		On
	sesdrop		Off
	CoMinMgn	Off
	24k		On
	phyReXmt(Us/Ds)	On/On
	TpsTc		AvPvAa
	monitorTone:	On
	dynamicD:	On
	dynamicF:	On
	SOS:		On
	Training Margin(Q4 in dB):	-1(DEFAULT)
 > xdslctl info --vendor
xdslctl: ADSL driver and PHY status
Status: Showtime
Last Retrain Reason:	8000
Last initialization procedure status:	0
Max:	Upstream rate = 1060 Kbps, Downstream rate = 6432 Kbps
Bearer:	0, Upstream rate = 1019 Kbps, Downstream rate = 5119 Kbps

ChipSet Vendor Id:	IFTN:0x71c8
ChipSet VersionNumber:	0x71c8
ChipSet SerialNumber:	
 >

That is with a ZyXEL VMG1312-B10A, a significantly shorter line length (though still of a length to make ADSL2 preferable rather than ADSL2+) and with MSAN whose line cards are equipped with an Infineon chipset. So I do not benefit from Broadcom's PhyR.

Weaver · « **Reply #14 on:** May 06, 2019, 09:24:06 AM »

ah shame. I really feel the need for it. It keeps things under control nicely at low SNRMs and a 3dB target is a much more realistic prospect with L2ReTx of some sort.

News:

Author Topic: Firebrick Polling PPP LCP Timeout (Read 2354 times)

Weaver

Firebrick Polling PPP LCP Timeout

burakkucat

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout

aesmith

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout

aesmith

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout

burakkucat

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout

burakkucat

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout

burakkucat

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout

burakkucat

Re: Firebrick Polling PPP LCP Timeout

Weaver

Re: Firebrick Polling PPP LCP Timeout