Topic: Packet loss and AA CQM (Read 3602 times)

Weaver · « **on:** March 22, 2019, 04:27:11 AM »

[Apologies in advance for this post which has grown into a monster. I am trying to understand what I am seeing here, I am assuming I have a problem, and I may have misunderstood many things.]

Below I have a piece of a screenshot taken from my ISP AA’s control and monitoring server clueless.aa.net.uk. There’s a poor quality jpeg preview image at the end of this monster post

, but if you click on the image you can get to get full quality version. It shows traffic on my four bonded DSL lines. Green line indicates downstream traffic. Dark red is upstream traffic.

Take a look at the section from 20:30 to 21:10 (very approx). The raised green line over that period shows that there is a modest level of downstream traffic on all four links. Now for the first link at the very top of the yellow background area, notice that at 21:03 there are a few bright red pixels, occasionally clustered into the beginnings of tiny stalactites. Easiest to see in the case of the first link, but a tiny few such bright red pixels can be seen in every one of the per-line sections.

* The problem: in that 20:30 to 21:10 period, the tiny sprinkling of bright red pixels at, for example, line 1 around 21:00 indicates packet loss and at that point there was a ‘packet loss’ figure of 2%.

(This means that as compared with the number of test ping-type packets sent - PPP LCP echo request packets - to test the link, 2% of the expected responses were not received during some short interval around that sampling time.)

This downstream data rate is way way to low to saturate the pipes. So overloading the link cannot be blamed. So why is this happening ?

Comparison situation: Over on the left, at 03:02, we can see a green line, high up, that shows a download in progress for a couple of hours, which is occasionally flat-out, sometimes maxing out, as at 03:02, and at other times it seems to be following some medium-term flat, set rate. That was a supposedly flat-out download of video from Amazon - as opposed to live streaming. Occasionally, as in the example of 03:02, the downstream traffic does appear to be saturating the links. It is hitting the high horizontal solid black line (which is immediately above a dotted line for 2.0 Mbps downstream) that shows the downstream rate limit.

* Despite the fact that one might reasonably imagine there could be some chance of fully loading the downstream links at some point during that period, I cannot see any packet loss during that download period 02:30 to 04:30.

So:

Is this a real issue or a red herring?
What are the possible causes?
Is data corruption, causing ‘real packet’ loss, a realistic possibility ? (That is, corruption of PPP frames containing IP data with a user data payload, not just mere test pings or responses.)

The downstream SNRM was really quite low on some of the links, eg 1.9dB, when I noted the stats on Thursday evening at around 21:30.

* Line 4 modem : In the case of line 4, somehow the upstream SNRM for one line was in fact only ~2.4dB, rather than the expected 6dB upstream target. That line was not showing more packet loss red pixels or worse loss figures though. Looking at the upstream stats it had ES≈44, CRC≈80 total over the preceding 21 hour-long period from midnight and roughly the same over the 24 hour day before that. Is that good or bad?

Line 4 modem stats:

Code: [Select]

xdslctl: ADSL driver and PHY status
Status: Showtime
Last Retrain Reason:	8000
Last initialization procedure status:	0
Max:	Upstream rate = 406 Kbps, Downstream rate = 3364 Kbps
Bearer:	0, Upstream rate = 550 Kbps, Downstream rate = 3220 Kbps

Link Power State:	L0
Mode:			ADSL2 Annex A
TPS-TC:			ATM Mode(0x0)
Trellis:		U:ON /D:ON
Line Status:		No Defect
Training Status:	Showtime
		Down		Up
SNR (dB):	 2.3		 2.4
Attn(dB):	 63.5		 39.9
Pwr(dBm):	 18.4		 12.4

			ADSL2 framing
			Bearer 0
MSGc:		52		12
B:		37		62
M:		4		1
T:		3		1
R:		10		14
S:		1.4879		3.6023
L:		871		171
D:		2		8

			Counters
			Bearer 0
SF:		232851021		11248
SFErr:		1469		2663
RS:		1539084850		4161783
RSCorr:		3511040		3606794
RSUnCorr:	17400		0

ReXmt:		116232		0
ReXmtCorr:	104619		0
ReXmtUnCorr:	17734		0

			Bearer 0
HEC:		11455		2785
OCD:		561		0
LCD:		561		0
Total Cells:	3024582802		629535452
Data Cells:	1827922763		213965480
Drop Cells:	0
Bit Errors:	262477		366891

ES:		563		1514
SES:		5		0
UAS:		51		51
AS:		3791833

			Bearer 0
INP:		26.00		2.50
INPRein:	0.00		0.00
delay:		8		7
PER:		16.18		16.21
OR:		28.67		8.88
AgR:		3234.79	557.45

Bitswap:	761778/775012		31231/31231

Total time = 43 days 21 hours 18 min 4 sec
FEC:		3511040		3606794
CRC:		1469		2663
ES:		563		1514
SES:		5		0
UAS:		51		51
LOS:		0		0
LOF:		0		0
LOM:		1		0
Latest 15 minutes time = 3 min 4 sec
FEC:		395		346
CRC:		0		0
ES:		0		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Previous 15 minutes time = 15 min 0 sec
FEC:		2018		1818
CRC:		0		0
ES:		0		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Latest 1 day time = 21 hours 18 min 4 sec
FEC:		80272		170427
CRC:		0		85
ES:		0		44
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Previous 1 day time = 24 hours 0 sec
FEC:		56188		165286
CRC:		1		83
ES:		1		44
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Since Link time = 43 days 21 hours 17 min 13 sec
FEC:		3511040		3606794
CRC:		1469		2663
ES:		563		1514
SES:		5		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		1		0
NTR: mipsCntAtNtr=0 ncoCntAtNtr=0

* PhyR upstream? - I am thinking there’s no upstream L2 retx ‘PhyR’ in operation, for some reason, only downstream? Is that right? If so, then that’s really annoying. If there’s a data corruption problem, is that where it is lurking, in corruption of upstream data because there’s no L2 retx mechanism to compensate for the ridiculously low upstream SNRM?

* An idea for a fix? Resync everything ? - I thought anyway that I would put the modems back to their target SNRMs by telling each one to reboot itself. When I restarted the modem on line 2, it dropped a couple of hundred kbps downstream which says to me that the noise level happened to be much much worse than it was when it previously synced up. I don’t know if that was just due to the choice of time of day (late on Thursday evening). It could be that the line 2 modem had been struggling a lot with greatly worsened noise conditions and a downstream SNRM which had ended up drooping way too low. I wish I had noted everything down from that modem before I stupidly told it to reboot.

* So:

Have I found and fixed a problem ? If the red pixels are a real significant problem, then is this data corruption at work? Am I suffering real problems due to running with SNRM targets that are too low, too aggressive ?

Given the weird upstream SNRM diurnal variation - discussed in earlier threads - that is still a problem on some lines, am I getting screwed by upstream data corruption ? - because the daily cycle means I will have to have a ridiculously huge upstream target SNRM or some intelligence about when and how to resync, just to ensure initial upstream SNRM is enough to still be large enough for the later bad half of the day?

aesmith · « **Reply #1 on:** March 22, 2019, 10:14:07 AM »

Quote from: Weaver on March 22, 2019, 04:27:11 AM

...
Looking at the upstream stats it had ES≈44, CRC≈80 total over the preceding 21 hour-long period from midnight and roughly the same over the 24 hour day before that. Is that good or bad?

Good. BT threshold before a line is classed as bad is either one error every 10 seconds, or one every 5 depending on who you ask. In the real world I have found that even 60 CRC per minute has no effect whatsoever on real world throughput (although it would be enough for BT to start applying slow-down measures if A&A hadn't disabled DLM).

As for why you're losing LCP pings, I wonder if this could be a shaping mismatch. If I understand correctly these pings would be addressed to and replied by your firewall rather than the modem itself. Firewall connects by Ethernet which even if it's only 10 meg is still miles faster than the modem upload speed. So could the firewall be occasionally sending stuff faster than the modem can forward it?

Weaver · « **Reply #2 on:** March 23, 2019, 02:36:53 AM »

Remember that in this case these are not in fact normal pings, ie ICMP echo request/echo reply, but a PPP-level thing, in PPP itself, part of the PPP LCP function, so below IP. The abuse of the term ping is merely metaphorical.

AA can generate real ICMP pings for CQM testing as an option, so it says somewhere. I seem to remember reading that only the AA high priests can enable this.

But I don’t know why it doesn’t prioritise PPP LCP pings over user data. Perhaps that was an intentional design choice. Having it one way or the other, high priority or low priority, reveals different information. It would have been good to hide some kind of priority flag in there so that each of the behaviours could be selected, and then we would see two lots of information. But perhaps the hardware can’t do priority queuing or multiple queues or queue-jumping. Could be that it just had to be that way, I think.

It does say in this page that a tiny amount of packet loss red pixels on a link is bad news unless the link not maxed-out, in which case red pixels are normal, so it also says.

Btw, the purple column at 11:20-11:40 approx shows an outage due to either BT or AA.

At 09:15 approx there is a moderately high dark red peak peak which indicates flat-out upload and I think this an iPad doing an automatic daily backup to the Apple iCloud service. I think that service always runs tx flat out, but the upstream is very slow relative to the downstream so the height of that peak is quite modest.

[Moderator edited to merge three successive "tweet"-like posts into one homogeneous whole.]

johnson · « **Reply #3 on:** March 23, 2019, 04:09:08 AM »

Quote from: Weaver on March 23, 2019, 02:36:53 AM

Remember that in this case these are not in fact normal pings, ie ICMP echo request/echo reply, but a PPP-level thing, in PPP itself, part of the PPP LCP function, so below IP. The abuse of the term ping is merely metaphorical.

Reading that reminded me that I have to change the LCP timeout settings on my router to avoid the PPP session being dropped under load some times. Have to set a failure threshold and increase the time interval for LCP echos or the link drops under full load, so I assume if my ISP offered monitoring at such a level it would be full of red blips, whereas with ICMP monitoring its clean:

Not sure if that helps at all, its possible I should not have to do this and it represents an issue somewhere.

Weaver · « **Reply #4 on:** March 23, 2019, 04:31:04 AM »

Actually a very good point. I have set these numbers too low. I was trying to get 3G failover to kick in more swiftly, so that there’s less of a chance of noticing a tiny delay during the changeover. I have clearly gone too far, just as you say. Very rarely I have had a clueless KCI alert email saying ‘link down 1 second’. I had already told myself that this was due to the incorrect timeout figures.

In my Firebrick router config file I am using the following tweaks on the <ppp /> XML element I have

lcp-rate="1" lcp-timeout="5"

which I got from the AA support webpage on ‘dongles’ for 4G/3G.

Originally I was just using defaults, and I don’t know what the default behaviour is.

I saw that tip in that page mentioned earlier, and decided, rightly or wrongly, that it applies to any multiple bonded lines setup, aside from the case of dongle failover, and so adopted it as is. Maybe those settings are a bit aggressive but I think being non-responsive for five seconds is not good and you would reasonably think that’s a sensible figure. I did test even more aggressive shorter timeouts, an experiment which was immediately revealed as useless.

I’m assuming that when transmitting (upstream) a Firebrick will switch off a link that has by this criterion been seen to have gone bad and should then route traffic to the other links. So in fact if my understanding is right it helps with swifter failure detection nothing to do with 3G/4G USB dongles, just helps tx recovery.

These 1 second things are very rare and so I’m inclined to leave the timeout as it is, since increasing it does also have a downside.

[Moderator edited, ditto, as above.]

aesmith · « **Reply #5 on:** March 23, 2019, 07:43:47 AM »

Remember that the lost LCP pings could be the replies rather than the queries, so even if A&A prioritise their pings outbound to you, the replies could be lost if in turn your gear doesn't prioritise the replies. Or as mentioned if there's a traffic shaping mismatch. Not being totally familiar with all the elements of PPP, I'm assuming the LCP ping is replied to by your firewall, rather than by the modem. Is that correct?

The config you quote (lcp-rate="1" lcp-timeout="5") looks like it applies to queries originating from your firewall, presumable to monitor whether a particular connection is up. I wouldn't expect to see a setting at your end affecting timeouts when replying to pings, all your box can do is decide whether to reply or not and it's up to the originator (A&A) to keep track of RTT and success rate.

Weaver · « **Reply #6 on:** March 25, 2019, 10:25:16 AM »

I was not talking about ICMP pings here.

The PPP LCP ‘pings’ are being replied to by my Firebrick firewall-router.

That’s a good point, replies could be getting corrupted if anything is.

DiggerOfHoles · « **Reply #7 on:** May 06, 2019, 05:34:56 PM »

Quote from: Weaver on March 25, 2019, 10:25:16 AM

I was not talking about ICMP pings here.

The PPP LCP ‘pings’ are being replied to by my Firebrick firewall-router.

That’s a good point, replies could be getting corrupted if anything is.

WTF Weaver your talking about the same thing I am. PPP 23 years out of date. Are you 23?

Has anyone actually read https://www.btplc.com/SINet/SINs/pdf/472v2p9.pdf

If your ISP uses WBC then this is what you are actually talking about

If AA use non WBC product then accept my apologies.

The The PPP LCP ‘pings’ you speak of are what keeps your session alive. Note NOT xDSL/DSLx session.

PPP layer 2 of the OSI model google it.

burakkucat · « **Reply #8 on:** May 06, 2019, 06:18:49 PM »

<Cough!> (Pesky fur-ball.)

Please perform a little research on the subject of A&A, the ISP, and the service they provide to all their end users.

DiggerOfHoles · « **Reply #9 on:** May 06, 2019, 06:36:17 PM »

So every ISP uses PPP LCP echo. That's how the technology works.

The hype, the service they , you refer me to is exactly that.

If you don't understand the hype then don't believe it to be true.

Marketing techno babble tosh!

kitz · « **Reply #10 on:** May 07, 2019, 10:57:44 AM »

>>> So every ISP uses PPP LCP echo.

The graphs and additional monitoring tools are provided exclusively by AAISP for their EU's, which have mostly been developed and coded by RevK. I'm not aware of any other UK ISP that makes anything else like this available for their EU's.
They provide a suite of additional monitoring tools including the CQM Graphs which use LCP echo. AAISP even allow the EU's to run line tests or reset the BTw adsl DLM their self using the custom CP.

DiggerOfHoles · « **Reply #11 on:** May 08, 2019, 09:51:05 AM »

If you make a PPP connection to your ISP, is this not monitored using LCP PPP echo?

n echos not replied to, close PPP session attempt t start a new one?

Timer T203 refereed to in ITU spec? ?

I suspect RevK maybe using a spare/empty attribute in echo frame to convey info from your router back to AA to be collated a fed back to user?

Fair point?

As for AA they sound like my kind of ISP. I'll check them out further. Thanks.

DiggerOfHoles · « **Reply #12 on:** May 08, 2019, 10:32:44 AM »

Well some drooling and a tightening of the wallet later..

AAISP too expensive for me.

Fire brick manual fascinating. Data stored automatically at LNS/radius. Sorry RevK.

Any way if all LCP PPP Echos get dumped some where this will bring down PPP session?

Esp if one been sent every second. Would not WBC filre/dump some of these at times of congestion?

dee.jay · « **Reply #13 on:** May 08, 2019, 01:13:40 PM »

Quote from: DiggerOfHoles on May 08, 2019, 10:32:44 AM

AAISP too expensive for me.

You get what you pay for

News:

Author Topic: Packet loss and AA CQM (Read 3602 times)

Weaver

Packet loss and AA CQM

aesmith

Re: Packet loss and AA CQM

Weaver

Re: Packet loss and AA CQM

johnson

Re: Packet loss and AA CQM

Weaver

Re: Packet loss and AA CQM

aesmith

Re: Packet loss and AA CQM

Weaver

Re: Packet loss and AA CQM

DiggerOfHoles

Re: Packet loss and AA CQM

burakkucat

Re: Packet loss and AA CQM

DiggerOfHoles

Re: Packet loss and AA CQM

kitz

Re: Packet loss and AA CQM

DiggerOfHoles

Re: Packet loss and AA CQM

DiggerOfHoles

Re: Packet loss and AA CQM

dee.jay

Re: Packet loss and AA CQM