Kitz ADSL Broadband Information
adsl spacer  
Support this site
Home Broadband ISPs Tech Routers Wiki Forum
 
     
   Compare ISP   Rate your ISP
   Glossary   Glossary
 
Please login or register.

Login with username, password and session length
Advanced search  

News:

Author Topic: Packet loss and AA CQM  (Read 923 times)

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 7460
  • Retd sw dev; A&A; 4 × 7km ADSL2; IPv6; Firebrick
Packet loss and AA CQM
« on: March 22, 2019, 04:27:11 AM »

[Apologies in advance for this post which has grown into a monster. I am trying to understand what I am seeing here, I am assuming I have a problem, and I may have misunderstood many things.]

Below I have a piece of a screenshot taken from my ISP AA’s control and monitoring server clueless.aa.net.uk. There’s a poor quality jpeg preview image at the end of this monster post  :-[, but if you click on the image you can get to get full quality version. It shows traffic on my four bonded DSL lines. Green line indicates downstream traffic. Dark red is upstream traffic.

Take a look at the section from 20:30 to 21:10 (very approx). The raised green line over that period shows that there is a modest level of downstream traffic on all four links. Now for the first link at the very top of the yellow background area, notice that at 21:03 there are a few bright red pixels, occasionally clustered into the beginnings of tiny stalactites. Easiest to see in the case of the first link, but a tiny few such bright red pixels can be seen in every one of the per-line sections.

* The problem: in that 20:30 to 21:10 period, the tiny sprinkling of bright red pixels at, for example, line 1 around 21:00 indicates packet loss and at that point there was a ‘packet loss’ figure of 2%.

(This means that as compared with the number of test ping-type packets sent - PPP LCP echo request packets - to test the link, 2% of the expected responses were not received during some short interval around that sampling time.)

This downstream data rate is way way to low to saturate the pipes. So overloading the link cannot be blamed. So why is this happening ?

Comparison situation: Over on the left, at 03:02, we can see a green line, high up, that shows a download in progress for a couple of hours, which is occasionally flat-out, sometimes maxing out, as at 03:02, and at other times it seems to be following some medium-term flat, set rate. That was a supposedly flat-out download of video from Amazon - as opposed to live streaming. Occasionally, as in the example of 03:02, the downstream traffic does appear to be saturating the links. It is hitting the high horizontal solid black line (which is immediately above a dotted line for 2.0 Mbps downstream) that shows the downstream rate limit.

* Despite the fact that one might reasonably imagine there could be some chance of fully loading the downstream links at some point during that period, I cannot see any packet loss during that download period 02:30 to 04:30.

So:
  • Is this a real issue or a red herring?
  • What are the possible causes?
  • Is data corruption, causing ‘real packet’ loss, a realistic possibility ? (That is, corruption of PPP frames containing IP data with a user data payload, not just mere test pings or responses.)

The downstream SNRM was really quite low on some of the links, eg 1.9dB, when I noted the stats on Thursday evening at around 21:30.

* Line 4 modem : In the case of line 4, somehow the upstream SNRM for one line was in fact only ~2.4dB, rather than the expected 6dB upstream target. That line was not showing more packet loss red pixels or worse loss figures though. Looking at the upstream stats it had ES≈44, CRC≈80 total over the preceding 21 hour-long period from midnight and roughly the same over the 24 hour day before that. Is that good or bad?

Line 4 modem stats:

Code: [Select]
xdslctl: ADSL driver and PHY status
Status: Showtime
Last Retrain Reason: 8000
Last initialization procedure status: 0
Max: Upstream rate = 406 Kbps, Downstream rate = 3364 Kbps
Bearer: 0, Upstream rate = 550 Kbps, Downstream rate = 3220 Kbps

Link Power State: L0
Mode: ADSL2 Annex A
TPS-TC: ATM Mode(0x0)
Trellis: U:ON /D:ON
Line Status: No Defect
Training Status: Showtime
Down Up
SNR (dB): 2.3 2.4
Attn(dB): 63.5 39.9
Pwr(dBm): 18.4 12.4

ADSL2 framing
Bearer 0
MSGc: 52 12
B: 37 62
M: 4 1
T: 3 1
R: 10 14
S: 1.4879 3.6023
L: 871 171
D: 2 8

Counters
Bearer 0
SF: 232851021 11248
SFErr: 1469 2663
RS: 1539084850 4161783
RSCorr: 3511040 3606794
RSUnCorr: 17400 0

ReXmt: 116232 0
ReXmtCorr: 104619 0
ReXmtUnCorr: 17734 0

Bearer 0
HEC: 11455 2785
OCD: 561 0
LCD: 561 0
Total Cells: 3024582802 629535452
Data Cells: 1827922763 213965480
Drop Cells: 0
Bit Errors: 262477 366891

ES: 563 1514
SES: 5 0
UAS: 51 51
AS: 3791833

Bearer 0
INP: 26.00 2.50
INPRein: 0.00 0.00
delay: 8 7
PER: 16.18 16.21
OR: 28.67 8.88
AgR: 3234.79 557.45

Bitswap: 761778/775012 31231/31231

Total time = 43 days 21 hours 18 min 4 sec
FEC: 3511040 3606794
CRC: 1469 2663
ES: 563 1514
SES: 5 0
UAS: 51 51
LOS: 0 0
LOF: 0 0
LOM: 1 0
Latest 15 minutes time = 3 min 4 sec
FEC: 395 346
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Previous 15 minutes time = 15 min 0 sec
FEC: 2018 1818
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Latest 1 day time = 21 hours 18 min 4 sec
FEC: 80272 170427
CRC: 0 85
ES: 0 44
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Previous 1 day time = 24 hours 0 sec
FEC: 56188 165286
CRC: 1 83
ES: 1 44
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Since Link time = 43 days 21 hours 17 min 13 sec
FEC: 3511040 3606794
CRC: 1469 2663
ES: 563 1514
SES: 5 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 1 0
NTR: mipsCntAtNtr=0 ncoCntAtNtr=0

* PhyR upstream? - I am thinking there’s no upstream L2 retx ‘PhyR’ in operation, for some reason, only downstream? Is that right? If so, then that’s really annoying. If there’s a data corruption problem, is that where it is lurking, in corruption of upstream data because there’s no L2 retx mechanism to compensate for the ridiculously low upstream SNRM?

* An idea for a fix? Resync everything ? - I thought anyway that I would put the modems back to their target SNRMs by telling each one to reboot itself. When I restarted the modem on line 2, it dropped a couple of hundred kbps downstream which says to me that the noise level happened to be much much worse than it was when it previously synced up. I don’t know if that was just due to the choice of time of day (late on Thursday evening). It could be that the line 2 modem had been struggling a lot with greatly worsened noise conditions and a downstream SNRM which had ended up drooping way too low. I wish I had noted everything down from that modem before I stupidly told it to reboot.

* So:

Have I found and fixed a problem ? If the red pixels are a real significant problem, then is this data corruption at work? Am I suffering real problems due to running with SNRM targets that are too low, too aggressive ?

Given the weird upstream SNRM diurnal variation - discussed in earlier threads - that is still a problem on some lines, am I getting screwed by upstream data corruption ? - because the daily cycle means I will have to have a ridiculously huge upstream target SNRM or some intelligence about when and how to resync, just to ensure initial upstream SNRM is enough to still be large enough for the later bad half of the day?




Logged

aesmith

  • Reg Member
  • ***
  • Posts: 867
Re: Packet loss and AA CQM
« Reply #1 on: March 22, 2019, 10:14:07 AM »

...
Looking at the upstream stats it had ES≈44, CRC≈80 total over the preceding 21 hour-long period from midnight and roughly the same over the 24 hour day before that. Is that good or bad?

Good.  BT threshold before a line is classed as bad is either one error every 10 seconds, or one every 5 depending on who you ask.   In the real world I have found that even 60 CRC per minute has no effect whatsoever on real world throughput (although it would be enough for BT to start applying slow-down measures if A&A hadn't disabled DLM).

As for why you're losing LCP pings, I wonder if this could be a shaping mismatch.   If I understand correctly these pings would be addressed to and replied by your firewall rather than the modem itself.  Firewall connects by Ethernet which even if it's only 10 meg is still miles faster than the modem upload speed.  So could the firewall be occasionally sending stuff faster than the modem can forward it? 
Logged

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 7460
  • Retd sw dev; A&A; 4 × 7km ADSL2; IPv6; Firebrick
Re: Packet loss and AA CQM
« Reply #2 on: March 23, 2019, 02:36:53 AM »

Remember that in this case these are not in fact normal pings, ie ICMP echo request/echo reply, but a PPP-level thing, in PPP itself, part of the PPP LCP function, so below IP. The abuse of the term ping is merely metaphorical.

AA can generate real ICMP pings for CQM testing as an option, so it says somewhere. I seem to remember reading that only the AA high priests can enable this.

But I don’t know why it doesn’t prioritise PPP LCP pings over user data. Perhaps that was an intentional design choice. Having it one way or the other, high priority or low priority, reveals different information. It would have been good to hide some kind of priority flag in there so that each of the behaviours could be selected, and then we would see two lots of information. But perhaps the hardware can’t do priority queuing or multiple queues or queue-jumping. Could be that it just had to be that way, I think.

It does say in this page that a tiny amount of packet loss red pixels on a link is bad news unless the link not maxed-out, in which case red pixels are normal, so it also says.

Btw, the purple column at 11:20-11:40 approx shows an outage due to either BT or AA.

At 09:15 approx there is a moderately high dark red peak peak which indicates flat-out upload and I think this an iPad doing an automatic daily backup to the Apple iCloud service. I think that service always runs tx flat out, but the upstream is very slow relative to the downstream so the height of that peak is quite modest.

[Moderator edited to merge three successive "tweet"-like posts into one homogeneous whole.]
« Last Edit: March 23, 2019, 08:06:32 PM by burakkucat »
Logged

johnson

  • Reg Member
  • ***
  • Posts: 695
Re: Packet loss and AA CQM
« Reply #3 on: March 23, 2019, 04:09:08 AM »

Remember that in this case these are not in fact normal pings, ie ICMP echo request/echo reply, but a PPP-level thing, in PPP itself, part of the PPP LCP function, so below IP. The abuse of the term ping is merely metaphorical.

Reading that reminded me that I have to change the LCP timeout settings on my router to avoid the PPP session being dropped under load some times. Have to set a failure threshold and increase the time interval for LCP echos or the link drops under full load, so I assume if my ISP offered monitoring at such a level it would be full of red blips, whereas with ICMP monitoring its clean:




Not sure if that helps at all, its possible I should not have to do this and it represents an issue somewhere.
Logged

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 7460
  • Retd sw dev; A&A; 4 × 7km ADSL2; IPv6; Firebrick
Re: Packet loss and AA CQM
« Reply #4 on: March 23, 2019, 04:31:04 AM »

Actually a very good point. I have set these numbers too low. I was trying to get 3G failover to kick in more swiftly, so that there’s less of a chance of noticing a tiny delay during the changeover. I have clearly gone too far, just as you say. Very rarely I have had a clueless KCI alert email saying ‘link down 1 second’. I had already told myself that this was due to the incorrect timeout figures.

In my Firebrick router config file I am using the following tweaks on the <ppp /> XML element I have

    lcp-rate="1" lcp-timeout="5"

which I got from the AA support webpage on ‘dongles’ for 4G/3G.

Originally I was just using defaults, and I don’t know what the default behaviour is.

I saw that tip in that page mentioned earlier, and decided, rightly or wrongly, that it applies to any multiple bonded lines setup, aside from  the case of dongle failover, and so adopted it as is. Maybe those settings are a bit aggressive but I think being non-responsive for five seconds is not good and you would reasonably think that’s a sensible figure. I did test even more aggressive shorter timeouts, an experiment which was immediately revealed as useless.

I’m assuming that when transmitting (upstream) a Firebrick will switch off a link that has by this criterion been seen to have gone bad and should then route traffic to the other links. So in fact if my understanding is right it helps with swifter failure detection nothing to do with 3G/4G USB dongles, just helps tx recovery.

These 1 second things are very rare and so I’m inclined to leave the timeout as it is, since increasing it does also have a downside.

[Moderator edited, ditto, as above.]
« Last Edit: March 23, 2019, 08:12:49 PM by burakkucat »
Logged

aesmith

  • Reg Member
  • ***
  • Posts: 867
Re: Packet loss and AA CQM
« Reply #5 on: March 23, 2019, 07:43:47 AM »

Remember that the lost LCP pings could be the replies rather than the queries, so even if A&A prioritise their pings outbound to you, the replies could be lost if in turn your gear doesn't prioritise the replies.  Or as mentioned if there's a traffic shaping mismatch.   Not being totally familiar with all the elements of PPP, I'm assuming the LCP ping is replied to by your firewall, rather than by the modem.  Is that correct?

The config you quote (lcp-rate="1" lcp-timeout="5") looks like it applies to queries originating from your firewall, presumable to monitor whether a particular connection is up.  I wouldn't expect to see a setting at your end affecting timeouts when replying to pings, all your box can do is decide whether to reply or not and it's up to the originator (A&A) to keep track of RTT and success rate.
Logged

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 7460
  • Retd sw dev; A&A; 4 × 7km ADSL2; IPv6; Firebrick
Re: Packet loss and AA CQM
« Reply #6 on: March 25, 2019, 10:25:16 AM »

I was not talking about ICMP pings here.

The PPP LCP ‘pings’ are being replied to by my Firebrick firewall-router.

That’s a good point, replies could be getting corrupted if anything is.
Logged

DiggerOfHoles

  • Member
  • **
  • Posts: 64
Re: Packet loss and AA CQM
« Reply #7 on: May 06, 2019, 05:34:56 PM »

I was not talking about ICMP pings here.

The PPP LCP ‘pings’ are being replied to by my Firebrick firewall-router.

That’s a good point, replies could be getting corrupted if anything is.

WTF Weaver your talking about the same thing I am. PPP 23 years out of date. Are you 23?

Has anyone actually read https://www.btplc.com/SINet/SINs/pdf/472v2p9.pdf

If your ISP uses WBC then this is what you are actually talking about

If AA use non WBC product then accept my apologies.

The The PPP LCP ‘pings’ you speak of are what keeps your session alive. Note NOT xDSL/DSLx session.

PPP layer 2 of the OSI model google it.
Logged

burakkucat

  • Global Moderator
  • Senior Kitizen
  • *
  • Posts: 26941
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Packet loss and AA CQM
« Reply #8 on: May 06, 2019, 06:18:49 PM »

<Cough!> (Pesky fur-ball.)

Please perform a little research on the subject of A&A, the ISP, and the service they provide to all their end users.
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

DiggerOfHoles

  • Member
  • **
  • Posts: 64
Re: Packet loss and AA CQM
« Reply #9 on: May 06, 2019, 06:36:17 PM »

So every ISP uses PPP LCP echo. That's how the technology works.

The hype,  the service they , you refer me to is exactly that.

If you don't understand the hype then don't believe it to be true.

Marketing techno babble tosh!
Logged

kitz

  • Administrator
  • Senior Kitizen
  • *
  • Posts: 32112
  • Trinity: Most guys do.
    • http://www.kitz.co.uk
Re: Packet loss and AA CQM
« Reply #10 on: May 07, 2019, 10:57:44 AM »

>>> So every ISP uses PPP LCP echo.

The graphs and additional monitoring tools are provided exclusively by AAISP for their EU's, which have mostly been developed and coded by RevK.  I'm not aware of any other UK ISP that makes anything else like this available for their EU's.
They provide a suite of additional monitoring tools including the CQM Graphs which use LCP echo.  AAISP even allow the EU's to run line tests or reset the BTw adsl DLM their self using the custom CP.
Logged
Please do not PM me with queries for broadband help as I may not be able to respond.
-----
How to get your router line stats :: ADSL Exchange Checker

DiggerOfHoles

  • Member
  • **
  • Posts: 64
Re: Packet loss and AA CQM
« Reply #11 on: May 08, 2019, 09:51:05 AM »

If you make a PPP connection to your ISP, is this not monitored using LCP PPP echo?

n echos not replied to, close PPP session attempt t start a new one?

Timer T203 refereed to in ITU spec? ?

I suspect RevK maybe using a spare/empty attribute in echo frame to convey info from your router back to AA to be collated a fed back to user?

Fair point?

As for AA they sound like my kind of ISP. I'll check them out further. Thanks.
Logged

DiggerOfHoles

  • Member
  • **
  • Posts: 64
Re: Packet loss and AA CQM
« Reply #12 on: May 08, 2019, 10:32:44 AM »

Well some drooling and a tightening of the wallet later..

AAISP too expensive for me.

Fire brick manual fascinating. Data stored automatically at LNS/radius. Sorry RevK.

Any way if all LCP PPP Echos get dumped some where this will bring down PPP session?

Esp if one been sent every second. Would not WBC filre/dump some of these at times of congestion?

Logged

dee.jay

  • Reg Member
  • ***
  • Posts: 267
Re: Packet loss and AA CQM
« Reply #13 on: May 08, 2019, 01:13:40 PM »

AAISP too expensive for me.

You get what you pay for :)
Logged
Sky FTTC 69/17 @ 3dB SNR - HG612
AAISP FTTC 67/16 @ 3dB SNR - HG612
Routed by pfSense on VMware ESX 6.0
~600M Huawei cabinet