Topic: Line 1 packet loss (Read 2095 times)

Weaver · « **on:** April 09, 2021, 01:31:31 PM »

I’m currently sometimes seeing packet loss on line 1 when there is any significant level of traffic. I’m not sure, but looking at clueless.aa.net.uk, it might be that we see the dripping blood when we get an upload? Is that correct?

I talked to AA who pointed out that, assuming the problem is to do with upstream, then it’s only when the upstream is maxed out, so I should have spotted that, and that suggests that I need to reduce the modem load factor slightly. (Terms: the value of the "modem load factor" [as I have termed it] in use currently = 95%; 88% = protocol efficiency for ADSL2 = byte bloat due to ATM and other protocols below IP; "auto-percent" attribute in FB2900 router = 88% * 95% = 84%). I would be surprised though if this is the wrong value, as AA staff already told me it’s a reasonable number and I tested this a while back.

Here below are the complete stats from the modem for line 1:

Code: [Select]

xdslctl: ADSL driver and PHY status
Status: Showtime
Last Retrain Reason:	8000
Last initialization procedure status:	0
Max:	Upstream rate = 793 Kbps, Downstream rate = 2508 Kbps
Bearer:	0, Upstream rate = 615 Kbps, Downstream rate = 2432 Kbps

Link Power State:	L0
Mode:			ADSL2 Annex A
TPS-TC:			ATM Mode(0x0)
Trellis:		U:ON /D:ON
Line Status:		No Defect
Training Status:	Showtime
		Down		Up
SNR (dB):	 6.0		 11.2
Attn(dB):	 64.5		 40.2
Pwr(dBm):	 17.7		 12.4

			ADSL2 framing
			Bearer 0
MSGc:		59		11
B:		76		18
M:		1		1
T:		1		4
R:		0		0
S:		1.0000		0.9744
L:		616		156
D:		1		1

			Counters
			Bearer 0
SF:		3675069		567391
SFErr:		72		139
RS:		0		374712
RSCorr:		0		0
RSUnCorr:	0		0

			Bearer 0
HEC:		90		75
OCD:		0		0
LCD:		0		0
Total Cells:	342543943		86727496
Data Cells:	27671840		5497850
Drop Cells:	0
Bit Errors:	4569		7114

ES:		373		522
SES:		148		18
UAS:		10168		10020
AS:		59730

			Bearer 0
INP:		0.00		0.00
INPRein:	0.00		0.00
delay:		0		0
PER:		16.25		16.56
OR:		32.00		8.21
AgR:		2454.41	621.57

Bitswap:	3783/3788		581/581

Total time = 11 days 13 hours 29 min 52 sec
FEC:		1703		75839
CRC:		3286		498
ES:		373		522
SES:		148		18
UAS:		10168		10020
LOS:		41		0
LOF:		106		0
LOM:		4		0
Latest 15 minutes time = 14 min 52 sec
FEC:		0		0
CRC:		2		0
ES:		2		0
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Previous 15 minutes time = 15 min 0 sec
FEC:		0		0
CRC:		2		6
ES:		2		3
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Latest 1 day time = 13 hours 29 min 52 sec
FEC:		0		0
CRC:		49		103
ES:		47		74
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
Previous 1 day time = 24 hours 0 sec
FEC:		160		8642
CRC:		501		43
ES:		31		29
SES:		11		0
UAS:		59		48
LOS:		1		0
LOF:		9		0
LOM:		0		0
Since Link time = 16 hours 35 min 29 sec
FEC:		0		0
CRC:		72		139
ES:		67		98
SES:		0		0
UAS:		0		0
LOS:		0		0
LOF:		0		0
LOM:		0		0
NTR: mipsCntAtNtr=0 ncoCntAtNtr=0

meritez · « **Reply #1 on:** April 09, 2021, 02:24:13 PM »

Are we seeing crosstalk between lines 1 + 3?

@Weaver have you connected a corded phone to each line and checked for noise?

burakkucat · « **Reply #2 on:** April 09, 2021, 11:23:02 PM »

Quote from: Weaver on April 09, 2021, 01:31:31 PM

I talked to AA who pointed out that, assuming the problem is to do with upstream, then it’s only when the upstream is maxed out, so I should have spotted that, and that suggests that I need to reduce the modem load factor slightly. (Terms: the value of the "modem load factor" [as I have termed it] in use currently = 95%; 88% = protocol efficiency for ADSL2 = byte bloat due to ATM and other protocols below IP; "auto-percent" attribute in FB2900 router = 88% * 95% = 84%). I would be surprised though if this is the wrong value, as AA staff already told me it’s a reasonable number and I tested this a while back.

As a test, just reduce it by 1% and see how the circuit then performs.

burakkucat · « **Reply #3 on:** April 09, 2021, 11:36:25 PM »

Looking in my notes, I see --

Quote

Drop cable 1, Pair 1, xxxx@a.1 BBEU2070nnnn 01471 82nnnn (1st installed) NU tone.
Drop cable 1, Pair 2, xxxx@a.3 BBEU2070nnnn 01471 82nnnn (2nd installed) Ringing tone.
Drop cable 2, Pair 1, xxxx@a.4 BBEU2070nnnn 01471 82nnnn (3rd installed) Ringing tone.
Drop cable 2, Pair 2, xxxx@a.2 BBEU2973nnnn 01471 82nnnn (4th installed) NU tone.

Which, if I understand what I typed correctly (

), lines one and three were sharing the first dual-pair drop cable. After the most recent Openreach intervention, they now have individual single-pair drop cables. Hence a "no" to meritez' first query.

Weaver · « **Reply #4 on:** April 16, 2021, 01:46:51 AM »

I will definitely reduce the auto-percent value and retest. And I need to do the quiet line test as advised by meritez.

Line 1 has been coming up with errors, sporadic packet loss as well as the aforementioned heavy packet loss whenever there is a flat out upload. This may actually be ok (or not) in the latter case if it’s just necessary as a part of the workings of TCP. But packet loss when there’s no flat-out transfer is of course not ok and I have just very occasionally had a notification "line down - line up (7 s)" a period that is so short that it’s not really a case of a modem dropping the sync at all, nothing went into the ‘down’ state, and a modem couldn’t possibly reconnect that quickly, takes ~70-80 secs iirc. Despite having this 7 secs down thing, I am not seeing any ES in the modem stats. It seems to me that enough data is getting corrupted sometimes so that n successive LCP echo replies are getting lost, or alternatively the downstream is bad so the incoming LCP echo requests are getting corrupted, but I would put my money on the former.

The downstream side has a 6 dB target SNRM and L2ReTX (in the form of Broadcom PhyR) like G.INP so is very heavily protected against corruption, yet the upstream side doesn’t have L2ReTX so if anything is going to get corrupted it would be on the upstream side anyway. Upstream was set to 6dB SNRM too, but in view of the bogus 7 sec reports I am increasing the upstream SNRM target of line 1 from 6dB to 9dB which should help. It’s a real shame there’s no L2ReTX, because it would speed up the upstream and/or make it vastly more reliable.

The 7 secs duration thing could be because I have set the upstream line probing from the Firebrick to be every 6 s; it sends out regular PPP LCP echo request probes to check that a line is really up and truly working. (On the other hand, in the opposite direction, AA servers send downstream PPP LCP echo requests as well, for a similar reason, and their success or failure to get a reply also shows up on the clueless.aa.net.uk CQM graphs as bright red dots at the top of the graph - so-called "dripping blood" - meaning ‘failure’.) Coming back to the Firebrick, if a certain number of upstream echo requests get no replies (I forget the exact details) then the Firebrick puts the link into the down state and where bonding is in use it takes that link out of the bonded set so its share of upstream traffic can go to the other links rather than being sent into a non-functioning link to go nowhere. I’ve set the duration to be as short as possible without triggering bogus reports when all is working properly 100%, because I want to switch out bad links from the bonded set as quickly as possible so the other links can take over. I can set the duration to be short only because I have multiple lines. If there were only one line then I would want the duration to be set longer so that the line doesn’t go into the down state but hangs on until things improve.

If you look at the top of the graph very carefully, you will see the occasional faint bright red single pixel in the yellow; that is the sporadic packet loss problem. Where there is heavy activity you will also see the distinct larger patches of dripping blood and I think that this is only when there is upstream traffic activity ? (Key on the right; dark red is upstream traffic, darkish green is downstream). I can also separate these graphs out into multiple ones showing traffic in individual directions each alone, and ones showing latency only, so like that the display is not crowded as it is in this combined graph.

I’m supposed to be able to click on the display with my non-existent mouse to get a reading of all the graph data values at that point, but due to the iPad unfriendly nature of the design, clicking on the graph does bring up this data but also switches to a new page showing that line only rather than the current ‘all lines overview’ page, so I immediately can’t see the data it has displayed. AA really needs to sort out that website for modern device agnostic behaviour, given the number of mouseless machines out there now.

The dark red upstream transfers are due to machines backing up their state to the Apple iCloud. On the right there’s a white marker shaped a bit like a balloon which shows the sync drop when I changed from 6dB upstream to 9dB. That caused the upstream line 1 sync rate to drop quite a bit, but it’s still very fast. The upstream sync rate of line 1 was definitely the fastest before and I suspect that it was just running too hot.

It seems to me that without upstream L2ReTX, an upstream 6dB target SNRM is only just adequate if you want a really low error rate and perhaps 9dB is always required for total upstream reliability. If a user is always using TCP then total reliability is arguably not a sensible goal as you lose so much speed in trying to achieve it and TCP fixes any problems. Even a supposed ‘total TCP’ user will use other protocols, such as DNS lookups, which may not be using TCP. Now I’m using Zoom all the time, a non-TCP application afaik, then suddenly I’m no longer in the all-TCP user category and I can’t really use a 3dB downstream target SNRM any more I think, and it may also be the case that I have to look closely at upstream reliability and review whether 6 dB upstream is enough in every case.

Weaver · « **Reply #5 on:** April 16, 2021, 02:50:32 AM »

In this later graph, on the right you can see some transfers made with the higher upstream target SNRM of 9dB and a lower modem loading factor of 94% instead of before and this causes a FB XML config ‘auto-percent’ attribute = 83% (approx) instead of 84%, set for line 1 only. Unfortunately changing two things at the same time is a bad idea because you don’t know which measure caused any change in results and I have knowingly committed that sin here. Looking at the graphs it seems that one or t’other change has cured the major dripping blood problem. I can soon reverse one of the changes to find out.

So extremely good news.

News:

Author Topic: Line 1 packet loss (Read 2095 times)

Weaver

Line 1 packet loss

meritez

Re: Line 1 packet loss

burakkucat

Re: Line 1 packet loss

burakkucat

Re: Line 1 packet loss

Weaver

Re: Line 1 packet loss

Weaver

Re: Line 1 packet loss