I will definitely reduce the auto-percent value and retest. And I need to do the quiet line test as advised by meritez.
Line 1 has been coming up with errors, sporadic packet loss as well as the aforementioned heavy packet loss whenever there is a flat out upload. This may actually be ok (or not) in the latter case if it’s just necessary as a part of the workings of TCP. But packet loss when there’s no flat-out transfer is of course not ok and I have just very occasionally had a notification "line down - line up (7 s)" a period that is so short that it’s not really a case of a modem dropping the sync at all, nothing went into the ‘down’ state, and a modem couldn’t possibly reconnect that quickly, takes ~70-80 secs iirc. Despite having this 7 secs down thing, I am not seeing any ES in the modem stats. It seems to me that enough data is getting corrupted sometimes so that n successive LCP echo replies are getting lost, or alternatively the downstream is bad so the incoming LCP echo requests are getting corrupted, but I would put my money on the former.
The downstream side has a 6 dB target SNRM and L2ReTX (in the form of Broadcom PhyR) like G.INP so is very heavily protected against corruption, yet the upstream side doesn’t have L2ReTX so if anything is going to get corrupted it would be on the upstream side anyway. Upstream was set to 6dB SNRM too, but in view of the bogus 7 sec reports I am increasing the upstream SNRM target of line 1 from 6dB to 9dB which should help. It’s a real shame there’s no L2ReTX, because it would speed up the upstream and/or make it vastly more reliable.
The 7 secs duration thing could be because I have set the upstream line probing from the Firebrick to be every 6 s; it sends out regular PPP LCP echo request probes to check that a line is really up and truly working. (On the other hand, in the opposite direction, AA servers send
downstream PPP LCP echo requests as well, for a similar reason, and their success or failure to get a reply also shows up on the clueless.aa.net.uk CQM graphs as bright red dots at the top of the graph - so-called "dripping blood" - meaning ‘failure’.) Coming back to the Firebrick, if a certain number of upstream echo requests get no replies (I forget the exact details) then the Firebrick puts the link into the down state and where bonding is in use it takes that link out of the bonded set so its share of upstream traffic can go to the other links rather than being sent into a non-functioning link to go nowhere. I’ve set the duration to be as short as possible without triggering bogus reports when all is working properly 100%, because I want to switch out bad links from the bonded set as quickly as possible so the other links can take over. I can set the duration to be short only because I have multiple lines. If there were only one line then I would want the duration to be set longer so that the line doesn’t go into the down state but hangs on until things improve.
If you look at the top of the graph very carefully, you will see the occasional faint bright red single pixel in the yellow; that is the sporadic packet loss problem. Where there is heavy activity you will also see the distinct larger patches of dripping blood and I think that this is only when there is upstream traffic activity ? (Key on the right; dark red is upstream traffic, darkish green is downstream). I can also separate these graphs out into multiple ones showing traffic in individual directions each alone, and ones showing latency only, so like that the display is not crowded as it is in this combined graph.
I’m supposed to be able to click on the display with my non-existent mouse to get a reading of all the graph data values at that point, but due to the iPad unfriendly nature of the design, clicking on the graph does bring up this data but also switches to a new page showing that line only rather than the current ‘all lines overview’ page, so I immediately can’t see the data it has displayed. AA really needs to sort out that website for modern device agnostic behaviour, given the number of mouseless machines out there now.
The dark red upstream transfers are due to machines backing up their state to the Apple iCloud. On the right there’s a white marker shaped a bit like a balloon which shows the sync drop when I changed from 6dB upstream to 9dB. That caused the upstream line 1 sync rate to drop quite a bit, but it’s still very fast. The upstream sync rate of line 1 was definitely the fastest before and I suspect that it was just running too hot.
It seems to me that without upstream L2ReTX, an upstream 6dB target SNRM is only just adequate if you want a really low error rate and perhaps 9dB is always required for total upstream reliability. If a user is always using TCP then total reliability is arguably not a sensible goal as you lose so much speed in trying to achieve it and TCP fixes any problems. Even a supposed ‘total TCP’ user will use other protocols, such as DNS lookups, which may not be using TCP. Now I’m using Zoom all the time, a non-TCP application afaik, then suddenly I’m no longer in the all-TCP user category and I can’t really use a 3dB downstream target SNRM any more I think, and it may also be the case that I have to look closely at upstream reliability and review whether 6 dB upstream is enough in every case.