[Apologies in advance for this post which has grown into a monster. I am trying to understand what I am seeing here, I am assuming I have a problem, and I may have misunderstood many things.]
Below I have a piece of a
screenshot taken from my ISP AA’s control and monitoring server clueless.aa.net.uk. There’s a poor quality jpeg preview image at the end of this monster post
, but if you click on the image you can get to get
full quality version. It shows
traffic on my four bonded DSL lines.
Green line indicates
downstream traffic. Dark red is upstream traffic.
Take a look at the section from 20:30 to 21:10 (very approx). The raised
green line over that period shows that there is a modest level of downstream traffic on all four links. Now for the first link at the very top of the
yellow background area, notice that at
21:03 there are a few
bright red pixels, occasionally clustered into the beginnings of
tiny stalactites. Easiest to see in the case of the first link, but a tiny few such bright red pixels can be seen in every one of the per-line sections.
* The problem: in that 20:30 to 21:10 period, the tiny sprinkling of bright red pixels at, for example, line 1 around 21:00 indicates
packet loss and at that point there was a ‘
packet loss’ figure of 2%.
(This means that as compared with the number of test ping-type packets sent - PPP LCP echo request packets - to test the link, 2% of the expected responses were not received during some short interval around that sampling time.)
This downstream data rate is way way to low to saturate the pipes. So overloading the link cannot be blamed.
So why is this happening ?Comparison situation: Over on the left, at 03:02, we can see a
green line, high up, that shows a
download in progress for a couple of hours, which is occasionally flat-out, sometimes maxing out, as at 03:02, and at other times it seems to be following some medium-term flat, set rate. That was a supposedly flat-out
download of video from Amazon - as opposed to live streaming. Occasionally, as in the example of 03:02, the
downstream traffic does appear to be saturating the links. It is hitting the high horizontal solid black line (which is immediately above a dotted line for 2.0 Mbps downstream) that shows the downstream rate limit.
* Despite the fact that one might reasonably imagine there could be some chance of fully loading the
downstream links at some point during that period, I cannot see any
packet loss during that
download period 02:30 to 04:30.
So:
- Is this a real issue or a red herring?
- What are the possible causes?
- Is data corruption, causing ‘real packet’ loss, a realistic possibility ? (That is, corruption of PPP frames containing IP data with a user data payload, not just mere test pings or responses.)
The
downstream SNRM was really quite low on some of the links, eg 1.9dB, when I noted the stats on Thursday evening at around 21:30.
* Line 4 modem : In the case of line 4, somehow the
upstream SNRM for one line was in fact only ~2.4dB, rather than the expected 6dB upstream target. That line was not showing more
packet loss red pixels or worse loss figures though. Looking at the upstream stats it had ES≈44, CRC≈80 total over the preceding 21 hour-long period from midnight and roughly the same over the 24 hour day before that. Is that good or bad?
Line 4 modem stats:
xdslctl: ADSL driver and PHY status
Status: Showtime
Last Retrain Reason: 8000
Last initialization procedure status: 0
Max: Upstream rate = 406 Kbps, Downstream rate = 3364 Kbps
Bearer: 0, Upstream rate = 550 Kbps, Downstream rate = 3220 Kbps
Link Power State: L0
Mode: ADSL2 Annex A
TPS-TC: ATM Mode(0x0)
Trellis: U:ON /D:ON
Line Status: No Defect
Training Status: Showtime
Down Up
SNR (dB): 2.3 2.4
Attn(dB): 63.5 39.9
Pwr(dBm): 18.4 12.4
ADSL2 framing
Bearer 0
MSGc: 52 12
B: 37 62
M: 4 1
T: 3 1
R: 10 14
S: 1.4879 3.6023
L: 871 171
D: 2 8
Counters
Bearer 0
SF: 232851021 11248
SFErr: 1469 2663
RS: 1539084850 4161783
RSCorr: 3511040 3606794
RSUnCorr: 17400 0
ReXmt: 116232 0
ReXmtCorr: 104619 0
ReXmtUnCorr: 17734 0
Bearer 0
HEC: 11455 2785
OCD: 561 0
LCD: 561 0
Total Cells: 3024582802 629535452
Data Cells: 1827922763 213965480
Drop Cells: 0
Bit Errors: 262477 366891
ES: 563 1514
SES: 5 0
UAS: 51 51
AS: 3791833
Bearer 0
INP: 26.00 2.50
INPRein: 0.00 0.00
delay: 8 7
PER: 16.18 16.21
OR: 28.67 8.88
AgR: 3234.79 557.45
Bitswap: 761778/775012 31231/31231
Total time = 43 days 21 hours 18 min 4 sec
FEC: 3511040 3606794
CRC: 1469 2663
ES: 563 1514
SES: 5 0
UAS: 51 51
LOS: 0 0
LOF: 0 0
LOM: 1 0
Latest 15 minutes time = 3 min 4 sec
FEC: 395 346
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Previous 15 minutes time = 15 min 0 sec
FEC: 2018 1818
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Latest 1 day time = 21 hours 18 min 4 sec
FEC: 80272 170427
CRC: 0 85
ES: 0 44
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Previous 1 day time = 24 hours 0 sec
FEC: 56188 165286
CRC: 1 83
ES: 1 44
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
Since Link time = 43 days 21 hours 17 min 13 sec
FEC: 3511040 3606794
CRC: 1469 2663
ES: 563 1514
SES: 5 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 1 0
NTR: mipsCntAtNtr=0 ncoCntAtNtr=0
* PhyR upstream? - I am thinking there’s no upstream L2 retx ‘PhyR’ in operation, for some reason, only downstream? Is that right? If so, then that’s really annoying. If there’s a data corruption problem, is that where it is lurking, in corruption of upstream data because there’s no L2 retx mechanism to compensate for the ridiculously low upstream SNRM?
* An idea for a fix? Resync everything ? - I thought anyway that I would put the modems back to their target SNRMs by telling each one to reboot itself. When I restarted the modem on line 2, it dropped a couple of hundred kbps downstream which says to me that the noise level happened to be much
much worse than it was when it previously synced up. I don’t know if that was just due to the choice of time of day (late on Thursday evening). It could be that the line 2 modem had been
struggling a lot with greatly worsened noise conditions and a downstream SNRM which had ended up drooping way too low. I wish I had noted everything down from that modem before I stupidly told it to reboot.
* So:
Have I found and fixed a problem ? If the red pixels are a real significant problem, then is this data corruption at work? Am I suffering real problems due to running with SNRM targets that are too low, too aggressive ?
Given the weird upstream SNRM diurnal variation - discussed in earlier threads - that is still a problem on some lines, am I getting screwed by upstream data corruption ? - because the daily cycle means I will have to have a ridiculously huge upstream target SNRM or some intelligence about when and how to resync, just to ensure initial upstream SNRM is enough to still be large enough for the later bad half of the day?