Sum up - thoughts on AA, CQM, SNRM and stuff
I am taking what EJS said on board very much. Am still thinking about it, am a bit befuddled concerning the statistical definitions, we completely agree on the real-world picture of error occurrences. I am a bit surprised at the revelation that HEC-mechanism single-bit errors are not corrected. Am needing to think some more about that.
I'm getting reported downstream HEC errors from one of my three AA-supplied Dlink DSL-320B-Z1 modems at what seems to be a modest rate but is actually scary when you do the maths. Upstream is fine. The second modem is 10 times better (lower) per cell and at one stage the third was 100 times better per cell.
The 'bad' modem gives downstream 0.3% HEC errors per cell.
I don't know if the modem’s reporting is genuine and if the figure for the total number of cells is bogus then my fraction is bogus, but then still why the factor of 10 and 100 difference? That would seem to indicate that something odd is going on even if the modem’s figures are not right, either one modem is duff, or line #1 is bad or the modem 1 settings are wrong (insufficient FEC for some reason for example, or SNRM wrong).
Is the modem mad? Or am I? The denominator problem.
My derivation of the figure of 0.3% is HEC_errors / xx_cells. What is xx_cells? The modem quotes two user-looking numbers unfortunately one called ‘total cells’ and one called ‘data cells’. I have no idea what these two mean. Maybe you would expect a true total cells number and perhaps a count for true data-bearing cells or idle cells or all three, but anyway the true total would one obviously be the largest and would be the sum of the other two.
modem 1
'total' 1576618
'data' 549514
modem 2
'total' 1419050
'data' 14950880
The modems have not had comparable uptimes here, modem 1 was resynched recently. The two numbers in a pair belonging to one modem do not even have values in a constant relationship when compared with a different modem!
Because I have no idea what is going on, I just chose the max of the two values for the denominator xx_cells. An alternative would have been to add the two together.
I have not yet worked out how to cross-check the believability of the modem’s HEC ratio numbers using a comparison of the rate of RS uncorrected errors per (x)-bits of data as the only other thing I have to go on is the downstream superframe count and I don't know how to convert that into n bits without discovering some more parameter values and reading the standards docs again.
Anyway.
Hypothetically, let's suppose the downstream HEC errors per cell rate of 0.3% obtained as described above is genuine. Then we have one line that is ten times worse then line 2. What does this error rate mean in practice ?
I am using the following equation 1 - ( 1 - p )**32 (where ** denotes exponentiation) to obtain the probability of a corrupt 1500 byte IP PDU (=32 ATM cells). This comes out to about 9% (ie per IP PDU per line). So on the one line, TCP will need to retx one in every 11 max-length packets, which is not good at all. I have three lines so it will be 1 in 33 packets and that will cover the bad line up.
Could someone tell me if my maths is completely wrong? (We do not have to believe the modem, especially given that the labels on the figures make absolutely no sense, they can't possibly be ‘totals’, but I'd just like some help with a sanity check.)
If I am right then a HEC error rate that does not sound too scary means a big deal on a flat-out TCP download if 9% of your packets are corrupted.
One other possibility. It could be that the modem is reporting all cells that have errors even if these errors are successfully corrected by the HEC single-bit correction mechanism. There are not two figures labelled ‘HEC’, uncorrectable and corrected.
I asked AA about this, hoping for a sanity check, and asked about the 9% corrupt TCP rate. I didn't get any help at all, staff just refused to be drawn in. I explained that I wasn’t trying to assert that it’s a line issue, and was aware that the whole thing could be bogus, although if the modems are not lying and modem 1 is not broken and the settings on line 1 are not off then line 1 just happens to be a lot more noisy than the other two lines, but then that's just what happens sometimes. If the numbers and my maths are not bogus then I would say that it is AA’s problem though, or ought to be, because they also sold me the modem and you would hope that they want to deliver a working service that includes non-corrupted PPP frames and that includes non-corrupted IP packets not relying on TCP (what about UDP-only prototocols).
I need to :
1. swap out modem 1 in case it is broken.
2. Do a test download and capture it in order to try and spot a high rate of TCP retx. Clueless can capture it, unfortunately I don't have the tools myself but no matter, luckily clueless saves the day.
3. Look at the SNRM on line 1 (again).
SNRM and target SNRM
All lines had a 3dB downstream target SNRM. I was expecting this to cause problems of exactly this type, which is why I took a look at the modems’ reported numbers. I increased the d/s target on line 1 from 3dB to the usual 6dB. Weirdly this never seems to do much, the actual SNRM was 2.3dB after a resynch soon after the change and the sync rate did not change. This is very odd and I don't understand it. Changing the target again to 9dB chopped 20% off the sync rate, a ‘cure’ that is worse than the disease and probably not a sensible option since sustaining a slightly higher actual SNRM might be all that is needed, but trying to get rather nearer to 6dB or a bit less rather than the status quo, which is 0.6-2.5 dB varying, does not seem to be happening.
What real current SNRM do other users see when they have chosen a 6dB d/s target with on of these modems? Is it something weird about the modem model, a naughty margin tweak? Doesn't explain why the lines are so different though, because the SNRM values reported are about the same, varying quite a lot.
Packet loss, CQM and 'pings'
AA staff asked me if I was seeing any packet loss and I said no. And then lost all interest, but this doesn't seem to be reasonable.
It seems to me, when you do the maths and have done the to-the-power-32 thing, that you realise that a CQM ppp ping loss rate of 0.3% - because one PPP ping equals one ATM cell - isn't going to show up in clueless much. This is not going to show you that you actually have a really bad problem of 9% per packet corruption rate of full-size IP PDUs. And I have three lines, and that helps to cover up the bad problem that TCP has to deal with. AA needs to successfully deliver 1508 byte-long PPP frames not just short PPP pings. Also VoIP won't show it so much.
So it seems to me that we have something beyond what CQM as we have it currently can ‘see’, and we could put too much trust in it. (Like radiation monitoring equipment that could not see gamma rays.)
If someone wanted to do something very fancy, a L4-snoop function in an FB6000 could count TCP retxs and give a packet loss rate counter which might be interesting? Could even have an optional and configurable alarm on it if the value went weird, way out of the established ‘learned’ norm of that link or outside some global sanity limit. More load on an FB6000 CPU. Also I'm aware that VPNs would make it impossible. Also not everything is TCP. You wouldn't be covering other L4 protocols without additional L4-snooper modules that spoke certain real time UDP protocols or for example SCTP. But so what - you would just do what you could when you could and if some people couldn't get the benefit some times then so be it. It's an enhancement not an essential. And this would not specifically detect corruption, it would pick up a lot of natural packet loss that is essential in the internet as part of congestion-related behaviour and even congestion control. Also multiple line users such as myself would see problems like this diluted down - by a factor of three in my case - so this kind of corruption would be that bit harder to detect by L4 snoop.
Another Firebrick feature, turning now to the FB2x00 series, that is very interesting would be a counter for the number of bad L3 (or L4?) TCP/IP checksums seen. I don't know whether this is in there already?
AA help
I really could do with a little help from AA. A hand. In the following -
1. Sanity check me, my maths, assumptions
2. Help check whether (their own) modems are lying / insane ie bugged
3. If so then we could advise customers that the stats on them are just bugged and not to be fooled thus saving everyone from a waste of time
4. Help find out what is going on with the instantaneous d/s SNRM and the target SNRM as I should be able to have the _option_ of getting the real instantaneous SNRM up to something nearer to 6dB without having to go crazy and set a horrible 9dB d/s target.
I was a little disappointed. I understand AA staff are bound to rightly take this as low priority compared to people who are really stuffed. But I still would appreciate a helping hand, whenever someone could take a while to look over it seriously and just be a second pair of eyes.
AA, you just got a five-star vote from me in the recent best ISP contest, so don't let me down.
Any other users who would be kind enough to sanity-check me, report their own experiences or advise please do help if you are able.