Topic: HEC errors (Read 3371 times)

Weaver · « **on:** May 03, 2018, 12:36:26 AM »

I'm getting a lot of ATM HEC errors.

(I don't know what counts as ‘a lot’, what is your opinion? More than zero is a lot to me.)

This is modem 1 after running for about a day, not heavily used, little traffic

Mode:   ADSL2
Type:   ANNEX_A
Status:   Showtime

    Downstream   Upstream
Rate (Kbps):   2862 kbps   566 kbps

SNR Margin (dB):   1.6   5.6
Attenuation (dB):   64.9   42.1
Output Power (dBm):   18.2   12.2

Super Frames:   3625851   3625859
RS Correctable Errors:   10263   20793
RS Uncorrectable Errors:   1839   197

HEC Errors:   25146   0
Total Cells:   55237   16863
Data Cells:   5229814   1260748
Bit Errors:   0   0

= 0.481 % per cell, probability of 32 good cells 85.7% [!], so one failure every 6.9 1500 byte IP PDUs per modem I make it (not counting ACKs, which makes it worse)

Opinions?

What should I do about it? Obviously can resynch the modem, or can increase the target SNRM.

Modem 2

Mode:   ADSL2
Type:   ANNEX_A
Status:   Showtime

    Downstream   Upstream
Rate (Kbps):   2816 kbps   439 kbps

SNR Margin (dB):   1.1   5.8
Attenuation (dB):   65.5   42.2
Output Power (dBm):   18.1   12.2

Super Frames:   3413876   3413883
RS Correctable Errors:   48846   6723
RS Uncorrectable Errors:   374   2

HEC Errors:   1981   0
Total Cells:   54006   36901
Data Cells:   4471444   933414
Bit Errors:   0   0

= 0.044% per cell, probability of 32 good cells 98.6%

And modem 3

Mode:   ADSL2
Type:   ANNEX_A
Status:   Showtime

    Downstream   Upstream
Rate (Kbps):   2819 kbps   496 kbps

SNR Margin (dB):   1.8   5.8
Attenuation (dB):   65.2   42.5
Output Power (dBm):   17.8   12.2

Super Frames:   5051937   5051945
RS Correctable Errors:   54953   14436
RS Uncorrectable Errors:   448   24

HEC Errors:   3608   0
Total Cells:   67890   58116
Data Cells:   31525175   3990276
Bit Errors:   0   0

= 0.0114% per cell. probability of 32 good cells 99.63%

burakkucat · « **Reply #1 on:** May 03, 2018, 12:59:34 AM »

b*cat smiles, enigmatically, and murmurs "that is what comes from looking at the device's administration interface".

I would suggest that you record the values for all three modems, once per quantum-period^*, and see if a trend can be observed.

^* insert unit of your choice.

Weaver · « **Reply #2 on:** May 03, 2018, 01:11:03 AM »

I am in agreement with Burakkucat’s suggestion. That is exactly the doom that has now befallen me from having this new data to faff with constantly.

Q. Is my maths correct?

What kind of HEC error rates do others get?

If my maths is right then I may have speeded up things by x % by reducing the target SNRM from 6dB to 3dB and getting increased sync rates, but then I just lose it all, and possibly then some, because of a ~4.8% TCP ( = 1 / ( 3 * 6.9 ), ignoring modems 2 and 3 and loss of ACKs ) packet retransmission rate on a full-MTU TCP download, plus occasional annoying random failures such as DNS failure if a lookup fails due to corruption.

Btw, does anyone know what the likely definition of these error counts is? Does the error figure include successfully corrected single-bit errors or is it only multibit uncorrected errors? I realise that I was just assuming that it was the latter in my ideas about failures and retransmissions. It would be nice to see both.

Since both figures, single-bit and multibit, are not given then my guess is that the reported figure might be the latter as that is critical to know.

I since changed line 1’s downstream target SNRM to 6dB instead of the usual 3dB. Very surprisingly it did not slow it down, it has still picked a really high sync rate of 2816k. Can anyone help me understand that - why no change? (Actually slightly faster.)

Very soon after the resynch the reported instantaneous d/s SNRM dropped from the initial 6.0 to 3.3dB. What is that all about?

More weirdness, as an experiment, I also did the same d/s target SNRM change back up to 6dB for line cwcc@a.3 (modem 2) and that caused the d/s sync rate to drop by about 100k from 2800kbps to around 2700kbps.

This means that using 3dB target SNRM is for me either a really useful performance tweak or possibly a disaster depending on the line or who knows what, if my understanding of the hec error count is correct and my basic maths is still intact after 40 years of decay. Going for 3dB without visibility of at least fairly detailed stats seems not to be not such a good idea.

But I have zero idea of what is going on here. It seems that the combined system of modem and DSLAM is not ending up with enough error correction or using too many bits per bin in places or both, or that variability over time is not being taken into account enough.

But why then are the other two lines doing a bit or a lot better? Perhaps it's down to their different experiences of noise over time.

[Moderator edited to merge successive "tweet-like" posts into one.]

ejs · « **Reply #3 on:** May 03, 2018, 05:48:41 PM »

Single bit errors in the ATM cell header are never corrected on ADSL.

I think you're considering this issue under an incorrect assumption that the individual bit errors are randomly distributed. Errors are expected to arrive in bursts. That's the explanation given in the G.992.x documents for not bothering to correct a single bit error in an ATM cell header.

I'm not sure if frames with a CRC error get passed up to the ATM layer or not. From a quick look at some of my recorded stats, an increase in the HEC count seems to coincide with an increase in the CRC count, and the CRC count can increase without the HEC count increasing. I suspect that the CRC and HEC counts are both reporting the same bursts of noise.

I think an issue with FEC/interleaving is that if the errors exceed the error correction capability, the errors have still been spread over lots of data, so the whole 8ms (delay) worth of data will have been lost.

Weaver · « **Reply #4 on:** May 03, 2018, 06:54:01 PM »

I didn't know that the single bit error thing is just not used. Many thanks for clearing that up.

I am aware that burst noise is an important type and very much expected to be an oft-occurring phenomenon. I think you are right surely about losing a whole 8ms some times. On my slow line, if my arithmetic is correct, that would mean losing two or three adjacent 1500-byte packets not one in a flat-out download. Maybe they were thinking about a scenario with lower data rates?

I realise that my maths could be right but my statistics could be wrong, because I have got some of the basic underlying assumptions wrong, for example about independent random variables.

I'm still not sure about the impact of your important point concerning burst noise on my numbers though because the presence of an error in one cell does not cause a change in the probability of an error in the next cell. But I may very well be misunderstanding the subject.

What is your view, do you think I can't simply apply multiplication of probabilities to get the conditional probability?

I used the formula
1 - (1 - p)ⁿ
for the probability of badness in a packet, where p is the probability of a HEC error in a cell, n = 32 cells in a PPPoEoA IP = 1500 byte packet.

If you get general thermal noise type errors, then de-interleaved could occasionally bunch those errors up into an apparent spike in the errors in the post-processed data stream. Iirc a CD doesn't like ‘smeary’ crud or condensation because CD’s de-interleaved is less than helpful.

ejs · « **Reply #5 on:** May 03, 2018, 07:57:03 PM »

With errors arriving in bursts, and without FEC+interleaving, then if you've got one bit error, then it's more likely that they'll be more errors around it.

With FEC+interleaving, then these are all errors that the FEC+interleaving process has been unable to correct. If the burst of noise is short enough for the FEC+interleaving to fix, then you'll see no CRC or HEC errors. If it can't be fixed, then the burst of noise would have been spread out over a lot of data by the interleaving.

As I see it, both with and without FEC+interleaving, if you've detected one bit error, then it is more likely that there are more errors close to it. The characteristics of the noise bursts causing the errors are such that a single burst will usually damage far more than a single bit.

Weaver · « **Reply #6 on:** May 03, 2018, 08:26:14 PM »

I am very ready to defer to you given my huge ignorance here. I will explain my dubious thinking.

You're absolutely correct about the probability distribution, indeed, given that a cell is bad there is a greater than normal likelihood that the next one is bad. I don't think that makes the variables non-independent though? Because the first error does not cause the second, rather the two errors both have a common cause. Is that how statistics actually works? But perhaps that is the definition of ‘independent’ random variables. I know next to nothing about the subject. Clearly this isn't like a real-world coin toss though, where there is no pattern.

Weaver · « **Reply #7 on:** May 09, 2018, 02:11:52 AM »

Any idea what DLink might mean by the 'total cells' and 'data cells' entries exactly?

Their nomenclature makes no sense at all, indeed it's impossible because 'total' is less than 'data' above.

I wondered if they had some classification as user data vs management / control info, and data vs idle cells. I assume that ATM here transmits idle cells continuously ? Is that correct ?

If that is right then it introduces an additional tx latency of max 53 * 8 bit times even if the tx queue is empty because you can't start transmitting a new cell immediately whenever you want, only when the end of the currently-being-wound-out tx cell is reached. But the whole cell concept minimises this delay by forcing the use of short cells instead of long frames.

But if I am understanding this Cisco document about ATM correctly, it says that a lower layer does not send idle cells to an upper layer, which makes sense.

I am wondering if corrupt idle cells get included in quite (total) error counts or not. Of course if they are sufficiently corrupt then you don't know whether they were idle or not, but you could rely on a higher layer to give you context, if you are in the midst of a run that forms a packet then you could assume that a cell is not idle. I don't think you would want to consider anything like that.

If you get corrupt cells then idle or not you have a problem and if the corruption happens to be on an idle cell then that is just good luck, you can't ignore it and you still want to know about it just as much because the corrupt cell count is a warning of problems.

Anyway, the reason I ask is that I would like to work out some kind of reliability measure. It would be N_HEC_error_cells / t but what might t be? t might be uptime which is a bad idea and time since last resync would be meaningful. I don't know this at all though, not without telnet where I may have seen it but can't remember.

Much simpler would be for me to just use t = the total cell count. But what is that, referring to the confusing names of quoted figures? I could take the max of the weirdly named two values. I could add the two weird numbers together.

Any suggestions?

Anyway a value of N_HEC_error_cells / true_total_cells would be useful and I could monitor that. What values do others get for these numbers ?

I suppose I ought to take differences in these numbers over time and take the ratio of the per-period differences in hec error cells and total cells, so I could spot a new problem period instead of seeing an old one forever.

If any of the numbers go down then I would just assume that there had been a resync and restart the differencing process with the latest figures.

Weaver · « **Reply #8 on:** May 11, 2018, 03:38:33 AM »

I have been monitoring this for a number of days now and am beginning to think there is a problem. To begin with, I just diagnosed a new case if stats-obsession, caused by the first time availability of detailed stats to fiddled with.

The downstream HEC error rate reported by modem 1 is always ten times worse then that from two and three. Sometimes modem three is 100 times better. Upstream is excellent. The downstream error rate currently for modem 1 is 0.15% per cell which in my view is very bad.

Opinions? Could do with some advice.

To try and fix it I increased the target SNRM from 3dB to 6dB retested for a period and then up to 9dB. The change to 6dB target didn't work, because, for some reason, the actual SNRM was not that much higher, only raised from earlier 1.5dB to 1.5-3.3dB. 6dB did not reduce the speed at all either. Changing to 9dB cut a disastrous 400k (~20%) of the sync speed but did reduce the HEC errors to nothing as you would expect, but that is no solution.

I don't have a way of slightly raising the SNRM which may be what that line needs.

It seems that it just ignores the 6dB target SNRM setting.

Need to consider swapping the modem out.

Weaver · « **Reply #9 on:** May 11, 2018, 11:09:04 AM »

Could do with a little help on this if someone has a free moment.

* Do you think I need to pester the poor ISP about it?

(Luckily it's their modem too, they sold it to me. I rather wanted a one-stop-shop indeed - so as to know exactly where the buck stops because they sold me the router and modems.)

Weaver · « **Reply #10 on:** May 11, 2018, 11:54:24 AM »

Decided to pester AA on IRC about it. Got absolutely nowhere at all, only interested in ‘line faults’ but it's also their modem. Rather disappointing response. Got the feeling they simply didn't want to get drawn in as they perhaps have too much on their plate already.

Had an idea, could do a TCP flat-out download and capture it and look for retx rate. AA has a traffic capture tool.

I'm wondering if the modem is being mad and perhaps the HEC figure includes all cells that had an error even those that were successfully corrected by the single-bit error mechanism that HEC itself provides.

Someone said however, don't know why, that they believe modems do not implement HEC single bit error correction, which would be very bad if true, a seriously wasted opportunity.

Weaver · « **Reply #11 on:** May 11, 2018, 12:02:50 PM »

I pointed out that this is beyond the limits of AA’s CQM and ICMP ping-box technologies too. Their PPP pings and ICMP pings are too short, so that they will only see an error rate of say 0.3% per ping if a ping equals one ATM cell. So they will think that 99.7% success rate on pings is pretty good, but when you do the maths I should have a 9% failure rate per IP 1500 PDU (=a run of 32 successive cells).

Weaver · « **Reply #12 on:** May 11, 2018, 04:33:49 PM »

Sum up - thoughts on AA, CQM, SNRM and stuff

I am taking what EJS said on board very much. Am still thinking about it, am a bit befuddled concerning the statistical definitions, we completely agree on the real-world picture of error occurrences. I am a bit surprised at the revelation that HEC-mechanism single-bit errors are not corrected. Am needing to think some more about that.

I'm getting reported downstream HEC errors from one of my three AA-supplied Dlink DSL-320B-Z1 modems at what seems to be a modest rate but is actually scary when you do the maths. Upstream is fine. The second modem is 10 times better (lower) per cell and at one stage the third was 100 times better per cell.

The 'bad' modem gives downstream 0.3% HEC errors per cell.

I don't know if the modem’s reporting is genuine and if the figure for the total number of cells is bogus then my fraction is bogus, but then still why the factor of 10 and 100 difference? That would seem to indicate that something odd is going on even if the modem’s figures are not right, either one modem is duff, or line #1 is bad or the modem 1 settings are wrong (insufficient FEC for some reason for example, or SNRM wrong).

Is the modem mad? Or am I? The denominator problem.

My derivation of the figure of 0.3% is HEC_errors / xx_cells. What is xx_cells? The modem quotes two user-looking numbers unfortunately one called ‘total cells’ and one called ‘data cells’. I have no idea what these two mean. Maybe you would expect a true total cells number and perhaps a count for true data-bearing cells or idle cells or all three, but anyway the true total would one obviously be the largest and would be the sum of the other two.

modem 1
'total'   1576618
'data'    549514

modem 2
'total'     1419050
'data'   14950880

The modems have not had comparable uptimes here, modem 1 was resynched recently. The two numbers in a pair belonging to one modem do not even have values in a constant relationship when compared with a different modem!

Because I have no idea what is going on, I just chose the max of the two values for the denominator xx_cells. An alternative would have been to add the two together.

I have not yet worked out how to cross-check the believability of the modem’s HEC ratio numbers using a comparison of the rate of RS uncorrected errors per (x)-bits of data as the only other thing I have to go on is the downstream superframe count and I don't know how to convert that into n bits without discovering some more parameter values and reading the standards docs again.

Anyway.

Hypothetically, let's suppose the downstream HEC errors per cell rate of 0.3% obtained as described above is genuine. Then we have one line that is ten times worse then line 2. What does this error rate mean in practice ?

I am using the following equation 1 - ( 1 - p )**32 (where ** denotes exponentiation) to obtain the probability of a corrupt 1500 byte IP PDU (=32 ATM cells). This comes out to about 9% (ie per IP PDU per line). So on the one line, TCP will need to retx one in every 11 max-length packets, which is not good at all. I have three lines so it will be 1 in 33 packets and that will cover the bad line up.

Could someone tell me if my maths is completely wrong? (We do not have to believe the modem, especially given that the labels on the figures make absolutely no sense, they can't possibly be ‘totals’, but I'd just like some help with a sanity check.)

If I am right then a HEC error rate that does not sound too scary means a big deal on a flat-out TCP download if 9% of your packets are corrupted.

One other possibility. It could be that the modem is reporting all cells that have errors even if these errors are successfully corrected by the HEC single-bit correction mechanism. There are not two figures labelled ‘HEC’, uncorrectable and corrected.

I asked AA about this, hoping for a sanity check, and asked about the 9% corrupt TCP rate. I didn't get any help at all, staff just refused to be drawn in. I explained that I wasn’t trying to assert that it’s a line issue, and was aware that the whole thing could be bogus, although if the modems are not lying and modem 1 is not broken and the settings on line 1 are not off then line 1 just happens to be a lot more noisy than the other two lines, but then that's just what happens sometimes. If the numbers and my maths are not bogus then I would say that it is AA’s problem though, or ought to be, because they also sold me the modem and you would hope that they want to deliver a working service that includes non-corrupted PPP frames and that includes non-corrupted IP packets not relying on TCP (what about UDP-only prototocols).

I need to :
1. swap out modem 1 in case it is broken.
2.   Do a test download and capture it in order to try and spot a high rate of TCP retx. Clueless can capture it, unfortunately I don't have the tools myself but no matter, luckily clueless saves the day.
3. Look at the SNRM on line 1 (again).

SNRM and target SNRM

All lines had a 3dB downstream target SNRM. I was expecting this to cause problems of exactly this type, which is why I took a look at the modems’ reported numbers. I increased the d/s target on line 1 from 3dB to the usual 6dB. Weirdly this never seems to do much, the actual SNRM was 2.3dB after a resynch soon after the change and the sync rate did not change. This is very odd and I don't understand it. Changing the target again to 9dB chopped 20% off the sync rate, a ‘cure’ that is worse than the disease and probably not a sensible option since sustaining a slightly higher actual SNRM might be all that is needed, but trying to get rather nearer to 6dB or a bit less rather than the status quo, which is 0.6-2.5 dB varying, does not seem to be happening.

What real current SNRM do other users see when they have chosen a 6dB d/s target with on of these modems? Is it something weird about the modem model, a naughty margin tweak? Doesn't explain why the lines are so different though, because the SNRM values reported are about the same, varying quite a lot.

Packet loss, CQM and 'pings'

AA staff asked me if I was seeing any packet loss and I said no. And then lost all interest, but this doesn't seem to be reasonable.

It seems to me, when you do the maths and have done the to-the-power-32 thing, that you realise that a CQM ppp ping loss rate of 0.3% - because one PPP ping equals one ATM cell - isn't going to show up in clueless much. This is not going to show you that you actually have a really bad problem of 9% per packet corruption rate of full-size IP PDUs. And I have three lines, and that helps to cover up the bad problem that TCP has to deal with. AA needs to successfully deliver 1508 byte-long PPP frames not just short PPP pings. Also VoIP won't show it so much.

So it seems to me that we have something beyond what CQM as we have it currently can ‘see’, and we could put too much trust in it. (Like radiation monitoring equipment that could not see gamma rays.)

If someone wanted to do something very fancy, a L4-snoop function in an FB6000 could count TCP retxs and give a packet loss rate counter which might be interesting? Could even have an optional and configurable alarm on it if the value went weird, way out of the established ‘learned’ norm of that link or outside some global sanity limit. More load on an FB6000 CPU. Also I'm aware that VPNs would make it impossible. Also not everything is TCP. You wouldn't be covering other L4 protocols without additional L4-snooper modules that spoke certain real time UDP protocols or for example SCTP. But so what - you would just do what you could when you could and if some people couldn't get the benefit some times then so be it. It's an enhancement not an essential. And this would not specifically detect corruption, it would pick up a lot of natural packet loss that is essential in the internet as part of congestion-related behaviour and even congestion control. Also multiple line users such as myself would see problems like this diluted down - by a factor of three in my case - so this kind of corruption would be that bit harder to detect by L4 snoop.

Another Firebrick feature, turning now to the FB2x00 series, that is very interesting would be a counter for the number of bad L3 (or L4?) TCP/IP checksums seen. I don't know whether this is in there already?

AA help

I really could do with a little help from AA. A hand. In the following -

1. Sanity check me, my maths, assumptions
2. Help check whether (their own) modems are lying / insane ie bugged
3. If so then we could advise customers that the stats on them are just bugged and not to be fooled thus saving everyone from a waste of time
4. Help find out what is going on with the instantaneous d/s SNRM and the target SNRM as I should be able to have the _option_ of getting the real instantaneous SNRM up to something nearer to 6dB without having to go crazy and set a horrible 9dB d/s target.

I was a little disappointed. I understand AA staff are bound to rightly take this as low priority compared to people who are really stuffed. But I still would appreciate a helping hand, whenever someone could take a while to look over it seriously and just be a second pair of eyes.

AA, you just got a five-star vote from me in the recent best ISP contest, so don't let me down.

Any other users who would be kind enough to sanity-check me, report their own experiences or advise please do help if you are able.

burakkucat · « **Reply #13 on:** May 11, 2018, 06:25:01 PM »

I really don't know what to suggest . . . other than to ask for the results of some simple tests.

Taking a hypothetical fibre-tip marker pen and mark each modem, in sequence, "M1", "M2" & "M3". Now mark each NTE5, in sequence, "NTE5-1", "NTE5-2" & "NTE5-3".

I shall now assume that the current physical set-up is "NTE5-1 & M1", "NTE5-2 & M2" and "NTE5-3 & M3". If one was to perform a modem right-shift, i.e. "NTE5-1 & M3", "NTE5-2 & M1" and "NTE5-3 & M2", and then another two modem right-shifts, in sequence, does the anomaly move with modem "M1" or does the anomaly remain associated with "NTE5-1"?

(In performing the modem right-shift, I am expecting that the the two cables associated with each modem (i.e. that connecting it to the NTE5 and the FireBrick port) moves with the modem.)

NTE5-1 <---> M1 <---> FB2700-1, NTE5-2 <---> M2 <---> FB2700-2, NTE5-3 <---> M3 <---> FB2700-3 NTE5-1 <---> M3 <---> FB2700-1, NTE5-2 <---> M1 <---> FB2700-2, NTE5-3 <---> M2 <---> FB2700-3 NTE5-1 <---> M2 <---> FB2700-1, NTE5-2 <---> M3 <---> FB2700-2, NTE5-3 <---> M1 <---> FB2700-3

In a proverbial nutshell, the question being asked is whether the anomaly is associated with line 1, modem 1 or FireBrick port 1?

ejs · « **Reply #14 on:** May 11, 2018, 09:29:48 PM »

I'd like to point out that the HEC error detection only covers the ATM cell header, not the data content of the ATM cell. So if all those HEC errors are errors detected in the cell headers, and working on the basis of bit errors at random, you'd expect the same probability of bit errors in the data portion of the cell as in the header portion. Considering that the data portion of the cell is much larger than the header, then there ought to be even more bit errors, not visible in the HEC count, but occurring in the data portion of the cell. So the probability of a whole ATM cell being error free would be even worse than what you've calculated, or all this maths is not applicable, there's no actual problem, and there isn't some awful probability of 32 cells all being error free.

If you don't have a problem with actual real-world download throughput, my advice is to ignore it. You appear to be looking at the stats, doing some maths, and deciding that there must be some serious problem, which you're now trying to fix. However, I think that if there really were a problem of the magnitude you've calculated, you would have noticed it long ago without looking at any stats.

Some lines will be better or worse than others. If they are operating within spec, there's not a lot that can be done about it.

News:

Author Topic: HEC errors (Read 3371 times)

Weaver

HEC errors

burakkucat

Re: HEC errors

Weaver

Re: HEC errors

ejs

Re: HEC errors

Weaver

Re: HEC errors

ejs

Re: HEC errors

Weaver

Re: HEC errors

Weaver

Re: HEC errors

Weaver

Re: HEC error

Weaver

Re: HEC errors

Weaver

Re: HEC errors

Weaver

Re: HEC errors

Weaver

Re: HEC errors

burakkucat

Re: HEC errors

ejs

Re: HEC errors