Kitz Forum

Broadband Related => ADSL Issues => Topic started by: Weaver on August 09, 2021, 09:58:11 AM

Title: Afternoon burst of errors
Post by: Weaver on August 09, 2021, 09:58:11 AM
The other day, I noticed that one of my ZyXEL VMG1312-B10A modems’ Johnson graph-plotting function (https://github.com/johnson442/custom-zyxel-firmware/releases) was showing a large burst in errors per unit time during the afternoon of the 7th. Janet had a workman at the house for that time interval although the time he was here was much longer than the duration of the interference period. When I asked her about it she said, "but he wasn’t on the same mains supply!", which is good thinking, and is true, because the site where he was working, an outbuilding, has its own separate mains feed, it isn’t fed from the house. I explained to her the concept of RF interference and had to tell her something about what radio frequency EM energy is, then I think the penny dropped.

Anyway, I have some pretty pictures below, one of "FECs" and one of "CRC errors".

Could someone help me understand how these images were arrived at? I’m being really, really stupid again.

There’s also the question of how/where the PhyR L2ReTX protocol fits in and how it affects the definition of these terms - ie pre- and post- L2ReTX recovery vs protocol ReTX timeout.


FECs:

(https://i.postimg.cc/XvzCt9L8/50-F97-F5-B-9-E30-4-AA1-B384-C9230-FFD1-DC9.jpg)



CRCs:

(https://i.postimg.cc/rmsDJzLr/B849031-E-7-DC2-4841-A680-14-D171-C2083-E.jpg)

And what kind of tools and what kind of badness in those tools (if any) causes RF interference like this?

Normally the CRC rate is between 0 - 4 per time interval, so very clean. And once again I have forgotten what the unit time interval is in these Johnson graphs; Burakkucat and friends must have reminded me at least twice already - so I will need yet another reminder if you all would be so kind and then I’ll write it down somewhere prominent. Trying a search of the Kitz forum proved too confusing.
Title: Re: Afternoon burst of errors
Post by: Alex Atkin UK on August 09, 2021, 02:15:26 PM
What was he doing?  I work on the assumption that any kind of inductive load can spit out a ton of RF on various frequencies.
Title: Re: Afternoon burst of errors
Post by: burakkucat on August 09, 2021, 05:13:43 PM
And once again I have forgotten what the unit time interval is in these Johnson graphs; Burakkucat and friends must have reminded me at least twice already - so I will need yet another reminder if you all would be so kind and then I’ll write it down somewhere prominent. Trying a search of the Kitz forum proved too confusing.

Unless it is shown as otherwise then, with such plots, I would always assume the time quantum is "per minute". (Per second and per hour would be somewhat illogical.)

The question to ask yourself is "What is the sampling rate?" and the answer is "Once per minute". The counters are read once per minute and then the delta between two adjacent reads of the counter in question is then is plotted.
Title: Re: Afternoon burst of errors
Post by: Weaver on August 09, 2021, 07:44:56 PM
I didn’t explain my confusion properly at all. My apologies. I should have asked about the exact definition of CRCs and FECs.

As for the time quantum being ‘per minute’, that is my best guess having zoomed in on the graphs of certain suitable, moderately ‘busy’, alternating 0-1, 50% duty-cycle datasets.

@Alex - he was repairing a floor that was damaged by a flood caused by sheep-attack. No I’m not joking, sheep who were scratching rubbed on a plastic water pipe and managed to damage a connector or something, causing a huge flood. Several floorboards were cut out and replaced and then (amazingly good-looking) fake tongue-and-grooved plastic ‘wood’ was laid over the whole floor.

It is our first experience of a sheep-attack. Donkeys also scratch their backs on things. Janet has purchased some strong, stiff wire fencing material to keep sheep (and haggis) out of where they shouldn’t be.
Title: Re: Afternoon burst of errors
Post by: tubaman on August 10, 2021, 07:44:10 AM
Serious RF noise like that can be caused by a badly suppressed motor in an electric drill or the like. Another possibility is a badly suppressed petrol engine - from a standalone generator perhaps?
 :)
Title: Re: Afternoon burst of errors
Post by: Weaver on August 10, 2021, 05:24:39 PM
At the point when a "CRC" event is counted, as a result of corrupt data received that cannot be recovered by the Reed-Solomon method, is it after that that the PhyR or G.INP algorithms kick in and save the day? If that’s true and the ‘CRCs’ are counted before L2ReTX has corrected errors then the ‘CRC’ count will be a dramatic exaggeration of the true number of corrupted PDUs received, no?

Or is the CRC event counted after PhyR or G.INP algorithms have recovered your data by retransmission(s)?

That was my confusion about the detailed stats in the pictures.
Title: Re: Afternoon burst of errors
Post by: burakkucat on August 10, 2021, 05:39:02 PM
I have always understood that the CRC count is that of the "residue" after all corrective methods have been attempted. (But I could well be wrong.) It's not something of which that I have given a great deal of thought . . .
Title: Re: Afternoon burst of errors
Post by: Alex Atkin UK on August 10, 2021, 11:28:13 PM
You'd think CRC would count BEFORE retransmission, as technically you did lose that packet, but if it does - who knows.
Title: Re: Afternoon burst of errors
Post by: tubaman on August 11, 2021, 08:16:48 AM
I have always understood that the CRC count is that of the "residue" after all corrective methods have been attempted. (But I could well be wrong.) It's not something of which that I have given a great deal of thought . . .

That is my understanding too, but I've never seen it actually documented anywhere.
Title: Re: Afternoon burst of errors
Post by: kitz on August 20, 2021, 02:43:12 AM
I have always understood that the CRC count is that of the "residue" after all corrective methods have been attempted. (But I could well be wrong.)

True.   CRC isn't error correction.  It's error detection for corrupt data packets that have not be sent. (lost packets or packet loss)
It doesn't record errors that may have been corrected by 'lower level' error correction methods.

Remember how broadly speaking* we say that FECs have the potential to be CRCs if the line wasn't Interleaved?   

>>  as technically you did lose that packet, but if it does - who knows.

If it was corrected by any of the lower level error correction methods then no - because it was successfully recovered. 
The relevant error correction methods keep counters of that method it was recovered by.   There's a big difference between error correction and error detection and its why since G.INP I specifically say error detection or error correction as opposed to Interleaving.
   
CRCs go waaay back higher up the stack for recovery such as ARQ.  Sufficient packet loss at this level has the potential to grind a connection to a noticeable crawl.




*Technically it should be Interleaved and using RS encoding.
Title: Re: Afternoon burst of errors
Post by: Alex Atkin UK on August 20, 2021, 02:48:02 AM
That's where its unclear, because G.INP is error detection, the packet IS lost, its just able to more efficiently get re-transmitted.  So presumably if that counts as CRC or not would depend on if it occurs lower down the chain so it can't be detected as a CRC.

My guess would be it probably doesn't count as a CRC because the whole point is it corrects the problem long before that point, to avoid it being seen as packet loss which in turn could trigger a loss of sync?
Title: Re: Afternoon burst of errors
Post by: kitz on August 20, 2021, 03:30:14 AM
>>> PhyR L2ReTX

I'm not sure as the mention of L2 is confusing me here, so ignoring that bit.....
Would have thought that since its the modem performing this function then it would be the same level as G.998.4 and 'beneath' RS encoding. 
 
As above,  Retransmission and RS (and TCM) are types of error correction used with DSL.  CRC is a counter of errors not able to have been fixed by any of the error correction methods employed by the xDSL modem.

Retransmission uses RS methods for protection.  The level of protection is set using parameters in similar way to standard INP.   If the DTU doesnt arrive then a CRC is recorded and dealt with at higher level in the protocol stack.


Further reading on this topic on the main site under
 - Data Transmission
 - Error Correction
 - G.INP Retransmission
Title: Re: Afternoon burst of errors
Post by: kitz on August 20, 2021, 03:52:58 AM
That's where its unclear, because G.INP is error detection, the packet IS lost, its just able to more efficiently get re-transmitted.  So presumably if that counts as CRC or not would depend on if it occurs lower down the chain so it can't be detected as a CRC.

My guess would be it probably doesn't count as a CRC because the whole point is it corrects the problem long before that point, to avoid it being seen as packet loss which in turn could trigger a loss of sync?

G.INP is classed as error protection.   
OK I get what you say about it being error detection, but all error correction methods first have to be able to perform error detection to know when & which  packets need to be corrected.


If I say that CRCs are errors that haven't been able to be corrected at the 'DSL layer' rather than modem,  does that help to clarify it a bit better?
CRC's are passed higher up the chain and have to be dealt with higher up in the (TCP/IP) protocol stack ie by the 'network card'.   

The whole packet has to be resent - unlike with TCM, RS & G.INP which can attempt to repair the packet.   
OK again that's a bit basic explanation because g.inp can re-request a packet...  but purely at the 'DSL layer' between the modem and dslam....  rather than the more traditional method of between the PC and remote server.     Thats why we say a CRC is something that hasn't been fixed by the modem and has to be passed higher up the network chain - as it is irrespective if g.inp is in use or not.
Title: Re: Afternoon burst of errors
Post by: Alex Atkin UK on August 20, 2021, 04:43:44 AM
Yes, that makes sense.  Plus presumably a DSL packet is much smaller than say a TCP packet, so you're retransmitting much less data if its fixed at the DSL layer?
Title: Re: Afternoon burst of errors
Post by: Weaver on August 20, 2021, 05:15:43 AM
I’m getting confused about PDUs here. How large is a DTU in PhyR ? (Need to read Kitz’ article again thoroughly.)

So is an incoming AAL5 PDU which is seen as a string of PhyR DTUs checked, retransmission performed if needed, and if the retransmission process fails then that is a "CRC error" event?
Title: Re: Afternoon burst of errors
Post by: niemand on August 20, 2021, 12:53:07 PM
DTU doesn't have a fixed size. It's a multiple of either 53 byte ATM cells or 65 byte PTM codewords wrapped in some overhead.

I don't think these DTUs will be packaged inside ATM. They're generated by the remote modem and carry ATM so presumably are part of the DSL layer and will be transported as superframes.
Title: Re: Afternoon burst of errors
Post by: Weaver on August 20, 2021, 08:36:42 PM
I’m digging out some more reading matter.

I found this interesting. https://doc.lagout.org/electronics/doc/ikanos/DO-435935-WP-1_Improved-Impulse-Noise-Protection_ReTx1.pdf
Have only just started reading it as not feeling very well this evening, fuzzy and headachy.

I have been mislead by the experience of old dial-up modems’ retx protocols which used very small DTUs (perhaps 64 bytes iirc, but it’s been thirty years and I just can’t remember).

@Kitz I used the term L2 in L2ReTX from OSI L2 to distinguish it from TCP ReTX at L4. Sorry if created more confusion than clarification. There are so very many layers, layers within layers. I ought to start using the DSL-internal layer terminology.
Title: Re: Afternoon burst of errors
Post by: burakkucat on August 21, 2021, 12:42:37 AM
I found this interesting. https://doc.lagout.org/electronics/doc/ikanos/DO-435935-WP-1_Improved-Impulse-Noise-Protection_ReTx1.pdf

Have downloaded a copy to consider . . .
Title: Re: Afternoon burst of errors
Post by: Weaver on January 11, 2022, 07:32:33 AM
[I have a lot of questions in the following. If friends would be kind enough to answer individual ones one at a time then as usual I will be very very grateful indeed, many thanks.]

Is PhyR use more-or-less exactly the same thing in detail as G.INP (not just in concert or high-level aim), or are there differences in the protocol?

Kitz wrote:
> Thats why we say a CRC is something that hasn't been fixed by the modem and has to be passed higher up the network chain - as it is irrespective if g.inp is in use or not.

@Kitz - This is important. So a CRC error count is counted even if G.INP kicks in and recovers the situation by successfully retransmitting DTUs as needed. In other words, if PhyR or G.INP fixes the problem situation by retx this doesn’t affect the CRC error event count. Is that correct? So if we see CRC counts we may have a serious problem but these errors could in fact be getting fixed by G.INP.

And is that just as true if I substitute ‘PhyR’ for G.INP?

So going back to my graphs at the start (https://forum.kitz.co.uk/index.php/topic,26211.msg439753.html#msg439753), where I see 34 CRC events per time quantum, which is per minute as Burakkucat has explained, we are seeing 34 corrupt RS frames RX per minute and an RS frame’s size is determined by the DSL framing parameters (B * M iirc). We can’t tell from that anything about how many IP PDUs are being corrupted.

Is all of that correct?

G.INP proper is allowed in conjunction with G.992.3 / G.992.5 isn’t it? But BT’s DSLAMs/MSANs don’t generally support it outside of some FTTC cabs ?

I’m assuming that I am only lucky enough to have PhyR because I have a ZyXEL modem that has a Broadcom chipset in it now which supports PhyR and so does the now six year-old 21CN DSLAM at NSBFD. Is that correct ? And other ADSL2 / 2+ users are not so lucky if they don’t have the matching pair of good hardware in their modem and DSLAM ? That’s what I’ve been telling myself.

If all of that is correct, then that would agree with my feeling that the ZyXEL with its Broadcom chipset is much more robust under poor line conditions than my MediaTek-based DLink DSL-320B-Z1 modems at a low SNRM. I’ve been telling myself that the support for PhyR is the reason why I’m very very happy (no CRCs and no ES at all) even at a downstream target SNRM of 3dB. Compare this with upstream, which I believe has no PhyR support so is not entirely happy even at 6dB upstream target SNRM.

Another question: if you are getting a lot of CRC errors and have PhyR or G.INP, do you get a measurable slowdown due to the retransmissions, if you look at things carefully?
Title: Re: Afternoon burst of errors
Post by: Chrysalis on January 12, 2022, 11:50:29 AM
To me a CRC is a uncorrected error, any kind of error correction, if it fixes errors will not result in a CRC error been counted, on all the things I have seen error correction implemented, I have never seen corrected errors tallied as CRC.

Those with G.INP hopefully can help you more on your last point, I do think though that G.INP has its own set of stats which may show how many times its had to do retransmissions? (latency increase hit).
Title: Re: Afternoon burst of errors
Post by: tubaman on January 12, 2022, 01:46:23 PM
To me a CRC is a uncorrected error, any kind of error correction, if it fixes errors will not result in a CRC error been counted, on all the things I have seen error correction implemented, I have never seen corrected errors tallied as CRC.

Those with G.INP hopefully can help you more on your last point, I do think though that G.INP has its own set of stats which may show how many times its had to do retransmissions? (latency increase hit).

Quite agree, and yes, G.INP has its own stats counters. A quick search found me this explanation of them (https://www.manualsdir.com/manuals/736472/exfo-maxtester-max-630.html?page=92)

"G.INP RTX_TX is the number of frames retransmitted by the transmitter. The Local number is what is retransmitted from the CPE to the DSLAM. The Remote value is what is retransmitted from the DSLAM to the CPE.

Note: The RTX_TX number may contain retransmits not requested by the receiver(that is, the retransmission request channel got corrupted and a retransmission was sent automatically) resulting in an incremental value even though the same frame ID was retransmitted multiple times.

G.INP RTX_C is a counter that is increased each time a frame is detected in error and has successfully been corrected by a retransmission. The Local number is received by the CPE, the Remote value received by the DSLAM.

G.INP RTX_UC is a counter that is increased each time a frame is detected in error and has not been corrected by one or more retransmissions within the maximum delay period. The Local number is received by the CPE, the Remote value received by the DSLAM."
Title: Re: Afternoon burst of errors
Post by: Weaver on January 22, 2022, 05:27:45 AM
The key point for me is that there can be an error detected by one software/hardware component, here RS CRCs, but is fixed by a later component, here PhyR or G.INP. And if there is an RS CRC detected error, then that counts one stats error counter event even if the problem is subsequently fixed by PhyR or G.INP ReTx. So you have a stats counter error even if there was no fault seen at L3, no practical problem except for a small latency increase due to the L2 retx.
Title: Re: Afternoon burst of errors
Post by: Weaver on April 29, 2022, 11:58:55 AM
I checked with Mr Johnson. Burakkucat’s guess as to the time quantum in the Johnson graphs was not correct, just logical :) The quantum in in fact 30 s, so the count is "events per 30 s". Johnson has very kindly made a new release for me which shows a " / 30 s" on the vertical axis for people like me who can’t remember anything.
Title: Re: Afternoon burst of errors
Post by: Weaver on January 15, 2023, 10:45:44 AM
DTU doesn't have a fixed size. It's a multiple of either 53 byte ATM cells or 65 byte PTM codewords wrapped in some overhead.

That makes sense, because the DTU is then a very appropriate unit such that the code that stitches together newly received retransmitted DTUs into the saved existing good parts of the earlier large PDU does not have to deal with for example a part of an ATM cell’s worth of new DTU payload content or half a PTM 65-byte codeword. Perhaps that’s not really an important issue for simplifying the code, but an appropriate DTU size choice can’t hurt.

Can anyone tell me what the likely value or range of values of the multiple is ? And what determines the choice of the value ?

I do wonder if I ought to work out the DTU size in the current situation because then I could say "there are n uncorrected errors per b bytes of received data and each error means the loss/corruption of d bytes" with an exact report on the data loss and a summary percentage shown to the user if I give out full details in ‘verbose’ mode. If there is no ARQ, then you would have to report different numbers: CRC count and the total size of the lost DSL PDUs vs the received total download size.

But going back to the PhyR situation, one bad DTU means one entire (possibly larger) DSL layer-x PDU lost as half a PDU is no use. I suspect that a bad DTU might stuff up more than one DSL layer-x PDU where multiple layer-x DSUs are packed into one layer-x DTU, is that right ? I need to re-read the ADSL2 spec on this point and I also need to remind myself of what the various correct DSL sublayers’ names are, so the above could be made less vague and confusing. Mea culpa maxima.

I’m sure I have been told this, but once again I forget. Are there going to be more errors if a download is in progress than if the link is idle ? I’m assuming the answer could be very different for the cases of VDSL2, ADSL2+PTM and ADSL2+ATM ?

I’m thinking that the link is always busy in ATM even if there’s no user data in transit ? I have no idea about PTM though ?

Looking back, I see that at various places I expressed opinions that were wrong and I think that was because I hadn’t spotted or had not understood some of Kitz’s points.

Just to once again check my understanding, I’m assuming that ES counting is exactly driven by the CRC count increment events, and the value of CRCs and ES is different because you can occasionally get several CRC events within one 1 s time quantum and that whole group of CRCs still only counts as one ES.

I presume that I made the correct choice by picking ES as the health metric in my modem-stats DSL link wellness assessment program, but CRCs would have been a reasonable choice too. Do you agree ?

And even PhyR or G.INP retransmission-corrected errors are not shown in the CRC or ES value ? That is, if there is a RS-uncorrected error then it’s not shown in the ES or CRC total if PhyR or G.INP successfully recovers from the situation within a specified max time limit ?

In my iPad wellness program, I thought about looking at the PhyR-related stats that my modems show. This would be a pain because some modems won’t have PhyR and in some situations the DSLAM won’t support PhyR or G.INP, so I would have to deal with all those alternatives, although that wouldn’t be a big pain in the code. The syntax of these stats might vary between modems and will probably be different for G.INP stats compared with PhyR stats, and that parsing would be a pain to handle. I certainly wasn’t keen to do a lot of work unless there was some real reason why looking at these stats was essential, and if ES or CRC count does effectively summarise the health of the link including absolutely all L2 error correction techniques available, then there’s no reason to do any unnecessary extra work. When I had the idea that the CRC count might show exactly how many RS-uncorrected PDUs there are, never mind those later corrected by L2 retx ie ARQ ie PhyR or G.INP, then I did think about looking at the PhyR/G.INP stats.
Title: Re: Afternoon burst of errors
Post by: burakkucat on January 15, 2023, 03:30:05 PM
Can anyone tell me what the likely value or range of values of the multiple is ? And what determines the choice of the value ?

No. Sorry, not I.

Quote
I’m sure I have been told this, but once again I forget. Are there going to be more errors if a download is in progress than if the link is idle ? I’m assuming the answer could be very different for the cases of VDSL2, ADSL2+PTM and ADSL2+ATM ?

The xDSL link is established and has various protocols in place to help maintain that established state. Why would the presence or absence of "the stuff" being carried by the xDSL link make any difference? (I guess I'm answering "no" and "no" to your two preceding questions.)

Quote
I presume that I made the correct choice by picking ES as the health metric in my modem-stats DSL link wellness assessment program, but CRCs would have been a reasonable choice too. Do you agree ?

Yes, that seems to be a sensible choice.

Quote
And even PhyR or G.INP retransmission-corrected errors are not shown in the CRC or ES value ? That is, if there is a RS-uncorrected error then it’s not shown in the ES or CRC total if PhyR or G.INP successfully recovers from the situation within a specified max time limit ?

"Yes" and "yes" is my response (until such time as I am corrected by a more knowledgable member).
Title: Re: Afternoon burst of errors
Post by: Weaver on January 15, 2023, 06:05:57 PM
> Why would the presence or absence of "the stuff" being carried by the xDSL link make any difference?

I can think of an example. From the Stone Age (1970s). RS232 links have the line in one of two states (oversimplified) which I’m going to call idle and frame, because I don’t know the official names if any such names exist. Being in the idle state means that no data is being currently sent, and the line is just unchanging, always high voltage. In the frame stage the line goes up and down for each bit in a short block of between 7 and 9 payload bits plus some extra bits for protocol framing overhead. Now if there is an EM noise spike while the link is in the idle state, ie no data currently being sent, then depending on the hardware possibly no error condition is indicated and certainly there will be no corrupted byte or bytes received, because we are in the idle state ie no data being transmitted so nothing is currently being received that can get corrupted. (Methinks there are other possible problems that could occur in such a situation but I’ll leave that analysis to the aged experts amongst us.)

Anyway, with this physical layer there is a concept of currently idle link state vs currently busy link state (my ‘frame’ state in this particular case). You might not see any evidence of badness produced when a spike occurs in the idle state but this is dependent on several unknowns. If you are doing a big download flat-out then possibly you would pick up a certain number of error events per unit time whereas if you transmitted nothing, then depending on these various unknown system-dependent factors you might have zero error events.
Title: Re: Afternoon burst of errors
Post by: burakkucat on January 17, 2023, 10:17:28 PM
In your case, your xDSL links are never quiescent, for you have the incessant 1 Hz Firebrick <---> Firebrick ping-pong running over the links.