In many situations I have seen how SNRM, maybe especially downstream, goes down over a certain fairly long period, with increasing error rates, until either DLM kicks in and takes action because if the error rate, or the SNRM gets so low that the modem decides to give up, at a point depending on the modem.
Questions:
1. what’s your experience of this problem?
2. How long does it take to get down to the point of death - ie ridiculously low SNRM where it drops the link?
3. Is upstream less vulnerable or is there no difference? (Although the target SNRM (ie initial) values may differ between up- and downstream.)
I’m wondering if I could look through the log file in my ZyXEL modem that is constantly written by the custom Mr Kitizen Johnson ZyXEL firmware that is installed and watch the downstream SNRM, or maybe up- and downstream, and try and detect any slow decline. I’m up for some guidance about the algorithm for this.
There’s a lot of data so it would perhaps not be good if I have to search through it all, and I’m wondering if I could get away with only looking at every nth sample. I’m assuming that I start from the latest time sample and go backwards in time, no? A couple of awkward things though. I only have 24 hrs worth of data, and that isn’t ideal.
I would like to find the start of the downward slide, but it might be before the start of the data-set. I don’t know what the target SNRM values up/down are because although I do have config file information about this, it’s not necessarily up to date, and my code has to work in all circumstances, or else say honestly that it can’t make a determination, but must never deliver bogus error reports. Maybe it’s not the end of the world if the start point of a decline is ‘off the screen to the left’, so to speak. I think levels aren’t that important and it’s all about diagonal average slopes, averaged out over a very wide range.
Another tricky thing is that I need to deal with day-night variation. Perhaps a wide-enough range averaging process will do the job. Alternatively, linear regression, fitting the best straight line using the least-squares method would do the job. I wrote this kind of code, but much more sophisticated, as one of my first mainframe programs 42 years ago, but I can’t remember how to do it now.
It absolutely has to be a cheap fast algorithm, as this code is being written in iPadOS Shortcuts which is horribly slow, but arithmetic is cheap as that will be done by machine code in the runtime library and the only thing that matters as regards slowness is the number of lines of Shortcuts’ code we run through (the number of so-called ‘actions’). So loop count is everything.
I think that in certain situations we will need to apply a simple averaging noise filter that smoothes out all the random variation. We can’t apply this everywhere for reasons that will become clear below, and also we don’t want to do runs of two different averaging processes over the same data range, if we can help it, as it’s a waste of CPU time.
One more tricky things is that the code has to detect discontinuities - rising step functions. Sharp vertical jumps in the SNRM of a height greater than some value to be worked out are to be counted as resynchs if they are not spikes, that is there is an upward jump and no immediately following downward jump. We define that as a spike and classify it as very bad noise. We shouldn’t be having too many spikes like that or else we need to do something about our noise filtering. Also we are looking for a rising step function, and so we need to check that before and after the step, the derivative of the heavily smoothed data with short-interval averaging applied in a patch before and after the step has a really low value, so the function is nearly flat before and after the discontinuity. My line 2 upstream has a lot of steps in every day. The algorithm won’t work on upstream for this line.The presence of falling steps tells us to forget it. The value has to drop so that it can jump up again.Why this happens has been a mystery for a long time. Something to do with a noise source being switched on or off, or perhaps more likely is that something bad in that line is a noise detector which somehow gets disturbed and flicks between something like enabled and disabled or sensitive/insensitive states.
We need to detect discontinuities first before we do any long-range smoothing or linear regression. In the first case we don’t want to try and smooth out a vertical jump, a step. And secondly we are looking at potentially two different diagonal (or not) lines, one for the situation before and then after the step, ie after the resync.
So I need to give some more thought to the order of operations as I have already mentioned disaster if I get that wrong. I think down-step detection might be a good idea to run first, to rule out a mad situation such as my line 2 upstream, as the presence of a nasty such as that would confuse the algorithm of the resync detector / rising step detector phase.
When I find a resync, I then have two situations: a before and an after period. Do I then split the whole range into two and start making up a list of ‘inter-resync situations/periods’ and ultimately have a loop looking at each one? Then I would have to report multiple judgement decisions, on each period in turn. I don’t think I want to go down this path; too much code and complexity and also splitting the data range into say two 12-hr periods, or an even worse case where they are unequal length and we can see that one period could easily be way to short to run the algorithms properly as they need a wide range to work well. So perhaps keep it simple and give up if there’s a resync. On only do the after period, and only then if the length of that range is above some minimum so that we have plenty of data available so that the algorithms can work well over a wide enough time range.
If I do find SNRM droop, I report the fact, providing it isn’t merely 24-hr cycling, which can be identified because the values go down but then rise again later, and also the period has to be exactly 24 hrs. Also very wide range smoothing might erase a 24-hr wavelength sine wave. (Are we sure we want to make it quite that powerful though?)
Is it right that some modems have SNRM droop over time because they have a problem with the operation of bitswap? Or am I just misremembering that or simply making something up? I seem to remember from when Kitz taught me about the ‘monitored tones’ hardware feature in modems that a cheap and nasty modem that doesn’t support monitored tones could be bad news with bitswap in unlucky circumstances. The noise spectrum shifts over time and the original bitloading chosen during training becomes no longer suitable for a new spectrum, so bitswap tries to do its thing, but if the new spectrum is far too weird/too different and if the noise is so bad on some tones that those tones are marked as ‘unusable’, then they get knocked out of consideration. Without the monitored tones feature, by which the modem keeps listening to those unusable tones to see if they ever recover, we have the situation where some tones have got knocked out and they will never get reinstated, while over time even more tones get knocked out, and it’s a one-way process because without the monitored tones ‘rehabilitation’ to add tones back into the available set, it’s a highway straight down to hell. If any of that is correct, that sounds like a perfect cause of long-term SNRM decline, caused by the combination of a cheapo modem’s lack of monitored tones plus a greatly varying noise spectrum.
Will a higher target SNRM help fix these problems? Presumably it depends on how far down the decline will ever go. If it stops at some lower level, then you carry on like that with a high error rate, which is not great, and nothing happens unless DLM chooses to kick in and forces a retrain. Regarding this monitored tones theory, I don’t see why the downward slide will ever stop, not unless the spectrum variation ceases. The cure in this latter case is to go out and buy a proper modem though, as I’m not certain that a higher target SNRM will save you. Does that sound right?
An understanding of the why behind a downward slide shapes the advice that my program will spit out when various different subtypes of error are detected. One think I would like to think about is whether or not it’s possible to be selective and either advise/not advise raising the target SNRM. Mind you if uncertain, I can always recommend a list of multiple ‘possibles’ to try.
My god, I cannot believe how long this post has become. Must be some kind of record, even for me.
If you made it this far, then many thanks. All advice and experience greatly welcomed.