Kitz Forum

Chat => Tech Chat => Topic started by: Weaver on March 22, 2021, 09:44:38 PM

Title: Goodness indicator from modems’ stats
Post by: Weaver on March 22, 2021, 09:44:38 PM
I’m wondering if I can write a quick program that takes full Broadcom stats and derives a summary from it all that gives an "all is well" or "all is not well" indicator. I can already extract stats from all my modems and am thinking of applying regexes to tweeze out the numbers I need. So the program would either come out with a binary outcome or maybe failing that a percentage on a goodness scale.

Clearly I want to look at ES and SES counts, but I’m unsure about what number of events per unit time to use as a threshold value between well and unwell. Any suggestions? That or again a sliding scale value of some sort.

I should perhaps also check the current SNRMs too.
Title: Re: Goodness indicator from modems’ stats
Post by: g3uiss on March 22, 2021, 09:46:57 PM
Sounds a useful tool !
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 22, 2021, 10:02:17 PM
Any ideas about what number (per unit time) of ES or SES is reasonable ? Zero ? Or a count of CRCs?
Title: Re: Goodness indicator from modems’ stats
Post by: g3uiss on March 22, 2021, 10:28:14 PM
CRC’S don’t always pride ES so ES a better measurement. For me ES over a figure per hour or say 3 hrs might relate to a danger of the DLM taking action. I appreciate there is now some debate over what that might be but on a “ speed” line it used to be 120/hr. maybe a caution at 60/hr ? SES should be really low I don’t see any very often here on either VDSL or ADSL. Of course DLSstats has options for alerting at various triggers, but not everyone runs that 24/7.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 22, 2021, 11:35:25 PM
So a threshold of 60 ES/hr might light up a severe warning light? What about a secondary warning if the count is non-zero?

I thought I might ignore the SES value as it implies there will be at least that many ES and it is the latter that I’m triggering on. Sound reasonable?

Do you think it’s worth watching for abnormal SNRM? Both too low and too high? It’s a nuisance but I would have to configure the system to tell it what the expected target SNRM is.

Since I have four modems to check and I have to either mess about with a web browser, or run an existing all-modems report tool, and then scan through masses of pages of irrelevant detail, I’m hoping this will be a useful timesaver.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 23, 2021, 08:20:56 AM
I’ve written part of the code. It collects the raw stats from all the modems, parses it crudely and pulls out ES up/down records, then converts these numbers into ES per hr so the ES rates are normalised. In my code, it refers to one of the, say, 15 min (or less) ES-counting periods, or up-to-24hr periods, as “buckets”.

Here’s some of the debugging output it produces:

Code: [Select]
FEC: 6 5
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[1]; bucket=[Latest 15 minutes time]; dir=[down]; ES_count=[0]; bucket duration =[308]; ES/hr = [0]
modem=[1]; bucket=[Latest 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[308]; ES/hr = [0]
FEC: 27 49
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[1]; bucket=[Previous 15 minutes time]; dir=[down]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
modem=[1]; bucket=[Previous 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
FEC: 104 49
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[2]; bucket=[Latest 15 minutes time]; dir=[down]; ES_count=[0]; bucket duration =[323]; ES/hr = [0]
modem=[2]; bucket=[Latest 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[323]; ES/hr = [0]
FEC: 304 59
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[2]; bucket=[Previous 15 minutes time]; dir=[down]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
modem=[2]; bucket=[Previous 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
FEC: 11 0
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[3]; bucket=[Latest 15 minutes time]; dir=[down]; ES_count=[0]; bucket duration =[127]; ES/hr = [0]
modem=[3]; bucket=[Latest 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[127]; ES/hr = [0]
FEC: 98 13
CRC: 0 0
ES: 0 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[3]; bucket=[Previous 15 minutes time]; dir=[down]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
modem=[3]; bucket=[Previous 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
FEC: 11358 80
CRC: 3 0
ES: 3 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[4]; bucket=[Latest 15 minutes time]; dir=[down]; ES_count=[3]; bucket duration =[305]; ES/hr = [35.40983606557377]
modem=[4]; bucket=[Latest 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[305]; ES/hr = [0]
FEC: 31316 241
CRC: 11 0
ES: 11 0
SES: 0 0
UAS: 0 0
LOS: 0 0
LOF: 0 0
LOM: 0 0
modem=[4]; bucket=[Previous 15 minutes time]; dir=[down]; ES_count=[11]; bucket duration =[900]; ES/hr = [44]
modem=[4]; bucket=[Previous 15 minutes time]; dir=[up]; ES_count=[0]; bucket duration =[900]; ES/hr = [0]
--
{"modem":1,"bucket label in stats":"Latest 15 minutes time","dir":"down","ES per hr":0}
{"modem":1,"bucket label in stats":"Latest 15 minutes time","dir":"up","ES per hr":0}
{"modem":1,"bucket label in stats":"Previous 15 minutes time","dir":"down","ES per hr":0}
{"modem":1,"bucket label in stats":"Previous 15 minutes time","dir":"up","ES per hr":0}
{"modem":2,"bucket label in stats":"Latest 15 minutes time","dir":"down","ES per hr":0}
{"modem":2,"bucket label in stats":"Latest 15 minutes time","dir":"up","ES per hr":0}
{"modem":2,"bucket label in stats":"Previous 15 minutes time","dir":"down","ES per hr":0}
{"modem":2,"bucket label in stats":"Previous 15 minutes time","dir":"up","ES per hr":0}
{"modem":3,"bucket label in stats":"Latest 15 minutes time","dir":"down","ES per hr":0}
{"modem":3,"bucket label in stats":"Latest 15 minutes time","dir":"up","ES per hr":0}
{"modem":3,"bucket label in stats":"Previous 15 minutes time","dir":"down","ES per hr":0}
{"modem":3,"bucket label in stats":"Previous 15 minutes time","dir":"up","ES per hr":0}
{"modem":4,"bucket label in stats":"Latest 15 minutes time","dir":"down","ES per hr":35.40983606557377}
{"modem":4,"bucket label in stats":"Latest 15 minutes time","dir":"up","ES per hr":0}
{"modem":4,"bucket label in stats":"Previous 15 minutes time","dir":"down","ES per hr":44}
{"modem":4,"bucket label in stats":"Previous 15 minutes time","dir":"up","ES per hr":0}
--
Severely bad (>=60ES / hr):

--
Non-zero:
{"modem":4,"bucket label in stats":"Latest 15 minutes time","dir":"down","ES per hr":35.40983606557377}
{"modem":4,"bucket label in stats":"Previous 15 minutes time","dir":"down","ES per hr":44}

Currently the program looks only at the two most recent 15 min (max) duration buckets, and doesn’t inspect the longer eg. 24-hour ones. I think I should do something about this. I had initially thought that "most recent is most relevant" but now I’m having my doubts. What do you think?

It has already proven its worth because it has detected a problem with line 4; the downstream SNRM is down to ~1.4 dB which appears to be a bad thing. You can see the problem in the ES downstream for line 4.

I need to distill the output down further as well, producing an additional summary that clearly shows an “action vs no action” indicator to the user, plus individual warning light-type indicators per modem. I could do with some guidance on this. Anyone up for helping me out?
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 23, 2021, 08:45:21 PM
I’ve stripped it down to remove all the debugging output and this is what the output looks like now:

Code: [Select]
Severely bad (>=60ES / hr):
{modem: 3, Previous 15 minutes time, dir: up, ES per hr: 72}

--
Non-zero:
{modem: 1, Latest 15 minutes time, dir: down, ES per hr: 4}
{modem: 1, Previous 15 minutes time, dir: down, ES per hr: 32}
{modem: 2, Previous 15 minutes time, dir: down, ES per hr: 8}
{modem: 3, Latest 15 minutes time, dir: up, ES per hr: 45}
{modem: 3, Previous 15 minutes time, dir: down, ES per hr: 8}
{modem: 3, Previous 15 minutes time, dir: up, ES per hr: 72}
{modem: 4, Previous 15 minutes time, dir: down, ES per hr: 28}

I still have to write something to assess the overall state and assess the SNRMs down and upstream.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 24, 2021, 10:23:05 PM
Extended it and cleaned up the clutter in this v1.00 beta. Note that it has found a problem with line 3.

* Summary of DSL links’ wellbeing and error counts
---------------------------------------------------

*** There is some badness; all is not well ! ***

* Modems with severe error problems:  ≥ 60 ES / hr:  None

--
* Modems with SNRM too low/high:  None
(Assuming an expected target SNRM of 6 dB downstream, 6 dB upstream)

--
* Modems with a few errors:
modem: 3 downstream, ES per hr: 1.4, Latest 15 minutes time
modem: 3 upstream, ES per hr: 8.6, Latest 15 minutes time
modem: 3 downstream, ES per hr: 4.4, Previous 15 minutes time
modem: 3 upstream, ES per hr: 11.2, Previous 15 minutes time
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 25, 2021, 08:30:18 PM
In that last example output, I had written the code to assess the SNRM figures as too low or too high. There are warnings if below 67% or above 150% of the target SNRM in each case for down or upstream. Do those percentages seem reasonable?

So if the target d/s SNRM is 6dB and the current SNR drops below 4 then you get a warning listed. And if the SNR goes up from 6 dB to 9 then you also get a warning.

For the low threshold, I just picked some number that seems reasonable from experience with 3dB d/s: if the level drops from 3dB down to below 2dB then that’s when I start to sometimes see CRC errors, hence the 67%. The low threshold must not be so high that it triggers in routine daily variation just because it’s nighttime unless there is some kind of abnormal variation going on due to a fault or interference. Does 67% sound reasonable for the daily variation thing? What’s the minimum daily downward variation you see that’s reasonable when all is normal? (Such as that in the droop during the nighttime.)

I’m also not so sure about the latter high threshold ratio of 1.5; I just picked a number out of thin air and I could do with some guidance. The way I wanted to look at it for the latter was: when is the SNR too high because of DLM having taken action? - because that is something that you do want to be alerted to. I might reduce this ratio to 1.2 because it occurs to me that going from normal 9dB to 12dB could be the next step up due to DLM, which is a ratio of 1.333 so would not trigger a warning, which is the wrong behaviour.

Any guidance as to how much higher than the target are SNRM values seen at certain times? Example : Such as after resync, or when you had a resync in the night and then later it’s daytime and conditions have improved so the SNR goes right up. I don’t want to report a warning on this latter type of normal variation.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on March 30, 2021, 06:15:49 AM
I’m wondering if I should be interested in the "since link time" error count ? What do you think?

Interesting to be alerted to last serious problems that have gone away more recently and so don’t show up in the two most recent 15-mins-max collection buckets. I thought about using the "since link time" count rather than the 1 day counts because in the 1 day count there might be a short period of intense errors to do with a link dropping or being forcibly brought down, and I don’t want to include something that won’t be relevant later on as an error count was only because of something related to the link drop.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 05, 2021, 11:40:20 PM
At the moment I’m only looking at ES counts and not SES. See however Chrysalis’ opinion at https://forum.kitz.co.uk/index.php/topic,25752.msg433233.html#msg433233 (https://forum.kitz.co.uk/index.php/topic,25752.msg433233.html#msg433233)

Quote
To me the SES is what stands out. In my experience of years and years on DSL, my rule of thumb is ES on their own are not usually service affecting (unless they trigger DLM), but SES usually are, and yep you have SES on that upstream.

If you were getting a steady flow of ES but each ES was maybe just 1 or 2 CRC, you probably would have no red on your graph and wouldnt notice it, but looks like it’s coming in large bursts.

So should I really be considering SES too? I’m assuming that wherever there are SES there are also ES. (By definition??) But I have a ‘more severe’ category for an ES count above a certain threshold and I wondering if any non-zero SES counts should automatically generate a report in that same ‘more serious’ category?
Title: Re: Goodness indicator from modems’ stats
Post by: kitz on April 07, 2021, 11:44:35 PM
The main item that I am personally concerned about on my line is E/Secs.   
As to what is acceptable, then I'd use the figures applicable to your relevant DLM profile as a guideline and then take a bit more off for your own safety margin - See the coloured table DLM - categorising your line (https://kitz.co.uk/adsl/DLM.htm#DLM_categorising_the_line) 
 
SES indicates a concern that the line is struggling with >30% packet loss.  It's not unusual for a line to drop out after a few consecutive SES.    By the time you get a warning about SES (unless its the odd one from burst noise) then the line may have already lost sync.

iirc you line has re-tx so you may want to monitor for  LEFTRS which is kind of the G.INP version of E/S.

Warnings about SNRM are useful.  What level you set it at depends upon how your line performs.   It's probably of interest if the line is consistantly swinging or if it changes by several dB and stays there - indicating that a manual resync may be of benefit.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 08, 2021, 12:47:35 AM
Thanks for getting back to me, Lesley. I don’t have LEFTRS, presumably because I only have PhyR, not the full standard G.INP.

1. I wondered if I should count situations where there is an SES event, but no ES event. Is that even possible? I don’t count SES events now, so such an event would go unnoticed.

2. I also wondered if I should consider either the weighted totals, for some value of k (but I have no idea what), of either (ES + SES * k),  where k ≥ 0 and k ≠ 1 or where k ≥ 0
Title: Re: Goodness indicator from modems’ stats
Post by: burakkucat on April 08, 2021, 04:10:53 PM
1. I wondered if I should count situations where there is an SES event, but no ES event. Is that even possible?

I don't think that can be possible. If there is an SES event then there must also be ES.

Quote
2. I also wondered if I should consider either the weighted totals, for some value of k (but I have no idea what), of either (ES + SES * k),  where k ≥ 0 and k ≠ 1 or where k ≥ 0

Surely you are just suggesting where k != 1 . . . b*cat is a little confused.  ???
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 09, 2021, 02:24:48 AM
> don't think that can be possible. If there is an SES event then there must also be ES.

Agreed. An SES event does not include access to anything that is other than that that is recorded in an ES.

the idea of some value of some constant k ≠ 1 might be used to prioritise an SES event as being more important than a mere ES, which is taken from a suggestion by Chrysalis; see Chrysalis’ opinion at https://forum.kitz.co.uk/index.php/topic,25752.msg433233.html#msg433233
Quote
To me the SES is what stands out. In my experience of years and years on DSL, my rule of thumb is ES on their own are not usually service affecting (unless they trigger DLM), but SES usually are, and yep you have SES on that upstream.

A weighted function something like w = badness / goodness = (ES + SES * k); where k > 1 and the event collection buckets where the value of w counted is in some range w > 0 and w < w1, then w ≥ w1 and w < w2 and finally w ≥ w2 which draws attention to the presence or absence of any w counts in those particular ranges.
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 09, 2021, 02:51:49 AM
Currently I have only four collection buckets for ES, and none for SES; the first two counts are (i) for zero ES in the most recent 15 mins (max) bucket, and (ii) the preceding 15 mins bucket then (iii) ES below a certain threshold, finally (iv) ES above that threshold in those buckets. So, to summarise: error categories: (a) zero errors, (b) mild errors and (c)more serious error counts. But currently I don’t take into account long-term event counts that are no longer relevant, because I don’t want to include error bursts that are very old news or error bursts that are related only to link-up or link-down events. The problem with that approach is that it doesn’t take into account repeating problem bursts which might be old news now but whose causes will return to create further trouble later on. Separate from that I report deviant SNRM values.
Title: Re: Goodness indicator from modems’ stats
Post by: kitz on April 09, 2021, 11:43:37 PM
Sorry you totally lost me with what you were trying to do. Brain fog as I try to read it...  which I've done no less than a dozen times :(

Yes SES contain at least 1 ES....  in the same way that ES will contain coding violations - CRC or a LOS.
Do you need to add them, why not treat them separately.   

Whilst I understand what Chrys is saying in that SES stand out.   But there are many a time when E/S become service affecting on my line without recording any SES.    In fact before I set up DSLstat alarms, I would realise something was wrong with my line just through general browsing and how long it takes for pages to load.   Even now if Im seeing patches of slow and unresponsive websites I still automatically go check to make sure I'm not getting ES and my line is stuck generating hundreds of Err Secs.     Ironically just this morning I was experiencing one of those patchy times when surfing became slow.   I looked at Err Sec and hardly had any so I was puzzled...   then I saw I was flatlining 300,000 FEC's per min.   It's probably higher because 300,000 is the max DSLstats will graph at.    I rebooted my modem and all was well again.   This line does do well on g.inp - it performs even worse when I get the high stuck error problem  - everything backs up datawise and surfing can practically grind to a halt.  Yet it doesnt cause a single SES.     So im my experience, its the amount of errors per min that matters.   
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 10, 2021, 02:12:19 PM
> But there are many a time when E/S become service affecting on my line without recording any SES.    In fact before I set up DSLstat alarms, I would realise something was wrong with my line just through general browsing and how long it takes for pages to load.   

That’s good to know. That was my feeling too.

Apologies for the brain fog, not a prize for great writing. :) I was just trying to isolate the higher ES rate problems from the very mild problems per line, then broken down according to how recent the ES were - in most recent up-to-15 mins (`bucket’) period or the 15 min period preceding that.

Here’s an example of the output:
Code: [Select]

 *** Summary of DSL links’ wellbeing and error counts ***
============================================================

-----------------------------------------------------
  *** There is some BADNESS; all is not well !  : -(  ***
-----------------------------------------------------

* Modems with severe error problems:  ≥ 60 ES / hr:  None

--
* Modems with SNRM too low/high:  None
(Assuming an expected target SNRM of 6 dB downstream, 6.5 dB upstream)

--
* Modems with a few errors:
modem: 1 downstream, ES per hr: 8, Previous 15 minutes time
modem: 1 upstream,      ES per hr: 16, Previous 15 minutes time



Title: Re: Goodness indicator from modems’ stats
Post by: kitz on April 11, 2021, 03:47:39 PM
Sounds good...  this is what I currently do for my alerts.


Whilst its ES that I'm really interested in and not bothered too much about individual CRCs..... on my own line I've tended to find that if my modem has a large spike of CRCs per min, then that is the type of "badness" to trigger the type of ES I need to know about immediately.  If I have 2 consecutive alarm emails then I know for sure something has gone wrong.    These alert emails go to my phone and I can also reboot the modem remotely if I'm out.

If I was monitoring it by ES per 15min...  then I could have gone too far into that time slot to be able 'rescue' the line from the effects of DLM. 
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 12, 2021, 10:03:49 AM
Wow, that’s really sophisticated. I didn’t know about the capabilities of that application, but then I don’t have a windows box any more - only iPads and a Raspberry Pi.
Title: Re: Goodness indicator from modems’ stats
Post by: kitz on April 12, 2021, 10:27:40 AM
I'm pretty sure DSL stats works on RPi the same way.   
Title: Re: Goodness indicator from modems’ stats
Post by: kitz on April 12, 2021, 10:47:54 AM
When it comes to alarms, I don't think there's a strict same criteria for what is good or what is bad for all lines, and it's more about knowing what is normal for your own line.   

SNRM swing is the main parameter that will vary from line to line, but when it comes to error alerts, I had to think how would be best to advise me asap that my line had entered into one of its phases of oscillating errors.       By the time I got a warning about x ES in a 15 min period it would often be too late to salvage the line.   I wasn't interested in just the odd ES neither.   So for me it was looking at the type of CRCs that caused the ES, and then realising that if the CRCs constantly went above the 200 per min, then those type of errors needed intervention to stop DLM noticing before I'd had a chance to do something about it first.     
Title: Re: Goodness indicator from modems’ stats
Post by: Weaver on April 12, 2021, 11:37:39 PM
Good to know about the Pi. I don’t have a display or keyboard connected to mine, it’s telnet-only. And I have four modems. That’s why our own Johnson’s firmware for the ZyXEL modems is so helpful to me; only need a web browser. I’ve written a lot of tools in iOS Shortcuts for the iPad which make use of Johnson protocol and this latest one, an overall state summary report, is just the latest.

If you had to put an ES per hour figure into one of several categories, say, approximately

1. no errors, zero
2. low error count that is non-zero
3. problem error count, and maybe also
4. severe problem error count

what might those numbers be for your line, and how many /which of those categories do you think might be useful? Probably too many?