Topic: Design of overall DSL health check / summary / alerts program (Read 2198 times)

Weaver · « **on:** May 30, 2019, 05:39:22 AM »

I’m thinking of writing a routine that will scan the stats received from all my modems and do a summary check that detects any problems that are current or are in recent history.

This is the Broadcom CLI xdslctl full stats summary command applied to my XyXEL VMG 1312 B10A devices which have our Johnson’s custom firmware and support the ‘Easy Stats’ (as I named it) stats-querying protocol (http-based, using port 8000, login-free, readonly). This is the familiar raw command output with current state then counters including error counters over various time periods.

I was thinking that I might have to remember previous error counts in order to difference them, but I can see a lot of problems with this dangerous idea. What happens if modems get swapped out, or they get reinitialised so that the state of counters gets reset? How do I even know that I’m seeing the same modem that I saw before? And having to store persistent state info is problematic - would I need to wipe the stored state under certain circumstances? So I’m thinking ‘let the modem do the work’ so I get a much more robust result as I don’t have to try and detect disruptive events such as swap-outs and resets.

I should use the ‘Previous 15-min’ and ‘Latest 1 day’ and ‘Previous 1 day’ counter values. I need some guidance on these, so any help suggestion? What about alarms / alerts at various levels? Including low levels.

I need to look at the current state at the top of the stats listing. Any suggestions - there are so many things I could be looking out for there. Do I need to look out for ‘abnormal for him’ values on a per modem basis? Such threshold values if per modem could be time-varying or fixed ? Suggestions please.

(A doctor told me the secret abbreviation code on the bottom of a patient’s chart which stands for ‘normal for Norfolk’ and sincere apologies for the racism.)

So basically the core problem is what thresholds and what values of what should trigger alerts to reach these levels whatever they are ‘yellow alert’ ‘defcon 3’ I don’t know, suggestions for the number and types of ‘health conditions’ please as well as the numbers.

I’m a bit lost and drowning in ignorance and indecision. I need to do a fair bit of thinking and guided problem definition before specing anything and must resist the hacker’s usual bad habit of charging in and writing code.

I am going to be using regexes a lot to isolate the fields I need and their values. Lovely.

I am also thinking about something to detect the condition ‘link x has dropped recently’ whatever recently means. I could do with a bit of help on this. Unexpected retrains are interesting and instability with lots of retrains during some period is a red alert. Unfortunately I ought to do something about the case where I power off a modem or reset / reboot it myself using CLI commands or whatever. That could all get fairly nightmarish. It would be a real pain having to have a special ‘forget it’, whatever that means, command that tells the system that a modem has been interfered with by me in some way, and that tampering had created the impression that there had been some problem.

Weaver · « **Reply #1 on:** May 30, 2019, 05:02:34 PM »

A question: would any of us signal a very low-level alert if CRC count per some period (what period?) or other was non-zero? Or are many of us used to seeing a certain number of errors and writing that off as ‘normality’?

Errored seconds is the other metric, perhaps fairer because the CRC count is affected by the amount of traffic? Is that correct? Perhaps multiple CRC errors in one short time interval do not really matter?

If the link is quiet then we don’t want a metric that says all is good simply because there hasn’t been any traffic. Am I misunderstanding this or not?

At least in my case, the link is never totally quiet for very long because of AA’s CQM PPP LCP ‘pinging’.

j0hn · « **Reply #2 on:** May 30, 2019, 05:25:17 PM »

Quote

the CRC count is affected by the amount of traffic?

I'm not familiar with ADSL/ATM but that's definitely not how VDSL2/PTM works.
The amount of traffic makes not 1 iota of difference to the CRC count.

I get similar CRC/ES with a bridge modem in sync with no router connected (so zero traffic) that I get with a router connected and heavily utilising the link.

DslStats can be configured to send email alerts if CRC/ES/FEC per min (or per sample, which is customizable) exceeds the threshold you set.

aesmith · « **Reply #3 on:** May 31, 2019, 04:01:23 PM »

Based on my experience CRC rates are independent of user traffic.

If your lines have worse SNR overnight then one thing your supervisory control could usefully do is trigger a re-train on high noise margin, that way if any circuit retrains overnight to a lower speed, your retrain would reset it back to normal.

PhilipD · « **Reply #4 on:** May 31, 2019, 06:19:46 PM »

Hi

Yep ADSL and VDSL is like a constantly running scheduled train service, even if there are no passengers the train runs empty. ADSL/VDSL works in the same way, packets are always travelling back and forth at the speed the line supports. When we need to send or receive data, the TCP/IP or UDP packet "gets on board" and then arrives at the other end.

Errors are counted regardless if the packet contains data or not, which is why error rates often seem to cause little issue as there is a very good change they happen on an empty packet, and if they do hit a packet with a data payload, then with TCP/IP it gets requested and sent again anyway.

Regards

Phil

re0 · « **Reply #5 on:** June 01, 2019, 12:23:21 PM »

Except when power saving is involved, then you have L0, L2 and L3 states (only L0, L3 for VDSL AFAIK) which can reduce or stop transmissions.

Based on Phil's analogy:

L0 - being full power - is where the train has maximum amount of carriages that can be hauled based on conditions. Maximum sync with specified parameters from DLM.

L2 - low-power state - is where the trains are still operating, but with the minimum amount of carriages to satisfy the demand of travellers. Reduced sync and reduced transmit power when little or no traffic - power saving for DSLAM. L0->L2 initiated by DSLAM then co-ordiniated by DSLAM and CPE to reduce rate and power.

L3 - idle state - is where train services have ceased, but can resume when necessary. No transmissions (no sync) - power saving for DSLAM and CPE. L0->L3 can be initiated from either DSLAM or CPE.

I'm not aware of any state other than L0 being used in the UK, at least in the present day. I just thought it would serve as a good addition to the thread.

Weaver · « **Reply #6 on:** June 01, 2019, 03:44:49 PM »

The thing about such power-saving designs is that the whole system has to be able to work fine in the absolute worst case where everything is running at maximum power consumption and heat generation is maximal. A designer cannot sensibly just hope that they will always be lucky and if system performance is only ok in average conditions then that will always be good enough. I am not even sure how anyone is going to save any money on the electricity bill. Making sure the thing works ok in the worst case means having to incur more costs to meet all the challenges involved in delivering that much power, dealing with that much heat and so on. It sounds like a good way of reducing reliability, by adding complexity, making things more difficult and expensive to test and making the job of reproducing scenarios more difficult. Grrr.

But what do I know.

IanG · « **Reply #7 on:** June 08, 2019, 02:00:18 PM »

Interesting questions, Weaver. I don't know of any way of reading a modem serial number, but if the modem is on a local IPv4 network you could use ARP to recover its MAC address, and use that as an identifier. At least in principle. The details might be tricky.

As you no doubt know, the Hlog spectrum (among other things) changes when retraining. Otherwise, quoting the G992.3 standard, "The [Hlog] measurement shall not be updated during showtime". So the change is an unambiguous flag that something has happened.

As for recognising anomalies, others know much more than I do. In my limited experience, SNR is the most reliable indicator that performance is becoming marginal. SNR from time to time drops towards a level that would cause the modem to retrain, but then recovers within a few seconds. In those cases, it might not be productive to sound an alarm.

For counters, I use "Total" counts, rather than the latest 15 minutes or latest hour, and log the difference between the current total and the previous total. If the total ever falls (to zero), I reset things. During training, by the way, counts are not reported (on my particular modem).

You asked about actual numbers. I am looking now at a 15 hour log with:
   downstream SNR 6.3 dB,
   downstream date rate 15.3 Mbps
   FEC 120 (corrected errors) per second
   13 downstream CRC errors (i.e. one per hour)

A few days earlier, before it retrained to an SNR of 6 dB, a 24 hour record of the same modem showed it running consistently at:
   downstream SNR 3.1 dB
   downstream data rate 17.6 Mbps
   FEC 3000 per second
   1256 downstream CRC errors (i.e. just under one a minute)

Although the error numbers differ greatly, I regard both as acceptable. It shows the tradeoff between error rate and data rate.

On Regexes, can I suggest that you write a benchmark first, as some expressions can take a lot of CPU to evaluate. I found that out the hard way.

I look forward to you finding a useful warning signal. Until then, I will fall back on my personal alarm of complaints that the internet's not working.

aesmith · « **Reply #8 on:** June 11, 2019, 03:56:59 PM »

Quote from: re0 on June 01, 2019, 12:23:21 PM

I'm not aware of any state other than L0 being used in the UK, at least in the present day. I just thought it would serve as a good addition to the thread.

Does it go into low power in fault conditions? Just wondering because it's one of the parameters returned by BT's KBD, for example "Line operated in Low Power consumption mode (L2) for 0% of analysis period."

News:

Author Topic: Design of overall DSL health check / summary / alerts program (Read 2198 times)

Weaver

Design of overall DSL health check / summary / alerts program

Weaver

Re: Design of overall DSL health check / summary / alerts program

j0hn

Re: Design of overall DSL health check / summary / alerts program

aesmith

Re: Design of overall DSL health check / summary / alerts program

PhilipD

Re: Design of overall DSL health check / summary / alerts program

re0

Re: Design of overall DSL health check / summary / alerts program

Weaver

Re: Design of overall DSL health check / summary / alerts program

IanG

Re: Design of overall DSL health check / summary / alerts program

aesmith

Re: Design of overall DSL health check / summary / alerts program