Design of overall DSL health check / summary / alerts program

Broadband Related > Broadband Technology

(1/2) > >>

Weaver:
I’m thinking of writing a routine that will scan the stats received from all my modems and do a summary check that detects any problems that are current or are in recent history.

This is the Broadcom CLI xdslctl full stats summary command applied to my XyXEL VMG 1312 B10A devices which have our Johnson’s custom firmware and support the ‘Easy Stats’ (as I named it) stats-querying protocol (http-based, using port 8000, login-free, readonly). This is the familiar raw command output with current state then counters including error counters over various time periods.

I was thinking that I might have to remember previous error counts in order to difference them, but I can see a lot of problems with this dangerous idea. What happens if modems get swapped out, or they get reinitialised so that the state of counters gets reset? How do I even know that I’m seeing the same modem that I saw before? And having to store persistent state info is problematic - would I need to wipe the stored state under certain circumstances? So I’m thinking ‘let the modem do the work’ so I get a much more robust result as I don’t have to try and detect disruptive events such as swap-outs and resets.

I should use the ‘Previous 15-min’ and ‘Latest 1 day’ and ‘Previous 1 day’ counter values. I need some guidance on these, so any help suggestion? What about alarms / alerts at various levels? Including low levels.

I need to look at the current state at the top of the stats listing. Any suggestions - there are so many things I could be looking out for there. Do I need to look out for ‘abnormal for him’ values on a per modem basis? Such threshold values if per modem could be time-varying or fixed ? Suggestions please.

(A doctor told me the secret abbreviation code on the bottom of a patient’s chart which stands for ‘normal for Norfolk’ and sincere apologies for the racism.)

So basically the core problem is what thresholds and what values of what should trigger alerts to reach these levels whatever they are ‘yellow alert’ ‘defcon 3’ I don’t know, suggestions for the number and types of ‘health conditions’ please as well as the numbers.

I’m a bit lost and drowning in ignorance and indecision. I need to do a fair bit of thinking and guided problem definition before specing anything and must resist the hacker’s usual bad habit of charging in and writing code.

I am going to be using regexes a lot to isolate the fields I need and their values. Lovely.

I am also thinking about something to detect the condition ‘link x has dropped recently’ whatever recently means. I could do with a bit of help on this. Unexpected retrains are interesting and instability with lots of retrains during some period is a red alert. Unfortunately I ought to do something about the case where I power off a modem or reset / reboot it myself using CLI commands or whatever. That could all get fairly nightmarish. It would be a real pain having to have a special ‘forget it’, whatever that means, command that tells the system that a modem has been interfered with by me in some way, and that tampering had created the impression that there had been some problem.

Weaver:
A question: would any of us signal a very low-level alert if CRC count per some period (what period?) or other was non-zero? Or are many of us used to seeing a certain number of errors and writing that off as ‘normality’?

Errored seconds is the other metric, perhaps fairer because the CRC count is affected by the amount of traffic? Is that correct? Perhaps multiple CRC errors in one short time interval do not really matter?

If the link is quiet then we don’t want a metric that says all is good simply because there hasn’t been any traffic. Am I misunderstanding this or not?

At least in my case, the link is never totally quiet for very long because of AA’s CQM PPP LCP ‘pinging’.

j0hn:

--- Quote ---the CRC count is affected by the amount of traffic?
--- End quote ---

I'm not familiar with ADSL/ATM but that's definitely not how VDSL2/PTM works.
The amount of traffic makes not 1 iota of difference to the CRC count.

I get similar CRC/ES with a bridge modem in sync with no router connected (so zero traffic) that I get with a router connected and heavily utilising the link.

DslStats can be configured to send email alerts if CRC/ES/FEC per min (or per sample, which is customizable) exceeds the threshold you set.

aesmith:
Based on my experience CRC rates are independent of user traffic.

If your lines have worse SNR overnight then one thing your supervisory control could usefully do is trigger a re-train on high noise margin, that way if any circuit retrains overnight to a lower speed, your retrain would reset it back to normal.

PhilipD:
Hi

Yep ADSL and VDSL is like a constantly running scheduled train service, even if there are no passengers the train runs empty. ADSL/VDSL works in the same way, packets are always travelling back and forth at the speed the line supports. When we need to send or receive data, the TCP/IP or UDP packet "gets on board" and then arrives at the other end.

Errors are counted regardless if the packet contains data or not, which is why error rates often seem to cause little issue as there is a very good change they happen on an empty packet, and if they do hit a packet with a data payload, then with TCP/IP it gets requested and sent again anyway.

Regards

Phil

Navigation

[0] Message Index

[#] Next page

Go to full version