Topic: Algorithm for resyncs + modem reboots count assessment and alerting (Read 1622 times)

Weaver · « **on:** June 27, 2022, 02:10:06 AM »

[Sorry that this post has grown sooo long.]

I’ve been thinking about a method of showing a modem-monitoring alert if there has been one or more resyncs since some reference time and also any reboots due to crashes or power loss. Could anyone advise me about a suitable algorithm?

I’m not at all sure that this is feasible.

I would very much like to develop some mechanism of declaring certain reboots, power loss instances, or forced resyncs as ‘intended’. That’s not going to be at all easy.

Firstly the tools I have available for this in the framework I’m using, "Shortcuts" on my iPad, are a bit limited. My software that’s doing this, running on an iPad, is not running continuously, and the iPad is sometimes asleep, so I can’t implement any continuous monitoring function. I use custom ZyXEL modem-internal software written by kitizen Johnson, and this runs continuously in the modem as a service process, monitoring stats, writing information in a logfile, and providing a (second) http server which my iPad can query. I mention this because this limits the possible techniques I can use in an algorithm.

Secondly there’s the practical consideration that I have to remember to declare intended forced resyncs and reboots as not unexpected. That would mean somehow setting the reference or ‘since’-time to now after the intentional resync/reboot. I would also want to declare the reason for, or cause of the resync/reboot.

Then there would be some procedure for checking whether there has been an unexpected event more recently than the stored ‘reference time’. My modems give link-uptime, so detecting recent resyncs is easy. I’m not sure about obtaining uptime since cold boot and/or warm-reboot.

I’m also not at all sure about counting multiple such events and their times; not sure if I can obtain that info reliably. I’m not polling the modem’s status, as mentioned earlier. The custom Johnson software that I’m running in the modems does so though, and maintains a log file, which might provide part or all of the answer to this problem.

I need to also have a means of optionally setting another kind of ‘since when’ / reference time, so that I can ignore events that have already been seen and noted. When one unexpected event has been noted, then I want to know if there is a later second instance, rather than noting the same one twice. Saving the time of every unexpected badness event will help, and perhaps I need to save the ‘last time I checked’ time.

I’m thinking about detecting reboots / power loss even if the modem doesn’t provide enough info in the standard modem stats (which I need to review first). Can look at the start of the Johnson custom modem software’s regularly updated modem-internal logfile.

My modems do, I’m pretty sure, have access to the true UTC time, not just time from boot, as they can get this from NTP, if I remember correctly. If that is correct, I can’t remember how they manage that. Presumably some modems won’t have this luxury. My modems cannot talk to the internet for various reasons. They can only talk to my Firebrick router and can reply to inbound packets from machines within the LAN by means of redirection arranged by the router’s NAT capability. My router is an NTP server itself, so perhaps that’s how modems get the true time, but I forget. If that’s how it works, then the initial period before the modem acquires correct time from NTP should be really short and hopefully I won’t have to suffer too much of an initial sequence of bogus low timestamp values in logs.

So if the value of the start timestamp in the modem-internal Johnson log file is not as expected, then that would mean either a reboot or cold boot. Need to decide what reference time to compare that start time with.

I’m wondering if that Johnson logfile could provide me with a way of detecting multiple events missed: multiple resyncs or multiple reboots since the last time I checked. Obviously not if the logfile is wiped by a cold boot, or even warm reboot[?]. So I don’t think there’s ever going to be a full solution.

For the first time ever, I saw a mystery modem reboot a few days ago, discussed in another thread. This is the kind of thing that I really want to detect, classify and display an alert for. ‘Mystery’ in the strongest sense is here defined as not caused by any visible drop in SNRM, a big noise burst, or DLM action. I got the feeling, not at all reliable decision though, that that might have been a modem crash, transient hardware failure or software crash, or bad PSU, with PSU voltage droop, which was also suggested by a fellow kitizen, something that I would think would be due to transient insufficient current delivery or insufficient dI/dt.

I doubt there’s any way of detecting DLM action, is there? Symptoms to be expected might be: high ES count or CRC count; middle of the night [?] as that’s deemed a convenient time, less disruptive; and low SNRM before the DLM-resync. Anything else? Can I turn the problem around here and consider a number of rule-out conditions, such as: ‘can’t happen in the middle of the day’; ‘not if the error counts are low’; ‘not if the SNRM values are all ok’. In the case of an unexpected resync, I think I could look at the Johnson SNRM vs time log and get an understanding from that. What can I decide from the presence or absence of a long slow downward slide in SNRM, or a sudden drop in SNRM? And a drop down to what value? Do I need to consider upstream when doing DLM-action detection? Does DOM even care about DLM? Do I need to take the presence or absence of PhyR or G.INP into account as a factor in the resync reason classification algorithm?

I can detect spikes in the number of RS CRCs. That information is provided by the Johnson modem-internal log files, but I’m not sure if I can use this to classify any resync just immediately following that. Is that something that we see? That’s a kind of badness, link badness and we know that it’s only a mere resync and of the ‘these things happen’ kind, not a modem fault at all. How important is it that we get our error count comparison value right here? What are the implications of getting the value a bit out in either direction? -I’m not sure.

These kind of considerations might give more information to the user as to a ‘why’, in ‘why did x event happen?’ and help classify unexpected events according to their cause.

I also want to report events that are happening ‘too frequently’ and events that indicate various kinds of problem conditions that mean unreliable links or modems. This means I need duration values for the various ‘too frequent’ decisions. These numbers tend to be arbitrary and that always worries me. Suggestions for the values of these kind of durations: link uptimes, time between unexpected resyncs; modem uptimes - time between reboots. That’s a reliability thing, and, as someone pointed out to me very recently, power supplies are a factor as well as modems’ hardware and software reliability. I’d appreciate advice on these numbers. I think that it might be very important to exclude certain types of cold boot and reboot according to expected/unexpected type, and then consider each (re)boot reason, if possible, otherwise the reliability numbers could be completely messed up.

This is about as far as I’ve got with my musings. I need to think things through carefully to see if it’s even going to work before I waste a lot of time writing code.

burakkucat · « **Reply #1 on:** June 27, 2022, 01:17:58 PM »

<b*cat blinks and rubs his eyes>

I would suggest that you perform a simple investigation --

Log into one of the modems, as "supervisor", from the command line. (Either ssh or telnet.)
At the ZyXEL prompt (" > ") issue a "sh" command to gain the underlying BusyBox shell.
Issue an "uptime" command and analyse the result.
Issue a "cat /proc/uptime" command and analyse the result.

As an example, this is what I see when I perform the above --
~ # uptime 13:15:34 up 5:03, load average: 0.95, 0.94, 0.92 ~ # cat /proc/uptime 18222.48 10.63 ~ #

Weaver · « **Reply #2 on:** June 27, 2022, 09:30:57 PM »

Thank you, that’s very valuable. And that’s uptime of the o/s, nothing to do with DSL, so that’s one additional piece of the puzzle. I really really need to work out how to do telnet in the stupid iOS Shortcuts framework. If ‘uptime’ is an executable, which I assume it is, can soon check, then I think I can achieve what I want by using the ‘Run Remote’ library function in Shortcuts; will have to refresh my memory.

No wonder you’re rubbing your eyes :-)

Do you all think this whole thing is achievable or not?

Can we work out from the o/s uptime whether or not there has been one or more reboots? If it is seen to have decreased, then you know there has been a reboot. But the real world time interval since the last o/s uptime check has to be considered in other cases, no?

burakkucat · « **Reply #3 on:** June 27, 2022, 10:38:00 PM »

Having just checked, to be absolutely sure, this is what I see --
Script started on 2022-06-27 22:06:04+01:00 [COMMAND="telnet" TERM="xterm-256color" TTY="/dev/pts/1" COLUMNS="190" LINES="51"] telnet> open ap Trying 10.10.10.254... Connected to ap (10.10.10.254). Escape character is '^]'. VMG1312-B10A Login: supervisor Password: > uptime 0D 13H 54M 18S > sh ~ # uptime 22:06:36 up 13:54, load average: 0.94, 1.02, 1.08 ~ # cat /proc/uptime 50107.84 31.90 ~ # exit > exit CLI exiting now. Bye bye. Have a nice day!!! Connection closed by foreign host. Script done on 2022-06-27 22:07:18+01:00 [COMMAND_EXIT_CODE="1"]
The ZyXEL shell will respond to the "uptime" command and the BusyBox shell will also respond to the "uptime" command. Each give the basic, power-on time, of the hardware.
~ # find -name 'uptime' | sort ./proc/uptime ./usr/bin/uptime ~ # ls -l /proc/uptime -r--r--r-- 1 supervis root 0 Jun 27 22:18 /proc/uptime ~ # ls -l /usr/bin/uptime lrwxrwxrwx 1 supervis root 17 Nov 14 2018 /usr/bin/uptime -> ../../bin/busybox ~ #
I think you will need to be careful as to what you actually mean. For example, "reboot". Do you want to know how long the hardware has been powered on and ready to operate? Do you want to know when a forced re-train was caused on the xDSL circuit?

I recall that some months ago we privately discussed the initial records saved in the brcm.log file and that they were date & time stamped relative to the hardware's understanding of its "epoch". Having deliberately "conditioned" the hardware to a known "epoch" we could then see that there was around 17 seconds before the OS began to produce output. If the hardware could ultimately synchronise with an NTP server then there was a step-jump in the concept of "now", otherwise the time just incremented, second by second, from the hardware's "epoch".

Weaver · « **Reply #4 on:** June 27, 2022, 11:39:26 PM »

> I recall that some months ago we privately discussed the initial records saved in the brcm.log file and that they were date & time stamped relative to the hardware's understanding of its "epoch". Having deliberately "conditioned" the hardware to a known "epoch" we could then see that there was around 17 seconds before the OS began to produce output. If the hardware could ultimately synchronise with an NTP server then there was a step-jump in the concept of "now", otherwise the time just incremented, second by second, from the hardware's "epoch".

Agreed.

To answer your question about what I mean by reboot. I would say that I more interested in o/s start or restart time, ie. o/s boot, rather than machine start from cold following power being applied or lost and reapplied. I am equally interested in DSL link startup, ie retrain resync: start time of that process and time of entry to showtime state. Both are matters of concern to the user as they illustrate quite different possible illnesses: DSL link badness and/or hardware/software/PSU unreliability

burakkucat · « **Reply #5 on:** June 28, 2022, 04:10:16 PM »

b*cat makes a note to think about the requirements . . .

News:

Author Topic: Algorithm for resyncs + modem reboots count assessment and alerting (Read 1622 times)

Weaver

Algorithm for resyncs + modem reboots count assessment and alerting

burakkucat

Re: Algorithm for resyncs + modem reboots count assessment and alerting

Weaver

Re: Algorithm for resyncs + modem reboots count assessment and alerting

burakkucat

Re: Algorithm for resyncs + modem reboots count assessment and alerting

Weaver

Re: Algorithm for resyncs + modem reboots count assessment and alerting

burakkucat

Re: Algorithm for resyncs + modem reboots count assessment and alerting