[Sorry that this post has grown sooo long.]
I’ve been thinking about a method of showing a modem-monitoring alert if there has been one or more resyncs since some reference time and also any reboots due to crashes or power loss. Could anyone advise me about a suitable algorithm?
I’m not at all sure that this is feasible.
I would very much like to develop some mechanism of declaring certain reboots, power loss instances, or forced resyncs as ‘intended’. That’s not going to be at all easy.
Firstly the tools I have available for this in the framework I’m using, "Shortcuts" on my iPad, are a bit limited. My software that’s doing this, running on an iPad, is not running continuously, and the iPad is sometimes asleep, so I can’t implement any continuous monitoring function. I use custom ZyXEL modem-internal software written by kitizen Johnson, and this runs continuously in the modem as a service process, monitoring stats, writing information in a logfile, and providing a (second) http server which my iPad can query. I mention this because this limits the possible techniques I can use in an algorithm.
Secondly there’s the practical consideration that I have to remember to declare intended forced resyncs and reboots as not unexpected. That would mean somehow setting the reference or ‘since’-time to now after the intentional resync/reboot. I would also want to declare the reason for, or cause of the resync/reboot.
Then there would be some procedure for checking whether there has been an unexpected event more recently than the stored ‘reference time’. My modems give link-uptime, so detecting recent resyncs is easy. I’m not sure about obtaining uptime since cold boot and/or warm-reboot.
I’m also not at all sure about counting multiple such events and their times; not sure if I can obtain that info reliably. I’m not polling the modem’s status, as mentioned earlier. The custom Johnson software that I’m running in the modems does so though, and maintains a log file, which might provide part or all of the answer to this problem.
I need to also have a means of optionally setting another kind of ‘since when’ / reference time, so that I can ignore events that have already been seen and noted. When one unexpected event has been noted, then I want to know if there is a later second instance, rather than noting the same one twice. Saving the time of every unexpected badness event will help, and perhaps I need to save the ‘last time I checked’ time.
I’m thinking about detecting reboots / power loss even if the modem doesn’t provide enough info in the standard modem stats (which I need to review first). Can look at the start of the Johnson custom modem software’s regularly updated modem-internal logfile.
My modems do, I’m pretty sure, have access to the true UTC time, not just time from boot, as they can get this from NTP, if I remember correctly. If that is correct, I can’t remember how they manage that. Presumably some modems won’t have this luxury. My modems cannot talk to the internet for various reasons. They can only talk to my Firebrick router and can reply to inbound packets from machines within the LAN by means of redirection arranged by the router’s NAT capability. My router is an NTP server itself, so perhaps that’s how modems get the true time, but I forget. If that’s how it works, then the initial period before the modem acquires correct time from NTP should be really short and hopefully I won’t have to suffer too much of an initial sequence of bogus low timestamp values in logs.
So if the value of the start timestamp in the modem-internal Johnson log file is not as expected, then that would mean either a reboot or cold boot. Need to decide what reference time to compare that start time with.
I’m wondering if that Johnson logfile could provide me with a way of detecting multiple events missed: multiple resyncs or multiple reboots since the last time I checked. Obviously not if the logfile is wiped by a cold boot, or even warm reboot[?]. So I don’t think there’s ever going to be a full solution.
For the first time ever, I saw a mystery modem reboot a few days ago, discussed in another thread. This is the kind of thing that I really want to detect, classify and display an alert for. ‘Mystery’ in the strongest sense is here defined as not caused by any visible drop in SNRM, a big noise burst, or DLM action. I got the feeling, not at all reliable decision though, that that might have been a modem crash, transient hardware failure or software crash, or bad PSU, with PSU voltage droop, which was also suggested by a fellow kitizen, something that I would think would be due to transient insufficient current delivery or insufficient dI/dt.
I doubt there’s any way of detecting DLM action, is there? Symptoms to be expected might be: high ES count or CRC count; middle of the night [?] as that’s deemed a convenient time, less disruptive; and low SNRM before the DLM-resync. Anything else? Can I turn the problem around here and consider a number of rule-out conditions, such as: ‘can’t happen in the middle of the day’; ‘not if the error counts are low’; ‘not if the SNRM values are all ok’. In the case of an unexpected resync, I think I could look at the Johnson SNRM vs time log and get an understanding from that. What can I decide from the presence or absence of a long slow downward slide in SNRM, or a sudden drop in SNRM? And a drop down to what value? Do I need to consider upstream when doing DLM-action detection? Does DOM even care about DLM? Do I need to take the presence or absence of PhyR or G.INP into account as a factor in the resync reason classification algorithm?
I can detect spikes in the number of RS CRCs. That information is provided by the Johnson modem-internal log files, but I’m not sure if I can use this to classify any resync just immediately following that. Is that something that we see? That’s a kind of badness, link badness and we know that it’s only a mere resync and of the ‘these things happen’ kind, not a modem fault at all. How important is it that we get our error count comparison value right here? What are the implications of getting the value a bit out in either direction? -I’m not sure.
These kind of considerations might give more information to the user as to a ‘why’, in ‘why did x event happen?’ and help classify unexpected events according to their cause.
I also want to report events that are happening ‘too frequently’ and events that indicate various kinds of problem conditions that mean unreliable links or modems. This means I need duration values for the various ‘too frequent’ decisions. These numbers tend to be arbitrary and that always worries me. Suggestions for the values of these kind of durations: link uptimes, time between unexpected resyncs; modem uptimes - time between reboots. That’s a reliability thing, and, as someone pointed out to me very recently, power supplies are a factor as well as modems’ hardware and software reliability. I’d appreciate advice on these numbers. I think that it might be very important to exclude certain types of cold boot and reboot according to expected/unexpected type, and then consider each (re)boot reason, if possible, otherwise the reliability numbers could be completely messed up.
This is about as far as I’ve got with my musings. I need to think things through carefully to see if it’s even going to work before I waste a lot of time writing code.