I have been trying to debug two problems plus something that is possibly a red herring.
A while back there was a thunderstorm in the distance, about 15 mi south of me. I heard the thunder in the distance. Stupid hardware lightning alarm unit didn’t sound an alert and I didn’t see it flash (whether or not it did so). My lightning alert all was not running, which is my own stupid fault plus Sod’s law. So lucky that I just heard it.
Being very nervous about such things, I asked the poor sleeping Mrs Weaver if she would kindly unplug dsl lines to protect them. At this point my Firebrick routershould have failed over to 3G via a USB dongle automatically but it seemed that for some unknown reason the 3G link was down so the main internet connection was down as a result. I checked a few things over and decided to force reinitialisation of the 3G link by rebooting the Firebrick, which, although extreme, was the quickest and easiest way. This fixed the problem, 3G link up, failover working and main internet connection restored.
But the question then was why had the 3G link been down anyway. Talking to AA, my ISP, from the evidence of logs it seems that the 3G link had been down for several days and I had not noticed, somehow.
So that was the first problem, why had the link failed?
Other questions were: how best to detect such a problem in future? AA’s clueless server should alert me, in theory anyway so that should be fine. For extra insurance I thought about adding something into the Firebrick config to continually ping-test the 3G link, but I have no idea how to do that.
I set up an external server using the mouselike.org ping test server (which uses a Firebrick ping tester box), thanks to a wonderful tip in another thread. This monitored the wan IPv4 address of the 3G link.
Now it turned out that ICMP-pinging the wan IPv4 address of the 3G link using the mouselike.org server or from other external screen addresses just failed.
The question is why? Is the link not really ‘up’.
AA staff and I tested failover to 3G by faking all the dsl lines going down, by tampering with the config file temporarily. Failover worked ok.
So the next question is: do I need to really worry about the mystery of not being able to ping the 3G dongle? Does this inability have anything to do with the 3G link actually not working when it comes to a failover situation?
I suppose I should ask if this is something unknown about Firebrick behaviour or behaviour of AA’s servers at their end, or both. Since my Firebrick and the AA routers know (certainly could in theory possibly know) that the 3G link is meant to be used in failover only, then perhaps one or the other end is either dropping the link or disabling downstream routing to it after the usage period during failover is ended.
So at the moment, for reasons unknown, I cannot use the excellent mouselike.org ping monitoring facility as a double check that the 3G link is really working.
I’m worried that the 3G link might go down again at some point for reasons unknown. And also what if it should happen without me knowing about it and possibly without AA’s clueless.aa.net.uk server spotting it and warning me. The clueless.aa.net.uk server should, I think, be continuously PPP LCP-ping testing that link, and that proves that the link is really working, not just claiming to be up. If that system is all good then I need have no worries about missing out on alerts.
It may be that I didn’t spot an alert concerning the 3G link going down because I confused it with alerts relating to DSL modem links dropping and those are all too frequent and tend to get casually binned some times. I perhaps need to think of a way of conditionally highlighting any specific emails from AA’s monitoring systems that are about that one particular link.
The other remaining problem is that if I find out that the 3G link really is down, then how do I debug it? And also how do I capture enough information about what badness it was that made it go down at that time?