Kitz Forum
Broadband Related => ADSL Issues => Topic started by: Weaver on March 30, 2016, 05:09:29 AM
-
Started last Thursday evening around 19:15 very suddenly, packet loss on one line out of three, as indicated by loss of responses to PPP LCP "ping" Echo Requests issued by the ISP's CQM system. See attached image, the red 'dripping blood' hanging down from the top is the indicator.
Why should it start so suddenly, and what might be behind it?
-
This line is approx 4.55 mi long ( est. ) by road. The BT DSL availability website gives the line length as ~6200m.
-
I have been very out of it, so I will ask my wife to email Andrews and Arnold support tomorrow to get the ball rolling. I did a ‘copper line test’ at the weekend, which did not reveal the source of the problem.
-
I've done obvious things such as power-cycling the modem. I haven't left it turned off across a DLM fifteen minute boundary period.
-
The simple thing to do if you suspect the line is the cause is to monitor the error stats on the modem.
-
I have plenty of spare preconfigured modems as well, that might be worth trying, just as a rule-out.
-
@weaver...is that the phone number in the top RHS of the graph?? We do know you only have dsl on the line but......
Ian
-
It is. (There's no voice service on the line. (I think.) It used to be an option as to what Andrews and Arnold would do with voice, but now I don't really understand what the story is. AA are desperate to stop all voice services and prevent users getting charged for any such thing.)
-
weaver I wonder if this is related? decided to have a read of revk's blog and seems they have had to do some work to relieve capacity issues. Although the most recent post suggests it is now resolved.
http://www.revk.uk/2016/03/growing-pains-next-step.html
-
There was a nasty two hour outage, not properly warned, early that Thursday morning. Then that evening one line went suddenly bad and has remained bad every since. Why it has stayed bad, and why only the one line, these are perplexing questions. The lines seem to all go to the same BRAS in ?Falkirk, so no obvious difference there.
Is it my imagination or is there some sort of repeating temporal pattern in the dripping blood?
-
There have been a few errored seconds per day some days, which is most unusual. The other two lines are zero errored seconds in each day, as expected.
-
Had two faults this year so far, 1) piece of wire or solder or something fallen down somewhere it shouldn't be in exchange, and 2) knackered wiring shallowly buried near the fank at Harrapul, possibly after a lorry went over it, shorting wires out against one another or something, iirc.
-
Andrews and Arnold had a brief look at the line this morning. Did an SNR reset to 6dB d/s instead of my dangerous 3dB, a change that has had no effect on the dripping blood at all.
And it didn't explain why the sudden change last Thursday evening, when it worked fine on a really really low target SNRM before then, just like the other two lines.
The problem is that I don't think anything is showing up in BT's tests.
Here's the CQM graph for last Thursday 2016-03-24. Scroll to the right to see the sudden onset of the problem:
-
We agreed that the d/s target SNRM 6dB was a waste of time, I think it's back to 3dB now. Listened to the line on a POTS phone, was all good.
Then swapped lines @a.1 and a.3 over at the wallsocket link stage, to see if the visible dripping blood would move from 1 to 3. An equipment rule-out.
Agreed to leave it in swapped over state until tomorrow morning when we reconvene.
-
The dripping blood 'moved' when the lines were swapped over. Thank goodness. Now hopefully we can make some actual progress tomorrow.
It's my fault that this has gone on so long, I have been really out of it, very ill with overwhelming fatigue, and so I didn't notice the dripping blood until Sunday and then there was the bank holiday. AA didn't pick up on my requests for support delivered by Twitter, so I lost more time as I had to get my wife to email tech support which again I did belatedly because I was so under the weather.
-
I wonder if someone (me) could write a program to get at clueless' live CQM data and continuously be on the look out for dripping blood, sounding an alarm in some way: sending email, an SMS or a tweet, or an alert via an SNMP trap etc.
-
On Wednesday, the dripping blood ‘moved’ when the DSL cables from two of the wall sockets to the modems were swapped. Fair enough. So it's the case that we have a bad modem #1, sending its output now into line @a.3. Surely?
But - I put things back the way they were originally yesterday afternoon and the dripping blood is back on line @a.1 - no surprise there.
This morning I swapped out modem #1 for a known good spare. And the dripping blood continues. So we have a bad line 1. Sanity check.
-
I've even swapped out the RJ11-Rj11 cable that was in use on line 1, just in case. No change as expected.
(Don't have any idea why Wednesday's perplexing result happened. Everything else seems to say “line1 bad” unless I'm going crackers, which is a distinct possibility.)
-
Second brain needed.
-
Have now even replaced the straight-through rj11-plate with a dangly microfilter straight into the test socket.
-
Andrews and Arnold have finally completed all the rule out tests that need to be done at my end. They've submitted something to BT.
===
Andrews and Arnold Ltd
BT Knowledge Based DiagnosticsAuto reload on update
Line BBEU2070004201/04/2016 14:25:02QDMNN Today 14:24:45 andy@a CLOSED COMPLETED
Problem:
The End User has a working session.
This is a BT diagnostics tool. If you need any help with understanding this please contact support. Where we go on to report a fault based on this we expect you to have answered all the questions accurately. If we are charged by BT as a result of incorrect answers we will charge you.
Today 14:25:42: No fault identified in BTW network. Do you wish to continue with further KBD diagnostics? (Note: To see the KM lite Analysis go to the Drill Down view. If you wish to continue with KBD diagnostics please note intrusive testing may be involved and please ensure the End User Modem is Switched-On and connected.)
: Yes
Today 14:25:42: Please confirm the type of problem being experienced with the service.
: Connection
Today 14:27:58: Do you have your own End User set up checks process? Please note: You must complete all CP and End User checks before (continuing with KBD diagnostics). Please confirm this have been carried out and no customer fault identified. If not please select the 'NO' option below.
: YES
Today 14:27:58: Has the issue been resolved?
: NO
Resolution: ISP15
Please confirm all CP and End User checks have been completed. KBD tests indicate no BTW network fault. This is a CCSFI enabled outcome.
Test Results / Notes
Product Info WBC End User ACCESS
Profile Info WBC 160K - 24M Medium delay (INP 1) 3dB Downstream, UC Medium delay (INP 2) 6dB Upstream (ADSL2+)
BRAS Profile adsl2000-b
RRT:
RRT:Prognosis for a period of 14 days from 18-Mar-2016 to 31-Mar-2016. Line operated in Low Power consumption mode (L2) for 0% of analysis period. The circuit was in sync throughout the specified analysis period. Please refer to the other sub tests within the KBD including the Status Check to confirm whether the circuit is currently in sync and logged on. If the circuit is currently out of sync please carry out internal wiring, filters, modem/router checks with the End User at the master socket where possible.The circuit has no dropping syncs. Please carry out internal wiring, filters and modem/router checks with the End User at the master socket where possible. Also, refer to the other sub tests within the KBD and use the Performance Tester.This Line is not flapping today (01-Apr).
Down:The line rate has varied by a small amount and frequently on most of the days during the analysis period. The line rate is within acceptable limits. Please see the average value. MIN=2699 AVG=2800 MAX=2848
Down:The noise margin is constant throughout the analysis period.The average margin for this line is at the bare minimum, which can cause dropping connections. MIN=1 AVG=1 MAX=3
Down:There have been insignificant errors on the line during almost all parts of the day. This behaviour happens on almost all days during the analysis period. This is normal behaviour for a DSL product and is not affecting the service in any way. MIN=78 AVG=1859 MAX=3600
Down:This is an extremely long line, and lower line rates can be expected. MIN=65 AVG=65 MAX=65
Down:There have been a few initializations on the line during just a few parts of the day. This behaviour happens on most of the days during the analysis period. MIN=0 AVG=1 MAX=5
Down:This circuit is up for an average 85.0% of the time. MIN=900 AVG=73549 MAX=86400
Up:The line rate has varied by a small amount and frequently on almost all days during the analysis period. The line rate is very high (good). Please see the average value. MIN=525 AVG=530 MAX=534
Up:The noise margin is constant throughout the analysis period.The Noise margins are low. Please see the average value. MIN=5 AVG=6 MAX=7
Up:There have been insignificant errors on the line during almost all parts of the day. This behaviour happens on almost all days during the analysis period. This is normal behaviour for a DSL product and is not affecting the service in any way. MIN=102 AVG=2681 MAX=3600
Up:This is an extremely long line, and lower line rates can be expected. MIN=41 AVG=42 MAX=42
RADIUS:WORKED CWCC@A.1 2016-03-30T03:31:08.000
Status:Circuit In Sync NTE /PowerOn MUX Up LL=41.9 SNR=5.9 522kb/s Down LL=64.5 SNR=1.8 2809kb/s
Copper:Line Test OK - End User Equipment detected ACap=nF BCap=nF DPDist=m DNDist=m
-
b*cat asks the question --
Did you swap out the modem's power supply as part of your hardware checking?
-
No, I didn't swap out the power supply. I'd better do so ASAP. This will mean calling upon my long-suffering assistant once more.
-
No, I didn't swap out the power supply. I'd better do so ASAP.
Ah . . .
This will mean calling upon my long-suffering assistant once more.
Nods knowingly. ;)
I have read through the A&A information (listed above) and am at a loss as to: (1) what it means (2) what they are requesting that BT (in whichever guise) do . . . Very confused. ???
-
I am assuming that this stuff I've quoted is a report by BT's 'intelligent' case analysis system sent to AA. I was told that AA have booked a BT engineer visit.
AA now have tech support formally available on Saturdays. Hurray! Before, there would often be someone hanging around in IRC just in case though.
-
The thing that strikes me is that you're reporting packet loss, but the BT blurb seems to be saying very low error rates. Is that the case? I'm not completely sure, for example if your drops were in bursts, and BT's averaging over 24 hours then maybe the two do tie up. If that's not the case, and you have a high drop rate but low error rate then it doesn't sound like a line fault does it?
-
It doesn't have to be a DSL fault at all. What if it's further upstream?
-
Indeed, I was jumping to conclusions that the BT engineer was booked to look at the local line. Could be a DSLAM card issue?
-
My feeling is, knowing A&A's reputation, that an appropriate support request has been made to attend to the appropriate entity but the feed-back, as seen by Weaver and reproduced here, is garbled. :-\
-
One of my earlier faults this year was in the exchange, iirc.
-
Packet loss in PPP LCP pings continues just the same, bright red ‘dripping blood’ from the top of the graph down.
-
I've now swapped lines @a.4 and @a.1 at the wallsocket points. The fault has now moved to graph 4 in clueless. Sanity check: So this means a bad modem in line 1 doesn't it? (Brain failing due to exhaustion, apol.)
Status now: Graph 1 currently clear, graph 4 dripping blood.
Suggestion: Earlier perplexing results would just mean that it's either something to do with duff PSUs - which have not been swapped out (thanks Burakkucat, for pointing this out) - or there's more than one bad modem in the spares pool?
-
Hi weaver
Why not save time and change the power supply from a known good router with the perceived bad router
If issue is then seen on the good router as was with original power supply, you know it's the psu
If the issue stays with the perceived bad router, just change the router with good to see if it resolved issue (keep it on same line though)
If issue remains, can you factory default bad router and reset it back up, then test
Your theory over bad router spare seems ok, but I would test this by changing a good router with this spare and see if issue arises
Many thanks
John
-
@d2d4j good point. Will do, delayed because I have been waiting a short while to let A&A collect a little more data.
-
FYI: I must not put these devices back to factory settings as they need to be in modem-only mode (bridge mode) which is a non-default setting preconfigured by AA before delivery.
(Btw And as routers they are shite, a security hazard, and in any case unusable as what I need is a straight PPPoE modem not a router. See earlier thread investigating the innards of the DLink DSL-320B-Z1:
http://forum.kitz.co.uk/index.php/topic,17065.msg313922.html#msg31392
)
-
This afternoon, I changed the modem out yet again, for yet another spare. (I have a lot of fairly new spares.) I also swapped out the PSU along with it.
I then swapped the lines back at the Wallsockets to the way they were originally.
-
As long as you are keeping notes of each experiment performed and the result obtained, then you should be able to rule out (or in) any particular item of local hardware.
I have to confess that now I am completely ker-fuddled as to "what's what". ???
-
The next step was to question the Firebrick itself, as its port operations haven't been investigated. I swapped the Ethernet cables between modems #1 and #3 (line @a.4) where they go into the Firebrick's ports.
Amazingly this has almost 99% cured the problem, with a microscopic amount of dripping blood just before every hour. This time pattern suggests a software bug perhaps.
I looked back at the Firebrick software upgrade status, and I noted that there had been a software upgrade a few days before the onset of the fault, but there's no obvious way to explain with the time gap between the installation and the start of the fault.
One other thing I should look at is the network cables between Firebrick and modems just in case they could have been damaged.
So the current state is that the dripping blood is spectacularly reduced, down to acceptable levels. Because the cables are into the wrong ports compared with the line identifiers @a.1, @a.3 etc, I will have to clean up somehow.
At the moment, the Firebrick config is wrong, because it specifies an upstream traffic limit in bps for each line, and these have got scrambled. Line 1 has traditionally had a rather higher upstream traffic allowance than the others, this is now incorrectly assigned to line @a.4 and vice versa. I wonder if this upstream rate limiting system could have something to do with it. It would have to have been hacked in the most recent release.
-
All of this work carried out by my beautiful assistant, thanks Janet.
-
This is very hard to guess remotely what might be going on, cable swapping improving things would suggest to me tho that the line isnt at fault.
Swapping the cables on the firebrick would also perhaps serve to re-initiate the connections, flushing buffers and so forth, meaning it could be something like a memory leak which got resolved when the port was reset. So another idea is if it starts getting worse, dont swap any cables but just reboot the firebrick, to see if that also yields a temporary improvement.
-
I think the A&A monitoring is by LCP ping, so I was wondering whether that would preempt any traffic shaping on your PPPoE router. Depends how the shaping an queuing is done on the firebrick. Does the A&A graph show packet loss if you max out your links with real traffic?
-
No, it doesn't seem to show bright red in the period of a speed test.
-
And it's back. Bad as ever, after about 22 hrs. It came back showing on line @a.4 at around 15:30. This line is the one that originally was connected to port #3 of the Firebrick, now it's connected to port #1 on the FB.
So it seems (sanity check pls?) that the trouble follows Firebrick port #1 around.
-
So it seems (sanity check pls?) that the trouble follows Firebrick port #1 around.
For the first 24 hour period, connect the equipment as --
- Line 1 > Port 1
- Line 2 > Port 2
- Line 3 > Port 3
For the second 24 hour period, connect the equipment as --
- Line 1 > Port 2
- Line 2 > Port 3
- Line 3 > Port 1
Then, finally, for the third 24 hour period, connect the equipment as --
- Line 1 > Port 3
- Line 2 > Port 1
- Line 3 > Port 2
For each of the three 24 hour periods note which line shows the problem. If it is Line 1, Line 3 and Line 2 then that does appear to point to Port 1 on the Firebrick.
Perhaps you should suggest to Adrian Kennard that he might like to spend a long weekend at Skye Shepherd Huts and, whilst in the vicinity, take a look at your Firebrick FB2500?
-
> Perhaps you should suggest to Adrian Kennard that he might like to spend a long weekend at Skye Shepherd Huts and, whilst in the vicinity, take a look at your Firebrick FB2500?
That's an excellent idea. I'm told it's very agreeable there!
-
This week I've only been awake a few hours every day, so my fault-finding abilities and the stamina-reserves of my beloved assistant have become quite limited.
-
did you reboot after it came back as I advised?