Topic: Upstream performance (again) - either Firebrick packet scheduling or line 3 (Read 2436 times)

Weaver · « **on:** March 17, 2017, 06:01:35 PM »

This has been discussed in earlier threads-maybe I could find some of them. All comments in the following apply only to upstream.

As far as I can see it, I either have a Firebrick problem or a problem with upstream performance on line 3. The multi-line throughput on upstream of my firebrick appears really bad at only 77% of the expected three-line combined value after taking out all overheads due to protocol bloat such as AAL5, ATM and so on.

This has held true despite every kind of tweak I can think of and days wasted fiddling around with parameters in the config file because you are supposed to be able to rate-limit each sub-pipe to each individual PPPoE modem. If you don't want to, you just want it to do the right thing and run each line as fast as it can go, scheduling packets correctly according to the differences in line speeds between the pipes the I think you should be able to do that. I don't know what you are supposed to do if you just want intelligent maximised performance with no configuration.

I've tried simply setting all the rate limits to huge values to see if that is good. It doesn't break things, but it doesn't get you more performance than using explicit numbers that look plausible and are derived from reported DSL sync rates by some sort of guesswork. I'm assuming that the rates you are supposed to put in the config file are actual bit rates that are derived from L3 PDUs (= [eg IP] headers and payload) by ATM bitrate for PPPoEoA = L3_PDU_Bitrate / 0.85, where the latter fudge factor is for ATM overhead in expanding 1500 bytes of IP PDU into 33 ATM cells worth of bytes = 33 * 53 bytes. But fiddling about with the factor used never gets me any more throughput: pushing it harder doesn't help, although I assume that having a really low number means that I could succeed in crippling it. That is all old news.

* But there's something new: because of this week’s outage I've discovered that the speedtester-reported upstream throughput with three lines is only about ~15% better than it is with two lines active. Note that with only two lines up - cwcc@a.1 and @a.4 - the measured performance is 100% of that predicted! So could this be an n_lines == 3 thing? Or n_lines > 2? Or else it's that cwcc@a.3 is bad.

Now a word of warning: this could be to do with the behaviour of the speed tester. Perhaps it simply doesn't like something about the three line ‘packet pattern’, is stupid and doesn't cope with out-of-order packet delivery or something. If this is a general, representative effect, and normal servers taking upload from the usual kind of single TCP connection will be unhappy with the particular behaviour then this is not ideal and perhaps something that could be improved in the Firebrick’s scheduler to make it more per-TCP connection friendly, or in general more 'flow aware'. (Using the word flow as in the sense in which it is used in discussing the IPv6 flow label: meaning multiple six-tuples of addresses + L4 ports + IP protocol + IP version.) This might involve timing to order the arrival times correctly by considering line rates and packet lengths and choosing packet departure times appropriately, while of course trying to keep everything busy. And if multiple upstream flows are in use then it makes the scheduling a lot easier because it can try and assign packets to the relevant modems according to the flow they belong to. I suppose that it might be interesting trying the two-line vs three-line ratio thing with a different speed tester.

Now in case it isn't a server-end ordering sensitivity thing, but a real case of bubbles in some of the pipes, then that would be a scheduler bug. I could try and look at the utilisation of each link using clueless. Reading it off a packet capture would be confusing but I suppose that would be the hard-core option.

Really I need a test method where I do a really big long upload to be able to look at it in clueless, and then I could simply time a flat out file upload that uses TCP, say. The duration of the tests speedtesters run are way way too short to be usable with inspections on clueless which only takes a snapshot every n seconds, where n is far too large. That way I could check in case it is ignoring an entire modem.

Weaver · « **Reply #1 on:** March 18, 2017, 02:33:40 AM »

What's the easiest way of getting a big upload going so I can time it? I could fire up an ftp server. I'm thinking of something even lazier.

Chrysalis · « **Reply #2 on:** March 18, 2017, 09:41:55 AM »

I have only messed with bonding once and that was me bonding multiple gigabit uplinks together on a server. I found it problematic not only in performance but also I had to be careful not to compromise stability, I have certainly never tried it at home.

My gut feeling is you have a packet ordering issue, I expect the higher amount of lines been bonded would need a higher tolerance to packets been out of order. If the configured tolerance is exceeded then retransmits will occur.

Have you ever asked these types of questions in the aaisp irc channel where you might find other people with similar setups?

Weaver · « **Reply #3 on:** March 19, 2017, 09:34:10 PM »

No, I haven't asked about the upstream thing. I just thought that I was asking too much out of it. But recently I realised how many odd things there are. Firstly that two channels work well, and the third doesn't, unless that is just a bad pipe and I should test that next.

Secondly, downstream works superbly well, so it is not that AA's routers are no good, as I'm assuming it is the exact same software just in a 6000-series Firebrick employed in the downstream direction.

Third thing is that if there is some kind of ordering phenomenon - then only with three links, not two - then the effects would presumably depend on the design of the tester or its TCP stack if it is using TCP. Goals for a tester could either be : firstly, realism - be like normal life, so use TCP and perhaps no cheating by using multiple TCP streams to check if a bit more oomph can be squeezed out of the pipe. Or secondly : measure the link, not the software design at the tester, so try and remove all software-design-dependant aspects, so do not use TCP, just keep firing more and more packets down the link until you can't push any more through. Anyway, in this latter case, you might expect different figures depending on TCP vs non-TCP, single vs multiple connections or good stacks that tolerate out-of-order arrival well vs stacks that perform badly.

I thought that networks were supposed to handle out of order packets, but maybe the only do it they don't dig it. Perhaps no-one reacts well, I don't know. Anyway, accusing the FB 2x00 schedulers of causing problems doesn't make sense given that downstream does well, unless it is something to do with having some server o/s as a sulky receiver for upstream tests and in my case Apple iOS or the iOS tester app as a happy receiver for the downstream tests.

All very confusing.

Chrysalis · « **Reply #4 on:** March 24, 2017, 02:08:08 PM »

Out of order packets can be tolerated but only to a limited amount. Both linux and FreeBSD have tunables to make it more leniant, I am unaware of such a setting on windows.

Really you need to ask either aaisp themselves (since its their hardware doing the bonding) or in the aaisp irc channel. As the amount of people bonding 3 lines at home is probably a really tiny number and especially on firebrick units.

Think about it, if the out of order packets was a really high limit, then the network stack has to wait XXX amount of packets before it can mark a segment as missing meaning re transmissions and so forth will be delayed.

Also I think the tunables only affect ingress traffic, so you sort of at the mercy of the recipient as to how well upstream bonding works, if the recipient only tolerates low numbers of out of order packets then you will have to retransmit when its excessive and hence get lower throughput.

Weaver · « **Reply #5 on:** March 24, 2017, 11:52:48 PM »

Thing is, I get no performance loss on downstream, so iOS for example and any other client-end receiving o/s would have to be out-of-order friendly, and the receiving speed-tester system would have to be very non-friendly and I would have to be generating an out-of-order sequence (bad scheduling). What is perplexing is the loss of performance that is not consistent, i.e. upstream-only. However perhaps some system tolerates two links but not three.

Weaver · « **Reply #6 on:** March 31, 2017, 06:50:33 AM »

Headline: Gone away, sort of, after several years

And now something weird.

I tried a speedof.me performance test at some unearthly time in the morning, 05:45 - 06:45, and guess what. I got the expected upstream result, or is it unexpected?

Upstream and downstream throughput was reported as
Speedof.me :
   1.21 / 7.71 Mbps
   1.25 / 7.69 Mbps
   1.29 / 7.77 Mbps
   1.22 / 7.71 Mbps

The upstream is normally around 0.90 as I mentioned earlier. Perhaps the test servers or their inbound links are exceptionally quiet and that makes the difference ? I don't think that it is due to a quiet local end because I have tested it so many times.

On this occasion the firebrick PPPoE pipes were accurately tuned with speed limiter rates set according to the upstream sync rates, as opposed to just using really high rates that would not limit things. The rates were derived by multiplying the sync rate by (1500/(33×53)) in each case. For the three lines we had on this occasion

Sync = 537k, Tx rate = 460548
Sync = 429k, Tx rate = 367924
Sync = 440k, Tx rate = 377358
Total Tx rate = 1205830

In the case of the highest reported figure this is 107% of the expectation based on the total sync rate times the overheads-related expansion factor mentioned earlier.

Testing with the thinkbroadband tester gave an upstream figure of 0.90 Mbps and various downstream figures around 7.0 Mbps. But then this tester always reads lower than speedof.me. However, the Ookla speed-tester app set to use their test server at Coreix in London (which routinely gives the highest results among their pool of servers) returned 1.18 / 6.97 and 1.17 / 6.95.

So what on earth to make of all this? It is clearly not simply the case that speedof.me has changed in some way, because Ookla supports their new figures. I can think of one thing that has changed, new iOS release since the last test - I am now on AArch64 iOS 10.3. However the thinkbroadband upstream is still low, although I can't remember if it is improved.

I am at a loss. There have been so very many low upstream results at around the 0.90 Mbps mark from speedof.me.

Weaver · « **Reply #7 on:** March 31, 2017, 10:10:27 AM »

I tried a speedof.me upstream test on an iOS 10.2.1 iPhone ARM 32-bit and the result was 1.14 Mbps so it is not the iOS upgrade that is responsible for the big performance hike.

Weaver · « **Reply #8 on:** March 31, 2017, 10:51:22 PM »

Talking to some people tonight made me wonder if the Firebrick router's o/s has been upgraded, but no.

Now I am wondering if BT have done something (located where) that has improved effective throughput for me, perhaps by improving jitter, or packet loss or something that relates to relative packet arrival times across the lines and reordering effects. Minimum RTT has not changed, I would have noticed that from the clueless graphs and also I would have spotted packet loss. However a change in jitter or a change in average and max RTT would be something that I would not have noticed. Although the latter is available for inspection, it is simply all over the place.

Definitely a very good kind of problem to have, a 35 - 43% improvement in upstream.

Weaver · « **Reply #9 on:** April 01, 2017, 08:57:04 AM »

Unfortunately the corollary of this is that something was seriously broken before, beyond a doubt. This absolutely proves that asking for this bonding efficiency level upstream as well as downstream was not just unrealistic. I did have a 'fault', I would have to say, given that performance was down to something like 69%-78% (variation in tester results) of what was achievable.

I have now done something like thirty speed tests and the results are all consistent - the huge improvement is not going away and is not confined to one speed tester. Thinkbroadband’s tester at a new 0.90 Mbps value (IPv6 selected) remains low compared to ookla and speedof.me, but comparing thinkbroadband new results with historical ones there has been a big improvement there too but from a lower base of something like 0.6-0.7 Mbps. (I should see if thinkbroadband can do an IPv4 test as well, preferably without my having to go to the trouble of turning IPv6 off just to force the test.)

The speedof.me results lie in the region 1.17 - 1.31 Mbps, with the great majority being above 1.20Mbps, stats after 30 samples are
Mean    1.24
σ       0.038
σ / mean   0.0305

Weaver · « **Reply #10 on:** April 02, 2017, 01:40:14 AM »

Was talking to RevK about what I considered to be a fault, in this case a possible BT fault that was screwing up my upstream massively. What I said was:

Explaining my thinking: The improvement is 35-43% over the old upstream throughput is so huge that I would have to say that just on a matter of definition there has to have been a 'fault' previously given that the current performance is by definition 'achievable' and 'realistic' because it is the reality, and the optimimum was not remotely being achieved before. So my choice of definition of what a fault is, one free of analysis of underlying pathology, "not being at the realistic optimum, taking into account all of the normal conditions for that exact case, not just comparison with general cases". Indeed if someone's downstream were >30% down, then that would be an FTR breach according to BT wouldn't it? Perhaps am I wrong? So although it's my choice of word, and there are others that are also appropriate terms, it seems to me that I'm not alone in my definition, indeed in good company. That's just my thinking anyway. Now v happy. Actually I realise that it's not just as simple as blaming a single DSLAM, as the low speed has been around for so long that it predates the upgrade to 21CN and ADSL2 about fifteen months ago.

Fair enough?

I now remember that there was a period some years back when I was getting >1.2 Mbps upstream figures from speed testers. It took a while to come back to me though. And at thr time I just thought it was some fluke defect in the speed testers. Then later on I forgot all about that good period.

News:

Author Topic: Upstream performance (again) - either Firebrick packet scheduling or line 3 (Read 2436 times)

Weaver

Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Chrysalis

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Chrysalis

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3

Weaver

Re: Upstream performance (again) - either Firebrick packet scheduling or line 3