Tuesday, June 2, 2015

HTB rate limiting not quite lining up

I've noticed this off/on with first the WNDR3800, and now the WRT1900AC.  The rates I enter for the sqm_scripts aren't being met, and not, I think, because of CPU load issues, but something about the book-keeping.

Here's a set of tcp_download tests on the WNDR3800, the ingress rate limits are in the legend:

The WNDR holds up linearly until 90Mbps, and then it's clear that everything's come apart.  With the measured "good-put" at an eyeballed 95% of the rate that's setup in the limiter.  This is likely to be expected TCP overhead vs. the raw line bit-rate (which is where the limiter is running).

However, on the WRT1900AC, it's off rather significantly:

Maybe 80% of the target ingress rate?

+Dave Taht suggested I turn off TCP offloads, and it got less linear, worse on the low-end, better on the higher end.

This is definitely going to take some more testing (and longer test runs) to map out what the issue(s) might be.


Corrections:  This post previously stated that the WNDR3800 was falling short, but after talking with some other people, I think that's likely just the expected overhead of TCP, which becomes a more-obvious difference between the raw line rate and the "good-put" as bandwidths go up (5Mbps is easier to see than 500Kbps).

sqm_scripts: before and after at 160Mbps+

Apparently I've been upgraded.  I did a baseline test today (with sqm off), and saw that the new download rate was up around 160-175Mbps from 120-130.  That's some very impressive over-provisioning from Comcast.

Unfortunately, it also includes some rather unfortunate bufferbloat.  That's a surprising change for the worse, as the service, when initially installed with the same modem, was actually quite good by "retail" standards.  But still awful vs. what it should be.

The ugly (but fast):

Classic bufferbloat.  At idle, the target endpoint is maybe 10-12ms away.  200+ms of latency is pretty awful, and drags the "effective" performance of the service from >150Mbps down to what "feels" like a couple Mbps.

After upping the limits in the sqm, and turning off the tcp stack offloads, I ended up with this:

So, total bandwidth available has dropped to about 140-150Mbps (still more than the 120Mbps the service claims to be).  But latency is basically gone.  fq_codel holds the 5ms target rather nicely.

To make that latency difference more apparent:

200Mbps ingress limit (something is odd with the math on this, clearly)
12Mbps egress limit
ethtool -k eth1 tso off gso off gro off

TCP Offloads: more harm than good

+Dave Taht has been saying for a while that TCP offloads do more harm than good, especially when mixed with fq_codel, and the ingress rate limiter that the sqm_scripts package uses to replace the large inbound buffers in the modem and CMTS with a much smaller buffer (but nearly as fast bandwidth), under the control of the router.

I finally put some numbers on that tonight.

The first dataset (green plots) are without gro, tso, and gso.  The second plots are with those offloads all re-enabled.  So enabling offloads:

1) slows it down
2) increases latency


Yeah, I'm keeping all the offloads turned off (and adjusted my router startup scripts to keep them off each time the simple.qos script runs).