The bufferbloat project is tackling overbloated systems in two ways:
- Removing the bloat everywhere that we can
- Moving bottlenecks to places where we can control the queues, and keep them from getting bloated
sqm-scripts, and now cake, are part of the latter. They work by restricting the bandwidth that flows through an interface (ingress, egress, or both), and then carefully managing the queue so that it doesn't add any (or much) latency.
More details on how cake works can be read HERE.
Cake was meant to perform well on lower-end CPUs like those in home routers. So the test results that follow are all on a Netgear WNDR3800. This was a fairly high-end router, 5 years ago when it was new. Now, it's dual 802.11n radios are falling behind the times, and it's 680MHz MIPS CPU is distinctly slow compared to the >1GHz multi-core ARM CPUs that are currently in many home routers.
All the tests that follow were taken using the same piece of hardware.
I'm starting with the final results, and then we'll compare the various revisions of settings and software that led to this.
100s of ms of latency
~135 Mbps download speed
no additional latency vs idle conditions
What's really impressive is how smooth the incoming streams are. They really are doing well. Upstream is also pretty good (although not great, this is the edge of what the CPU can manage). But what's simply amazing is the latency graph. It doesn't change between an idle or fully-in-use connection.
And the CDF plot really shows that. There's no step between the idle and loaded operation, just a near-vertical line around the link latency (which is almost entirely between the modem and the head-end).
How To Get There
Base Service (No SQM)First, we'll start with the raw connection, as it currently stands from Comcast, with no sqm of any kind.
Using the DSL Reports speed test, we get:
Fast, but the bufferbloat get's an F (upload latency went to nearly 2 seconds, twice).
The RRUL test was a mess:
SQM-Scripts with HTB+fq_codel
Previously, the best we had was to use HTB to limit the bandwidth (thereby moving the bottleneck to one whose buffer we controlled), and then using fq_codel to keep that buffer under control.
But HTB is known to be CPU intensive. And so the WNDR3800 could only be set to about 100Mbps (and that was honestly pushing things).
Set to 100Mbps downstream, it's only actually achieving about 80Mbps. It's separating the traffic classes on upstream, but not doing a great job of keeping the downstream fair across all the streams. But it's significantly better than the previous version.
Latency, however, is fantastic. unloaded latency is ~15ms, and it goes up 5ms to ~20Ms, as we'd expect from fq_codel with a 5ms latency target.
But, from 180Mbps, that's leaving a lot on the table, even though it's really quite fast. And that brings us to cake.
Getting to Cake
The previous results (if you look at the date, you'll see that was from August of 2014) were from CeroWRT, running a 3.1x kernel.
To get cake, in a reasonably "easy" manner, we just need to grab the LEDE project's firmware for the WNDR3800.
LEDE is another fork of OpenWRT, but unlike CeroWRT, which was experimental, the LEDE project is a forking of the OpenWRT community, and it's working on making it's v1.0 stable release. Today's results are based on the 2016-12-22 "snapshot" build of LEDE.
So a quick router upgrade (and factory reset) later, and I'm running a 4.4 kernel.
opkg install luci kmod-sched-cake luci-app-sqm
Then log in and setup the settings.
First I tried the 3-queue equivalent to what I'd been running, which is called "layer cake". This performed well, but with some oddities.
I liked how well it separated the traffic classes, but I didn't like that it wasn't smooth. And since this is a lower-end platform (by today's standards), I moved on to the simpler "piece of cake" setup.
Piece of Cake
Piece of cake is a simple setup, with only a single queue for all traffic classes. Lightweight, but it's fast. And very smooth.
Comparing the generations
By comparing the various setups, it's clear just how much of an improvement there is with cake vs. HTB+fq_codel, and how much better both are at controlling latency vs. the base (unlimited) setup.
What's striking to me is how radically different the inner-quartile ranges are on these datasets. The unlimited ranges are huge, with very long whiskers. Moving to fq_codel, and they all collapse around the median.
My next tests that I plan on doing are running the same LEDE build on my Linksys WRT1900AC. It's a dual 1.2GHz ARM router, it should be able to push packets at a far higher rate.
The other set of tests that I want to do are to test Toke's airtime fairness patches, now that my WNDR3800 has them (as they're in the LEDE snapshot builds as of a few days ago).
Also on the list, IRQ affinity on the WRT1900AC, as it's clearly not spreading across the CPUs:
# cat /proc/interruptsCPU0 CPU116: 573410369 576621993 armada_370_xp_irq 5 armada_370_xp_per_cpu_tick18: 101119566 0 armada_370_xp_irq 31 mv64xxx_i2c19: 21 0 armada_370_xp_irq 41 serial25: 0 0 armada_370_xp_irq 45 ehci_hcd:usb126: 12069694 0 armada_370_xp_irq 8 mvneta27: 155872495 0 armada_370_xp_irq 10 mvneta28: 0 0 armada_370_xp_irq 55 f10a0000.sata29: 20241 0 armada_370_xp_irq 113 f10d0000.nand69: 0 0 f1018140.gpio 0 gpio_keys70: 0 0 f1018140.gpio 1 gpio_keys87: 869054656 0 armada_370_xp_irq 59 mwlwifi88: 592552941 0 armada_370_xp_irq 60 mwlwifi89: 2 0 armada_370_xp_irq 51 f1060900.xor90: 2 0 armada_370_xp_irq 52 f1060900.xor91: 2 0 armada_370_xp_irq 94 f10f0900.xor92: 2 0 armada_370_xp_irq 95 f10f0900.xor93: 0 0 armada_370_xp_msi_irq 0 xhci_hcdIPI0: 0 0 CPU wakeup interruptsIPI1: 0 0 Timer broadcast interruptsIPI2: 3043894 72886072 Rescheduling interruptsIPI3: 0 0 Function call interruptsIPI4: 500197 79942402 Single function call interruptsIPI5: 0 0 CPU stop interruptsIPI6: 0 0 IRQ work interruptsIPI7: 0 0 completion interrupts