Burnt Chrome: Cake: the latest in sqm (QoS) schedulers

Today I finally had the opportunity to try out Cake, the new replacement for the combination of HTB+fq_codel that the bufferbloat project developed as part of CeroWRT's sqm-scripts package.

Background

The bufferbloat project is tackling overbloated systems in two ways:

Removing the bloat everywhere that we can
Moving bottlenecks to places where we can control the queues, and keep them from getting bloated

sqm-scripts, and now cake, are part of the latter. They work by restricting the bandwidth that flows through an interface (ingress, egress, or both), and then carefully managing the queue so that it doesn't add any (or much) latency.

More details on how cake works can be read HERE.

The WNDR3800

Cake was meant to perform well on lower-end CPUs like those in home routers. So the test results that follow are all on a Netgear WNDR3800. This was a fairly high-end router, 5 years ago when it was new. Now, it's dual 802.11n radios are falling behind the times, and it's 680MHz MIPS CPU is distinctly slow compared to the >1GHz multi-core ARM CPUs that are currently in many home routers.

All the tests that follow were taken using the same piece of hardware.

Final Results

I'm starting with the final results, and then we'll compare the various revisions of settings and software that led to this.

Comcast Service Speeds:

180Mbps download

12Mbps upload

100s of ms of latency

Cake's shaping limits (before the CPU is maxed out):

~135 Mbps download speed

12Mbps upload

no additional latency vs idle conditions

What's really impressive is how smooth the incoming streams are. They really are doing well. Upstream is also pretty good (although not great, this is the edge of what the CPU can manage). But what's simply amazing is the latency graph. It doesn't change between an idle or fully-in-use connection.

And the CDF plot really shows that. There's no step between the idle and loaded operation, just a near-vertical line around the link latency (which is almost entirely between the modem and the head-end).

How To Get There

Base Service (No SQM)

First, we'll start with the raw connection, as it currently stands from Comcast, with no sqm of any kind.

Using the DSL Reports speed test, we get:

Fast, but the bufferbloat get's an F (upload latency went to nearly 2 seconds, twice).

The RRUL test was a mess:

The stream performance is all over the map, and the latency jumps by 200ms

SQM-Scripts with HTB+fq_codel

Previously, the best we had was to use HTB to limit the bandwidth (thereby moving the bottleneck to one whose buffer we controlled), and then using fq_codel to keep that buffer under control.

But HTB is known to be CPU intensive. And so the WNDR3800 could only be set to about 100Mbps (and that was honestly pushing things).

Set to 100Mbps downstream, it's only actually achieving about 80Mbps. It's separating the traffic classes on upstream, but not doing a great job of keeping the downstream fair across all the streams. But it's significantly better than the previous version.

Latency, however, is fantastic. unloaded latency is ~15ms, and it goes up 5ms to ~20Ms, as we'd expect from fq_codel with a 5ms latency target.

But, from 180Mbps, that's leaving a lot on the table, even though it's really quite fast. And that brings us to cake.

Getting to Cake

The previous results (if you look at the date, you'll see that was from August of 2014) were from CeroWRT, running a 3.1x kernel.

To get cake, in a reasonably "easy" manner, we just need to grab the LEDE project's firmware for the WNDR3800.

LEDE is another fork of OpenWRT, but unlike CeroWRT, which was experimental, the LEDE project is a forking of the OpenWRT community, and it's working on making it's v1.0 stable release. Today's results are based on the 2016-12-22 "snapshot" build of LEDE.

So a quick router upgrade (and factory reset) later, and I'm running a 4.4 kernel.

Install Cake:

opkg update

opkg install luci kmod-sched-cake luci-app-sqm

Then log in and setup the settings.

Layer Cake

First I tried the 3-queue equivalent to what I'd been running, which is called "layer cake". This performed well, but with some oddities.

I liked how well it separated the traffic classes, but I didn't like that it wasn't smooth. And since this is a lower-end platform (by today's standards), I moved on to the simpler "piece of cake" setup.

Piece of Cake

Piece of cake is a simple setup, with only a single queue for all traffic classes. Lightweight, but it's fast. And very smooth.

Comparing the generations

By comparing the various setups, it's clear just how much of an improvement there is with cake vs. HTB+fq_codel, and how much better both are at controlling latency vs. the base (unlimited) setup.

What's striking to me is how radically different the inner-quartile ranges are on these datasets. The unlimited ranges are huge, with very long whiskers. Moving to fq_codel, and they all collapse around the median.

Next Tests

My next tests that I plan on doing are running the same LEDE build on my Linksys WRT1900AC. It's a dual 1.2GHz ARM router, it should be able to push packets at a far higher rate.

The other set of tests that I want to do are to test Toke's airtime fairness patches, now that my WNDR3800 has them (as they're in the LEDE snapshot builds as of a few days ago).

Also on the list, IRQ affinity on the WRT1900AC, as it's clearly not spreading across the CPUs:

# cat /proc/interrupts

CPU0 CPU1

16: 573410369 576621993 armada_370_xp_irq 5 armada_370_xp_per_cpu_tick

18: 101119566 0 armada_370_xp_irq 31 mv64xxx_i2c

19: 21 0 armada_370_xp_irq 41 serial

25: 0 0 armada_370_xp_irq 45 ehci_hcd:usb1

26: 12069694 0 armada_370_xp_irq 8 mvneta

27: 155872495 0 armada_370_xp_irq 10 mvneta

28: 0 0 armada_370_xp_irq 55 f10a0000.sata

29: 20241 0 armada_370_xp_irq 113 f10d0000.nand

69: 0 0 f1018140.gpio 0 gpio_keys

70: 0 0 f1018140.gpio 1 gpio_keys

87: 869054656 0 armada_370_xp_irq 59 mwlwifi

88: 592552941 0 armada_370_xp_irq 60 mwlwifi

89: 2 0 armada_370_xp_irq 51 f1060900.xor

90: 2 0 armada_370_xp_irq 52 f1060900.xor

91: 2 0 armada_370_xp_irq 94 f10f0900.xor

92: 2 0 armada_370_xp_irq 95 f10f0900.xor

93: 0 0 armada_370_xp_msi_irq 0 xhci_hcd

IPI0: 0 0 CPU wakeup interrupts

IPI1: 0 0 Timer broadcast interrupts

IPI2: 3043894 72886072 Rescheduling interrupts

IPI3: 0 0 Function call interrupts

IPI4: 500197 79942402 Single function call interrupts

IPI5: 0 0 CPU stop interrupts

IPI6: 0 0 IRQ work interrupts

IPI7: 0 0 completion interrupts

Burnt Chrome

Friday, December 23, 2016

Cake: the latest in sqm (QoS) schedulers