Friday, December 23, 2016

Cake: the latest in sqm (QoS) schedulers

Today I finally had the opportunity to try out Cake, the new replacement for the combination of HTB+fq_codel that the bufferbloat project developed as part of CeroWRT's sqm-scripts package.


The bufferbloat project is tackling overbloated systems in two ways:
  1. Removing the bloat everywhere that we can
  2. Moving bottlenecks to places where we can control the queues, and keep them from getting bloated
sqm-scripts, and now cake, are part of the latter.  They work by restricting the bandwidth that flows through an interface (ingress, egress, or both), and then carefully managing the queue so that it doesn't add any (or much) latency.

More details on how cake works can be read HERE.

The WNDR3800

Cake was meant to perform well on lower-end CPUs like those in home routers.  So the test results that follow are all on a Netgear WNDR3800.  This was a fairly high-end router, 5 years ago when it was new.  Now, it's dual 802.11n radios are falling behind the times, and it's 680MHz MIPS CPU is distinctly slow compared to the >1GHz multi-core ARM CPUs that are currently in many home routers.

All the tests that follow were taken using the same piece of hardware.

Final Results

I'm starting with the final results, and then we'll compare the various revisions of settings and software that led to this.

Comcast Service Speeds:
180Mbps download
12Mbps upload
100s of ms of latency

Cake's shaping limits (before the CPU is maxed out):
~135 Mbps download speed
12Mbps upload
no additional latency vs idle conditions

What's really impressive is how smooth the incoming streams are.  They really are doing well.  Upstream is also pretty good (although not great, this is the edge of what the CPU can manage).  But what's simply amazing is the latency graph.  It doesn't change between an idle or fully-in-use connection.

And the CDF plot really shows that.  There's no step between the idle and loaded operation, just a near-vertical line around the link latency (which is almost entirely between the modem and the head-end).

How To Get There

Base Service (No SQM)

First, we'll start with the raw connection, as it currently stands from Comcast, with no sqm of any kind.

Using the DSL Reports speed test, we get:

Fast, but the bufferbloat get's an F (upload latency went to nearly 2 seconds, twice).

The RRUL test was a mess:

The stream performance is all over the map, and the latency jumps by 200ms

SQM-Scripts with HTB+fq_codel

Previously, the best we had was to use HTB to limit the bandwidth (thereby moving the bottleneck to one whose buffer we controlled), and then using fq_codel to keep that buffer under control.

But HTB is known to be CPU intensive.  And so the WNDR3800 could only be set to about 100Mbps (and that was honestly pushing things).

Set to 100Mbps downstream, it's only actually achieving about 80Mbps.  It's separating the traffic classes on upstream, but not doing a great job of keeping the downstream fair across all the streams.  But it's significantly better than the previous version.

Latency, however, is fantastic.  unloaded latency is ~15ms, and it goes up 5ms to ~20Ms, as we'd expect from fq_codel with a 5ms latency target.

But, from 180Mbps, that's leaving a lot on the table, even though it's really quite fast.  And that brings us to cake.

Getting to Cake

The previous results (if you look at the date, you'll see that was from August of 2014) were from CeroWRT, running a 3.1x kernel.

To get cake, in a reasonably "easy" manner, we just need to grab the LEDE project's firmware for the WNDR3800.

LEDE is another fork of OpenWRT, but unlike CeroWRT, which was experimental, the LEDE project is a forking of the OpenWRT community, and it's working on making it's v1.0 stable release.  Today's results are based on the 2016-12-22 "snapshot" build of LEDE.

So a quick router upgrade (and factory reset) later, and I'm running a 4.4 kernel.

Install Cake:
opkg update
opkg install luci kmod-sched-cake luci-app-sqm

Then log in and setup the settings. 

Layer Cake

First I tried the 3-queue equivalent to what I'd been running, which is called "layer cake".  This performed well, but with some oddities.

I liked how well it separated the traffic classes, but I didn't like that it wasn't smooth.  And since this is a lower-end platform (by today's standards), I moved on to the simpler "piece of cake" setup.

Piece of Cake

Piece of cake is a simple setup, with only a single queue for all traffic classes.  Lightweight, but it's fast.  And very smooth.

Comparing the generations

By comparing the various setups, it's clear just how much of an improvement there is with cake vs. HTB+fq_codel, and how much better both are at controlling latency vs. the base (unlimited) setup.

What's striking to me is how radically different the inner-quartile ranges are on these datasets.  The unlimited ranges are huge, with very long whiskers.  Moving to fq_codel, and they all collapse around the median.

Next Tests

My next tests that I plan on doing are running the same LEDE build on my Linksys WRT1900AC.  It's a dual 1.2GHz ARM router, it should be able to push packets at a far higher rate.

The other set of tests that I want to do are to test Toke's airtime fairness patches, now that my WNDR3800 has them (as they're in the LEDE snapshot builds as of a few days ago).

Also on the list, IRQ affinity on the WRT1900AC, as it's clearly not spreading across the CPUs:

# cat /proc/interrupts 
           CPU0       CPU1       
 16:  573410369  576621993  armada_370_xp_irq   5  armada_370_xp_per_cpu_tick
 18:  101119566          0  armada_370_xp_irq  31  mv64xxx_i2c
 19:         21          0  armada_370_xp_irq  41  serial
 25:          0          0  armada_370_xp_irq  45  ehci_hcd:usb1
 26:   12069694          0  armada_370_xp_irq   8  mvneta
 27:  155872495          0  armada_370_xp_irq  10  mvneta
 28:          0          0  armada_370_xp_irq  55  f10a0000.sata
 29:      20241          0  armada_370_xp_irq 113  f10d0000.nand
 69:          0          0  f1018140.gpio   0  gpio_keys
 70:          0          0  f1018140.gpio   1  gpio_keys
 87:  869054656          0  armada_370_xp_irq  59  mwlwifi
 88:  592552941          0  armada_370_xp_irq  60  mwlwifi
 89:          2          0  armada_370_xp_irq  51  f1060900.xor
 90:          2          0  armada_370_xp_irq  52  f1060900.xor
 91:          2          0  armada_370_xp_irq  94  f10f0900.xor
 92:          2          0  armada_370_xp_irq  95  f10f0900.xor
 93:          0          0  armada_370_xp_msi_irq   0  xhci_hcd
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:    3043894   72886072  Rescheduling interrupts
IPI3:          0          0  Function call interrupts
IPI4:     500197   79942402  Single function call interrupts
IPI5:          0          0  CPU stop interrupts
IPI6:          0          0  IRQ work interrupts
IPI7:          0          0  completion interrupts