Friday, December 23, 2016

Cake: the latest in sqm (QoS) schedulers

Today I finally had the opportunity to try out Cake, the new replacement for the combination of HTB+fq_codel that the bufferbloat project developed as part of CeroWRT's sqm-scripts package.


The bufferbloat project is tackling overbloated systems in two ways:
  1. Removing the bloat everywhere that we can
  2. Moving bottlenecks to places where we can control the queues, and keep them from getting bloated
sqm-scripts, and now cake, are part of the latter.  They work by restricting the bandwidth that flows through an interface (ingress, egress, or both), and then carefully managing the queue so that it doesn't add any (or much) latency.

More details on how cake works can be read HERE.

The WNDR3800

Cake was meant to perform well on lower-end CPUs like those in home routers.  So the test results that follow are all on a Netgear WNDR3800.  This was a fairly high-end router, 5 years ago when it was new.  Now, it's dual 802.11n radios are falling behind the times, and it's 680MHz MIPS CPU is distinctly slow compared to the >1GHz multi-core ARM CPUs that are currently in many home routers.

All the tests that follow were taken using the same piece of hardware.

Final Results

I'm starting with the final results, and then we'll compare the various revisions of settings and software that led to this.

Comcast Service Speeds:
180Mbps download
12Mbps upload
100s of ms of latency

Cake's shaping limits (before the CPU is maxed out):
~135 Mbps download speed
12Mbps upload
no additional latency vs idle conditions

What's really impressive is how smooth the incoming streams are.  They really are doing well.  Upstream is also pretty good (although not great, this is the edge of what the CPU can manage).  But what's simply amazing is the latency graph.  It doesn't change between an idle or fully-in-use connection.

And the CDF plot really shows that.  There's no step between the idle and loaded operation, just a near-vertical line around the link latency (which is almost entirely between the modem and the head-end).

How To Get There

Base Service (No SQM)

First, we'll start with the raw connection, as it currently stands from Comcast, with no sqm of any kind.

Using the DSL Reports speed test, we get:

Fast, but the bufferbloat get's an F (upload latency went to nearly 2 seconds, twice).

The RRUL test was a mess:

The stream performance is all over the map, and the latency jumps by 200ms

SQM-Scripts with HTB+fq_codel

Previously, the best we had was to use HTB to limit the bandwidth (thereby moving the bottleneck to one whose buffer we controlled), and then using fq_codel to keep that buffer under control.

But HTB is known to be CPU intensive.  And so the WNDR3800 could only be set to about 100Mbps (and that was honestly pushing things).

Set to 100Mbps downstream, it's only actually achieving about 80Mbps.  It's separating the traffic classes on upstream, but not doing a great job of keeping the downstream fair across all the streams.  But it's significantly better than the previous version.

Latency, however, is fantastic.  unloaded latency is ~15ms, and it goes up 5ms to ~20Ms, as we'd expect from fq_codel with a 5ms latency target.

But, from 180Mbps, that's leaving a lot on the table, even though it's really quite fast.  And that brings us to cake.

Getting to Cake

The previous results (if you look at the date, you'll see that was from August of 2014) were from CeroWRT, running a 3.1x kernel.

To get cake, in a reasonably "easy" manner, we just need to grab the LEDE project's firmware for the WNDR3800.

LEDE is another fork of OpenWRT, but unlike CeroWRT, which was experimental, the LEDE project is a forking of the OpenWRT community, and it's working on making it's v1.0 stable release.  Today's results are based on the 2016-12-22 "snapshot" build of LEDE.

So a quick router upgrade (and factory reset) later, and I'm running a 4.4 kernel.

Install Cake:
opkg update
opkg install luci kmod-sched-cake luci-app-sqm

Then log in and setup the settings. 

Layer Cake

First I tried the 3-queue equivalent to what I'd been running, which is called "layer cake".  This performed well, but with some oddities.

I liked how well it separated the traffic classes, but I didn't like that it wasn't smooth.  And since this is a lower-end platform (by today's standards), I moved on to the simpler "piece of cake" setup.

Piece of Cake

Piece of cake is a simple setup, with only a single queue for all traffic classes.  Lightweight, but it's fast.  And very smooth.

Comparing the generations

By comparing the various setups, it's clear just how much of an improvement there is with cake vs. HTB+fq_codel, and how much better both are at controlling latency vs. the base (unlimited) setup.

What's striking to me is how radically different the inner-quartile ranges are on these datasets.  The unlimited ranges are huge, with very long whiskers.  Moving to fq_codel, and they all collapse around the median.

Next Tests

My next tests that I plan on doing are running the same LEDE build on my Linksys WRT1900AC.  It's a dual 1.2GHz ARM router, it should be able to push packets at a far higher rate.

The other set of tests that I want to do are to test Toke's airtime fairness patches, now that my WNDR3800 has them (as they're in the LEDE snapshot builds as of a few days ago).

Also on the list, IRQ affinity on the WRT1900AC, as it's clearly not spreading across the CPUs:

# cat /proc/interrupts 
           CPU0       CPU1       
 16:  573410369  576621993  armada_370_xp_irq   5  armada_370_xp_per_cpu_tick
 18:  101119566          0  armada_370_xp_irq  31  mv64xxx_i2c
 19:         21          0  armada_370_xp_irq  41  serial
 25:          0          0  armada_370_xp_irq  45  ehci_hcd:usb1
 26:   12069694          0  armada_370_xp_irq   8  mvneta
 27:  155872495          0  armada_370_xp_irq  10  mvneta
 28:          0          0  armada_370_xp_irq  55  f10a0000.sata
 29:      20241          0  armada_370_xp_irq 113  f10d0000.nand
 69:          0          0  f1018140.gpio   0  gpio_keys
 70:          0          0  f1018140.gpio   1  gpio_keys
 87:  869054656          0  armada_370_xp_irq  59  mwlwifi
 88:  592552941          0  armada_370_xp_irq  60  mwlwifi
 89:          2          0  armada_370_xp_irq  51  f1060900.xor
 90:          2          0  armada_370_xp_irq  52  f1060900.xor
 91:          2          0  armada_370_xp_irq  94  f10f0900.xor
 92:          2          0  armada_370_xp_irq  95  f10f0900.xor
 93:          0          0  armada_370_xp_msi_irq   0  xhci_hcd
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:    3043894   72886072  Rescheduling interrupts
IPI3:          0          0  Function call interrupts
IPI4:     500197   79942402  Single function call interrupts
IPI5:          0          0  CPU stop interrupts
IPI6:          0          0  IRQ work interrupts
IPI7:          0          0  completion interrupts

Saturday, October 8, 2016

Gear Review: Brevite Camera Backpack

As I mentioned in my previous post, while the Crumpler bag is great for travel, it's not so great for day-to-day use, and in particular isn't great on a hike.  But then, neither are most hiking back-packs, being designed to be out of the way when you're hiking.

So when I saw the the Brevite kickstarter, I decided to take a chance and back it, and see if it would work out better for me.

Unfortunately, it's not well suited to me and my cameras.  But I think it would be fine for many other people.  An important clue in all of this was in the kickstarter campaign, and had I been paying closer attention, would have realized the issue I ran into in advance.

But first let's talk about the bag, and what it does well.

Styling-wise, it's clearly pulling from the JanSport classic student backpack, albeit a bit nicer looking with the leather accents.  Honestly, it's a bit too nice for hiking, it's the wrong sort of style.

Size-wise, it's very close to the Crumpler, which as I said in the last post, is pretty close to the typical student backpack size.

I'd not actually set them side-by-side before I took this pic, and was surprised at how close they are in external dimensions.  Because the Crumpler feels like larger bag, with all the space it has.  And the Brevite feels like a little bag, due to how little it has.  But it also feels "short" while wearing it.

Comparing published dimensions, it's 2" shallower than a JanSport, so maybe that's part of the issue.

Starting at the back, and going forward, there's the laptop and document pocket.

The two-layer pocket and smooth fabric works well for laptops and documents.  The next zipper is for the main, upper, pocket.

It's unfortunately too small for my needs.  It can fit a single rolled up sweater or fleece, a bottle of water, or other similarly items.  At least, that's all when the padded camera insert is inside it (which is visible at the bottom of the main pocket.

The front of the bag has two pockets.  The front-most is small and simple.  The second has a bunch of small loops that work for holding pens, smaller-diameter filters, memory cards, and the like.

Behind the front pockets is the removable, padded, camera insert.  It holds a body and multiple lenses with ease.

And it has a second way in (in the photo below).  This was the main selling feature of the bag.  A quick access into the camera pocket.  Two zippers (one on the bag, one on the camera insert), and it's open.

And it's too small for an FX-sized body, even one as small as a D600.  My old D50 just barely fits, with the viewfinder catching on the opening.

And this is what I missed in the campaign video.  They only show it being used with smaller form-factor cameras, like classic film bodies.  The campaign photographs show it with a larger DSLR and a big 70-200/2.8.  But it doesn't show how hard it is to get such large pieces in/out of the backpack.

That is where it just doesn't work for me, unfortunately.  It fails at being able to work with my standard kit, a D600 with either a prime or a versatile zoom like the 24-120/4.

Gear Review: Crumpler Proper Roady Photo Full Backpack

3 years ago, when living in Paris, I needed something for carrying camera gear on planes that was better than my 10yo daypack.

So I ordered this, having seen the Amazon Basics knock-off version of it:

Proper Roady Photo Full Backpack, by Crumpler

It's been utterly fantastic for carting camera gear and a latop on planes (or any other time I need to travel with the "full kit".  It holds a ton of camera gear and a 15" laptop, fits under-seat if it needs to, and is fairly comfortable once all the straps are adjusted right.

Here it is:

Size-wise, I consider it a medium-sized back-pack.  It's only a few cm wider and deeper than a standard student pack like the ubiquitous JanSport.

The grey and orange color was on clearance, now it seems to only come in black.  The outside has held up well, after many flights and photowalk trips.  The grey, with it's light texture seems to hide any dirt that it's picked up quite well.

Inside, the main compartment is covered with a separate mesh cover, so as you open up the bag to get at the inner flap, you don't need to worry about lenses coming out.

Another handy feature of the mesh panel and it's second pair of zippers is that you can only open it up in space to reach in for filters, batteries, etc.  But I've found that I tend to put almost everything in this main area, especially for a big trip where I'm carrying chargers, filters, multiple lenses, my laptop power adapter, the speedlight, etc.

Here it's carrying:

  • FX-sized body + 24-120 f/4 (well, except that I'm shooting with that one)
  • DX-sized body
  • 100mm f/2.8 prime
  • 18-200 DX zoom
  • 50mm f/1.8 prime
  • spare hoods
  • filters
  • charger
  • cleaning cloths
I no longer travel with the DX body, but in it's place I can put a speedlight (in its case), spare batteries, an external battery, cable, and charger for the phone, and portable external hard drive (photo backups).  It's amazing how much fits into it.

One of the dividers, I've never really figured out the right way to use, however, so I keep it curled up, and the space on the other side of where my FX body goes is filled by the camera strap.

The main pouch's inner flap also has a mesh cover with a zipper, giving access to an area that's good for small flat things.

This has little pouches for stuff like SD cards and the Nikon IR remote.

On the outside, there are compression straps that can be used for big bulky things like tripods.  I've never really used them except as extra closure insurance while traveling (as it makes it much harder to open the bag.

Hidden on the outside of the main flap is another pocket.  This is good for 1-2 books (3 if they're thin paperbacks).  But it works well enough for reading material when flying.  If you've switched to a kindle, it's probably perfect, but it's a bit small for larger format books or thick books.

At the back, there's a very soft, well-padded pocket that readily fits a 15" MacBook Pro.  I'm not sure if it would deal as well with a chunkier laptop like a ThinkPad.

Construction-wise, I've been worried about the stress on that zipper, since the straps anchor right at that zipper.  But it's never given any signs of being stressed, even when the bag has 20lbs of stuff in it.

Strap-wise, it has good shoulder straps, with a chest strap, and a hip belt.  It also has a mesh-covered padded back, to help with ventilation.

Unfortunately, the straps don't stow, so it can be sort of like an octopus when you get it out of the overhead bins on an airplane.  The curved straps and the chest strap together make it pretty comfortable, and I only bother with the hip belt when I'm going to be standing with it for a very long time.  It's not a great hip belt, but it works well enough to get some weight off your shoulders.

Overall, it's a solid bag.  But it has a few limitations.

It really can't do anything other than carry camera gear.  No side pockets, no document pockets, etc.  It's always taking the camera (and laptop), but leaves you needing another bag for anything else (books, clothes, knitting, etc).

What especially frustrating is that it doesn't deal well with paper.  Paper either goes in the main compartment, between the layers of mesh (where it's mashed out of shape by the stuff in the bag), or in with the laptop.  But the laptop sleeve is so plush that paper really don't want to go into there, unless it's in a folder.

Getting stuff out can be slow.

One of the reasons I put the camera with lens at the top of the bag is so that I can get a camera out by just opening the inner and outer zippers a bit, and reaching into the upright bag.  Perfect for stowing in the rear-driver-side footwell, and having ready access to the camera.

But if you need anything else, it has to be laid down flat, and the two sets of zippers opened.  Then everything is right there, but it's on a table or on the ground, and then back on your back...  It makes for slow cumbersome lens changes, and really limits where you want to open it up if the ground is at all dirty or muddy.

And so it's not at all a good bag for hiking.  It's actually pretty abysmal at that.  And it's not that great for photowalks, either.  But it is great for traveling by plane, train, or automobile when you need to carry a bunch of stuff.

So if that's what you're looking for, it's fantastic, and highly recommended.  But if you needs include other sorts of travel, with quick access to the camera (or other stuff), it's not the best.

Thursday, September 22, 2016

iperf3 and microbursts, part II

Previously, I talked about iPerf3's microbursts, and how changing the frequency of the timer could smoothen things out.

One of the first feedback items that I got, and lined up with some musings of my own, was if it might be better to calculate the timer based on the smoothest possible packet rate:

PPS = Rate / Packet Size

This is, btw, what iPerf3 does automatically on Linux, when the fq socket scheduler is available.  So this is really just seeing if we can fake it from user-land on systems that don't have it (like my OSX system)

Luckily, adding this to iPerf's code is pretty easy.


To recap, using a 1ms timer and 100Mbps with a 1K udp packet size results in the following:
Zooming in:

Switching from 1ms pacing to 819us pacing (the results of the calculated pacing), nets:
And zooming in:

And, I should be careful, because I'm quantizing this into buckets that are the same size as the timer...  I should probably be subdividing more, or much less, to get a better view of what's going on.  But I'm going to stick with the 1ms buckets for this analysis (for data presentation consistency).


I've only been showing 100Mbps results, but I really should document how it works at higher speeds, especially something closer to the line rate.  So here's what 500Mbps looks like through all these changes.


1ms timer:

And then using a calculated 16µs timer:

Much, much smoother pacing.

However, there's still some big overshoots, but that's due to how the iperf3 red-light/green-light algorithm reacts to stumbles (or late-firing timers).  It sends more packets until it catches back up.  At the micro-scale, this isn't a big deal, but it can cause the tool to stick in "green-light" mode when testing through congested links and it can't actually maintain the desired rate.

I've setup a new branch on my GitHub fork to play around with this.  Including capping the maximum frequency (with a command-line param to change it).  The cap is specified in µs, as that's what the POSIX api lets you use.

Now to get some captures from a Linux box with fq pacing, and show how well it performs.

Tuesday, September 20, 2016

iperf3 and microbursts

This is in the "Always know how your tools work" category.


We were doing some end-to-end network testing at work using iperf3.  While TCP would happily fill the link, UDP was giving us just miserable results.  It wouldn't achieve more than 8 mbps of 'goodput' on a 50mbps link (which we'd just verified using TCP).  Extra packets over 8mbps were getting dropped.

The setup at the traffic generation end was:

PC -> switch1 -> switch2 -> 50mbps link to ISP

  • PC to switch1 was 1Gbps
  • switch1 to switch2 was 100Mbps
If we connected the PC running iperf3 directly to switch2, we'd get much better throughput (or, rather, much less packet loss for the same traffic load).  But then the PC was transmitting into a 100Mbps port, not a 1Gbps port...

I thought that this sounded like perhaps packet bursts were exceeding some buffers, and got curious as to how iperf3 actually generates traffic and performs pacing with the -b option.

Packet Bursts

I personally didn't have access to the test setup, so I did some quick code inspection.  What I found was that it turns transmission on using a 100ms timer, turning transmission back off when it's achieved the right long-term average transmit rate (specified with -b ).

What this ends up looking like for a 100Mbps rate transmitting into a much higher rate interface, using 1KB packets, is below.  This is a graph of MB send per 1ms period, scaled into Mbps.
Here's the first second of that, using the same Y scale (1Gbps):

So yeah, bursty.  Not exactly "micro" bursts, either.  More like millibursts.

iperf3 and Packet Pacing

iperf3 uses the fq socket pacing option by default, when it's available (tc-fq.8.html).  But we were using OSX, where it's not available.

When it's not available, iperf3 uses the following algorithm to throttle the rate it transmits at:

while( testing) 
    sleep 100ms
    while( total_bytes_sent / total_elapsed_time < target_rate)
       transmit buffer of data

This results in big rate/10 bursts of data on the wire.  If the local interface rate is the same as the end-to-end network path's rate, then there's no issue.  If the local interface rate is wildly different, then issues start to arise.  Like at a switch that's doing a 10:1 rate change between ports.

Switches don't have 100ms of buffers, nor do we want them to (because that leads to TCP bufferbloat).

So I started experimenting with faster timers to see how smooth the results would be.

10ms timer

Using a 10ms timer brought the peaks down a little, and vastly shortened them in the process
But it's still not "smooth"

1ms timer

The real fq packet scheduler sets a timer based on the number of µs between packets to achieve the right rate, to give a very smooth pacing.  That's probably a better solution than the 1ms timer that I ended up using, but the 1ms timer works fairly well:

It's still not ideal, but it's quite good.  And in my testing, it seems to have a minor impact on CPU load, but not enough to cause issues (I can get 45-50Gbps between processes running locally using both timer values).


Know what your tools actually do.  Especially in networking where rate limiting is really an exercise in pulse-width modulation.