Adventures in deployment with Propellor, Docker, and Fabric

by on Jun.01, 2015


After playing around with Docker a bit, I decided that it would make an ideal deployment platform for my work services (previously we were using some ad-hoc isolation using unix users and not much else). While Docker’s security is…suspect…compared to a complete virtualization solution (see Xen), I’m not so much worried about complete isolation between my services, as things like easy resource limits and imaging. You can build this yourself out of cgroups, chroot, etc. but in the end you’re just reinventing the Docker wheel, so I went with Docker instead.

However, Docker by itself is not a complete solution. You still need some way to configure the Docker host, and you also need to build Docker images, so I added Propellor (which I recently discovered) and Fabric to the mix.


Propellor is a configuration management system (in the sense of Puppet, Chef, Salt, et al.) written in Haskell, where your configuration itself is Haskell code. For someone coming from a systems administration background, the flexibility and breadth offered by a real programming language like Haskell may be quite daunting, but as a programmer, I find it far more straightforward to just write code that does what I want, extracting common pieces into functions and so on. Our previous incarnation of things used Puppet for configuration management, but it always felt very awkward to work with; another problem is that Puppet was introduced after a bunch of the infrastructure was in place, meaning a lot of things were not actually managed by Puppet because somebody forgot. Propellor was used to configure a new server from scratch, ensuring that nothing was done ad-hoc, and while I won’t go into too much detail about Propellor, I am liking it a lot so far.

The role of Propellor in the new order is to configure things to provide the expected platform. This includes installing Docker, installing admin user accounts, SSH keys, groups, and so on.


The Docker workflow I adopted is based on the one described by Glyph. I would strongly recommend you go read his excellent post for the long explanation, but the short version is that instead of building a single container image, you instead build three: A “build” container used to produce the built artifacts from your sources (eg. Python wheels, Java/Clojure JARs), a “run” container which is built by installing the artifacts produced by running the “build” container, and thus does not need to contain your toolchain and -dev packages (keeping the size down), and a “base” container which contains the things shared by the “build” and “run” containers, allowing for even more efficiency of disk usage.

While I can’t show the Docker bits for our proprietary codebases, you can see the bits for one of our free software codebases, including instructions for building and running the images. The relative simplicity of the .docker files is no accident; rather than trying to shoehorn any complex build processes into the Docker image build, all of the heavy lifting is done by standard build and install tools (in the case of Documint: apt/dpkg, pip, and setuptools). Following this principal will save you a lot of pain and tears.


The steps outlined for building the Docker images are relatively straightforward, but copy/pasting shell command lines from a README into a terminal is still not a great experience. In addition, our developers are typically working from internet connections where downloading multi-megabyte Docker images / packages / etc. is a somewhat tedious experience, and uploading the resulting images is ten times worse (literally ten times worse; my connection at home is 10M down / 1M up ADSL, for example). Rather than doing this locally, this should instead run on one of our servers which has much better connectivity and a stable / well-defined platform configuration (thanks to Propellor). So now the process would be “copy/paste shell command lines from a README into an ssh session” — no thanks. (For comparison, our current deployment processes use some ad-hoc shell scripts lying around on the relevant servers; a bit better than copy/pasting into an ssh session, but not by much.)

At this point, froztbyte reminded me of Fabric (which I knew about previously, but hadn’t thoughto f in this context). So instead I wrote some fairly simple Fabric tasks to automate the process of building new containers, and also deploying them. For final production use, I will probably be setting up a scheduled task that automatically deploys from our “prod” branch (much like our current workflow does), but for testing purposes, we want a deploy to happen whenever somebody merges something into our testing release branch (eventually I’d like to deploy test environments on demand for separate branches, but this poses some challenges which are outside of the scope of this blog post). I could build some automated deployment system triggered by webhooks from BitBucket (where our private source code is hosted), but since everyone with access to merge things into that branch also has direct SSH access to our servers, Fabric was the easiest solution; no need to add another pile of moving parts to the system.

My Fabric tasks look like this (censored slightly to remove hostnames):

def build_uat_documint():
    with settings(warn_only=True):
        if run('test -d /srv/build/documint').failed:
            run('git clone --quiet -- /srv/build/documint')
    with cd('/srv/build/documint'):
        run('git pull --quiet')
        run('docker build --tag=fusionapp/documint-base --file=docker/base.docker .')
        run('docker build --tag=fusionapp/documint-build --file=docker/build.docker .')
        run('docker run --rm --tty --interactive --volume="/srv/build/documint:/application" --volume="/srv/build/documint/wheelhouse:/wheelhouse" fusionapp/documint-build')
        run('cp /srv/build/clj-neon/src/target/uberjar/clj-neon-*-standalone.jar bin/clj-neon.jar')
        run('docker build --tag=fusionapp/documint --file=docker/run.docker .')

def deploy_uat_documint():
    with settings(warn_only=True):
        run('docker stop --time=30 documint')
        run('docker rm --volumes --force documint')
    run('docker run --detach --restart=always --name=documint --publish=8750:8750 fusionapp/documint')

Developers can now deploy a new version of Documint (for example) by simply running fab build_uat_documint deploy_uat_documint. Incidentally, the unit tests are run during the container build (from the .docker file), so deploying a busted code version by accident shouldn’t happen.

Leave a Comment :, , , more...

Axiom benchmark results on PyPy 2.5.0

by on Mar.07, 2015

This is a followup to a post I made about 1.5 years ago, benchmarking Axiom on PyPy 2.1.0. Not too much has changed in Axiom since then (we fixed two nasty bugs that mainly affected PyPy, but I don’t expect those changes to have had much impact on performance), but PyPy (now at 2.5.0) has had plenty of work done on it since then, so let’s see what that means for Axiom performance!

Unlike my previous post, I’m basically just going to show the results here without much commentary:

Graph of Axiom performance

A few notes:

  • I didn’t redo the old benchmark results, but the hardware/software I ran the benchmarks on is not significantly different, so I think the results are still valid as far as broad comparisons go (as you can see, the CPython results match fairly closely).
  • The benchmark harness I’m using now is improved over the last time, using some statistical techniques to determine how long to run the benchmark, rather than relying on some hardcoded values to achieve JIT warmup and performance stability. Still could use some work (eg. outputting a kernel density estimate / error bars, rather than just a single mean time value).
  • There is one new benchmark relative to the last time, powerup-loading; PyPy really shines here, cutting out a ton of overhead. There’s still room for a few more benchmarks of critical functions such as actually running and loading query results (as opposed to just constructing query objects).
  • The branch I used to run these benchmarks is available on Github.
  • The horizontal axis is cut off at 1.0 so you can’t actually see how store-opening lines up, but the raw data shows that PyPy 2.1.0 took about 53% longer on this benchmark, whil PyPy 2.5.0 only takes about 2% longer.
1 Comment more...

BitBucket migration

by on Sep.28, 2013

At work, we are currently using Launchpad for project hosting of our proprietary codebase. Launchpad charges $250/year/project for hosting of proprietary projects which is a little steep, and Launchpad/bzr has been falling behind the alternatives in terms of tooling / development / support, so when our Launchpad subscription came up for renewal at the beginning of the month, this caused our somewhat vague plans to switch to something else to crystalize.

I initially assumed that Github/git would be the obvious way to go, but after looking into BitBucket/hg, I was pleasantly surprised to discover that things like hosted CI were available there too. Nobody on our team is much of a git enthusiast to begin with, so using hg seemed like a far more attractive option. This meant figuring out how to do two things: 1) migrate all of our existing bugs, and 2) migrate our existing bzr branches to hg. The former proved to be relatively straightforward: Jonathan wrote a script that used launchpadlib to access the Launchpad API, retrieve the bug data + metadata, and write it out in the BitBucket import format (more on this in another post, or on Jonathan’s blog, depending on if I can convince him to write it up or not).

The bzr to hg conversion turned out to be a little more complex. A simple “hg convert” of our trunk branch worked surprisingly well; the trunk history converted correctly (at least as far as I could tell), but more (pleasantly) surprisingly, the branches which were merged into trunk were also reconstructed by the conversion, along with the merges. The conversion relies on the bzr branch nick; this works somewhat like hg branches (it’s associated with a commit at the time that you commit), but as bzr does not place as much importance on this label as hg, it is more likely to be wrong by accident (by default the branch nick is just taken from the last component of the path to the branch you are committing in, I believe, and in our case I suspect nobody has ever set the branch nick manually). Among other things, this resulted in 4 different branch names for “trunk”, as well as some other oddities like misspelled feature branch names.

(As an aside, I’d like to mention that `hg log` has far more utility than `bzr log`, mostly due to the “revsets” feature. Almost all of the inspection I did while debugging the conversion was done on the converted repo using hg, not on the original bzr repo, simply because it was far easier to get the information that way.)

A “branchmap” file solved the problem with differing branches; mapping the different names for “trunk” to “default” made the revision history graph look a lot more reasonable than when I originally did the conversion. I also switched to using --datesort for the conversion at this point; the documentation warns that this may produce a much larger repository than --branchsort (the default), but in my case, the size difference was trivial. I suspect this may only apply in scenarios with back-and-forth merges between long-lived branches, rather than the short-lived topic branches that form the majority of our workflow. I also created an “authormap” file at this point to reconcile differing author identities over the history of our repository. The bzr author identity is a full name/email (eg. “Tristan Seligmann <>”, but again, there were various historical oddities here; BitBucket also has the ability to map author identities to BitBucket users, but I decided normalizing during the conversion was a good idea anyway.

The biggest problem I had to deal with (although this was actually one of the first problems I noticed) was that all of these merged branches were still open. Mercurial has the concept of “open” and “closed” branches, with closed branches being hidden by default in most places. A “closed” branch is simply one whose head revision is marked as closing the branch; which of course, none of my branches had, due to being converted from bzr which does not have an equivalent concept. Committing a closing revision to each branch was simple enough to script, but that only lead to more difficulties: 1) a gigantic pile of noise revisions in the history, and 2) a bunch of dangling heads as the new “close” revision was not part of the merge to trunk. Scripting a merge of all of the dangling heads would have produced even more noise, so I looked for a different solution.

Eventually I ran across a patch on the Mercurial mailing list; unfortunately the thread in which it was posted never went anywhere, but the patch still worked. What this patch allowed me to do was after the initial conversion, run another hg-to-hg conversion in which I marked the last branch revision before the merge to trunk as closing the branch. The full conversion process now looked something like this:

hg convert --datesort --branchmap branchmap --authormap authormap Fusion Fusion-hg-unspliced
cd Fusion-hg-unspliced
hg log --template "{node} close\n" -r "head() and not branch(default)" > ../splicemap
cd ..
PYTHONPATH=$HOME/hg-patched python $HOME/hg-patched/hg convert --splicemap splicemap Fusion-hg-unspliced Fusion-hg

This was good enough for a trunk conversion, but what about open branches that aren’t yet merged into trunk? We could have held off until we were able to merge all of these branches, but that seemed like a lot of work (although we did merge as many outstanding branches as possible). Fortunately hg convert can operate in an incremental way; during the conversion, the mapping from source revs to destination revs is stored in dest/.hg/shamap; the only wrinkle was my two stage-conversion process. What I needed was a way to map the original bzr revisions to the hg revisions in the *second* repository. In order to accomplish this, I wrote a small Python script to merge the two mappings:

import sys
revmap2 = {}
with open(sys.argv[2]) as fMap2:
    for line in fMap2:
        a, b = line.rstrip('\n').split(' ')
        revmap2[a] = b

with open(sys.argv[1]) as fMap1:
    for line in fMap1:
        a, b = line.rstrip('\n').split(' ')
        print '%s %s' % (a, revmap2[b])

With the help of this script, I could now convert other branches:

# Only take the bzr revisions
grep '@' Fusion-hg-unspliced/.hg/shamap > shamap
python shamap Fusion-hg/.hg/shamap > shamap-spliced
mv shamap-spliced Fusion-hg/.hg/shamap
# Now let's convert a branch
hg convert --branchmap branchmap --authormap authormap Fusion-some-branch Fusion-hg

In summary, the process, while hardly trivial, worked out a lot better than I had initially expected.

EDIT: I forgot to mention in the original draft: We first started thinking about moving away from Launchpad at the beginning of September, and completed the migration in the last week, so the entire process took us less than a month of part-time discussion / work.

Leave a Comment more...

Axiom benchmark results on PyPy

by on Aug.04, 2013

EDIT: Updated version now available.

EDIT: Fixed the issue with the store-opening benchmark

Axiom conveniently includes a few microbenchmarks; I thought I’d use them to give an idea of the speed increase made possible by running Axiom on PyPy. In order to do this, however, I’m going to have to modify the benchmarks a little. To understand why this is necessary, one has to understand how PyPy achieves the speed it does: namely, through the use of JIT (Just-In-Time) compilation techniques. In short, these techniques mean that PyPy is compiling code during the execution of a program; it does this “just in time” to run the code (or actually, if I understand correctly, in some cases only after the code has been run). This means that when a PyPy program has just started up, there is a lot of performance overhead in the form of the time taken up by JIT compilation running, as well as time taken up by code being interpreted slowly because it has not yet been compiled. While this performance hit is quite significant for command-line tools and other short-lived programs, many applications making use of Axiom are long-lived server processes; for these, any startup overhead is mostly unimportant, the performance that interests us is the performance achieved once the startup cost has already been paid. The Axiom microbenchmarks mostly take the form of performing a certain operation N times, recording the time taken, then dividing that time by N to get an average time per single operation. I have made two modifications to the microbenchmarks in order to demonstrate the performance on PyPy; first, I have increased the value of “N”; second, I have modified the benchmarks to run the entire benchmark twice, throwing away the results from the first run and only reporting the second run. This serves to exclude startup/”warmup” costs from the benchmark.

All of the results below are from my desktop machine running Debian unstable on amd64, CPython 2.7.5, and PyPy 2.1.0 on a Core i7-2600K running at 3.40GHz. I tried to keep the system mostly quiet during benchmarking, but I did have a web browser and other typical desktop applications running at the same time. Here’s a graph of the results; see the rest of the post for the details, especially regarding the store-opening benchmark (which is actually slower on PyPy).

[graph removed, see the new post instead]

To get an example of how much of a difference this makes, let’s take a look at the first benchmark I’m going to run, item-creation 15. This benchmark constructs an Item type with 15 integer attributes, then runs 10 transactions where each transaction creates 1000 items of that type. In its initial form, the results look like this:

mithrandi@lorien> python item-creation 15
mithrandi@lorien> pypy item-creation 15

That’s about 165µs per item creation on CPython, and 301µs on PyPy, nearly 83% slower; not exactly what we were hoping for. If I increase the length of the outer loop (number of transactions) from 10 to 1000, and introduce the double benchmark run, the results look a lot more encouraging:

mithrandi@lorien> python item-creation 15
mithrandi@lorien> pypy item-creation 15

That’s about 159µs per item creation on CPython, and only 87µs on PyPy; that’s a 45% speed increase. The PyPy speed-up is welcome, but it’s also interesting to note that CPython benefits slightly from the changes to the benchmark. I don’t have any immediate explanation for why this might be, but the difference is only about 3%, so it doesn’t matter too much.

The second benchmark is inmemory-setting. This benchmark constructs 10,000 items with 5 inmemory attributes (actually, the number of attributes is hardcoded, due to a limitation in the benchmark code), and then times how long it takes to set all 5 attributes to new values on each of the 10,000 items. I decreased the number of items to 1000, wrapped a loop around the attribute setting to repeat it 1000 times, and introduced the double benchmark run:

mithrandi@lorien> python inmemory-setting
mithrandi@lorien> pypy inmemory-setting

That’s 486ns to set an attribute on CPython, and 129ns on PyPy, for a 74% speed increase. Note that this benchmark is extremely sensitive to small fluctuations since the operation being measured is such a fast one, so the results can vary a fair amount between benchmarks run. For interest’s sake, I repeated the benchmark except with a normal Python class substituted for Item, in order to compare the overhead of setting an inmemory attribute as compared with normal Python attribute access. The result was 61ns to set an attribute on CPython (making an inmemory attribute about 700% slower), and 2ns on PyPy (inmemory is 5700% slower). The speed difference on PyPy is more to do with how fast setting a normal attribute is on PyPy, than to do with Axiom being slow.

The third benchmark is integer-setting. This benchmark is similar to inmemory-setting except that it uses integer attributes instead of inmemory attributes. I performed the same modifications, except with an outer loop of 100 iterations:

mithrandi@lorien> python integer-setting
mithrandi@lorien> pypy integer-setting

That’s 12.3µs to set an attribute on CPython, and 3.8µs on PyPy, a 69% speed increase.

The fourth benchmark is item-loading 15. This benchmark creates 10,000 items with 15 integer attributes each, then times how long it takes to load an item from the database. On CPython, the items are deallocated and removed from the item cache immediately thanks to refcounting, but on PyPy a gc.collect() after creating the items is necessary to force them to be garbage collected. In addition, I increased the number of items to 100,000 and introduced the double benchmark run:

mithrandi@lorien> python item-loading 15
mithrandi@lorien> pypy item-loading 15

That’s 90µs to load an item on CPython, and 57µs on PyPy, for a modest 37% speed increase.

The fifth benchmark is multiquery-creation 5 15. This benchmark constructs (but does not run) an Axiom query involving 5 different types, each with 15 attributes (such a query requires Axiom to construct SQL that mentions each item table, and each column in those tables) 10,000 times. I increased the number of queries constructed to 100,000 and introduced the double benchmark run:

mithrandi@lorien> python multiquery-creation 5 15
mithrandi@lorien> pypy multiquery-creation 5 15

55µs to construct a query on CPython; 8µs on PyPy; 86% speed increase.

The sixth benchmark is query-creation 15. This benchmark is the same as multiquery-creation, except for queries involving only a single item type. I increased the number of queries constructed to 1,000,000 and introduced the double benchmark run:

mithrandi@lorien> python query-creation 15
mithrandi@lorien> pypy query-creation 15

15.5µs to construct a query on CPython; 1.6µs on PyPy; 90% speed increase.

The final benchmark is store-opening 20 15. This benchmark simply times how long it takes to open a store containing 20 different item types, each with 15 attributes (opening a store requires Axiom to load the schema from the database, among other things). I increased the number of iterations from 100 to 10,000; due to a bug in Axiom, the benchmark will run out of file descriptors partway, so I had to work around this. I also introduced the double benchmark run:

mithrandi@lorien> python store-opening 20 15
mithrandi@lorien> pypy store-opening 20 15

1.41ms to open a store on CPython; 2.02ms on PyPy; 44% slowdown. I’m not sure what the cause of the slowdown is.

A bzr branch containing all of my modifications is available at lp:~mithrandi/

1 Comment :, , more...

Divmod / PyPy status update

by on Aug.04, 2013

Just a quick status update:

  • Epsilon test suite passes on PyPy.
  • Nevow test suite passes on PyPy.
  • Axiom (trunk) test suite has two failures on PyPy, fixed by this branch (which just deletes the tests); I don’t expect this to affect any application code.
  • Mantissa test suite mostly fails. This is due to modules that indirectly import xmantissa.terminal which imports PyCrypto directly or indirectly (via twisted.conch) — PyCrypto does not build on PyPy.
  • I haven’t looked at Combinator; the test suite has a whole bunch of test failures on CPython due to a change in Subversion, and there’s no real reason to run Combinator with PyPy (I would expect it to be slower than on CPython).
  • I haven’t looked at Quotient or Imaginary yet, as they depend on Mantissa.
  • I haven’t looked at Hyperbola, Prime, Reverend, or Sine — I doubt anyone cares about these (I don’t even know what Prime does, and it doesn’t have a test suite).

The next thing I’m going to work on is making the Mantissa dependency on PyCrypto optional; while having an SSH server is nice, there is plenty of functionality in Mantissa that does not depend on or interact with the SSH server in any way, so it’ll still be useful for most applications. With any luck, once this is fixed, the entire test suite will pass; it’s hard to predict given how little of the test suite is currently runnable.

Leave a Comment :, , , more...

A Brief History of Bufferbloat

by on Feb.27, 2013


The “bufferbloat” issue has now been explained and documented in many places by many people (most recently, and famously, by Jim Gettys), but I’m going to present my own explanation by way of introduction. I’m going to consider the case of a home network with a single router connecting the LAN to the internet (most likely via an ADSL or cable internet connection); this is not the only place where the issue arises, but it is the situation that most people are familiar with.

Why buffer?

To understand the problem with buffering, we first have to understand why buffering is being done in the first place. Generally speaking, buffering at least one packet is necessary to successfully forward traffic from the LAN interface to the WAN interface; if you can’t buffer at least one packet, you can’t receive and route any packets because you don’t know where to send them until you’ve received and processed them. However, most routers will buffer far more than just one packet; and the reason for this is throughput. Incoming traffic does not always arrive at a steady rate, so by keeping a reasonably-sized buffer of incoming traffic, the router can provide a steady stream of outgoing traffic keeping the outgoing link at  maximum utilization despite fluctuations on the input side. To some extent, the more you buffer, the better throughput you can achieve, and as historically the focus has been on maximum throughput on an internet connection, buffers have been sized very generously for some time now, to the point where they are frequently far larger than they have to be in order to achieve maximum throughput. This brings us to the next question:

Why not buffer?

The problem with large buffers is that while they may improve throughput, they also increase latency. Thus, while “bulk” flows (file uploads and downloads) experienced improved performance, “interactive” flows such as gaming, VoIP traffic, and so on suffers. To understand why this is, let us consider some example figures. The default settings for an ethernet interface on Linux are to use the “pfifo_fast” queueing discipline (which is basically just a first-in-first-out queue, as the name suggests), with a qlen of 1000 (so the queue can grow up to 1000 packets long). Standard Ethernet MTU is 1500 bytes, which means that if the queue fills with traffic from a file upload, we will have 1,500,000 bytes (~1.43 MB) of data in the queue. Ordinarily the LAN interface will be running at 100M or 1G while the outgoing ADSL / cable connection will be much slower (let’s use a 1Mbps uplink in this illustration), causing the queue to fill up under normal circumstances before packets start being dropped.

Now, let’s say that while this file upload is occurring, you are also trying to play a real-time strategy game. When you issue a command to some of your units, the game sends out a small command packet encoding the command that you just issued. This packet arrives at the router and joins the queue behind all of the traffic currently in the queue. Only once it reaches the head of the queue will it actually be transmitted across your internet connection, so there will be a delay before it is even sent out onto the internet; this delay will be cumulative with the normal latency between you and the server. How long will this delay be? Transferring 1,500,000 bytes at 1Mbps will take 12 seconds! Obviously this is a ridiculous amount of added latency, resulting in a completely unplayable game as many gamers can attest to.

This increase in latency can even affect throughput if you are trying to download and upload at the same time; the ACK traffic for the upload will get caught in the bufferbloat caused by the download, and the ACK traffic for the download will get caught in the bufferbloat caused by the upload, causing everything to slow down. (Attempting to use a bittorrent client without setting ratelimits often leads one to encounter this problem, although many modern torrent clients have “smart ratelimiting” to try to work around this problem)

Now what?

So, how do we solve this problem? We can reduce the size of the queue in the router, as often it is massively oversized, but beyond a certain point, making the queue smaller will start to hurt throughput (resulting in slower downloads and uploads), forcing us to make a trade-off between latency and throughput. In addition, manually adjusting the size of the queue is a very difficult task, especially in the face of changing network conditions; the bandwidth available to your ADSL connection can vary greatly depending on congestion at your ISP, “turbo boost” ISP products that allow you to temporarily burst above your normal bandwidth limit, and so on. Can we do better than this?

In fact, we can do better. The answer lies in more advanced queue management: we want to queue as much as necessary to maintain throughput, but no more. The latest and greatest in this field is the CoDel (“controlled delay”) queue management algorithm, designed by Kathleen Nichols and Van Jacobson, which aims to achieve reasonable behaviour with very little tuning; in other words, it can be deployed on ADSL/cable routers in a standard configuration with no end-user tuning required. In brief, CoDel looks at a sliding window of time (100ms by default), and determines the minimum delay experienced by packets traversing the queue; if this increases above a certain target value (5ms by default), CoDel enters a dropping mode. (Side note: “dropping” can mean actual packet dropping, or simply ECN marking; end-users will normally want dropping, but users in more controlled environments, like datacenters, may get more reliable behaviour with packet marking).

CoDel (implemented by the “codel” qdisc in the Linux kernel) thus allows us to manage the size of the queue, but we still have the problem of multiple flows interfering with each other; if I start a file upload, that will still interfere with my IRC connection or VoIP call. What we need is called “fair queuing”; we want to share the internet connection “fairly” between all of the flows going over the connection, rather than allowing a few of them to hog the connection at the expense of others. The “fq_codel” qdisc gives us a way to do this (although there are other ways to accomplish the same thing); essentially, it classifies the incoming packets into separate flows and maintains a separate queue for each flow, managed by CoDel. (Actually, it uses a hashing scheme, so the more flows you have, the more likely it is that some of them will share the same queue, but this is necessary to avoid out-of-control resource usage in the presence of many flows.) Traffic is drawn from each separate queue “fairly”, so essentially this allows your interactive traffic (games, IRC, VoIP, etc.) to “skip the queue” as their individual queues will be small, instead of being stuck in a queue behind bulk flows which can build up a longer queue.


Unfortunately there are still some problems facing early adopters wanting to take advantage of these improvements in queue management algorithms.

The first problem is that queues will only build up at the hop where the path bottleneck is; if this occurs at one of your ISP’s routers rather than your own router, then any queue management you do on your own router will have little effect. In order to ensure that the bottleneck is your router, you will need to ratelimit traffic through your router to below the speeds available to you through your ISP, thus reducing available bandwidth by a small amount, and also making it impossible to take advantage of any kind of “bursting” capacity your ISP might make available to you. In addition to this, if the available bandwidth drops below your ratelimiting due to network conditions (which may vary based on time of day, phase of moon, etc.), your queue management will once again become ineffective. The real solution here is for your ISP to implement CoDel (or some other active queue management that accomplishes the same thing), but for most people, that is just a pipe dream that is unlikely to be realised in the near future.

The second problem is that there are actually many different buffers lurking in many different places such as your ethernet switch, ADSL modem, Linux ethernet driver, etc. Some work has been done recently (see BQL) to deal with the sources of additional bufferbloat in the Linux kernel, but many drivers have not yet been updated to support BQL, so the problem remains. If your queue management algorithm thinks the packet has actually been transmitted, but it’s just stuck in another buffer farther down the stack, then it is unlikely to perform as expected. Send / receive offloads can produce similar problems as suddenly you can be waiting for a “packet” (which will actually be split into many packets by the ethernet hardware) to be transmitted that is far larger than the normal MTU, producing a much longer delay than expected; thus turning these off is essential (and unlikely to have any downside at typical home internet speeds, or even 1Gbps ethernet speeds, on modern hardware).

The third problem is a little more fundamental. At any given time, even if a packet has been placed at the front of all of your queues, you may already be in the process of transmitting a packet; thus the minimum delay that can be achieved is constrained by the time taken to transmit a single packet across your internet connection. Assuming a standard ethernet MTU of 1500 bytes (this size will actually be slightly higher in practice, due to ethernet frame / ADSL / ATM / etc. overheads), on a 100Mbps uplink, this will be a delay of 0.12ms; this is unlikely to be of concern for many people. However, on slower uplinks, this starts to become more of a problem: at 10Mbps the delay increases to 1.2ms; at 1Mbps the delay is 12ms (requiring the CoDel target to be increased, as reducing the delay below the default target of 5ms is now impossible); and at 512kbps the delay is 24ms. This figure represents not only an increase in the maximum delay experienced, but also the variance between minimum and maximum delay. If you are playing a game on a server that is 25ms away, having your latency fluctuate between 25ms and 49ms (nearly doubling) in an unpredictable fashion is far harder to deal with than a stable, predictable delay of, say, 60ms would be. Thus people on slower uplinks have little recourse other than to hopefully upgrade to a faster internet connection.

Leave a Comment :, , , , more...

(Fair) queuing, buffering, and “bufferbloat”

by on Dec.24, 2012

This is basically just a teaser, but I’ve been playing around with things on my home internet connection (PPPoE over ADSL) and have achieved some fairly good results (within the limitations of my internet connection). I’ve learned quite a lot about various bits and pieces along the way, so I decided I should blog it all for my own future reference as well as others. I’ll probably be duplicating some existing material along the way, but half of the challenge was just finding all the bits and pieces scattered around and putting them all together, so hopefully I’ll save somebody else some time in the future. I’ll also be uploading my actual test scripts, as well as the test results, for reference.

Leave a Comment :, , , , more...

Debian Developer

by on Aug.05, 2012

I’m now a Debian Developer. Woot! And it only took me just 6 years and 17 days… although the overwhelming majority of the delay was due to my own lack of time / energy, rather than anyone else’s fault. On a related matter, my newish PGP key is sadly lacking in signatures. If you’re in or travelling to Johannesburg / Gauteng area in South Africa and want to exchange signatures, please get in touch (especially if you’re a Debian Developer or Debian Maintainer, but even if you’re not).

Leave a Comment :, , more...

An introduction to Mantissa (part 3): Navigation powerups and other friends

by on Mar.04, 2012

This is the fourth post in a series of articles about Mantissa.

In the previous article I described how an offering can provide powerups to be included in a product, which will then be installed on a user store; in this installment, I will discuss what form these powerups can actually take, and how they allow you to expose your application’s functionality to the user.

One of the most commonly-implemented powerup interfaces in Mantissa is INavigableElement. Mantissa has a somewhat generalized idea of “navigation”, whereby a nested menu structure can be defined through INavigableElement powerups, and then displayed by different implementations for different protocols; for example, the web view has a visual dropdown menu system, whereas the SSH server presents a textual menu system. A typical INavigableElement powerup implementation will look something like this:

from zope.interface import implements
from axiom.item import Item
from axiom.attributes import integer
from xmantissa.ixmantissa import INavigableElement
from xmantissa.webnav import Tab

class PonyCreator(Item):
    Powerup for creating and managing ponies.
    powerupInterfaces = [INavigableElement]

    ponyQuota = integer(allowNone=False, default=10)

    def getTabs(self):
        return [Tab('ZOMG PONIES!', self.storeID, 1.0)]

INavigableElement only has one method, getTabs, which returns a list of “tabs” or menu items to be presented in the nav. The primary components of a tab are a title (which is how the item is displayed in the UI), the storeID of an item in the same store which the tab points to, and a float between 0.0 and 1.0 indicating the sort priority of the tab (higher values sort sooner). In this case, we have the tab pointing directly at the PonyCreator item itself; in order for this to work, we’ll need some extra code to allow PonyCreator to be exposed via the web.

In order for an item in a user’s store to be privately accessible via the web by that user, it needs to be adaptable to the (somewhat poorly-named) INavigableFragment interface. This is almost always done by defining an adapter from the item type to INavigableFragment:

from twisted.python.components import registerAdapter
from nevow.athena import LiveElement
from xmantissa.webtheme import ThemedDocumentFactory
from xmantissa.ixmantissa import INavigableFragment, ITemplateNameResolver

class PonyCreatorView(LiveElement):
    Web view for Pony management.

    title = u'Pony management'
    docFactory = ThemedDocumentFactory('pony-creator', 'resolver')

    def __init__(self, ponyCreator):
        super(PonyCreatorView, self).__init__()
        self.ponyCreator = ponyCreator
        self.resolver = ITemplateNameResolver(

registerAdapter(PonyCreatorView, PonyCreator, INavigableFragment)

Our element will be wrapped in the Mantissa shell when it is rendered, so we cannot control the page title directly from the template, but the title attribute provides a way for our element to specify the page title. ThemedDocumentFactory is used to retrieve the template through the theme system; the arguments are the name of the template (‘pony-creator’) and the name of the attribute holding the ITemplateNameResolver implementation used to retrieve the template. This attribute is set in __init__ using a slightly awkward method; the template resolver should really be passed in by Mantissa somehow, but currently there is no mechanism for doing this, so instead we retrieve the default resolver ourselves from the site store.

This is all that is needed for hooking some code up to the web view; any further UI behaviour would be implemented in HTML / JavaScript in PonyCreatorView, usually by invoking additional methods defined on PonyCreator.

Next up: Sharing, or “How do I publish public / shared content?”

Leave a Comment :, , more...

NAT connection pinning with iproute2 / iptables

by on Oct.23, 2011

My home network has a somewhat complicated setup where I have multiple PPPoE sessions across my ADSL connection, with various different ISPs. This allows me to take advantage of varying ISP properties such as cost and latency, by routing different traffic over different connections. Naturally, each of these connections only affords me a single IPv4 address, so I make use of NAT to allow the rest of my network access to the Internet. A potential problem arises, however, when connections go down and come back up. In the simple case, with only one connection, MASQUERADE takes care of all the details; when the interface goes down, all of the NAT entries associated with the connection are removed, so when it comes back up, it’s not a problem that your IP address has changed, because all of the NAT entries associated with the old address are gone. This works just as well in the multiple connections scenario; if an interface goes down resulting in traffic being routed over another interface, all of the old NAT entries have been dropped, so new ones will be established associated with the interface they are now travelling over. The problem arises when the interface that went down comes back up; traffic will now be routed over the first interface again, while still being rewritten to the second interface’s address, and this traffic is almost guaranteed to be dropped by either your ISP, or their upstream provider.

What’s the solution? Well, if you absolutely definitely want to start routing traffic over the first interface as soon as it comes back up, you’re going to need to flush the associated conntrack NAT entries as soon as it comes up, and let all your users reconnect (since their connections will be interrupted); I’m not entirely sure how to do this. In my case, however, I’m more concerned with maintaining existing connections without interruption, even if that means continuing to route them over the “wrong” interface. This also applies to incoming connections; ordinarily if somebody tries to establish a connection to the public IP address of one of your connections, they will need to connect to the same interface that outbound traffic to them would be routed over, which can be somewhat inconvenient.

My solution is something I’m going to call “connection pinning”. The idea is that once an outbound interface has been selected for a particular connection (by the Linux routing table), we “pin” the connection to that interface, so that traffic associated with that connection always travels over that interface even if the routing table changes. In order to achieve this, we can use a combination of Linux policy routing (ip rule), as well as firewall / conntrack packet marking. When a connection is first established, we set a connmark, which is a value stored in the conntrack table entry for that connection. In the case of an incoming connection, we set the mark based on the interface the packet arrived on; in the case of an outgoing connection, we set the mark in POSTROUTING based on the outbound interface already selected by the routing table. Then, for future outgoing traffic associated with that connection (as determined by conntrack), we set an fwmark based on the connmark, and bypass the normal routing table using policy rules for traffic marked thusly.

This is implemented in three parts. Firewall rules added using iptables, for the netfilter/conntrack bits; an ip-up script for establishing policy rules and routes when a PPP connection is established; and an ip-down script for flushing them again when the PPP connection is terminated.

First, the firewall rules (using the excellent ferm tool):

@def $DEV_PRIVATE = eth0;
@def $NET_PRIVATE_V4 =;

domain ip table mangle {
    # Only match new connections; established connections should
    # already have a connmark, which should not be overwritten.
    chain (INPUT FORWARD) {
        # Unfortunately the set-mark rules need to be duplicated for
        # each ppp interface we have.
        mod conntrack ctstate NEW {
            interface ppp0 CONNMARK set-mark 1;
            interface ppp1 CONNMARK set-mark 2;
            interface ppp2 CONNMARK set-mark 3;
            interface ppp3 CONNMARK set-mark 4;
            interface ppp4 CONNMARK set-mark 5;
    chain POSTROUTING {
        mod conntrack ctstate NEW {
            outerface ppp0 CONNMARK set-mark 1;
            outerface ppp1 CONNMARK set-mark 2;
            outerface ppp2 CONNMARK set-mark 3;
            outerface ppp3 CONNMARK set-mark 4;
            outerface ppp4 CONNMARK set-mark 5;
    chain PREROUTING {
        # Copy the connmark to the fwmark in order to activate the
        # policy rules for connection pinning. Only do this for
        # traffic originating from the local network; other traffic
        # (such as traffic going *to* the local network) should be
        # left unmodified, to allow return traffic to be routed over
        # the correct interface.

        interface $DEV_PRIVATE daddr ! $NET_PRIVATE_V4 CONNMARK restore-mark;
    chain OUTPUT {
        # Same as above, but for locally originating traffic.

        daddr ! $NET_PRIVATE_V4 CONNMARK restore-mark;

# I am assuming you already have something like this:
domain ip table nat {
    chain POSTROUTING outerface (ppp0 ppp1 ppp2 ppp3 ppp4) MASQUERADE;

If you’re not using ferm, here’s what the raw iptables commands would be (these are exactly what ferm will install given the above, so this is just more verbose):

iptables -t mangle -A FORWARD --match conntrack --ctstate NEW --in-interface ppp0 --jump CONNMARK --set-mark 1
iptables -t mangle -A FORWARD --match conntrack --ctstate NEW --in-interface ppp1 --jump CONNMARK --set-mark 2
iptables -t mangle -A FORWARD --match conntrack --ctstate NEW --in-interface ppp2 --jump CONNMARK --set-mark 3
iptables -t mangle -A FORWARD --match conntrack --ctstate NEW --in-interface ppp3 --jump CONNMARK --set-mark 4
iptables -t mangle -A FORWARD --match conntrack --ctstate NEW --in-interface ppp4 --jump CONNMARK --set-mark 5
iptables -t mangle -A INPUT --match conntrack --ctstate NEW --in-interface ppp0 --jump CONNMARK --set-mark 1
iptables -t mangle -A INPUT --match conntrack --ctstate NEW --in-interface ppp1 --jump CONNMARK --set-mark 2
iptables -t mangle -A INPUT --match conntrack --ctstate NEW --in-interface ppp2 --jump CONNMARK --set-mark 3
iptables -t mangle -A INPUT --match conntrack --ctstate NEW --in-interface ppp3 --jump CONNMARK --set-mark 4
iptables -t mangle -A INPUT --match conntrack --ctstate NEW --in-interface ppp4 --jump CONNMARK --set-mark 5
iptables -t mangle -A POSTROUTING --match conntrack --ctstate NEW --out-interface ppp0 --jump CONNMARK --set-mark 1
iptables -t mangle -A POSTROUTING --match conntrack --ctstate NEW --out-interface ppp1 --jump CONNMARK --set-mark 2
iptables -t mangle -A POSTROUTING --match conntrack --ctstate NEW --out-interface ppp2 --jump CONNMARK --set-mark 3
iptables -t mangle -A POSTROUTING --match conntrack --ctstate NEW --out-interface ppp3 --jump CONNMARK --set-mark 4
iptables -t mangle -A POSTROUTING --match conntrack --ctstate NEW --out-interface ppp4 --jump CONNMARK --set-mark 5
iptables -t mangle -A PREROUTING --in-interface eth0 ! --destination --jump CONNMARK --restore-mark
iptables -t mangle -A OUTPUT ! --destination --jump CONNMARK --restore-mark

iptables -t nat -A POSTROUTING --out-interface ppp0 --jump MASQUERADE
iptables -t nat -A POSTROUTING --out-interface ppp1 --jump MASQUERADE
iptables -t nat -A POSTROUTING --out-interface ppp2 --jump MASQUERADE
iptables -t nat -A POSTROUTING --out-interface ppp3 --jump MASQUERADE
iptables -t nat -A POSTROUTING --out-interface ppp4 --jump MASQUERADE

Next, the ip-up script (to be placed in /etc/ppp/ip-up.d/ and made executable):

MARK=$((${PPP_IFACE##ppp} + 1))
ip rule del lookup "$TABLE"
ip route flush table "$TABLE"
ip route add default dev "$PPP_IFACE" table "$TABLE"
ip rule add fwmark "$MARK" table "$TABLE"

Finally, the ip-down script (to be placed in /etc/ppp/ip-down.d/ and made executable):

ip rule del lookup "$TABLE"
ip route flush table "$TABLE"

There are a couple of changes you will need to make to adapt these for your own network. In particular, you’ll need to duplicate the pppN iptables rules for each of the PPP interfaces you want to apply this to. Also, if you are already doing packet marking for some other reason, you’ll need to change the fwmark values I’ve used to ones that don’t interfere with your existing marks. I suspect there’s a better way to only mark outbound traffic than what I do above, but I wasn’t able to figure it out. If you have any improvements to suggest, feel free to mention them in the comments; I will try to keep this post updated with any improvements I make (either on my own, or based on other people’s suggestions).

Leave a Comment :, , , more...