When the blue whale sinks

3 Jun 2017 • on sysadmin, • by bcdonadio

When the blue whale sinks

A bug in the Linux kernel is affecting thousands of people for more than 3 years, and so far there’s no complete fix. Continue reading to know what to expect when encoutering this issue (hopefully not) in production.

Let’s get started with the biggest PITA I’ve ever experienced with a software in production, that after more than 3 years of being reported (at least), is far from being completely fixed. This issue isn’t present in the Docker code itself, but rather in the Linux kernel code instead. It affects not only Docker, but any kind of software that uses the Linux network stack to create devices and namespaces frequently like LXC, OpenStack, Rkt, Proxmox, etc…

The tell-talle of hitting this bug is receiving a similar message of this every 10 seconds on the VT/syslog/journal of your server:

unregister_netdevice: waiting for veth1 to become free. Usage count = 1

Regarding the message variation, the interface may be any one that is currently being manipulated by Docker (the most frequent case is the lo interface of the container), and the usage count may be larger than one.

If you’ve already had some experience with multi-thread programming, you can instantly diagnose this as a race condition, and even cringe on the memory of debugging those kind of issues. However, this issue affects so many people for so much time (more than 3 years) that you may think that it already got fixed, right? Nope.

Once the bug is triggered, the situation now is the following: you’re unable to create, delete or change any network device in the whole system, rendering Docker basically useless - as you need to do exactly this to create and delete containers. There’s absolutely no fix to the bug so far, except a few mitigations and the Windows-style workaround: reboot your computer. Seriously.

Worse even is the fact that reproducing this issue is far from straightforward. You hit this bug basically by creating and deleting containers frequently, but the frequency needed to hit the issue varies wildy. Some people simply don’t ever hit the bug, and others get systems frozen by this multiple times a day. The Red Hat kernel (including CentOS) seems to be specially more prone to suffer from this problem, but all other distributions have reports of also being affected. Ironically, the sosreport tool used by Red Hat to collect information and logs on the system in order to diagnose the situation simply doesn’t works after the issue is triggered.

There are, however, partial mitigations to address the issue: you can put your virtual bridge docker0 in promiscuous mode to delay the interface teardown just a little as to not hit the race condition (the promiscuous mode has no other relation to the problem than this), or disable the IPv6 support in the kernel, hence it is more prone to hit the bug than the IPv4 stack.

As of the time of this publication, a fix has already been released to the issue being encountered by hitting the bug in the IPv6 stack, but people are still get the annoying message and frozen behaviour, suggesting that the bug has multiple causes, spread all over the network subsystem of the kernel.

This particular fix was released in this linux 4.8 commit and backported for RHEL/CentOS on the kernel-3.10.0-514.21.1.el7 package on RHEL/CentOS, as you can follow in the RHSA#3034221 (RHN access needed).

A very fearsome, but also very possible, scenario is that if you have a PaaS system that auto-heals (like OpenShift or tsuru), you can unleash a chain reaction by triggering the bug on one system. When the auto-heal function takes care of starting the container on other machines, you can now trigger the bug again on those system. Then the thing grows. Exponentially.

Until this issue is completely fixed, I’m very cautious with using Docker for anything other than stateless applications with at least double redundancy.