Why networked software should expire.
If your business is still writing letters in WordStar and printing them out on an original Apple Laserwriter, good for you. If you're writing your own LISP in vi on a PDP 10, best of luck. However, if you're still using IE6, that's just rude.
Networked software (by which I just mean programs that talk to other programs) on public networks have a different cost model than the first two examples, but our mental models haven't caught up with that fact yet. We're stuck with the idea that what software you run is your own business and that's preventing needed changes. Here's one example:
ECN (Explicit Congestion Notification) is a modification to TCP and IP which allows routers to indicate congestion by altering packets as they pass though. Early routers dropped packets only when their buffers overflowed and this was taken as an indication of congestion. It was soon noticed that a more probabilistic method of indicating congestion performed better. So routers starting using RED (random early drop) where, approximately, if a buffer is 50% full, a packet has a 50% chance of getting dropped. This gives an indication of congestion sooner and prevented cases where TCP timeouts for many different hosts would start to synchronise and resonate.
To indication congestion, RED drops a packet that has already traversed part of the network; throwing away information. So ECN was developed to indicate congestion without dropping the packets. Network simulations and small scale testing showed a small, but significant benefit from it.
But when ECN was enabled for vger.kernel.org, the mailing list server which handles Linux kernel mailing lists, many people suddenly noticed that their mails weren't getting though. It turned out that many buggy routers and firewalls simply dropped all packets which were using ECN. This was clearly against the specifications and, in terms of code, an easy fix.
ECN wasn't enabled by default in order to give time for the routers to get fixed. In a year or so, it was hoped, it could start to be used and the Internet could start to benefit.
That was over eight years ago now. ECN is still not enabled by default in any major OS. The latest numbers I've seen (which I collected) suggest that 0.5% of destinations will still stop working if you enable ECN and the number of hosts supporting ECN appears to be dropping.
The world has payed a price for not having ECN for the past eight years. Not a lot, but downloads have been a little slower and maybe more infrastructure has been build than was really needed. But who actually paid? Every user of the Internet did, a little bit. But that cost was imposed by router manufactures who didn't test their products and network operators who didn't install updates. Those people saved money by doing less and everyone else paid the price.
These problems are multiplying with the increasing amount of network middleware (routers, firewalls etc) getting deployed; often in homes and owned by people who don't know of care about them.
Recently, Linux 2.6.27 was released and broke Internet access for, probably, thousands of people. Ubuntu Intrepid released with it and had to disable TCP timestamps as a work around while the issue was fixed.
But the issue wasn't a bug in 2.6.27. It was a bug in many home routers (WiFi access points and the like) which was triggered by a perfectly innocent change in Linux that caused the order of TCP options to change. (I felt specifically aggrieved about this because I made that change.) The order was soon changed back and everything started working again.
But, for the future, this now means that the order cannot change. It's not written down anywhere, it's a rule written in bugs. This imposes costs on anyone who might write new TCP stacks in the future, by requiring increased testing and reduced sales as some customers find that it wont work with their routers. These are all costs created by router manufactures and paid by others.
Economists calls these sorts of costs externalities and they are seen as a failure which needs to be addressed. Often, in other areas, they are addressed by regulation or privitisation. Neither of those options appeal in this case.
An uncontroversial suggestion that I'm going to make is that we require better test suites. As a router manufacturer, testing involves checking that your equipment works with a couple of flavors of Windows and, if we're lucky, Linux and some BSDs too. This is much too small of a testing surface. There needs to be an open source test suite designed to test every corner of the RFCs. The NFS connectathons are similar in spirit and probably saved millions of man-hours of debugging over their lifetimes. Likewise, the ACID tests for web browsers focused attention on areas where they were poorly implementing the standards.
And, although my two examples above are both IP/TCP related, I don't want to suggest that the problem stops there. Every common RFC should have such a test suite. HTTP may be a simple protocol but I'll bet that most implementations can't cope with continued header lines. It's those corners which a test suite should address.
Testing should help, but I don't think that it'll be enough. Problems will slip through. Testing against specifications will also never catch problems with the specification itself.
DNS requests can carry multiple questions. There's a big counter in the packet to say how many questions you are asking. However, the reply format can only hold one response code. Thus, I don't know of any DNS server which handles multiple questions (most consider the request to be invalid).
The ability to ask multiple questions would be very helpful. Just look at the number of places which suggest that you turn off IPv6 to make your networking faster. That's because software will otherwise ask a single IPv6 question of DNS, wait for the reply and then ask the IPv4 question. This delay, caused by not being able to request both results in a single request, is causing people to report slowdowns and disable IPv6.
We need to fix DNS, but we never can because one cannot afford break the world. We can't even start a backwards compatible transition because of broken implementations.
That's why networked software should have an expiry date. After the expiry date, the code should make it very clear that it's time to upgrade. For a router, print a big banner when an administrator connects. Flash all the error lights. For software, pop up a dialog every time you start. For home routers, beep and flash a big indicator.
We don't need everyone to update and, as manufacturers fold, maybe there won't be any firmware updates or software upgrades. Almost certainly the device shouldn't stop working. But we need to make more of an effort to recognise that large populations of old code hold everyone else back.
If we can know that nearly all the old code is going to be gone by some date, maybe we can make progress.
(Thanks to Evan for first putting this idea in my mind.)
rwb0fuz1024 included in eBATS
rwb0fuz1024 (pronounced 'robo-fuzz') has been included in the eBATS benchmarking suite. Not all of the test systems have been run with it yet, but here's one which has. It's the fastest verification by fa
r
.
(Full results here)
Sandboxing on Linux
This blog post has been brought about because of the issues of sandboxing Chromium on Linux (no, it's not ready and wont be for months).
Chromium uses a multiprocess model where each tab (roughly) is a separate process which performs IPCs to a UI process. This means that we can do parallel rendering and withstand crashes in the renderer. It also means that we should be able to sandbox the renderers.
Since the renderer are parsing HTML, CSS, Javascript, running plugins etc, sandboxing them would be very desirable. There's a lot of scope for buffer overflows and other issues in a code base that large and a good sandbox would dramatically reduce the scope of any exploits.
Traditional sandboxes: chroot, resource limits
People have been using chroot jails for many years. A chroot call changes the root of the filesystem for the current process. Once that has happened the process cannot interact with any of the filesystem outside the jail. As long as the process cannot gain root access, it's a good security measure.
Resource limits prevent denial of service attacks by, say, trying to use up all the memory on the system. See the getrlimit manpage for details.
These two mechanisms are supported by most UNIX like systems. However, there are some limitations:
Network access, for one, is not mediated by the filesystem on these platforms, so a compromised process could spew spam or launch attacks on an internal network. Also, the chroot call requires root access. Traditionally this has been done with a small SUID helper binary, but then root access is needed to install etc.
ptrace jails
The ptrace call is used by the strace utility which shows a trace of all the system calls that a child makes. It can also be used to mediate those system calls.
It works like this: the untrusted child is traced by a trusted parent and the kernel arranges that all system calls that the child makes cause a SIGTRAP, stopping the child. The parent can then read the registers and memory of the child and decide if the system call is allowed, permitting it or simulating an error if not.
The first issue is that some system calls take pointers to userspace memory which needs to be validated. Take open, which passes a pointer to the filename to be opened. If the parent wishes to validate the filename it has to read the child's memory and check that it's within limits. That's perfectly doable with ptrace.
The issue comes when there are multiple threads in the untrusted address space. In between the parent validating the filename and the kernel reading it, another thread can change its contents. In the case of open that means that the validator in the parent see one (safe) filename but the kernel actually acts on another. Because of this, either multithreaded children need to be prohibited, or the validator must forbid all system calls which take a pointer to a buffer which needs to be validated.
When calls like open have been prohibited, there's another trick which can be used to securely replace it:
UNIX domain sockets are able to transmit file descriptors between processes. Not just the integer value, but a reference to the actual descriptor (which will almost certainly have a different integer value in the other process). For details see the unix and cmsg manpages.
With this ability an untrusted child can securely open a file by making a request, over a UNIX domain socket to a trusted broker. The broker can validate the filename requested in safety: because it's in another address space the filename is safe between validation and use by the kernel. The broker can then return the file descriptor over the socket to the untrusted child.
The major problem with ptrace jails is that they have a high cost at every system call. On my 2.33GHz Core2 a simple getpid call takes 128ns. When a process is ptraced, that rises to 13,800ns (a factor of 100x slower). Additionally, Chromium on Linux is a 32-bit process because of our JIT, so getting the current time is a system call too.
Seccomp
Seccomp has a rather messy past (see the linked Wikipedia page for details). It's a Linux specific mode which a process can request whereby only read, write, exit and sigreturn system calls are allowed. Making any system call not on the permitted list results in immediate termination of the process.
This is a very tight jail, designed for pure computation and is perfect for that. It's enabled by default in kernel builds (although some distributions disable it I believe). It used to be enabled via a file in /proc but, in order to save space, it's now a prctl.
This issue is that the jail is too tight. It's great that read and write calls are enabled without overhead because that's much of what one of our rendering processes will use, but many other system calls would be nice (brk and mmap for memory allocation, gettimeofday etc). We would have to use the broker model for all of them.
For some calls the broker model has to be updated. Allocating memory to an address space isn't something which can be performed outside that address space so, in this case, the broker for these calls has to be in the same address space. This means that there's an untrusted thread running under seccomp and a trusted thread, not running seccomped, in the same process. The untrusted thread can request more memory by making an request over a pipe to the trusted thread. The trusted thread can then perform the allocation in the same address space.
This presents some issues when writing the trusted code. Because untrusted code has access to the memory the only thing the trusted thread can trust are its registers. That means no stack nor heap usage. Basically the trusted code has to be written in assembly and has to be pretty simple. That's not a huge problem for us however.
But we will be making lots of these other system calls, not just the memory allocation ones, but time calls, poll etc. All have to use a broker model.
To recap, a basic system call (getpid) on my 2.33GHz Core2 takes about 128ns. Performing the same operation over a pipe to another thread takes 7,775ns and to another process takes 8,423ns, roughly a factor of 60x slower.
Again, this is a very painful slowdown given the volume of such calls that we expect to make.
SELinux
Fedora, rightfully, makes a lot of noise about the fact that they have SELinux. It's a huge beast and Fedora's work has mostly been a process of taming the complexity and dealing with the fact that very little is written with SELinux in mind.
I don't have Fedora installed anywhere, but this may be a very nice solution to our issues. However, I suspect that root access will be required, again, to configure it. I speak mostly from a position of ignorance here, however. I should install Fedora at some point and have a play.
The Other Man's Grass
Recent releases of OSX have a system call actually called sandbox_init. It's a little half-baked at the moment, but shows great promise.
It's a feature from TrustedBSD and, in the limit, allows for a Scheme like language to give a detailed specification of the shape of the sandbox which is compiled to bytecode and loaded into the kernel. You can see some examples of the profile language in the slides for this USENIX talk. But, for the moment, I believe that just a few preset profiles are provided (see the manpage).
Rolling one's own
SELinux is implemented atop of LSM which is a general framework for hooking security decisions in the Linux kernel. It's conceivable that one could write a sandboxing module using these hooks.
It would require root access to install, but then so do many of the other solutions. It would probably play badly with other LSM users too, but Fedora is the only major distribution to be using them as far as I know. However, it would also be a large distraction.
Summary of data
| Platform | Simple system call | ... via a broker thread | ... via a broker process | ... when ptraced |
|---|---|---|---|---|
| 32-bit | 136.9ns | 8161.4ns | 8327.3ns | 14087.0ns |
| 64-bit | 128.7ns | 7775.0ns | 8423.3ns | 13779.9ns |
Obfuscated TCP
It's now in its 3rd iteration, Obfuscated TCP now has an updated site, mostly working code etc. I need people to go to the site, look at the docs, watch the video, build the code, try stuff out etc. Tell me what works and what doesn't. Email address is at the top of the page. Thanks to all who do, and remember that you don't just have to email if you have problems, positive reports are good too!
Google datacenters
There are different levels of secrets at Google. Almost everything unreleased is “confidential” - which means that we don't talk about it to the outside world. Then there is the “top secret” stuff - stuff that you don't even talk about to other Googlers.
Now, top secret stuff is rare because it's a little poisonous. An environment where lots of things are secret between coworkers isn't a pleasant one. How we cool our data centers was one of those items and I was sworn to secrecy when I was lucky enough to be given a guided tour of our Oregon operations.
But, for whatever reasons, this information is now public! Seriously, this is some of the coolest (no pun intended) stuff that Google does: go read about evaporative cooling.
A Rabin-Williams signature scheme: rwb0fuz1024
I wrote a Rabin-Williams signature scheme [source]:
- Verification speeds 4x RSA (on a Core2 2.33GHz, at least)
- Signatures are half the size of RSA for the same security
- A hash generic attach is provably as hard as factoring
Crit-bit trees
I wrote up djb's implementation of crit-bit trees for strings here [pdf]. Crit-bit trees have several nice properties:
- Fast: only a single string compare per lookup.
- For finite sets (like 32-bit ints) the depth of the tree is bounded by the length of the longest element.
- Simple code - no complex balancing operations
- Supports the usual tree operations: successor, minimum, prefix set etc.
Several groups of Linux kernel papers have been published recently. Here's my pick of them:
First we have the Proceedings of the 2008 Linux Symposium (these are in some order of order, favourite first):
- 'Real Time' vs. 'Real Fast': How to Choose?
- Korset: Automated, Zero False-Alarm Intrusion Detection for the Linux Kernel
- Low Power MPEG4 Player
- I/O Containment
- Bazillions of pages: The future of memory management under Linux
- Linux capabilities: making them work
- Keeping The Linux Kernel Honest (Testing Kernel.org kernels)
Next there's the ACM SIGOPS Operating Systems Review. These papers are about much more experimental developments in the kernel and are thus more fun, even if they are less likely to see the light of day:
- Extending futex for kernel to user notification
- PipesFS: fast Linux I/O in the unix tradition
- Plan 9 authentication in Linux
I've just released two new curve25519 implementations: one in C and one in x86-64 assembly. The latter is 10% faster than djb's implementation.
curve25519 is an elliptic curve, developed by Dan Bernstein, for fast Diffie-Hellman key agreement. DJB's original implementation was written in a language of his own devising called qhasm. The original qhasm source isn't available, only the x86 32-bit assembly output.
Since many x86 systems are now 64-bit, and portability is important, this project provides alternative implementations for other platforms.
| Implementation | Platform | Author | 32-bit speed | 64-bit speed |
| curve25519 | x86 32-bit | djb | 265µs | N/A |
| curve25519-donna-x86-64 | x86 64-bit | agl | N/A | 240µs |
| curve25591-donna | Portable C | agl | 2179µs | 628µs |
(All tests run on a 2.33GHz Intel Core2)
Google has, at last, open sourced Protocol buffers. My, very minor contribution to this is that I wrote the basis for the encoding documentation.
Protocol buffers pretty much hit the sweet spot of complexity and capability. (See XML and ASN.1 for examples of attempts which missed.) I have the beginnings of a protocol buffer compiler for Haskell that I wrote for internal apps. Now that the C/Java/Python versions are out, I should probably clean that up and put it on Hackage. But every coder should consider protocol buffers for their serialisation needs from now on.
Older entries can be found in the archives

