Why Pipes Suck (22 Aug 2003)
I'm going to have to do something about this problem at some point, but for the moment I'm going to settle with describing it.
I'm considering the design of some status monitoring for the servers in DoC. At the moment we have some pretty complex triggers setup on our admin Postgres server that allows you to insert values into a table and have per-minute and per-hour tables filled out automatically with the min, max and average. This is all very nice, but very slow. Postgres just can't handle it so we need something different.
We want to be able to set alarms on values over any averaging time and we want to record the per-hour, per-minute, per-day etc data for long term analysis of server load and so forth.
I've written a small C program that parses /proc/stat and pulls useful information out of it. Every bit of information is a name-value pair like servername.load, 2.3. I don't want to have to bother with authenticating raw TCP connections so I'm going to have the status server ssh out and invoke the monitoring program to trusted servers.
That's all just background.
Now I have lots of incoming streams from the servers and I need to demultiplex them into a stream with all the data. I'm a good UNIX programmer so I want everything to be as modular as possible. Let's say that I collect the data with a command line like: ssh -t -t -x ... servername /usr/bin/monitor_program | domcat /var/status_data (domcat is like netcat, but for UNIX domain sockets). Now I need a program that can merge the incoming streams and allow people to connect and receive the total stream.
If I was being a poor UNIX programmer I would pass a couple of TCP port numbers to this program. It would take all the input from anyone who connected to the first port, merge it and throw it out to everyone connected to the second port.
But the decision to use TCP shouldn't be ingrained (authentication nightmare), nor should the splitting of streams (it's just data). All this program should do is use it's protocol specific knowledge to merge streams into one. Thankfully, I already have a program called conguardian that just passes file descriptors to the stdin of it's child and accepts (and authenticates) connections from a named UNIX domain socket. So, the command line is looking like: conguardian /var/status-data merger_program.
But how do we get the data out of it? We write a program called splitter that just takes an input stream from stdin and copies it to everyone who connects. Thankfully, conguardian already abstracts the business of accepting and authenticating connections. So we say conguardian /var/status-data merger_program | conguardian /var/status-data-out splitter.
Opps! conguardian passes file descriptors in via stdin and we are trying to pipe data into stdin. How well do you know your shell syntax? Can you even pipe the output of one program into a numbered fd input of another? Are you going to have a headache by the time you have finished?
I'm always finding that I can't connect programs together with anything like the flexibility I want. How do you do bidirectional pipes? You put make programs' name and arguments, arguments of the first and write special fork handling code in the first. And if you want two bidirectional inputs to the second program? Oh dear.
(The above may be clearer if I include the conguardian manpage)
CONGUARDIAN(1) CONGUARDIAN(1) NAME conguardian - Access control for UNIX domain sockets SYNOPSIS conguardian <path to socket> <child process> [<child argu ments>]... DESCRIPTION conguardian attempts to unlink the given socket path if it exists and is a socket. If it is not a socket then it will fail to bind to it and give up. conguardian accepts connections on the given socket and checks the UID of the other end against an internal list of allowed usernames. UID 0 is always allowed and the internal list is initially empty. If not on the list, the connection is terminated. If the client is allowed and sends an ASCII NUL as the first byte, the connection is passed to the child over UNIX domain DGRAM socket on stdin. If the client sends a 0x01 byte and is root, it can upload a new username list. ENVIRONMENT IDENT given in all syslog messages AUTHOR Adam Langley <agl@imperialviolet.org> CONGUARDIAN(1)
As O'Reilly books go, this is a pretty small one. It's list price ($25) is more than I would value it at, but I got it from the library .
The book is in three sections: an introduction to Perl 6, an introduction to Parrot and a primer in Parrot assembly. The last one is highly skimmable unless you are actually programming in Parrot assembly (in which case you probably already have a far better knowledge of Parrot).
Now Perl 6 looks quite cool, it fixes a couple of things that I don't like about Perl 5. Assigning a hash or array to a scalar produces a reference to it...
Interlude: I'm half watching a program about asteroid impacts and I've just seen a (poor) computer graphics simulation of an impact on right on Imperial College, of all the places in the world. I'm a little gob smacked...
... which makes a lot more sense than assigning the phase of the moon or something. And the rules system (even more complex regular expressions) looks very powerful and a little more sane than Perl 5 regexps.
Also, we have an intelligent equality (~~) operator, which looks neat and leads to a nice switch operator (given). But I'm a little concerned about the number of different things it does depending on the types of it's arguments, but that's very Perlish. And the book lists 12 different contexts in which anything can be evaluated.
Less cosmetically, Perl 6 might gain continuation and coroutine support from Parrot. I don't know if Perl 6 will actually expose these, but Parrot can do them. And Parrot looks like it could really do wonders for open source scripting languages. It looks fast, and has been designed to support Perl 6, Ruby, Parrot, Scheme and others. Intercalling between them might allow us to get rid of some of the terrible C glue code that we have at the moment.
One thing that does worry me about Parrot is that it's basic integer type is a signed 32-bit. If you want anything else you have to use PMCs, which is a vtable based data type that allows for arrays and hashs and so forth, and is much slower. Now there are many applications for which 31-bits isn't enough. File offsets are obvious, but how about UID and device numbers? Both of these are looking like that are going to be 32-bit unsigned ints. You can fit this into a Parrot signed int, but it's going to cause huge headaches.
I've been dealing with APC UPSs a fair bit this week. A quick Google search will turn up the serial protocol that they use and it's really quite nice. A lot of devices (APC MasterSwitches for one) have a fancy vt100 menu interface which is totally unusable for an automated system. The UPSs, on the other hand have a simple byte code protocol and hopefully I'll have the servers shutting down neatly in a power failure. Software like that already exists but it's generally too far simple minded. We have many servers on any given UPS and some servers on several.
APC do loose points for their serial cables however. APC supply (for quite a price) so called 'smart cables' that are specially pinned out and nothing else uses the same interface. Thankfully, after looking at diagrams for about an hour I stuck 3 pins into a D9->RJ45 converter and it worked first time!