Friday, March 31, 2006

asynchronous replies in mutt

David Woodhouse talks about Jack's rethorical post "why did we ever abandon Mutt and Pine?", and says he still uses pine on handheld devices. His second of two main reasons for defecting to Evolution is:
Secondly, I very much like composing new email in separate windows, rather than in the main mail reader window. I have habit of hitting 'reply', perhaps making a half-hearted attempt to respond to an email, and then getting distracted and leaving it for days before I eventually find it the composer window on my desktop again, finish it off and send it. If I didn't do that, then stuff would just get lost and I'd never reply to it.
Well, I'm a commited mutt user and I have a similar problem with replies; I often defer then because I can't do them justice right now, and never return. The message gets buried.

There was a recent thread in the mutt lists on started by Jamie Rollins on "pseudo multi-threading", wanting to dispatch mail composition in a separate window. There turn out to be a few people who do things like that, and a few scripts.

For myself, I don't want a separate window. I want always to start in-line with my reply, but perhaps abandon it and resume later. So... screen! Now I have a script called muttedit that I use as the mail editor; my .muttrc now says "set editor=muttedit". It copies the temp file mutt makes and runs screen, invoking "mutt -H" with the copy. If I complete the reply and quit, I'm still in my original mutt as if I had used a plain editor. To defer it, detach from screen. I'm back in mutt, and there a screen session lying around holding the pending reply. Muttedit takes a bit of care to give the screen session a nice fat title with the reply subject in it.

We'll see how it goes.

Saturday, March 18, 2006

how to scuttle a counter-terrorism hotline

I see in todays ABC News "Terrorism hotline callers may be monitored". I presume it's the same hotline touted in the frequent TV advertisements for reporting suspicious stuff, reassuringly touted with the words "and you can remain anonymous". Clearly that's voided now. Very clever. It can only dampen the response from anyone with insider knowledge of some operation.

moving from fetchmail to getmail

As mooted previously, my mail collection is largely bound by the bandwidth from my ISP to me. However, it's not entirely bound by that. Some of it is bound by the delivery cost of each message at my end.

Until this morning I have been using fetchmail to collect my email. generally I'm very happy with it. It has a concise and human friendly configuration file. It is fairly easy to use. It flexibly delivers via the program of your choice, typically procmail for people working this way. However, it only delivers via an external program. The default is the local sendmail on your system, and most people choose procmail if they override that default. This basicly means a program fire-up per message. It is not a big cost, but it's noticeable. Because fetchmail is quite careful about ensuring delivery before deleting something from your POP mail spool, this cost is inline with the data transfer, and thus makes for a little latency.

My mail delivery is a bunch of decoupled programs; something fetches from my ISP and drops messages into a spool Maildir folder (core documentation); another script collects messages from there and files the spam in a spam folder and the ham in the spool-in folder; a third script scatters things from there to the final destination folders according to my rules. It may seem overly complex but it keeps the tasks nicely separated for easy tinkering and works quite well. For example I can refile messages simply by copying them from whatever folder I'm reading into the spool-in folder; there is no fear of having them miscategorised as spam and I don't need to invoke some special program or incantation to kick off the refile.

Still, that's not today's point. The issue here is that the initial delivery goes unconditionally into a single mail folder. I have been using a trite procmailrc with no rules and a DEFAULT=$MAILDIR/spool/ line. However, procmail must still be invoked once per message. If fetchmail could deliver directly to a Maildir I'd just be doing that. However, fetchmail restrains itself to collection from POP or IMAP and delivering to a local mail agent such as sendmail or procmail, and does not sully itself with direct mail folder delivery.

In consequence, and only because my fetch delivers to a mail folder without any other smarts, I have moved to using getmail. Like fetchmail, it can hand to a mail agent for delivery but it can also do direct delivery to a mail folder. So now I do that. The fetches are now slightly quicker, enough to notice.

bogofilter and circumventing fsync()

For some time now I have been using bogofilter to categorise incoming email as spam or non-spam (ham). It is quite effective and generally I have been well pleased. However, it was not fast. It does have a batch mode, avoiding the need to fire it up per-message, but even so it was managing about 1 message per second.

Why is this so? Bogofilter usually uses the dbm library for its back end, and for consistency reasons with shared use of the db it seems to call fsync() frequently. Very frequently. It's easily seen using strace. This requires the OS to commit the data to the disc, not merely to the I/O queue, and basicly slows performance to that of a hard drive, which is as nothing compared to memory.

I am not very concerned about this degree of data integrity for a database that in essense is akin to a collection of junk mail keyword frequency. High value data indeed!

I was loathe to patch bogofilter, and more loathe to dig into dbm or change it, and so for many months I have simply lived with the slowdown. Email delivery rarely needs to be truly instant; if I get a message a minute after dispatch instead of a few seconds I will usually not care. Besides, there's plenty of delivery latency already in the inter-poll delays of my fetches from the POP mail spool.

There are, however, two circumstances where I care about bogofilter's speed; one annoying and the other probably a showstopper.

The annoying one is the morning catchup delay. I often turn off the regular POP fetch when I sleep or at toehr times when I'll be away from my laptop for several hours. Why load my ISP or the laptop's hard drive with wasted work? Therefore, on return to The World I kick off a large mail fetch. It is typically over 1000 messages after a night of inactivity. The POP fetch itself takes as long as it takes - it's largely constrained by bandwidth and there is little that can be done about it (not entirely true - see the next post, as yet unwritten). However, the bogofilter run then takes quite a long time and that adds substantially to the subsequent email filing.

The showstopper stems from a probably-failing hard drive. Our main machine has a pair of decent sized hard drives in RAID1 supplying the /home area. I embarked on moving my mail storage from my laptop to the main machine the other day, and soon after the machine crashed. It's done it three times so far, and the error messages suggest one of the drives in the RAID1 is to blame, possibly coupled with a Linux kernel bug (well let's be frank - definitely a kernel bug - the whole point of RAID1 is to be able to have a drive failure without loss of the machine). Anyway, the third crash seemed circumstantially tied to my morning bogofilter run; the very high disc seek numbers suggest to me that the DMA timeouts stem from the hard drive simply taking too long to do things or internally getting out of step with the kernel to the point that they no longer talk to each other.

What to do, what to do?

Obviously we'll be running some diagnostics on the drive this weekend, and probably returning it (it's nearly new).

Secondarily, this is the spur to me to make bogofilter less hard on physical hardware and hopefully faster into the bargain.

So last night I wrote this script, "bogof". We use a RAM disc. Modern Linuxes tend to ship with one attached as /dev/shm and of course Solaris has run /tmp that way for years. I'm using linux, thus the script's defaults.

The bogof script wants a copy of the wordlist.db file on the RAM disc and makes such a copy at need. Naturally, the first run of the script incurs a cost to copy the db to the RAM disc, but that runs at disc read speeds. Even on a laptop that's 4 or 5 MBps and so it's 20-25s for my current db, equivalent to the cost of 20-25 new messages, well below the size of the big mail fetch.

After the first run the copy already exists, so bogof just sets $BOGOFILTER_DIR to point at the copy and execs bogofilter. Since fsync(0 on a RAM disc is probably close to a no-op, bogofilter runs much faster. Close to 2 orders of magnitude faster in fact.

Still, the data do need to get back to the hard drive at some point or bogofilter will never learn about new spam. I run my POP fetch in a small shell script that basicly runs a fetch that delivers to a spool Maildir folder and then sleeps, repeating. Another script pulls stuff from the spool folder and spam filters it, and runs bogofilter in batch mode over a chunk of messages at once. This is now quite fast courtesy of the bogof script. Once categorised, the messages are then either filed in the spam folder for final sanity checking or passed to the main mail filer that parcels things out to my many mail folders. After the spam run I just copy the wordlist.db file back to the master copy. This runs much faster than disc read speed because it's coming from a copy in RAM and going to the buffer pool, also in RAM. in due course the OS will get it to the disc.

This simple change has greatly sped my mail processing and greatly eased the burden on my hard drive activity light. I'm happy!