Saturday, March 18, 2006

bogofilter and circumventing fsync()

For some time now I have been using bogofilter to categorise incoming email as spam or non-spam (ham). It is quite effective and generally I have been well pleased. However, it was not fast. It does have a batch mode, avoiding the need to fire it up per-message, but even so it was managing about 1 message per second.

Why is this so? Bogofilter usually uses the dbm library for its back end, and for consistency reasons with shared use of the db it seems to call fsync() frequently. Very frequently. It's easily seen using strace. This requires the OS to commit the data to the disc, not merely to the I/O queue, and basicly slows performance to that of a hard drive, which is as nothing compared to memory.

I am not very concerned about this degree of data integrity for a database that in essense is akin to a collection of junk mail keyword frequency. High value data indeed!

I was loathe to patch bogofilter, and more loathe to dig into dbm or change it, and so for many months I have simply lived with the slowdown. Email delivery rarely needs to be truly instant; if I get a message a minute after dispatch instead of a few seconds I will usually not care. Besides, there's plenty of delivery latency already in the inter-poll delays of my fetches from the POP mail spool.

There are, however, two circumstances where I care about bogofilter's speed; one annoying and the other probably a showstopper.

The annoying one is the morning catchup delay. I often turn off the regular POP fetch when I sleep or at toehr times when I'll be away from my laptop for several hours. Why load my ISP or the laptop's hard drive with wasted work? Therefore, on return to The World I kick off a large mail fetch. It is typically over 1000 messages after a night of inactivity. The POP fetch itself takes as long as it takes - it's largely constrained by bandwidth and there is little that can be done about it (not entirely true - see the next post, as yet unwritten). However, the bogofilter run then takes quite a long time and that adds substantially to the subsequent email filing.

The showstopper stems from a probably-failing hard drive. Our main machine has a pair of decent sized hard drives in RAID1 supplying the /home area. I embarked on moving my mail storage from my laptop to the main machine the other day, and soon after the machine crashed. It's done it three times so far, and the error messages suggest one of the drives in the RAID1 is to blame, possibly coupled with a Linux kernel bug (well let's be frank - definitely a kernel bug - the whole point of RAID1 is to be able to have a drive failure without loss of the machine). Anyway, the third crash seemed circumstantially tied to my morning bogofilter run; the very high disc seek numbers suggest to me that the DMA timeouts stem from the hard drive simply taking too long to do things or internally getting out of step with the kernel to the point that they no longer talk to each other.

What to do, what to do?

Obviously we'll be running some diagnostics on the drive this weekend, and probably returning it (it's nearly new).

Secondarily, this is the spur to me to make bogofilter less hard on physical hardware and hopefully faster into the bargain.

So last night I wrote this script, "bogof". We use a RAM disc. Modern Linuxes tend to ship with one attached as /dev/shm and of course Solaris has run /tmp that way for years. I'm using linux, thus the script's defaults.

The bogof script wants a copy of the wordlist.db file on the RAM disc and makes such a copy at need. Naturally, the first run of the script incurs a cost to copy the db to the RAM disc, but that runs at disc read speeds. Even on a laptop that's 4 or 5 MBps and so it's 20-25s for my current db, equivalent to the cost of 20-25 new messages, well below the size of the big mail fetch.

After the first run the copy already exists, so bogof just sets $BOGOFILTER_DIR to point at the copy and execs bogofilter. Since fsync(0 on a RAM disc is probably close to a no-op, bogofilter runs much faster. Close to 2 orders of magnitude faster in fact.

Still, the data do need to get back to the hard drive at some point or bogofilter will never learn about new spam. I run my POP fetch in a small shell script that basicly runs a fetch that delivers to a spool Maildir folder and then sleeps, repeating. Another script pulls stuff from the spool folder and spam filters it, and runs bogofilter in batch mode over a chunk of messages at once. This is now quite fast courtesy of the bogof script. Once categorised, the messages are then either filed in the spam folder for final sanity checking or passed to the main mail filer that parcels things out to my many mail folders. After the spam run I just copy the wordlist.db file back to the master copy. This runs much faster than disc read speed because it's coming from a copy in RAM and going to the buffer pool, also in RAM. in due course the OS will get it to the disc.

This simple change has greatly sped my mail processing and greatly eased the burden on my hard drive activity light. I'm happy!

No comments: