Saturday, March 18, 2006

bogofilter and circumventing fsync()

For some time now I have been using bogofilter to categorise incoming email as spam or non-spam (ham). It is quite effective and generally I have been well pleased. However, it was not fast. It does have a batch mode, avoiding the need to fire it up per-message, but even so it was managing about 1 message per second.

Why is this so? Bogofilter usually uses the dbm library for its back end, and for consistency reasons with shared use of the db it seems to call fsync() frequently. Very frequently. It's easily seen using strace. This requires the OS to commit the data to the disc, not merely to the I/O queue, and basicly slows performance to that of a hard drive, which is as nothing compared to memory.

I am not very concerned about this degree of data integrity for a database that in essense is akin to a collection of junk mail keyword frequency. High value data indeed!

I was loathe to patch bogofilter, and more loathe to dig into dbm or change it, and so for many months I have simply lived with the slowdown. Email delivery rarely needs to be truly instant; if I get a message a minute after dispatch instead of a few seconds I will usually not care. Besides, there's plenty of delivery latency already in the inter-poll delays of my fetches from the POP mail spool.

There are, however, two circumstances where I care about bogofilter's speed; one annoying and the other probably a showstopper.

The annoying one is the morning catchup delay. I often turn off the regular POP fetch when I sleep or at toehr times when I'll be away from my laptop for several hours. Why load my ISP or the laptop's hard drive with wasted work? Therefore, on return to The World I kick off a large mail fetch. It is typically over 1000 messages after a night of inactivity. The POP fetch itself takes as long as it takes - it's largely constrained by bandwidth and there is little that can be done about it (not entirely true - see the next post, as yet unwritten). However, the bogofilter run then takes quite a long time and that adds substantially to the subsequent email filing.

The showstopper stems from a probably-failing hard drive. Our main machine has a pair of decent sized hard drives in RAID1 supplying the /home area. I embarked on moving my mail storage from my laptop to the main machine the other day, and soon after the machine crashed. It's done it three times so far, and the error messages suggest one of the drives in the RAID1 is to blame, possibly coupled with a Linux kernel bug (well let's be frank - definitely a kernel bug - the whole point of RAID1 is to be able to have a drive failure without loss of the machine). Anyway, the third crash seemed circumstantially tied to my morning bogofilter run; the very high disc seek numbers suggest to me that the DMA timeouts stem from the hard drive simply taking too long to do things or internally getting out of step with the kernel to the point that they no longer talk to each other.

What to do, what to do?

Obviously we'll be running some diagnostics on the drive this weekend, and probably returning it (it's nearly new).

Secondarily, this is the spur to me to make bogofilter less hard on physical hardware and hopefully faster into the bargain.

So last night I wrote this script, "bogof". We use a RAM disc. Modern Linuxes tend to ship with one attached as /dev/shm and of course Solaris has run /tmp that way for years. I'm using linux, thus the script's defaults.

The bogof script wants a copy of the wordlist.db file on the RAM disc and makes such a copy at need. Naturally, the first run of the script incurs a cost to copy the db to the RAM disc, but that runs at disc read speeds. Even on a laptop that's 4 or 5 MBps and so it's 20-25s for my current db, equivalent to the cost of 20-25 new messages, well below the size of the big mail fetch.

After the first run the copy already exists, so bogof just sets $BOGOFILTER_DIR to point at the copy and execs bogofilter. Since fsync(0 on a RAM disc is probably close to a no-op, bogofilter runs much faster. Close to 2 orders of magnitude faster in fact.

Still, the data do need to get back to the hard drive at some point or bogofilter will never learn about new spam. I run my POP fetch in a small shell script that basicly runs a fetch that delivers to a spool Maildir folder and then sleeps, repeating. Another script pulls stuff from the spool folder and spam filters it, and runs bogofilter in batch mode over a chunk of messages at once. This is now quite fast courtesy of the bogof script. Once categorised, the messages are then either filed in the spam folder for final sanity checking or passed to the main mail filer that parcels things out to my many mail folders. After the spam run I just copy the wordlist.db file back to the master copy. This runs much faster than disc read speed because it's coming from a copy in RAM and going to the buffer pool, also in RAM. in due course the OS will get it to the disc.

This simple change has greatly sped my mail processing and greatly eased the burden on my hard drive activity light. I'm happy!

Sunday, February 19, 2006

source code != documentation

I was looking at Wesner Moise's "Smart Software" blog entry about the EU vs MS thing, and basicly he's saying that Microsoft, required to make interoperability possible for third parties, is offering both source code and documentation, and that should suffice regardless of the sufficiency of the documentation. He even says, "source code is relevant; it often is the best documentation". This fallacious. Why is code not interoperation documentation? Because code makes no distinction between what is stable and needed for interoperability, and what is just implementation - free to change. Without documentation Microsoft can simply pull the same forced-upgrade stunt they've been doing for years - change the code and break the interoperating competition. Because of Microsoft's monopoly position, this has the pragmatic effect of forcing a fresh round of Microsoft sw purchases, even amongst users who may have been using other, previously interoperable, tools as soon as they have to work with partners using the new Microsoft release. Even within Microsoft-only users I have seen this with Office releases. With the interoperation requirements specified in documentation, a Microsoft change that breaks the interoperating competition can be clearly recognised as either conformant with the pre-existing doco and thus a bug in the competition, in violation of the doco and thus a bug in the Microsoft code, or an ambiguity or gap in the doco that Microsoft must fill to clarify the required interoperation. Source code is not the best documentation; it is merely a specification of what happens to happen just at the moment. For interoperability, it is the _worst_ documentation.

Wednesday, May 04, 2005

headphones as productivity aid

I'm sure a bazllion people already know this: headphones let you work better. I used to listen to a lot of music when sysadmining at UNSW but that changed when I escaped into the outer world. At the CSIRO I had my own office, which was good for concentration but I found it easy to be distracted. At my current workplace I'm in cubicle land (pretty good cubicle land actually, but nonetheless). I find other people talking very distracting; I'm someone who literally can't think if there's a TV on in the same room; many waiting rooms drive me nuts - I can't even sit and read:-(

I recently got around to buying a decent set of enclosing headphones (SennHeiser HD270s [review @ dansdata]) and am ripping a bunch of CDs to Ogg format (using cdrip [manual] if you care). I'm now in my own little world and actually getting much more done.

Friday, April 29, 2005

just because you can do something doesn't mean you should

Rant: today I tripped over yet another busted web site that hasn't mastered the basic tool of the web - the hyperlink. There I am glancing at the Night Watch screenshots and as is my habit middle-clicked all 4 shots to pop them up in new tabs, knowing they'll take a little while to load and intending to read the text while that happens.

What transpired? I have four new tabs named "(Untitled)". This is The Clue.

The links attached to the screenshot thumbnails say:

javascript:popUp('press/174/nightwatch_shot25_uk.jpg')
embedded in this page is some pointless javascript function to open a new window containing the screenshot, an ad banner and a pointless "close this window" link, also javascript.

So what have we here? Another idiot web designer who thinks he/she knows what the reader should be doing.

Of course, what should be on that link is this:

http://www.worthplaying.com/kiwi_popup.php?img=press/174/nightwatch_shot26_uk.jpg
or even this:
http://www.worthplaying.com/press/174/nightwatch_shot26_uk.jpg
Why? Because the reader has their own ideas. If I want a new window I can make one thank you. If I want to point a scraper at the pager to grab the shots, perhaps to view them conveniently in some handy image view I like, a plain http: URL can be grabbed trivially. Maybe I'm using a text web browser, or have javascript disabled (since it's widely abused). A plain URL is portable and flexible.

If the web author wants to hint that I should get a new window he can always put a target="_new" tag in the anchor. Most browsers will open a new window for that, in some form or other.

Which brings me to the new window itself. As mentioned earlier, it's got a "Close this window" link. What a pointless waste. The user already has a way to close the window, usually at least two ways: they can press the browser's close-window button or they can use their window manager's close-window facility (the X in the top right for most desktops, Alt-Delete for me). And the beauty of both these things is that the user's hands already know how to do it, without thought. The "close window" link is a pointless, annoying and insulting waste of space.

I take it as a sign that the web author is incompetent. It at least makes me feel better than thinking they're trying to alienate their readers.

Wednesday, April 27, 2005

switched to rxvt-unicode from aterm

I'm try out rxvt-unicode. I'm a long term aterm user. It's small, does pseudotransparency and generally just works. However, it's got some minor annoyances: it can be fiddly to build (or it used to be - depended on WindowMaker for some libraries, and I'm not a WindowMaker user) but more importantly always had some unremovable borders.

Since I have a rather disciplined Zen desktop environment this is a bit annoying; I'd pop up a new terminal and find the text slightly inset from the screen edges. rxvt-unicode seems to honour its borderwidth settings and also has a neat "stay aligned to the nearest screen corner" setting that keeps things neat.

Besides, though I tend to live in a "C" locale, I like the idea of properly rendered glyphs. The other neat setting is the vanishing mouse cursor, which will disappear after a tunable time if not moved.

Thursday, May 13, 2004

opening post

bootstrap - actual content later