spamassassin + spampd + bayesian filtering

I have been running my own mail server for an eternity. For the last decade or two, it’s been Postfix with Spamassassin (invoked via spampd) for spam control. I then ran the most-excellent SpamSieve on my laptop to catch the spam that Spamassassin missed.

I didn’t run Spamassassin’s bayesian filter because SpamSieve did such an excellent job, and I could never get Spamassassin’s filter to work properly.

Lately, several things have happened. I realized that other users on my mail system weren’t making use of their own local spam filters, and thus were getting inundated with spam. I’ve been traveling a bit more, and have had to access mail from mobile devices when my laptop wasn’t able to run and do the filtering. And the total volume of spam (including dangerous phishing and malware-carrying spam) was just out of control.

I finally made a concerted effort to get Spamassassin, spampd, and Postfix playing nice together with bayesian filtering on.

I’m running under Ubuntu, but any Debian branch should work the same. The problems turned out to have to do with where the files were placed, ownership, and permissions—the usual suspects.

I run spampd as its own user (cleverly, spampd). The documentation implies that it should keep the bayesian data in the /home/spampd/ directory, but the files actually end up in /var/cache/spampd/. I’m nervous about just using /var/cache/, though, so I tried editing /etc/spampd.conf to add:

bayes_path /home/spampd/.spamassassin
auto_whitelist_path /home/spampd/awl
maxsize 1024000

That seems to work at first, but somehow, somewhere, some versions of the files end up back in /var/cache/spampd/ and filtering then stops working. After hundreds of iterations, I realized that some code was going to follow the .conf file, some code was going to use /var/cache no matter what, and absent a specific directive in .conf some code would default to /home/spampd/. The only workable solution was to use the bayes_path /home/spampd/.spamassassin directive in the .conf file and symlink from /home/spampd/.spamassassin back to /var/cache/spampd/.

# ll -a /home/spampd
total 16K
drwxr-xr-x  2 spampd spampd 4.0K Nov  3 20:51 awl/
lrwxrwxrwx  1 root   root     17 Dec 11 07:27 .spamassassin -> /var/cache/spampd/

That worked great for a few days, then mysteriously stopped. For some reason, ownership of at least one of the bayes data files (nominally spampd.spampd) would mysteriously change to root.root, killing the filtering.

The solution was to force ownership back to reality after each training session (which is done overnight via a cron job). So the crontab code ends up looking like this:
/usr/local/bin/sa-learn --username=spampd --mbox --spam /home/ron/mail/spamtrain && \
/usr/local/bin/sa-learn --username=spampd --mbox --ham /home/ron/mail/arc && \
/etc/init.d/spampd restart ; \
chown -v spampd.spampd /var/cache/spampd/* \
ls -al /var/cache/spampd/

You’ve probably figured out that I move all spam messages that get through into the spamtrain mailbox. I’m also a heap-and-search kind of guy, so nearly all my non-spam email ends up in the arc mailbox.

I hope someone can benefit from this information, and not have to spend the hours I did getting these fabulous programs to play nice together.