spamassassin + spampd + bayesian filtering

I have been running my own mail server for an eternity. For the last decade or two, it’s been Postfix with Spamassassin (invoked via spampd) for spam control. I then ran the most-excellent SpamSieve on my laptop to catch the spam that Spamassassin missed.

I didn’t run Spamassassin’s bayesian filter because SpamSieve did such an excellent job, and I could never get Spamassassin’s filter to work properly.

Lately, several things have happened. I realized that other users on my mail system weren’t making use of their own local spam filters, and thus were getting inundated with spam. I’ve been traveling a bit more, and have had to access mail from mobile devices when my laptop wasn’t able to run and do the filtering. And the total volume of spam (including dangerous phishing and malware-carrying spam) was just out of control.

I finally made a concerted effort to get Spamassassin, spampd, and Postfix playing nice together with bayesian filtering on.

I’m running under Ubuntu, but any Debian branch should work the same. The problems turned out to have to do with where the files were placed, ownership, and permissions—the usual suspects.

I run spampd as its own user (cleverly, spampd). The documentation implies that it should keep the bayesian data in the /home/spampd/ directory, but the files actually end up in /var/cache/spampd/. I’m nervous about just using /var/cache/, though, so I tried editing /etc/spampd.conf to add:

bayes_path /home/spampd/.spamassassin
auto_whitelist_path /home/spampd/awl
maxsize 1024000

That seems to work at first, but somehow, somewhere, some versions of the files end up back in /var/cache/spampd/ and filtering then stops working. After hundreds of iterations, I realized that some code was going to follow the .conf file, some code was going to use /var/cache no matter what, and absent a specific directive in .conf some code would default to /home/spampd/. The only workable solution was to use the bayes_path /home/spampd/.spamassassin directive in the .conf file and symlink from /home/spampd/.spamassassin back to /var/cache/spampd/.

# ll -a /home/spampd
total 16K
drwxr-xr-x  2 spampd spampd 4.0K Nov  3 20:51 awl/
lrwxrwxrwx  1 root   root     17 Dec 11 07:27 .spamassassin -> /var/cache/spampd/

That worked great for a few days, then mysteriously stopped. For some reason, ownership of at least one of the bayes data files (nominally spampd.spampd) would mysteriously change to root.root, killing the filtering.

The solution was to force ownership back to reality after each training session (which is done overnight via a cron job). So the crontab code ends up looking like this:

/usr/local/bin/sa-learn --username=spampd --mbox --spam /home/ron/mail/spamtrain && \
/usr/local/bin/sa-learn --username=spampd --mbox --ham /home/ron/mail/arc && \
/etc/init.d/spampd restart ; \
chown -v spampd.spampd /var/cache/spampd/* \
ls -al /var/cache/spampd/

You’ve probably figured out that I move all spam messages that get through into the spamtrain mailbox. I’m also a heap-and-search kind of guy, so nearly all my non-spam email ends up in the arc mailbox.

I hope someone can benefit from this information, and not have to spend the hours I did getting these fabulous programs to play nice together.

update 19January2019

It’s working fantastically well. Still, every few days, the bayes files lose their ownership and then get set back by the nightly training cronjob.

Things were going so well that I backed the spam threshold to 4. Spam messages get bounced, so I added a link to explanatory text in the bounce message:

WE THINK YOU ARE SENDING SPAM.
See https://risley.net/spam for more information.

Bouncing the spam (rather than just quarantining it) is important, as it helps cut down the total quantity of spam the net (“why” is the subject of another post). Of course, I worry about bouncing legitimate mail with such a low threshold.

This nightly cron job shows me the from:, to:, and spam score of all messages that have a spam score less than 7 but were still bounced:

grep "$(date +"%b %_d" -d "yesterday")" /var/log/mail.log | grep 'score=[4]\.' | sed -e 's/^\(...............\).*\( score=...\).*\( from=[^ ]*\).*\( to=[^ ]*\).*/\1\2\4\3/' ; echo ; grep "$(date +"%b %_d" -d "yesterday")" /var/log/mail.log | grep 'score=[567]\.' | sed -e 's/^\(...............\).*\( score=...\).*\( from=[^ ]*\).*\( to=[^ ]*\).*/\1\2\4\3/'

-*”*-.,,.-*”*-.,,.-*”*-.,,.-*”*-.,,.-*”*-.,,.-*”*-.,,.-*”*-

An important part of Spamassassin is checking with URIBL to see if the email references any blacklisted URLs. If you use common DNS servers like Google (8.8.8.8), Cloudflare (1.1.1.1), Quad9 (9.9.9.9), or OpenDNS, the URIBL lookup might fail because it sees too many requests from those servers and brands them as high-volume commercial users who need to purchase a license.

If your ISP gives you some DNS server addresses, those are more likely to work. You could change DNS for your whole network, but that’s probably a bad idea in that some ISPs don’t do a great job with DNS or even censor the net by blocking some DNS lookups. Instead, you can tell just Spamassassin to use your local DNS by putting a directive in /etc/spampd.conf:

dns_server 9.9.9.9

(Of course, replacing the 9.9.9.9 with your local ISP’s DNS server.)