reCAPTCHA as a Service

All I wanted an implementation of reCAPTCHA for my TWiki. What I got is reCAPTCHA almost anywhere.

If you host any kind of web site that allows your users to post content — blogs, wikis, calendars, comment pages, guest books, content management systems — then you’ve probably encountered comment spam. Automated processes clog your content with links to try to improve their search engine ranking. Even on a low-volume site, keeping up with deleting the dreck can be taxing.

That’s why sites use CAPTCHAs to prevent automated abuse. One popular source for CAPTCHAs is reCAPTCHA. Read about it: it’s CAPTCHA with a conscience.

I naively thought that I could use reCAPTCHA by just putting a snippet of HTML widget code on my sign-in pages. If you read the reCAPTCA instructions, though, you’ll find it’s considerably more complex than that. I’m no stranger to Perl and CGI coding, but hacking TWiki code isn’t trivial, and hacks often break with new releases. (To do it properly would require writing a TWiki plugin or add-on. I’m a doctor, Jim, not a TWiki developer!) I’d also like to be able to trivially add reCAPTCHA to some other platforms I use. What about people who can’t write their own CGIs? Or are hosted on sites where they’re not permitted to run their own code? Or people who want to be able to add a CAPTCHA to a site but can’t afford the time and effort to integrate reCAPTCHA into existing code?

Why can’t reCAPTCHA be easy?

Here’s an experiment: reCAPTCHA as a web service. No coding or CGI scripts. No PHP or ASP or .NET. Just a small modification to the HTML on your sign-in/posting page and a change in the access restrictions for your sign-in/posting script file (on Apache, this just means creating a small text file named .htaccess). It’s a little more complex than just putting a widget in an iframe, but it’s still pretty darned simple.

modify your HTML

First, figure out where you want the CAPTCHA to appear in the work flow. Usually, it will be right after submitting the form for a comment or registration request. Find the HTML code for the form. Somewhere in there will be a form tag, which will look vaguely like:

<form method=post action="/cgi-bin/addcommentscript">

The form tag can have lots of variations, but the important piece is what’s in quotes after “action=”. Here’s what you do:

  1. replace the action parameter in the form tag ("/cgi-bin/addcommentscript" in this example) with "https://captcha.sacdoc.org:9080/cgi/captcha.pl"
  2. below the form tag, add <input type="hidden" name="ultimate_destination_url" value="/cgi-bin/addcommentscript">
  3. if the value parameter doesn’t start with "http://" or "https://", then look at the address bar of your browser when you’re viewing the comment or registration page. The url should start with something like "https://yourdomain.com/". Insert that text before the action parameter. You should end up with something like this:
<form method=post action="https://captcha.sacdoc.org:9080/cgi/captcha.pl">
<input type="hidden" name="ultimate_destination_url" value="https://yourdomain.com/cgi-bin/addcommentscript">

Give it a try! After you submit a comment or registration request, you should be presented with a reCAPTCHA page. Prove you’re sentient, and you should be sent to the next step in your posting/registration process.

setting access restrictions

If this is all you do, it might stop some ‘bots. It’s just security-by-obscurity, though. Anyone who knows the location of the actual posting/registration script can just bypass the CAPTCHA screen. In fact, for popular platforms, ‘bots already bypass the entry form and go straight for the script. In any case, your modified entry form has the address of the posting/registration in plain text, so it would be trivial to bypass it. What to do?

Fortunately, all legitimate posting/registration requests will now be coming from my IP address (currently 69.62.162.196). If your hosting site uses an Apache server (most do), look in the directory that has your posting/registration script. Check if there is a file there named .htaccess. If there isn’t one, create one. Then add this to the end of the file:

<FilesMatch "^addcommentscript.*">
SetHandler cgi-script
order deny,allow
deny from all
allow from 69.62.162.196
Satisfy any
</FilesMatch>

The "addcommentscript" should be replaced by just the file name from value parameter above. In the example, the value parameter was "https://yourdomain.com/cgi-bin/addcommentscript", so you’d use the addcommentscript part in the .htaccess file.

I have no idea if Microsoft’s web server (IIS) supports .htaccess files or the equivalent, so you’re on your own if you’re hosted on an IIS server.

important caveats

privacy

The astute reader has already recognized that this scheme will pass all data from the posting/registration form through my servers. For some such, that might include username/password pairs, hidden access tokens (but not cookies), or the contents of posts to private blogs. I could try to convince you that I am a person of such honesty and integrity that you should trust me. Instead, let me state that I just might sell all the data to the highest bidder. If that makes you uncomfortable, find a different solution.

reliability

Over the past ten years, my servers have achieved a 99.98% scheduled availability level for mission-critical services. But reCAPTCHA as a service isn’t mission critical. I might decide not to host it any more. The captcha.sacdoc.org IP address (which has to be hard-coded into .htaccess files) might change without notice at the whim of my ISP (they’ve done that once in a decade, in spite of my paying for static addresses). I might die or sell out or just arbitrarily stop supporting reCAPTCHA as a service. No guarantee is expressed or implied. Use at your own risk. Don’t blame me if you get hurt.

If you use this service in any serious way, I strongly suggest that you email me at r@risley.net letting me know how to contact you if things change. I might not bother to do so, but at least this way you’ll have a chance.

Ron – 27 Dec 2009