Robin Whittleton

The future of CAPTCHA

CAPTCHA (standing for Completely Automated Public Turing test to tell Computers and Humans Apart) must have seemed like a good idea when it was first invented in 2000. Spam was beginning to become a major problem on the web and a method was needed to fight back. CAPTCHA at first glance seems ideal: a distorted image that would be instantly recognisable by humans yet incomprehensible to machines. Place some letters in the distorted image and get the user to type them back and bingo: you’ve stopped your spam problem.

Real life though is rarely so easy. The problem is that spam is profitable, and because of that it’s worthwhile to write programs that try to crack CAPTCHAs. The original CAPTCHA examples are now trivial for current algorithms to recognise, and the only option that developers had was to increase the complexity of the distortion. Successive CAPTCHA systems have added more distortion, extraneous lines and shapes, fuzz on the letters, multiple colours and different sizes all in an attempt to stay ahead of the spammers. This had lead to the current situation where the CAPTCHAs are so complex that it’s difficult if not impossible for a large proportion of humans to recognise any particular one, yet a sizeable proportion of CAPTCHA breaking bots can solve that same one.

There is a further problem with CAPTCHA – they are a complete block to many web users who have visual difficulties. Web standards demand alternative text for any image that contains information, but that completely breaks the system here. By using these tests we (as an industry) ghettoise a complete section of web users. Various workarounds have been proposed and implemented – for example reCAPTCHA’s audio equivalent – but these tend to be extremely difficult to use as well.

So is it possible to make CAPTCHA better? In its current form I’d have to say no: we’ve now reached the state where computers are so good at letter recognition that any system that lets the majority of humans through is going to be susceptible to bots. More recently, researchers have concentrated their efforts on upping the difficultly of recognition by changing to photos instead of words. Microsoft’s Asirra project was one of the first to attempt this: they use a database of cat and dog photos and ask the user to select the cats from a randomly selected set of twelve.

At first glance this seems like an good solution, but it too has major problems. The first is sample set size: although Asirra has a set of around three million photos this isn’t big enough to provide a completely new image for every time one is presented. Given a bit of time a spammer (possibly co-operating with other spammers) could easily build a database of photos to animal (analogous to rainbow tables in password cracking). This can be worked around by programmatically generating the images – see this recent attempt – but both of these fall prey to what is the ultimate problem for any CAPTCHA systems: using humans instead to solve them. The basic idea with this approach is to either use a pay-for-services system like Amazon’s Mechanical Turk or to simply offer something of small value like a mobile phone ringtone in return for a solution. The spammer simply has to pass CAPTCHAs they want cracked through to the human workforce and receive the answer in return.

If we can’t rely on CAPTCHAs then how can we stop spammers from abusing the services we want to provide? There are various alternatives, none of which are 100% effective at blocking spam but which if used in combination can remove the majority while still keeping an accessible service. First off it’s a good idea to blacklist previous offenders and to provide traps in the form of hidden form controls that automatically invalidate any submission. Form-filling robots might not notice that they’re not meant to check them. We can also use common-sense questions to filter humans from bots (the classic example is “What colour is an orange?”), but more complex examples can be difficult for users with cognitive difficulties and you have to make sure your collection of questions if you’re worried that spammers might focus on your site rather than just trying it as part of a random sweep.

Assuming you’re trying to protect a content submission form and not a user account generator then the best solution though is to base your reckoning of whether a user is a spammer or not on the content they submit. Anti-spam email services work in much the same way: by using Bayesian filtering we assign a points value to each of a set of rules that define whether something is spam or not. Every time a piece of content is submitted we check it against each rule and if an arbitrary points total is reached we ignore it. Examples of these rules could be “Does is contain the word Viagra?”, “Is there a web link?” and “Is this the first time the user has commented?”. Individually these are unlikely to be a problem; together they could be a sign that this user is a spammer.

Of course, if you know that your target audience is going to be limited to a particular set of users then you can do something like RBI’s signup form!

Tags: website

See more posts

Comments: 3

Paul
commented on

On my own site I've managed to block spam pretty well with the following...

1. Check against a black list of words
2. Don't allow the form to be accepted if it's submitted too quickly
3. Check there isn't forum markup included
4. Check the name field doesn't contain a link
5. Include a hidden field and reject it if it's filled in

piers
commented on

Ha ha, love RBI's signup form! Agree that it's a smorgasbord of techniques that will keep your web forms spam-free...limiting the number of characters in a field also seems to help but as you suggest it will always be an on-going battle...the positive side being that it may bring us that bit closer to AI.

piers
commented on

After many re-freshes still haven't found a question on the RBI site that I am capable of answering...good way of ensuring that you only get responses from users with the right background.

Add a comment

Note: comments are moderated before publication.

Most Popular

Garmin Edge 705 Data Recovery with DD & XML

Joel Richards

After recently taking part in a race across Scotland, using a Garmin GPS device (Edge 705) I had around 12 hours of GPS data which unfortunately seemed to get corrupted during the ride. The device was properly displaying the data on the unit, but Windows, OSX and Linux were all …

Web Meet Guildford, join us for a drink

Paul Sturgess

It’s been over a year now since we moved to Guildford and we’re really feeling settled in our new home on the High Street. We’ve got our artwork on the walls, an arcade machine setup and we’ve even hosted a live gig. However, one thing we haven’t done yet is meet our fellow web …

Writing is hard, so do it

Neil Middleton

Over the last few months I’ve been working on something that I wouldn’t have seen myself doing at any point in my life, and that’s the task of writing a book. I’m not talking about a ‘Janet & John’ novel, or some sort of ‘Fifty Shade…

Get on the 'social media' bandwagon

Matt Hamm

‘Social media’ is the new buzz term. Everybody’s doing it, and why? Because it can generate masses amount of traffic to your website, which can easily turn into revenue. It’s really what ‘web 2.0’ is all about.

Google+