Introduction

My friend and I have been doing a project which requires me to extract information from websites. The usual stuff have come up: data in the real world comes in multiple forms, whereas in the ideal world of databases it can have only one form. Have you realised how platonic information technology is?

Not every site likes the idea of someone grabbing all their information. Enter the . Webmasters require you to type down what you read in an image. The image is usually so hard to read that even a human have to refresh it a couple times before getting it right. Fortunately, the one site I needed to scrape didn't have a very hard captcha.

In fact, the captcha seemed so easy that I decided to take a shot at it. Even though I never tried to do it before and I don't really have that strong background in image processing. See it for yourself:

Piece of cake, huh? Well, as it turned out it wasn't that difficult to have a satisfactory success rate, but it was a little bit tougher than I envisioned.

After much struggle I have finally reached the 91% (+/- 1%) accuracy for the whole captcha. In order to find that value I did a 10-fold cross validation test. Thanks to scikit-learn doing so was a brease.

Separating the digits

Looking at the captcha I had to break, you can see that the numbers are clearly separated. Sometimes you see some captchas where the letters and numbers are all mixed together, one on top of the other. Thank God that wasn't the case.

So I figured my first task was to separate the digits. I didn't use machine learning for this one. Just plain old human trial and error.

For handling the image I used a Python library called PIL. I opened the captcha's image converting it to black and white. That means each pixel has a value from 0 to 255. 0 being black and 255 white. Anything in between is a shade of gray.

I wrote a few filters to make the image a little cleaner. We can see the image has two noise sources: thin lines behind the numbers and random dark blocks. In order to fix random blocks I simply set every pixel with value above 127 to white (255) and any pixel below that to black (0). The thin lines were a little bit trickier.

In order to remove the noisy lines I wrote a function that iterates over all pixels of the image. If it's neighbor to a black pixel, then I color it black as well. That reduces all lines in the image. The background lines, being thiner, tend to disapear, while the lines of the digits remain.

After those filters this is how the image looks like now:

I iterated over every column of the image (what are columns? if the image has a width of 100 and height of 150, then it has 100 columns of 150 pixels each). If 99% of the column pixels are black, then I considered it an empty (black) column, otherwise it's a filled (white) column.

After that, I found the ranges of continous white columns. For example, if I had 3 empty columns, followed by 6 white columns, followed by 3 empty columns, then I would have the following black intervals: [0,3), [6,9) and the following white interval: [3,6). In that case, I'll be interested only in the white interval, [3,6).

White intervals are possible part of digits, black intervals are empty space. My first try was to consider the 4 longest white intervals (we have 4 digits) as numbers. That approach worked reasonably, but it wasn't great. In the final version I'm still taking the top 4 white intervals, but I do somethings to the intervals before that.

I found out that, sometimes, my system would take more than one digit as a single digit. So now, if the range is wider than a certain value I divide it in half, if it's even larger, then I divide it in 3 and so on.

Another issue was that many 8s and 0s were being cut in half. Specially 0s. That happens because 0s have two thin lines on top and bottom, and a lot of empty space in the middle. So, sometimes, an empty column would show up in the middle of the digit. What I did here was to compare one white range to its adjacent white range. If they are similar to each other (I used euclidean distance on each half for similarity), then I'll merge them (it's probably the two halfs of a 0 or 8, they are symmetric).

The digit 6 also has a similar problem described above, I mitigate it by considering two small and adjacent white ranges to be concatenated. In order to avoid false positives (think about how small the number 1 is), I had to consider very small ranges. So the above trick is still helpful (both halves of 0 can be bigger than number 1 width).

Finally, I expand each range by 3 bytes. I do that because sometimes my algorithm cuts off parts of the number (for instance, the top of the number 1 is sometimes a very thin line.)

So that was the pre-processing I've done. I had a program that could divide an image into its digits. It made me happy already, as I've never done anything like that before. Here are the extracted digits:

Training

Now it comes the manual part of the job. I have downloaded a large number of captchas from the site I was analysing. I named each image after what's written on the captcha. I labeled 908 captchas.

I divide each image into digit images and then I extract the features described below from each digit. The features extracted from each digit goes to a classification algorithm (I ended up using SVM). The algorithm tells me what number it thinks each image is.

Features

All machine learning algorithms I know take a set of features identified by a label and create a model that's capable of predicting the label of a set of features it haven't seen before. If that seems abstract, don't worry, I just wasn't able to write a decent introduction.

For example, the climate may have two features: atmosphere pressure and the amount of water in the air. The labels may be: it's raing or it's not raining. If today is raining, you take today's (pressure, moist) vector and label it raining. If tomorrow is not raining you take tomorrow's (pressure, moist) vector and label it not raining. And so on, until you have a relevant data set. Then, a machine learning algorithm looks your dataset and creates a model. That model will be able to predict if other days are rainy or not based the day's (pressure, moist) vector.

I had to come up with such features for my system. The most obvious one was the value of each pixel of the image. Other features I've tried were: vertical and horizontal silhouette, number of pixels, number of white pixels, x-axis histogram, y-axis histogram, horizontal symmetry, and vertical symmetry. As it turns out, the value of each pixel was the best feature, all the others didn't help much or made it worse (except if I used RandomForest classifier, then it was best to use all of them).

I have found out that SVM works best if you make every digit the same size. So I scaled all digits to 16x16 using cubic scaling. That idea alone bumped my results from 80% to 90%.

Algorithms

I knew about scikit-learn. It's a nice library for machine learning algorithms. Since I have done everything else in Python, that library came as an obvious choice. I was a bit turned off by it not having a neural network. I know it's not that hot anymore, but I'd like to have tried it on.

I transformed my features into the format classifiers would understand (using scikit-learn's DictVectorizer) and tried all algorithms there were. I started by Naive Bayes and Nearest Neighbor, which I already knew and understood. They yielded good results. But the algorithm which performed the best was SVM. I kind of knew that could be the case because I have already heard so much about SVM being this very good thing. It was indeed good. Unfortunately, I don't understand it too well (specially the kernel trick part), maybe I'll learn more of it for another post.

In the end, I used SVM with the pixel values of the scaled image and a second degree polynomial kernel. It yielded a precision of 91% (+/- 1%). That's reasonable. Also, the performance was great, each request to classify an unseen captcha took around 10ms in my machine. And my machine is not really that good. A mere core 2 duo laptop with 2GB of ram.

Making it better

Later on I may study a few possibilities to make the digit separtion part better. Maybe I can come up with better features as well, although I don't have my hopes up for this one.

Other thing I have been thinking is on using a separate set of features and classifier for each digit. Each classifier would have two labels 0 or 1. The label 0 means the digit is not of that class, 1 means it is. I think that could have really helped. I'm a little out of time, though. Maybe someday.

Downloading the code

If you are so inclined, you can take a look at my code at github. It doesn't have many comments, but it's a very small program. If you know python and read scikit-learn's documentation you're set. I'm releasing it under X11 license (with my name in the copyright notice, obviously).

Datasets included :)