pages tagged machine-learningThe illusion of self consciencehttp://rafael.kontesti.me/blog/tags/machine-learning/The illusion of self conscienceikiwiki2012-09-18T22:21:57ZBreaking a captchahttp://rafael.kontesti.me/blog/posts/Breaking_a_captcha/2012-09-18T22:21:57Z2012-09-09T22:38:03Z
<h1>Introduction</h1>
<p>My friend and I have been doing a project which requires me to extract
information from websites. The usual stuff have come up: data in the
real world comes in multiple forms, whereas in the ideal world of
databases it can have only one form. Have you realised how platonic
information technology is?</p>
<p>Not every site likes the idea of someone grabbing all their information.
Enter the . Webmasters require you to type down what you
read in an image. The image is usually so hard to read that even a human
have to refresh it a couple times before getting it right. Fortunately,
the one site I needed to scrape didn't have a very hard captcha.</p>
<p>In fact, the captcha seemed so easy that I decided to take a shot at it.
Even though I never tried to do it before and I don't really have that
strong background in image processing. See it for yourself:</p>
<p><a href="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha.jpg"><img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha.jpg" width="175" height="55" class="img" /></a></p>
<p>Piece of cake, huh? Well, as it turned out it wasn't that difficult to
have a satisfactory success rate, but it was a little bit tougher than I
envisioned.</p>
<p>After much struggle I have finally reached the 91% (+/- 1%) accuracy for the
whole captcha. In order to find that value I did a 10-fold cross validation
test. Thanks to scikit-learn doing so was a brease.</p>
<h1>Separating the digits</h1>
<p>Looking at the captcha I had to break, you can see that the numbers are
clearly separated. Sometimes you see some captchas where the letters and
numbers are all mixed together, one on top of the other. Thank God that
wasn't the case.</p>
<p>So I figured my first task was to separate the digits. I didn't use
machine learning for this one. Just plain old human trial and error.</p>
<p>For handling the image I used a Python library called PIL. I opened the
captcha's image converting it to black and white. That means each pixel
has a value from 0 to 255. 0 being black and 255 white. Anything in
between is a shade of gray.</p>
<p>I wrote a few filters to make the image a little cleaner. We can see the
image has two noise sources: thin lines behind the numbers and random
dark blocks. In order to fix random blocks I simply set every pixel
with value above 127 to white (255) and any pixel below that to black
(0). The thin lines were a little bit trickier.</p>
<p>In order to remove the noisy lines I wrote a function that iterates over
all pixels of the image. If it's neighbor to a black pixel, then I color
it black as well. That reduces all lines in the image. The background
lines, being thiner, tend to disapear, while the lines of the digits
remain.</p>
<p>After those filters this is how the image looks like now:</p>
<p><a href="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/filtered_captcha.jpg"><img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/filtered_captcha.jpg" width="175" height="55" class="img" /></a></p>
<p>I iterated over every column of the image (what are columns? if the
image has a width of 100 and height of 150, then it has 100 columns of
150 pixels each). If 99% of the column pixels are black, then I
considered it an empty (black) column, otherwise it's a filled (white)
column.</p>
<p>After that, I found the ranges of continous white columns. For example,
if I had 3 empty columns, followed by 6 white columns, followed by 3
empty columns, then I would have the following black intervals: [0,3),
[6,9) and the following white interval: [3,6). In that case, I'll be
interested only in the white interval, [3,6).</p>
<p>White intervals are possible part of digits, black intervals are empty
space. My first try was to consider the 4 longest white intervals (we
have 4 digits) as numbers. That approach worked reasonably, but it
wasn't great. In the final version I'm still taking the top 4 white
intervals, but I do somethings to the intervals before that.</p>
<p>I found out that, sometimes, my system would take more than one digit as
a single digit. So now, if the range is wider than a certain value I
divide it in half, if it's even larger, then I divide it in 3 and so on.</p>
<p>Another issue was that many 8s and 0s were being cut in half. Specially
0s. That happens because 0s have two thin lines on top and bottom, and a
lot of empty space in the middle. So, sometimes, an empty column would
show up in the middle of the digit. What I did here was to compare one
white range to its adjacent white range. If they are similar to each
other (I used euclidean distance on each half for similarity), then I'll
merge them (it's probably the two halfs of a 0 or 8, they are symmetric).</p>
<p>The digit 6 also has a similar problem described above, I mitigate it by
considering two small and adjacent white ranges to be concatenated. In
order to avoid false positives (think about how small the number 1 is),
I had to consider very small ranges. So the above trick is still helpful
(both halves of 0 can be bigger than number 1 width).</p>
<p>Finally, I expand each range by 3 bytes. I do that because sometimes my
algorithm cuts off parts of the number (for instance, the top of the
number 1 is sometimes a very thin line.)</p>
<p>So that was the pre-processing I've done. I had a program that could
divide an image into its digits. It made me happy already, as I've never
done anything like that before. Here are the extracted digits:</p>
<p><a href="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-0.jpg"><img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-0.jpg" width="38" height="55" class="img" /></a>
<a href="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-1.jpg"><img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-1.jpg" width="32" height="55" class="img" /></a>
<a href="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-2.jpg"><img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-2.jpg" width="39" height="55" class="img" /></a>
<a href="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-3.jpg"><img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../posts/captcha-3.jpg" width="46" height="55" class="img" /></a></p>
<h1>Training</h1>
<p>Now it comes the manual part of the job. I have downloaded a large
number of captchas from the site I was analysing. I named each image
after what's written on the captcha. I labeled 908 captchas.</p>
<p>I divide each image into digit images and then I extract the features
described below from each digit. The features extracted from each digit
goes to a classification algorithm (I ended up using SVM). The algorithm
tells me what number it thinks each image is.</p>
<h1>Features</h1>
<p>All machine learning algorithms I know take a set of features identified
by a label and create a model that's capable of predicting the label of
a set of features it haven't seen before. If that seems abstract, don't
worry, I just wasn't able to write a decent introduction.</p>
<p>For example, the climate may have two features: atmosphere pressure and
the amount of water in the air. The labels may be: it's raing or it's
not raining. If today is raining, you take today's (pressure, moist)
vector and label it raining. If tomorrow is not raining you take
tomorrow's (pressure, moist) vector and label it not raining. And so on,
until you have a relevant data set. Then, a machine learning algorithm
looks your dataset and creates a model. That model will be able to
predict if other days are rainy or not based the day's (pressure, moist)
vector.</p>
<p>I had to come up with such features for my system. The most obvious one
was the value of each pixel of the image. Other features I've tried
were: vertical and horizontal silhouette, number of pixels, number of
white pixels, x-axis histogram, y-axis histogram, horizontal symmetry,
and vertical symmetry. As it turns out, the value of each pixel was the
best feature, all the others didn't help much or made it worse (except
if I used RandomForest classifier, then it was best to use all of them).</p>
<p>I have found out that SVM works best if you make every digit the same
size. So I scaled all digits to 16x16 using cubic scaling. That idea
alone bumped my results from 80% to 90%.</p>
<h1>Algorithms</h1>
<p>I knew about <a href="http://scikit-learn.org/stable/">scikit-learn</a>. It's a
nice library for machine learning algorithms. Since I
have done everything else in Python, that library came as an obvious
choice. I was a bit turned off by it not having a neural network. I know
it's not that hot anymore, but I'd like to have tried it on.</p>
<p>I transformed my features into the format
classifiers would understand (using scikit-learn's DictVectorizer) and
tried all algorithms there were. I started by Naive Bayes and Nearest
Neighbor, which I already knew and understood. They yielded good
results. But the algorithm which performed the best was SVM. I kind of
knew that could be the case because I have already heard so much about
SVM being this very good thing. It was indeed good. Unfortunately, I
don't understand it too well (specially the kernel trick part), maybe
I'll learn more of it for another post.</p>
<p>In the end, I used SVM with the pixel values of the scaled image and a
second degree polynomial kernel. It yielded a precision of 91% (+/- 1%).
That's reasonable. Also, the performance was great, each request to
classify an unseen captcha took around 10ms in my machine. And my
machine is not really that good. A mere core 2 duo laptop with 2GB of
ram.</p>
<h1>Making it better</h1>
<p>Later on I may study a few possibilities to make the digit separtion
part better. Maybe I can come up with better features as well, although
I don't have my hopes up for this one.</p>
<p>Other thing I have been thinking is on using a separate set of features
and classifier for each digit. Each classifier would have two labels 0
or 1. The label 0 means the digit is not of that class, 1 means it is. I
think that could have really helped. I'm a little out of time, though.
Maybe someday.</p>
<h1>Downloading the code</h1>
<p>If you are so inclined, you can take a look at my code at
<a href="https://github.com/aflag/captcha-study">github</a>. It doesn't have many
comments, but it's a very small program. If you know python and read
scikit-learn's documentation you're set. I'm releasing it under <a href="http://www.xfree86.org/3.3.6/COPYRIGHT2.html#3">X11
license</a> (with my name
in the copyright notice, obviously).</p>
<p>Datasets included <img src="http://rafael.kontesti.me/blog/tags/machine-learning/../../smileys/smile.png" alt=":)" /></p>