{"title": "Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)", "book": "Advances in Neural Information Processing Systems", "page_first": 265, "page_last": 272, "abstract": null, "full_text": " \n\n     Using Machine Learning to Break Visual \n          Huma n Inte raction  Proofs (HIPs) \n\n\n  Kumar \n                                Chellapilla                  Patrice Y. Simard \n                  Microsoft Research                         Microsoft Research \n                  One Microsoft Way                          One Microsoft Way \n                 Redmond, WA 98052                      Redmond, WA 98052 \n                kumarc@microsoft.com patrice@microsoft.com \n                                                                                     \n\n\n\n                                                Abstract \n\n          Machine learning is often used to automatically solve human tasks. \n          In this paper, we look for tasks where machine learning algorithms \n          are not as good as humans with the hope of gaining insight into \n          their current limitations. We studied various Human Interactive \n          Proofs (HIPs) on the market, because they are systems designed to \n          tell computers and humans apart by posing challenges presumably \n          too hard for computers. We found that most HIPs are pure \n          recognition tasks which can easily be broken using machine \n          learning. The harder HIPs use a combination of segmentation and \n          recognition tasks. From this observation, we found that building \n          segmentation tasks is the most effective way to confuse machine \n          learning algorithms. This has enabled us to build effective HIPs \n          (which we deployed in MSN Passport), as well as design \n          challenging segmentation tasks for machine learning algorithms. \n\n\n1   I n t ro d u c t i o n  \n\nThe OCR problem for high resolution printed text has virtually been solved 10 years \nago [1]. On the other hand, cursive handwriting recognition today is still too poor \nfor most people to rely on. Is there a fundamental difference between these two \nseemingly similar problems? \n\nTo shed more light on this question, we study problems that have been designed to \nbe difficult for computers. The hope is that we will get some insight on what the \nstumbling blocks are for machine learning and devise appropriate tests to further \nunderstand their similarities and differences. \n\nWork on distinguishing computers from humans traces back to the original Turing \nTest [2] which asks that a human distinguish between another human and a machine \nby asking questions of both. Recent interest has turned to developing systems that \nallow a computer to distinguish between another computer and a human. These \nsystems enable the construction of automatic filters that can be used to prevent \nautomated scripts from utilizing services intended for humans [4]. Such systems \nhave been termed Human Interactive Proofs (HIPs) [3] or Completely Automated \nPublic Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An \noverview of the work in this area can be found in [5]. Construction of HIPs that are \nof practical value is difficult because it is not sufficient to develop challenges at \n\n\f\n                                                                                                \n\n\nwhich humans are somewhat more successful than machines. This is because the \ncost of failure for an automatic attacker is minimal compared to the cost of failure \nfor humans. Ideally a HIP should be solved by humans more than 80% of the time, \nwhile an automatic script with reasonable resource use should succeed less than \n0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an \nautomatic trial divided by the cost of having a human perform the attack. \n\nThis constraint of generating tasks that are failed 99.99% of the time by all \nautomated algorithms has generated various solutions which can easily be sampled \non the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, \n2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept'04), Register, and Google, will \nbe given as examples in the next section. We will show in Section 3 that machine-\nlearning-based attacks are far more successful than 1 in 10,000. Yet, some of these \nHIPs are harder than others and could be made even harder by identifying the \nrecognition and segmentation parts, and emphasizing the latter. Section 4 presents \nexamples of more difficult HIPs which are much more respectable challenges for \nmachine learning, and yet surprisingly easy for humans. The final section discusses \na (known) weakness of machine learning algorithms and suggests designing simple \nartificial datasets for studying this weakness. \n\n\n2   E xa mp le s   o f  H I P s  \n\nThe HIPs explored in this paper are made of characters (or symbols) rendered to an \nimage and presented to the user. Solving the HIP requires identifying all characters \nin the correct order. The following HIPs can be sampled from the web: \n\nMailblocks:  While signing up for free email service with mailblocks \n(www.mailblocks.com), you will find HIP challenges of the type: \n\n\n\n                                                                                 \n\nMSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you \nwill find HIP challenges of the type: \n\n\n                                                                                           \n\nRegister.com: While requesting a whois lookup for a domain at www.register.com, \nyou will HIP challenges of the type: \n\n\n\n\n                                                                                      \n\nYahoo!/EZ-Gimpy (CMU): While  signing up for free e-mail service with Yahoo! \n(www.yahoo.com), you will receive HIP challenges of the type: \n\n\n\n                                                                                           \n\n\n\n                                                                                           \n\nYahoo! (version 2): Starting in August 2004, Yahoo! introduced their second \ngeneration HIP. Three examples are presented below:  \n\n\f\n                                                                                        \n\n\n\n\n\n                                                                                       \nTicketmaster:  While looking for concert tickets at www.ticketmaster.com, you \nwill receive HIP challenges of the type: \n\n\n\n                                                                                        \n\nGoogle/Gmail:  While signing up for free e-mail with Gmail at www.google.com, \none will receive HIP challenges of the type: \n\n\n\n\n                                                                                  \n\nWhile solutions to Yahoo HIPs are common English words, those for ticketmaster \nand Google do not necessarily belong to the English dictionary. They appear to have \nbeen created using a phonetic generator [8]. \n\n\n3   U s i n g   ma c h i n e   le a rn i n g   t o   b re a k  H I P s  \n\nBreaking HIPs is not new. Mori and Malik [7] have successfully broken the EZ-\nGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach \naims at an automatic process for solving multiple HIPs with minimum human \nintervention, using machine learning. In this paper, our main goal is to learn more \nabout the common strengths and weaknesses of these HIPs rather than to prove that \nwe can break any one HIP in particular with the highest possible success rate. We \nhave results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, \nregister, ticketmaster, and Google. \n\nTo simplify our study, we will not be using language models in our attempt to break \nHIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], \nwhich means that a random guess attack would get a success rate of 1 in 600 (more \nthan enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder \nwhen no language model is used. Similarly, when a HIP uses a language model to \ngenerate challenges, success rate of attacks can be significantly improved by \nincorporating the language model. Further, since the language model is not common \nto all HIPs studied, it was not used in this paper. \n\nOur generic method for breaking all of these HIPs is to write a custom algorithm to \nlocate the characters, and then use machine learning for recognition. Surprisingly, \nsegmentation, or finding the characters, is simple for many HIPs which makes the \nprocess of breaking the HIP particularly easy. Gimpy uses a single constant \npredictable color (black) for letters even though the background color changes. We \nquickly realized that once the segmentation problem is solved, solving the HIP \nbecomes a pure recognition problem, and it can trivially be solved using machine \nlearning. Our recognition engine is based on neural networks [6][9]. It yielded a \n0.4% error rate on the MNIST database, uses little memory, and is very fast for \nrecognition (important for breaking HIPs). \n\nFor each HIP, we have a segmentation step, followed by a recognition step. It \nshould be stressed that we are not trying to solve every HIP of a given type i.e., our \ngoal is not 100% success rate, but something efficient that can achieve much better \nthan 0.01%. \n\n\f\n                                                                                            \n\n\nIn each of the following experiments, 2500 HIPs were hand labeled and used as \nfollows (a) recognition (1600 for training, 200 for validation, and 200 for testing), \nand (b) segmentation (500 for testing segmentation). For each of the five HIPs, a \nconvolution neural network, identical to the one described in [6], was trained and \ntested on gray level character images centered on the guessed character positions \n(see below). The trained neural network became the recognizer. \n\n\n3 . 1   M a i l b l o c k s  \n\nTo solve the HIP, we select the red channel, binarize and erode it, extract the largest \nconnected components (CCs), and breakup CCs that are too large into two or three \nadjacent CCs. Further, vertically overlapping half character size CCs are merged. \nThe resulting rough segmentation works most of the time. Here is an example: \n\n\n\n                                                                                  \n\nFor instance, in the example above, the NN would be trained, and tested on the \nfollowing images: \n\n\n\n                                                                ... \n\nThe end-to-end success rate is 88.8% for segmentation, 95.9% for recognition \n(given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of \nthe errors come from segmentation, even though this is where all the custom \nprogramming was invested. \n\n\n3 . 2   R e g i s t e r  \n\nThe procedure to solve HIPs is very similar. The image was smoothed, binarized, \nand the largest 5 connected components were identified. Two examples are \npresented below: \n\n\n\n                                                                                            \n\n\n                                                                                            \n\nThe end-to-end success rate is 95.4% for segmentation, 87.1% for recognition \n(given correct segmentation), and (0.954)*(0.871)5 = 47.8% total.  \n\n\n3 . 3   Y a h o o / E Z - G i m p y  \n\nUnlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in \nthat a variety of backgrounds and clutter are possible. Though some amount of text \nwarping is present, the text color, size, and font have low variability. Three simple \nsegmentation algorithms were designed with associated rules to identify which \nalgorithm to use. The goal was to keep these simple yet effective: \n\n     a)  No mesh: Convert to grayscale image, threshold to black and white, select \n           large CCs with sizes close to HIP char sizes. One example:  \n\n\n\n                                                                                       \n\n\f\n                                                                                              \n\n\n      b)  Black mesh: Convert to grayscale image, threshold to black and white, \n            remove vertical and horizontal line pixels that don't have neighboring \n            pixels, select large CCs with sizes close to HIP char sizes. One example: \n\n\n\n                                                                                         \n\n      c)  White mesh: Convert to grayscale image, threshold to black and white, add \n            black pixels (in white line locations) if there exist neighboring pixels, select \n            large CCs with sizes close to HIP char sizes. One example: \n\n\n\n                                                                                         \n\nTests for black and white meshes were performed to determine which segmentation \nalgorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% \ncame from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given \ncorrect segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a \nYahoo HIP solution is 4.8 characters. \n\n\n3 . 4   T i c k e t m a s t e r  \n\nThe procedure that solved the Yahoo HIP is fairly successful at solving some of the \nticket master HIPs. These HIPs are characterized by cris-crossing lines at random \nangles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the \nYahoo case (section 3.3) has potential. In the interests of simplicity, a single attack \nwas developed: Convert to grayscale, threshold to black and white, up-sample \nimage, dilate first then erode, select large CCs with sizes close to HIP char sizes. \nOne example: \n\n\n\n                                                                                         \n\nThe dilate-erode combination causes the lines to be removed (along with any thin \nobjects) but retains solid thick characters. This single attack is successful in \nachieving an end-to-end success rate of 16.6% for segmentation, the recognition rate \nwas 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23  = 4.9% total. The \naverage HIP solution length is 6.23 characters. \n\n\n3 . 5   Y a h o o   v e r s i o n   2  \n\nThe second generation HIP from Yahoo had several changes: a) it did not use words \nfrom a dictionary or even use a phonetic generator, b) it uses only black and white \ncolors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. \nThe HIP is somewhat similar to the MSN/Passport HIP which does not use a \ndictionary, uses two colors, uses letters and digits, and background and foreground \narcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A \nsingle segmentation attack was developed: Remove 6 pixel border, up-sample, dilate \nfirst then erode, select large CCs with sizes close to HIP char sizes. The attack is \npractically identical to that used for the ticketmaster HIP with different \npreprocessing stages and slightly modified parameters. Two examples: \n\n\n\n                                                                                         \n\n\f\n                                                                                                    \n\n\n\n\n\n                                                                                          \n\nThis single attack is successful in achieving an end-to-end success rate of 58.4% for \nsegmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5  = 45.7% total. \nThe average HIP solution length is 5 characters. \n\n\n3 . 6   G o o g l e / G M a i l  \n\nThe Google HIP is unique in that it uses only image warp as a means of distorting \nthe characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also \ntwo color. The HIP characters are arranged closed to one another (they often touch) \nand follow a curved baseline. The following very simple attack was used to segment \nGoogle HIPs: Convert to grayscale, up-sample, threshold and separate connected \ncomponents.  \n\n\n\n  a)                                                          b)                               \n\nThis very simple attack gives an end-to-end success rate of 10.2% for segmentation, \nthe recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability \nof breaking a HIP. Average Google HIP solution length is 6.5 characters. This can \nbe significantly improved upon by judicious use of dilate-erode attack. A direct \napplication doesn't do as well as it did on the ticketmaster and yahoo HIPs (because \nof the shear and warp of the baseline of the word). More successful and complicated \nattacks might estimate and counter the shear and warp of the baseline to achieve \nbetter success rates. \n\n\n4   L e s s o n s   le a rn e d   f r o m  b re a ki n g   H I Ps  \n\nFrom the previous section, it is clear that most of the errors come from incorrect \nsegmentations, even though most of the development time is spent devising custom \nsegmentation schemes. This observation raises the following questions: Why is \nsegmentation a hard problem? Can we devise harder HIPs and datasets? Can we \nbuild an automatic segmentor? Can we compare classification algorithms based on \nhow useful they are for segmentation? \n\n\n4 . 1   T h e   s e g m e n t a t i o n   p r o b l e m  \n\nAs a review, segmentation is difficult for the following reasons: \n1.  Segmentation is computationally expensive. In order to find valid patterns, a \n         recognizer must attempt recognition at many different candidate locations. \n2.  The segmentation function is complex. To segment successfully, the system \n         must learn to identify which patterns are valid among the set of all possible \n         valid and non-valid patterns. This task is intrinsically more difficult than \n         classification because the space of input is considerably larger. Unlike the space \n         of valid patterns, the space of non-valid patterns is typically too vast to sample. \n         This is a problem for many learning algorithms which yield too many false \n         positives when presented non-valid patterns. \n3.  Identifying valid characters among a set of valid and invalid candidates is a \n         combinatorial problem. For example, correctly identifying which 8 characters \n         among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 \n         choose 8) chances of success by random guessing. \n\n\f\n                                                                                                     \n\n\n4 . 2   B ui l d i n g   b e t t e r / h a r de r   H I P s  \n\nWe can use what we have learned to build better HIPs. For instance the HIP below \nwas designed to make segmentation difficult and a similar version has been \ndeployed by MSN Passport for hotmail registrations (www.hotmail.com): \n\n\n                                                                               \n\nThe idea is that the additional arcs are themselves good candidates for false \ncharacters. The previous segmentation attacks would fail on this HIP. Furthermore, \nsimple change of fonts, distortions, or arc types would require extensive work for \nthe attacker to adjust to. We believe HIPs that emphasize the segmentation problem, \nsuch as the above example, are much stronger than the HIPs we examined in this \npaper, which rely on recognition being difficult. Pushing this to the extreme, we can \neasily generate the following HIPs: \n\n\n                                                                                                \n\nDespite the apparent difficulty of these HIPs, humans are surprisingly good at \nsolving these, indicating that humans are far better than computers at segmentation. \nThis approach of adding several competing false positives can in principle be used \nto automatically create difficult segmentation problems or benchmarks to test \nclassification algorithms. \n\n\n4 . 3   B ui l d i n g   a n   a ut o m a t i c   s e g m e n t o r  \n\nTo build an automatic segmentor, we could use the \nfollowing procedure. Label characters based on                                           HIP \ntheir correct position and train a recognizer. Apply \nthe trained recognizer at all locations in the HIP \nimage. Collect all candidate characters identified                                       K \nwith high confidence by the recognizer. Compute \nthe probability of each combination of candidates                                        Y \n(going from left to right), and output the solution \nstring with the highest probability. This is better \nillustrated with an example.                                                             B \n\nConsider the following HIP (to the right). The \ntrained neural network has these maps (warm                                              7 \ncolors indicate recognition) that show that K, Y, \nand so on are correctly identified. However, the                                         9 \nmaps for 7 and 9 show several false positives. In \ngeneral, we would get the following color coded   \nmap for all the different candidates: \n\n\n\n\n\n                                                                                    \n\n\f\n                                                                                                \n\n\nWith a threshold of 0.5 on the network's outputs, the map obtained is: \n\n\n\n\n\n                                                                                  \n\nWe note that there are several false positives for each true positive. The number of \nfalse positives per true positive character was found to be between 1 and 4, giving a \n1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the \ncorrect segmentation for the HIP characters. These numbers can be improved upon \nby constraining solution strings to flow sequentially from left to right and by \nrestricting overlap. For each combination, we compute a probability by multiplying \nthe 8 probabilities of the classifier for each position. The combination with the \nhighest probability is the one proposed by the classifier. We do not have results for \nsuch an automatic segmentor at this time. It is interesting to note that with such a \nmethod a classifier that is robust to false positives would do far better than one that \nis not. This suggests another axis for comparing classifiers.  \n\n\n5   C o n c lu s i o n  \n\nIn this paper, we have successfully applied machine learning to the problem of \nsolving HIPs. We have learned that decomposing the HIP problem into \nsegmentation and recognition greatly simplifies analysis. Recognition on even \nunprocessed images (given segmentation is a solved) can be done automatically \nusing neural networks. Segmentation, on the other hand, is the difficulty \ndifferentiator between weaker and stronger HIPs and requires custom intervention \nfor each HIP. We have used this observation to design new HIPs and new tests for \nmachine learning algorithms with the hope of improving them. \n\nA c k n o w l e d g e m e n t s  \n\nWe would like to acknowledge Chau Luu and Eric Meltzer for their help with \nlabeling and segmenting various HIPs. We would also like to acknowledge Josh \nBenaloh and Cem Paya for stimulating discussions on HIP security. \n\nR e f e r e n c e s  \n\n[1] Baird HS (1992), \"Anatomy of a versatile page reader,\" IEEE Pro., v.80, pp. 1059-1065. \n\n[2] Turing AM (1950), \"Computing Machinery and Intelligence,\" Mind, 59:236, pp. 433-460.  \n\n[3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. \n\n[4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net \n\n[5] Baird HS and Popat K (2002) \"Human Interactive Proofs and Document Image \nAnalysis,\" Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. \n\n[6] Simard PY, Steinkraus D, and Platt J, (2003) \"Best Practice for Convolutional Neural \nNetworks Applied to Visual Document Analysis,\" in International Conference on Document \nAnalysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. \n\n[7] Mori G, Malik J (2003), \"Recognizing Objects in Adversarial Clutter: Breaking a Visual \nCAPTCHA,\"  Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, \nIEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 \n\n[8] Chew, M. and Baird, H. S. (2003), \"BaffleText: a Human Interactive Proof,\" Proc., \n10th IS&T/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. \n\n[9] LeCun Y, Bottou L, Bengio Y, and Haffner P, \"Gradient-based learning applied to \ndocument recognition,' Proceedings of the IEEE, Nov. 1998. \n\n\f\n", "award": [], "sourceid": 2571, "authors": [{"given_name": "Kumar", "family_name": "Chellapilla", "institution": null}, {"given_name": "Patrice", "family_name": "Simard", "institution": null}]}