{"title": "A Topographic Support Vector Machine: Classification Using Local Label Configurations", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 936, "abstract": null, "full_text": "       A Topographic Support Vector Machine:\n Classification Using Local Label Configurations\n\n\n\n                                      Johannes Mohr\n                          Clinic for Psychiatry and Psychotherapy\n                                   Charite Medical School\n                                            and\n                 Bernstein Center for Computational Neuroscience Berlin\n                                   10117 Berlin, Germany\n\n\n                                     Klaus Obermayer\n               Department of Electrical Engineering and Computer Science\n                              Berlin University of Technology\n                                            and\n                 Bernstein Center for Computational Neuroscience Berlin\n                                   10587 Berlin, Germany\n\n\n             johann@cs.tu-berlin.de, oby@cs.tu-berlin.de\n\n\n\n\n                                         Abstract\n\n         The standard approach to the classification of objects is to consider the\n         examples as independent and identically distributed (iid). In many real\n         world settings, however, this assumption is not valid, because a topo-\n         graphical relationship exists between the objects. In this contribution we\n         consider the special case of image segmentation, where the objects are\n         pixels and where the underlying topography is a 2D regular rectangular\n         grid. We introduce a classification method which not only uses measured\n         vectorial feature information but also the label configuration within a to-\n         pographic neighborhood. Due to the resulting dependence between the\n         labels of neighboring pixels, a collective classification of a set of pixels\n         becomes necessary. We propose a new method called 'Topographic Sup-\n         port Vector Machine' (TSVM), which is based on a topographic kernel\n         and a self-consistent solution to the label assignment shown to be equiv-\n         alent to a recurrent neural network. The performance of the algorithm is\n         compared to a conventional SVM on a cell image segmentation task.\n\n\n1    Introduction\n\nThe segmentation of natural images into semantically meaningful subdivisions can be con-\nsidered as one or more binary pixel classification problems, where two classes of pixels are\ncharacterized by some measurement data (features). For each binary problem the task is\nto assign a set of new pixels to one of the two classes using a classifier trained on a set of\nlabeled pixels (training data).\n\n\f\nIn conventional classification approaches usually the assumption of iid examples is made,\nso the classification result is determined solely by the measurement data. Natural images,\nhowever, possess a topographic structure, in which there are dependencies between the\nlabels of topographic neighbors, making the data non-iid. Therefore, not only the measure-\nment data, but also the labels of the topographic neighbors can be used in the classification\nof a pixel. It has been shown for a number of problems that dependencies between in-\nstances can improve model accuracy. A Conditional Random Field approach approach has\nbeen used for labeling text sequences by [1]. Combining this idea with local discriminative\nmodels, in [2] a discriminative random field was used to model the dependencies between\nthe labels of image blocks in a probabilistic framework. A collective classification rela-\ntional dependency network was used in [3] for movie box-office receipts prediction and\npaper topic classification. The maximization of the per label margin of pairwise Markov\nnetworks was applied in [4] to handwritten character recognition and collective hypertext\nclassification. There, the number of variables and constraints of the quadratic programming\nproblem was polynomial in the number of labels.\n\nIn this work, we propose a method which is also based on margin maximization and allows\nthe collective assignments of a large number of binary labels which have a regular grid\ntopography. In contrast to [4] the number of constraints and variables does not depend on\nthe number of labels. The method called topographic support vector machine (TSVM) is\nbased on the assumption that knowledge about the local label configuration can improve the\nclassification of a single data point. Consider as example the segmentation of a collection\nof images depicting physical objects of similar shape, but high variability in gray level and\ntexture. In this case, the measurements are dissimilar, while the local label configurations\nshow high similarity.\n\nHere, we apply the TSVM to the supervised bottom-up segmentation of microscopic im-\nages of Papanicolaou stained cervical cell nuclei from the CSSIP pap smear dataset1. Seg-\nmentation of these images is important for the detection of cervical cancer or precancerous\ncells. The final goal is to use so-called malignancy associated changes (MACs), e.g. a\nslight shift of the distribution of nuclear size not yet visual to the human observer, in order\nto detect cancer at an early stage [5]. A previously used bottom-up segmentation approach\nfor this data using morphological watersheds was reported to have difficulties with weak\ngradients and the presence of other large gradients adjacent to the target [5]. Top-down\nmethods like active contour models have successfully been used [6], but require heuristic\ninitialization and error correction procedures.\n\n\n2     Classification using a Topographic Support Vector Machine\n\nLet O = {o1, ..., on} be a set of n sites on a 2D pixel-grid and G = {Go, o  O} be\na neighborhood system for O, where Go is the set of neighbors of o and neighborhood\nis defined by o  Go and o  Gp  p  Go. For each pixel site oi from the set O,\na binary label yi  {-1, +1} giving the class assignment is assumed to be known. To\nsimplify the notation, in the following we are going to make use of multi-indices written in\nthe form of vectors, referring to pairs of indices on a two-dimensional grid. We define the\nneighborhood of order c as Gc = {Gi, i  O}; Gi = {k  O : 0 < (k-i)2  c}. This way,\nG1 describes the first order neighborhood system (4 neighbors), G2 the second order system\n(8 neighbors), and so on. Each pixel site is characterized by some measurement vector. This\ncould for example be the vector of gray value intensities at a pixel site, the gray value patch\naround a central pixel location, or the responses to a bank of linear or nonlinear filters (e.g.\nGabor coefficients). Using a training set composed of (possibly several) sets of pixel sites,\neach accompanied by a set of measurement vectors X = {xi, i  [1..n]} and a set of\n\n     1Centre for Sensor Signal and Information Processing, University of Queensland\n\n\f\nlabels Y = {yi, i  [1..n]} (e.g. a manually labeled image), the task of classification is\nto assign class labels to a set of  pixels sites U = {u1, ..., u} of an unlabeled image, for\nwhich a set of measurements ~\n                               X = {~\n                                            xi, i  [1..]} is available. For the classification we\nwill use a support vector machine.\n\n\n2.1    Support Vector Classification\n\nIn Support Vector Classification (SVC) methods ([7]), a kernel is used to solve a complex\nclassification task in a usually high-dimensional feature space via a separating hyperplane.\nResults from statistical learning theory ([8]) show that maximizing the margin (the distance\nof the closest data point to the hyperplane) leads to improved generalization abilities. In\npractice, the optimal margin hyperplane can be obtained solving a quadratic programming\nproblem. Several schemes have been introduced to deal with noisy measurements via the\nintroduction of slack variables. In the following we will shortly review one such scheme,\nthe C-SVM, which is also later used in the experiments. For a canonical separating hy-\nperplane (w, b) in a higher dimensional feature space H, to which the n variables xi are\nmapped by (x), and n slack variables i the primal objective function of a C-SVM can\nbe formulated as\n                                                 1                             n\n                                                                2         C\n                                min                        w         +                i ,                         (1)\n                             wH,Rn 2                                   n i=1\nsubject to yi(wT (xi) + b)  1 - i,             i  0,                 C > 0,            i = 1, ..., n.\n\nIn order to classify a new object h with unknown label, the following decision rule is\nevaluated:                                       m\n\n                           f (xh) = sgn                    iyiK(xh, xi) + b ,                                     (2)\n                                                 i=1\n\nwhere the sum runs over all m support vectors.\n\n\n2.2    Topographic Kernel\n\nWe now assume that the label of each pixel is determined by both the measurement and\nthe set of labels of its topographic neighbors. We define a vector yG where the labels of\n                                                                                                    h\nthe q topographic neighbors of the pixel h are concatenated in an arbitrary, but fixed order.\nWe propose a support vector classifier using an extended kernel, which in addition to the\nmeasurement vector xh, also includes the vector yG :\n                                                                          h\n\n\n                    K(xh, xj, yG , y ) = K                                                        , y ),           (3)\n                                h     Gj               1(xh, xj) +   K2(yGh Gj\nwhere  is a hyper-parameter. Kernel K1 can be an arbitrary kernel working on the mea-\nsurements. For kernel K2 an arbitrary dot-product kernel might be used. In the following\nwe restrict ourselves to a linear kernel (corresponding to the normalized Hamming distance\nbetween the local label configurations)\n\n                                                                1\n                               K2(yG , y ) =                         y         |y      ,                           (4)\n                                            h    Gj             q         Gh Gj\n\nwhere ...|... denotes a scalar product. The kernel K2 defined in eq. (4) thus consists of a\ndot-product between these vectors divided by their length. For a neighborhood Gc of order\n                                                                                                              h\nc we obtain                                           1\n                          K2(yG , y ) =                                        y\n                                 h    Gj              q                             h+s  yj+s                     (5)\n                                                           |s|<c,s=0\nThe linear kernel K2 in (4) takes on its maximum value, if the label configurations are\nidentical, and its lowest value if the label configuration is inverted.\n\n\f\n2.3    Learning phase\n\nIf a SVM is trained using the topographic kernel (3), the topographic label configuration is\nincluded in the learning process. The resulting support vectors will still contain the relevant\ninformation about the measurements, but additionally the label neighborhood information\nrelevant for a good distinction of the classes.\n\n\n2.4    Classification phase\n\nIn order to collectively classify a set of  new pixel sites with unknown topographic label\nconfiguration, we propose the following iterative approach to achieve a self-consistent so-\nlution to the classification problem. We denote the labels at step  as yh( ), h. At each\nstep  new labels are assigned according to\n\n                           m\n\n        yh( ) = sgn             j  yv(j)  K(xh, xv(j), yG ( - 1), y                               ) + b , h.     (6)\n                                                                          h                   Gv(j)\n                          j=1\n\nThe sum runs over all m support vectors, whose indices on the 2D grid are denoted by\nthe vector v(j). Since initially the labels are unknown, we use at step  = 0 the results\nfrom a conventional support-vector machine ( = 0) as initialization for the labels. For the\nfollowing steps some estimates of the neighboring labels are available from the previous\niteration. Using this new topographic label information in addition to the measurement\ninformation, using (6) a new assignment decision for the labels is made. This leads to an\niterative assignment of new labels.\n\nIf we write the contributions from kernel K1, which depend only on x and do not change\nwith  , as ch(j) = j  yv(j)  K1(xh, xv(j)) equation (6) becomes\n\n                            m\n\n         yh( ) = sgn              jyv(j)K2(yG ( - 1), y                             ) + c\n                                                              h                Gv(j)             h(j)    + b , h.     (7)\n                           j=1\n\nPutting in the linear kernel from equation (5), we get\n\n                 m                  \n yh( ) = sgn            jyv(j)                        [y\n                                    q                         h+s( - 1)  yv(j)+s] + ch(j)                 + b , h. (8)\n                 j=1                     |s|<c,s=0\nInterchanging the sums, using the definitions\n\n                                          m      \n                  w                 q      j=1         j yv(j)yv(j)+(k-h) : k  Gh\n                        h,k =                                                                                          (9)\n                                                                               0        :    k /\n                                                                                                  Gh\n\nand                                                      m\n\n                                          h = -                   ch(j) + b ,                                        (10)\n                                                         j=1\n\nwe obtain\n                         yh( ) = sgn                  yk( - 1)  wh,k - h , h.                                    (11)\n                                                k\n\n\nThis corresponds to the equations describing the dynamics of a recurrent neural network\ncomposed of McCulloch-Pitts neurons [9]. The condition for symmetric weights wh,k =\nwk,h is equivalent to an inversion symmetry of the label configurations of the support vec-\ntors in the neighborhood topology, therefore the weights in equation (9) are not necessarily\nsymmetric. A suitable stopping criterion for the iteration is that the net reaches either a\nfixed point yh( ) = yh( - 1), h, or an attractor cycle yh( ) = yh(),  <  - 1, h.\n\n\f\nThe network described by eq. (11) corresponds to a diluted network of  binary neurons\nwith no self-interaction and asymmetric weights. One can see from eq.(9) that the net-\nwork has only local connections, corresponding to the topographic neighborhood Gh. The\nmeasurement xh only influences the individual unit threshold h of the network, via the\nweighted sum over all support vectors of the contributions from kernel K1 (eq. (10)). The\nlabel configurations of the support vectors, on the other hand, are contained in the network\nweights via eq.(9). The weights are multiplied by the hyper-parameter , which determines\nhow much the label configuration influences the class decision in comparison to the mea-\nsurements. It has to be adjusted to yield optimal results for a class of data sets. For  = 0\nthe TSVM becomes a conventional SVM.\n\n\n2.5         Symmetrization of the weights\n\nIn order to ensure convergence, we suggest to use an inversion symmetric version K sym\n                                                                                                                                                       2      of\nkernel K2. For the pixel grid we can define the inversion operation as l + t  l - t,                                                                        t \nN2,         l + t  Gl, and denote the inverse of a by \n                                                                                              a. Taking the inverse of the vector yG ,l\nin which the set yG is concatenated in an arbitrary but fixed order, leads to a reordering of\n                                 l\nthe components of the vector. The benefit from the chosen inversion symmetric kernel is\nthat the self consistency equations for the labels will turn out to be equivalent to a Hopfield\nnet, which has proven convergence properties. We define the new kernel as\n\n                                                                               1\n                                      Ksym\n                                            2         (yG , y ) = ( y |y + y |y ).                                                                         (12)\n                                                             h     Gj          q       Gh Gj                   Gh Gj\n\nAlthough only the second argument is inverted within the kernel, the value of this kernel\ndoes not depend on the order of the arguments.\n\nProof              It follows from the definition of the inversion operator and the dot product that\n yG |y               =      \n                            y         |\n                                      y               = y |y                    =      \n                                                                                       y |\n                                                                                              y           and \n                                                                                                                y      |y               = y       |\n                                                                                                                                                  y           =\n       h     Gj                  Gh Gj                            Gj Gh                     Gj Gh                    Gh Gj                      Gh Gj\n yG |y             = \n                           y |y             . Therefore,\n       j     Gh             Gj Gh\n                                                 1                                                   1\nKsym\n   2         (yG , y ) =                              ( y         |y          + y     |\n                                                                                       y      ) =         ( y |y              + \n                                                                                                                                        y |y      )\n                    h      Gj                    q           Gh Gj                   Gh Gj           q          Gj Gh                    Gj Gh\n                                                 1\n                                      =               ( y |y                  + y |\n                                                                                       y      ) = Ksym                 , y         )                          .\n                                                 q           Gj Gh                   Gj Gh                2     (yGj Gh\n\nPutting kernel (12) into eq.(7) and defining\n\n                                                m      \n              wsym =                   q         j=1         j yv(j)(yv(j)+(k-h) + yv(j)-(k-h)) : k  Gh                                                    (13)\n                    h,k                                                                                               0       :         k /\n                                                                                                                                          Gh\n\nwe get\n\n                                      yh( ) = sgn                            yk( - 1)  wsym - \n                                                                                                   h,k          h    , h.                                  (14)\n                                                                         k\n\n\nSince the network weights wsym defined in eq.(13) are symmetric this corresponds to the\n                                                       h,j\nequation describing the dynamics during the retrieval phase of a Hopfield network [10].\nInstead of taking the sum over all patterns, the sum is taken over all support vectors. The\nweight between two neurons in the original Hopfield net corresponds to the correlation\nbetween two components (over all fundamental patterns). In (13) the weight only depends\non the difference vector k-h between the two neurons on the 2D grid and is proportional to\nthe correlation (over all support vectors) between the label of a support vector and the label\nin the distance k-h.\n\n\f\nTable 1: Average misclassification rate R and the standard deviation of the mean  at\noptimal hyper-parameters C, S and .\n\n                        algorithm    log2 C    log2 S          R[%]      [%]\n                        SVM          4         0.5       0      2.23      0.05\n                        STSVM        4         0.5       1.2    1.96      0.06\n                        TSVM         2         0.5       1.4    1.86      0.05\n\n\n\n3         Experiments\n\nWe applied the above algorithms to the binary classification of pixel sites of cell images\nfrom the CSSIP pap smear dataset. The goal was to assign the label +1 to the pixels belong-\ning to the nucleus, and -1 to all others. The dataset contains three manual segmentations\nof the nucleus' boundaries, from which we generated a 'ground truth' label for the area\nof the nucleus using a majority voting. Only the first 300 images were used in the exper-\niments. As a measurement vector we took a column-ordering of a 3x3 gray value patch\ncentered on a pixel site. In order to measure the classification performance for a non-\niid data set, we estimated the test error based on the collective classification of all pixels\nin several randomly chosen test images. We compared three algorithms: A conventional\nSVM, the 'TSVM' with the topographic kernel K2 from eq.(4) and the 'STSVM' with the\ninversion symmetric topographic kernel Ksym\n                                                2     from eq.(12). In the experiments we used\na label neighborhood of order 32, which corresponds to q = 100 neighbors. For kernel\nK                                                                  2\n     1 we used an RBF kernel K1(x1, x2) = exp(- x1 - x2                 /S2) with hyper-parameter\nS. Since the data set was very large, no cross-validation or resampling techniques were\nrequired, and only a small subset of the available training data could be used for training.\nWe randomly sampled several disjoint training sets in order to improve the accuracy of\nthe error estimation. First, the hyper-parameters S and C (for TSVM and STSVM also )\nwere optimized via a grid search in parameter space. This was done by measuring the av-\nerage test error over 20 test images and 5 training sets. Then, the test of the classifiers was\nconducted at the in each case optimal hyper-parameters for 20 yet unused test images and\n50 randomly sampled disjoint training sets. In all experiments using synchronous update\neither a fixed point or an attractor cycle of length two was reached. The average number of\niterations needed was 12 (TSVM) and 13 (STSVM). Although the convergence properties\nhave only been formally proven for the symmetric weight STSVM, experimental evidence\nsuggests the same convergence properties for the TSVM. The results for synchronous up-\ndate are shown in table 1 (results using asynchronous update differed only by 0.01%). The\nperformance of both topographic algorithms is superior to the conventional SVM, while\nthe TSVM performed slightly better than the STSVM. For the top-down method in [6] the\nresults were only qualitatively assessed by a human expert, not quantitatively compared to\na manual segmentation, therefore a direct comparison to our results was not possible. To\nillustrate the role of the hyper-parameter , fig.1 shows 10 typical test images and their\nsegmentations achieved by an STSVM at different values of  for fixed S and C. For\nincreasing  the label images become less noisy, and at  = 0.4 most artifacts have disap-\npeared. This is caused by the increasing weight put on the label configuration via kernel\nKsym\n     2     . Increasing  even further will eventually lead to the appearance of spurious arti-\nfacts, as the influence of the label configuration will dominate the classification decision.\n\n\n\n4         Conclusions\n\nWe have presented a classification method for a special case of non-iid data in which the\nobjects are linked by a regular grid structure. The proposed algorithm is composed of two\n\n\f\nFigure 1: Final labels assigned by the STSVM at fixed hyper-parameters C = 26, S = 22.\n(a)  = 0, (b)  = 0.1, (c)  = 0.2, (d)  = 0.3, (e)  = 0.4.\n\n\f\ncomponents: The first part is a topographic kernel which integrates conventional feature\ninformation and the information of the label configurations within a topographic neighbor-\nhood. The second part consists of a collective classification with recurrent neural network\ndynamics which lets local label configurations converge to attractors determined by the\nlabel configurations of the support vectors. For the asymmetric weight TSVM, the dimen-\nsionality of the problem is increased by the neighborhood size as compared to a conven-\ntional SVM (twice the neighborhood size for the symmetric weight STSVM). However,\nthe number of variables and constraints does not increase with the number of data points to\nbe labeled. Therefore, the TSVM and the STSVM can be applied to image segmentation\nproblems, where a large number of pixel labels have to be assigned simultaneously.\n\nThe algorithms were applied to the bottom-up cell nucleus segmentation in pap smear im-\nages needed for the detection of cervical cancer. The classification performance of the\nTSVM and STSVM were compared to a conventional SVM, and it was shown that the in-\nclusion of the topographic label configuration lead to a substantial decrease in the average\nmisclassification rate. The two topographic algorithms were much more resistant to noise\nand smaller artifacts. A removal of artifacts which have similar size and the same measure-\nment features as some of the nuclei cannot be achieved by a pure bottom-up method, as\nthis requires a priori model knowledge. In practice, the lower dimensional TSVM is to be\npreferred over the STSVM, since it is faster and performed slightly better.\n\n\nAcknowledgments\n\nThis work was funded by the BMBF (grant 01GQ0411). We thank Sepp Hochreiter for\nuseful discussions.\n\n\nReferences\n\n [1] J. Lafferty; A. McCallum; F. Pereira. Conditional random fields: Probabilistic models for\n     segmenting and labeling sequence data. In Proc. Int. Conf. on Machine Learning, 2001.\n\n [2] S. Kumar; M. Hebert. Discriminative fields for modeling spatial dependencies in natural im-\n     ages. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural\n     Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.\n\n [3] J. Neville and D. Jensen. Collective classification with relational dependency networks. In\n     Proc. 2nd Multi-Relational Data Mining Workshop, 9th ACM SIGKDD Intern. Conf. Knowledge\n     Discovery and Data Mining, 2003.\n\n [4] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Sebastian Thrun,\n     Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing\n     Systems 16. MIT Press, Cambridge, MA, 2004.\n\n [5] P. Bamford. Segmentation of Cell Images with Application to Cervical Cancer Screening. PhD\n     thesis, University of Queensland, 1999.\n\n [6] P. Bamford and B. Lovell. Unsupervised cell nucleus segmentation with active contours. Signal\n     Processing Special Issue: Deformable Models and Techniques for Image and Signal Process-\n     ing, 71(2):203213, 1998.\n\n [7] B. Scholkopf and A. Smola. Learning with Kernels. The MIT Press, 2002.\n\n [8] V. Vapnik. Statistical Learning Theory. Springer, New York, 1998.\n\n [9] W. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.\n     Bulletin of mathematical physics, 5:115133, 1943.\n\n[10] S. Haykin. Neural Networks. Macmillan College Publishing Company Inc., 1994.\n\n\f\n", "award": [], "sourceid": 2653, "authors": [{"given_name": "Johannes", "family_name": "Mohr", "institution": null}, {"given_name": "Klaus", "family_name": "Obermayer", "institution": null}]}