{"title": "A Neural Network Based Head Tracking System", "book": "Advances in Neural Information Processing Systems", "page_first": 908, "page_last": 914, "abstract": "We have constructed an inexpensive video based motorized tracking system that learns to track a head. It uses real time graphical user inputs or an auxiliary infrared detector as supervisory signals to train a convolutional neural network. The inputs to the neural network consist of normalized luminance and chrominance images and motion information from frame differences. Subsampled images are also used to provide scale invariance. During the online training phases the neural network rapidly adjusts the input weights depending up on the reliability of the different channels in the surrounding environment. This quick adaptation allows the system to robustly track a head even when other objects are moving within a cluttered background.", "full_text": "A Neural Network Based\n\nHead Tracking System\n\nD. D. Lee and H. S. Seung\n\nBell Laboratories, Lucent Technologies\n\n\u0007 Mountain Ave.\n\nMurray Hill, NJ \u0007\t\u0007\u0004\n\nfddlee|seungg@bell-labs.com\n\nAbstract\n\nWe have constructed an inexpensive, video-based, motorized track-\ning system that learns to track a head. It uses real time graphical\nuser inputs or an auxiliary infrared detector as supervisory signals\nto train a convolutional neural network. The inputs to the neural\nnetwork consist of normalized luminance and chrominance images\nand motion information from frame dierences. Subsampled im-\nages are also used to provide scale invariance. During the online\ntraining phase, the neural network rapidly adjusts the input weights\ndepending upon the reliability of the dierent channels in the sur-\nrounding environment. This quick adaptation allows the system to\nrobustly track a head even when other objects are moving within\na cluttered background.\n\n\u0001\n\nIntroduction\n\nWith the proliferation of inexpensive multimedia computers and peripheral equip-\nment, video conferencing nally appears ready to enter the mainstream. But per-\nsonal video conferencing systems typically use a stationary camera, tying the user\nto a xed location much as a corded telephone tethers one to the telephone jack. A\nsimple solution to this problem is to use a motorized video camera that can track\na specic person as he or she moves about. However, this presents the diculty of\nhaving to continually control the movements of the camera while one is communi-\ncating. In this paper, we present a prototype, neural network based system that\nlearns the characteristics of a person\u2019s head in real time and automatically tracks\nit around the room, thus alleviating the user of much of this burden.\n\nThe camera movements in this video conferencing system closely resemble the move-\nments of human eyes. The task of the biological oculomotor system is to direct\n\n\fColor\n\nCCD Camera\n\n(Eye)\n\nDirectional\n\nMicrophones\n\n(Ears)\n\nPC\n\nSerial\nPort\n\nFrame\nGrabber\n\nSound\nCard\n\nServo Motors\n(Oculomotor\n\nMuscles)\n\nReinforcement\n\nSignals\n\nIR Detector\n\nGUI Mouse\n\nFigure \u0001: Schematic hardware diagram of Marvin, our head tracking system.\n\n\u0005interesting\" parts of the visual world onto the small, high resolution areas of the\nretinas. For this task, complex neural circuits have evolved in order to control the\neye movements. Some examples include the saccadic and smooth pursuit systems\nthat allow the eyes to rapidly acquire and track moving objects \u0005\u0001, \u0002\u0005. Similarly,\nan active video conferencing system also needs to determine the appropriate face\nor feature to follow in the video stream. Then the camera must track that person\u2019s\nmovements over time and transmit the image to the other party.\n\nIn the past few years, the problem of face detection in images and video has attracted\nconsiderable attention \u0005\u0003, \u0004, \u0005\u0005. Rule-based methods have concentrated on looking\nfor generic characteristics of faces such as oval shapes or skin hue. Since these types\nof algorithms are fairly simple to implement, they are commonly found in real-time\nsystems \u0005\u0006, \u0007\u0005. But because other objects have similar shapes and colors as faces,\nthese systems can also be easily fooled. A potentially more robust approach is to\nuse a convolutional neural network to learn the appropriate features of a face \u0005\b, \t\u0005.\nBecause most such implementations learn in batch mode, they are beset by the\ndiculty of constructing a large enough training set of labelled images with and\nwithout faces.\nIn this paper, we present a video based system that uses online\nsupervisory signals to train a convolutional neural network. Fast online adaptation\nof the network\u2019s weights allows the neural network to learn how to discriminate an\nindividual head at the beginning of a session. This enables the system to robustly\ntrack the head even in the presence of other moving objects.\n\n\u0002 Hardware Implementation\n\nFigure \u0001 shows a schematic of the tracking system we have constructed and have\nnamed \u0005Marvin\" because of an early version\u2019s similarity to a cartoon character.\nMarvin\u2019s eye consists of a small CCD camera with a \u0006\u0005 eld of view that is attached\nto a motorized platform. Two RC servo motors give Marvin the ability to rapidly\npan and tilt over a wide range of viewing angles, with a typical maximum velocity of\n\u0003 deg\u0002sec. The system also includes two microphones or ears that give Marvin the\nability to locate auditory cues. Integrating auditory information with visual inputs\nallows the system to nd salient objects better than with either sound or video\nalone. But these proceedings will focus exclusively on how a visual representation\nis learned.\n\n\fRGB Images\n\nY\n\nU\n\nV\n\nD\n\nFigure \u0002: Preprocessing of the video stream. Luminance, chromatic and motion\ninformation are separately represented in the Y, U, V, D channels at multiple res-\nolutions.\n\nMarvin is able to learn to track a visual target using two dierent sources of su-\npervisory signals. One method of training uses a small \u0003\b KHz modulated infrared\nlight emitter \u001c\u000f \u0013 \t nm\u001d attached to the object that needs to be tracked. A\nheat lter renders the infrared light invisible to Marvin\u2019s video camera so that the\nsystem does not merely learn to follow this signal. But mounted next to the CCD\ncamera and moving with it is a small infrared detector with a collimating lens that\nsignals when the object is located within a narrow angular cone in the direction\nthat the camera is pointing. This reinforcement signal can then be used to train\nthe weights of the neural network. Another more natural way for the system to\nlearn occurs in an actual video conferencing scenario. In this situation, a user who\nis actively watching the video stream has manual override control of the camera\nusing graphical user interface inputs. Whenever the user repositions the camera to\na new location, the neural network would then adjust its weights to track whatever\nis in the center portion of the image.\n\nSince Marvin was built from readily available commercial components, the cost of\nthe system not including the PC was under $\u0005. The input devices and motors\nare all controlled by the computer using custom-written Matlab drivers that are\navailable for both Microsoft Windows and the Linux operating system. The image\nprocessing computations as well as the graphical user interface are then easily im-\nplemented as simple Matlab operations and function calls. The following section\ndescribes the head tracking neural network in more detail.\n\n\u0003 Neural Network Architecture\n\nMarvin uses a convolutional neural network architecture to detect a head within its\neld of view. The video stream from the CCD camera is rst digitized with a video\ncapture board into a series of raw \u0001\u0002\u0002\u0001\u0006 RGB images as shown in Figure \u0002. Each\nRGB color image is then converted into its YUV representation, and a dierence \u001cD\u001d\n\n\fHidden\nUnits\n\nY\n\nU\n\nV\n\nD\n\nWY\n\nWU\n\nWV\n\nWD\n\nSaliency\n\nMap\n\nWinner Take All\n\nFigure \u0003: Neural network uses a convolutional architecture to integrate the dierent\nsources of information and determine the maximally salient object.\n\nimage is also computed as the absolute value of the dierence from the preceding\nframe. Of the four resulting images, the Y component represents the luminance or\ngrayscale information while the U and V channels contain the chromatic or color\ninformation. Motion information in the video stream is captured by the D image\nwhere moving objects appear highlighted.\n\nThe four YUVD channels are then subsampled successively to yield representations\nat lower and lower resolutions. The resulting \u0005image pyramids\" allow the network\nto achieve recognition invariance across many dierent scales without having to\ntrain separate neural networks for each resolution. Instead, a single neural network\nwith the same set weights is run with the dierent resolutions as inputs, and the\nmaximally active resolution and position is selected.\n\nMarvin uses the convolutional neural network architecture shown in Figure \u0003 to\nlocate salient objects at the dierent resolutions. The YUVD input images are l-\ntered with separate \u0001\u0006\u0002\u0001\u0006 kernels, denoted by WY , WU , WV , and WD respectively.\nThis results in the ltered images \u0010Y s, \u0010U s, \u0010V s, \u0010Ds:\n\n\u0010As\u001ci; j\u001d = WA  As = X\n\nWA\u001ci; j \u001d As\u001ci + i; j + j \u001d\n\n\u001c\u0001\u001d\n\ni;j \n\nwhere s denotes the scale resolution of the inputs, and A is any of the Y , U , V ,\nor D channels. These ltered images represent a single layer of hidden units in the\nneural network. These hidden units are then combined to form the saliency map\nX s in the following manner:\n\nX s\u001ci; j\u001d = cY g\u0005 \u0010Y s\u001ci; j\u001d\u0005 + cU g\u0005 \u0010U s\u001ci; j\u001d\u0005 + cV g\u0005 \u0010V s\u001ci; j\u001d\u0005 + cD g\u0005 \u0010Ds\u001ci; j\u001d\u0005 + c:\n\n\u001c\u0002\u001d\n\n\fSince g\u001cx\u001d = tanh\u001cx\u001d is sigmoidal, the saliency X s is computed as a nonlinear,\npixel-by-pixel combination of the hidden units. The scalar variables cY , cU , cV ,\nand cD represent the relative importance of the dierent luminance, chromatic, and\nmotion channels in the overall saliency of an object.\n\nWith the bias term c, the function g\u0005X s\u001ci; j\u001d\u0005 may then be thought of as the\nrelative probability that a head exists at location \u001ci; j\u001d at input resolution s. The\nnal output of the neural network is then determined in a competitive manner by\nnding the location \u001cim; jm\u001d and scale sm of the best possible match:\n\ng\u0005Xm\u0005 = g\u0005X sm\u001cim; jm\u001d\u0005 = max\ni;j;s\n\ng\u0005X s\u001ci; j\u001d\u0005:\n\n\u001c\u0003\u001d\n\nAfter processing the visual inputs in this manner, saccadic camera movements are\ngenerated in order to keep the maximally salient object located near the center of\nthe eld of view.\n\n\u0004 Training and Results\n\nEither GUI user inputs or the infrared detector may be used as a supervisory signal\nto train the kernels WA and scalar weights cA of the neural network. The neu-\nral network is updated when the maximally salient location of the neural network\n\u001cim; jm\u001d does not correspond to the desired object\u2019s true position \u001cin; jn\u001d as iden-\ntied by the external supervisory signal. A cost function proportional to the sum\nsquared error terms at the maximal location and new desired location is used for\ntraining:\n\nm = jgm (cid:0) g\u0005X sm\u001cim; jm\u001dj\u0002;\ne\u0002\njgn (cid:0) g\u0005X s\u001cin; jn\u001dj\u0002:\ne\u0002\nn = min\n\ns\n\n\u001c\u0004\u001d\n\n\u001c\u0005\u001d\n\nIn the following examples, the constants gm =  and gn = \u0001 are used. The gradients\nto Eqs. \u0004\u0007\u0005 are then backpropagated through the convolutional network \u0005\b, \u0001\u0005,\nresulting in the following update rules:\n\n\u0001cA = \u000b emg \u001cXm\u001dg\u0005 \u0010A\u001cim; jm\u001d\u0005 + \u000b eng\u001cXn\u001dg\u0005 \u0010A\u001cin; jn\u001d\u0005;\n\u0001WA = \u000b emg \u001cXm\u001dg\u001c \u0010Am\u001dcAAm + \u000b eng\u001cXn\u001dg\u001c \u0010An\u001dcAAn:\n\n\u001c\u0006\u001d\n\u001c\u0007\u001d\n\nIn typical batch learning applications of neural networks, the learning rate \u000b is set\nto be some small positive number. However in this case, it is desirable for Marvin\nto learn to track a head in a new environment as quickly as possible. Thus, rapid\nadaptation of the weights during even a single training example is needed. A natural\nway of doing this is to use a fairly large learning rate \u001c\u000b = :\u0001\u001d, and to repeatedly\napply the update rules in Eqs. \u0006\u0007\u0007 until the calculated maximally salient location\nis very close to the actual desired position.\n\nAn example of how quickly Marvin is able to learn to track one of the authors\nas he moved around his oce is given by the learning curve in Figure \u0004. The\nweights were rst initialized to small random values, and Marvin was corrected in\nan online fashion using mouse inputs to look at the author\u2019s head. After only a few\nseconds of training with a processing time loop of around \u0002 ms, the system was\nable to locate the head to within four pixels of accuracy, as determined by hand\nlabelling the video data afterwards. As saccadic eye movements were initiated at\n\n\fr\no\nr\nr\n\nE\n\nl\n\ne\nx\nP\n\ni\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\nFrame Number\n\nFigure \u0004: Fast online adaptation of the neural network. The head location error in\npixels in a \u0001\u0002 \u0002 \u0001\u0006 image is plotted as a function of frame number \u001c\u0005 frames\u0002sec\u001d.\n\nthe times indicated by the arrows in Fig. \u0004, new environments of the oce were\nsampled and an occasional large error is seen. However, over time as these errors\nare corrected, the neural network learns to robustly discriminate the head from the\noce surroundings.\n\n\u0005 Discussion\n\nFigure \u0005 shows the inputs and weights of the network after a minute of training as\nthe author walked around his oce. The kernels necessarily appear a little smeared\nbecause they are invariant to slight changes in head position, rotation, and scale.\nBut they clearly depict the dark hair, facial features, and skin color of the head. The\nrelative weighting \u001ccY ; cU ; cV \u0003 cD\u001d of the dierent input channels shows that the\nluminance and color information are the most reliable for tracking the head. This\nis probably because it is relatively dicult to distinguish in the frame dierence\nimages the head from other moving body parts.\n\nWe are currently considering more complicated neural network architectures for\ncombining the dierent input streams to give better tracking performance. How-\never, this example shows how a simple convolutional architecture can be used to\nautomatically integrate dierent visual cues to robustly track a head. Moreover, by\nusing fast online adaptation of the neural network weights, the system is able to\nlearn without needing large hand-labelled training sets and is also able to rapidly\naccomodate changing environments. Future improvements in hardware and neu-\nral network architectures and algorithms are still necessary, however, in order to\napproach human speeds and performance in this type of sensory processing and\nrecognition task.\n\nWe acknowledge the support of Bell Laboratories, Lucent Technologies. We also\nthank M. Fee, A. Jacquin, S. Levinson, E. Petajan, G. Pingali, and E. Rietman for\nhelpful discussions.\n\n\fY\n\nU\n\nV\n\nD\n\nc =0.15\n\nY\n\nc =0.12\n\nU\n\nc =0.11\n\nV\n\nc =0.08\n\nD\n\nFigure \u0005: Example showing the inputs and weights used in tracking a head. The\nhead position as calculated by the neural network is marked with a box.\n\nReferences\n\n\u0005\u0001\u0005 Horiuchi, TK, Bishofberger, B & Koch, C \u001c\u0001\t\t\u0004\u001d. An analog VLSI saccadic\neye movement system. Advances in Neural Information Processing Systems \u0006,\n\u0005\b\u0002\u0007\u0005\b\t.\n\n\u0005\u0002\u0005 Rao, RPN, Zelinsky, GJ, Hayhoe, MM & Ballard, DH \u001c\u0001\t\t\u0006\u001d. Modeling sac-\ncadic targeting in visual search. Advances in Neural Information Processing\nSystems \b, \b\u0003\u0007\b\u0003\u0006.\n\n\u0005\u0003\u0005 Sung, KK & Poggio, T \u001c\u0001\t\t\u0004\u001d. Example-based learning for view-based human\n\nface detection. Proc. \u0002\u0003rd Image Understanding Workshop, \b\u0004\u0003\u0007\b\u0005.\n\n\u0005\u0004\u0005 Eleftheriadis, A & Jacquin, A \u001c\u0001\t\t\u0005\u001d. Automatic face location detection and\ntracking for model-assisted coding of video teleconferencing sequences at low\nbit-rates. Signal Processing: Image Communication \u0007, \u0002\u0003\u0001.\n\n\u0005\u0005\u0005 Petajan, E & Graf, HP \u001c\u0001\t\t\u0006\u001d. Robust face feature analysis for automatic\nspeechreading and character animation. Proc. \u0002nd Int. Conf. Automatic Face\nand Gesture Recognition, \u0003\u0005\u0007-\u0003\u0006\u0002.\n\n\u0005\u0006\u0005 Darrell, T, Maes, P, Blumberg, B, & Pentland, AP \u001c\u0001\t\t\u0004\u001d. A novel environment\nfor situated vision and behavior. Proc. IEEE Workshop for Visual Behaviors,\n\u0006\b\u0007\u0007\u0002.\n\n\u0005\u0007\u0005 Yang, J & Waibel, A \u001c\u0001\t\t\u0006\u001d. A real-time face tracker. Proc. \u0003rd IEEE Workshop\n\non Application of Computer Vision, \u0001\u0004\u0002\u0007\u0001\u0004\u0007.\n\n\u0005\b\u0005 Nowlan, SJ & Platt, JC \u001c\u0001\t\t\u0005\u001d. A convolutional neural network hand tracker.\n\nAdvances in Neural Information Processing Systems \u0007, \t\u0001\u0007\t\b.\n\n\u0005\t\u0005 Rowley, HA, Baluja, S & Kanade, T \u001c\u0001\t\t\u0006\u001d. Human face detection in visual\n\nscenes. Advances in Neural Information Processing Systems \b, \b\u0007\u0005\u0007\b\b\u0001.\n\n\u0005\u0001\u0005 Le Cun, Y, et al. \u001c\u0001\t\t\u001d. Handwritten digit recognition with a back propagation\n\nnetwork. Advances in Neural Information Processing Systems \u0002, \u0003\t\u0006\u0007\u0004\u0004.\n\n\f", "award": [], "sourceid": 5221, "authors": [{"given_name": "Daniel", "family_name": "Lee", "institution": ""}, {"given_name": "H.", "family_name": "Seung", "institution": ""}]}