{"title": "Learning Generative Models with the Up Propagation Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 605, "page_last": 611, "abstract": "Up-\u0002propagation is an algorithm for inverting and learning neural network\r\ngenerative models\u0003 Sensory input is processed by inverting a model that\r\ngenerates patterns from hidden variables using top\u0002down connections\u0003\r\nThe inversion process is iterative\u0004 utilizing a negative feedback loop that\r\ndepends on an error signal propagated by bottom\u0002up connections\u0003 The\r\nerror signal is also used to learn the generative model from examples\u0003\r\nThe algorithm is benchmarked against principal component analysis in\r\nexperiments on images of handwritten digits\u0003.", "full_text": "Learning Generative Models with the\n\nUp(cid:2)Propagation Algorithm\n\nJong(cid:2)Hoon Oh and H(cid:3) Sebastian Seung\n\nBell Labs(cid:2) Lucent Technologies\n\nMurray Hill(cid:2) NJ \u0007\t\u0007\u0004\n\nfjhoh(cid:2)seungg(cid:3)bell(cid:4)labs(cid:5)com\n\nAbstract\n\nUp(cid:2)propagation is an algorithm for inverting and learning neural network\ngenerative models(cid:3) Sensory input is processed by inverting a model that\ngenerates patterns from hidden variables using top(cid:2)down connections(cid:3)\nThe inversion process is iterative(cid:4) utilizing a negative feedback loop that\ndepends on an error signal propagated by bottom(cid:2)up connections(cid:3) The\nerror signal is also used to learn the generative model from examples(cid:3)\nThe algorithm is benchmarked against principal component analysis in\nexperiments on images of handwritten digits(cid:3)\n\nIn his doctrine of unconscious inference(cid:2) Helmholtz argued that perceptions are\nformed by the interaction of bottom(cid:7)up sensory data with top(cid:7)down expectations(cid:8)\nAccording to one interpretation of this doctrine(cid:2) perception is a procedure of sequen(cid:7)\ntial hypothesis testing(cid:8) We propose a new algorithm(cid:2) called up(cid:7)propagation(cid:2) that\nrealizes this interpretation in layered neural networks(cid:8) It uses top(cid:7)down connections\nto generate hypotheses(cid:2) and bottom(cid:7)up connections to revise them(cid:8)\n\nIt is important to understand the di(cid:9)erence between up(cid:7)propagation and its an(cid:7)\ncestor(cid:2) the backpropagation algorithm(cid:10)\u0001(cid:12)(cid:8) Backpropagation is a learning algorithm\nfor recognition models(cid:8) As shown in Figure \u0001a(cid:2) bottom(cid:7)up connections recognize\npatterns(cid:2) while top(cid:7)down connections propagate an error signal that is used to learn\nthe recognition model(cid:8)\n\nIn contrast(cid:2) up(cid:7)propagation is an algorithm for inverting and learning generative\nmodels(cid:2) as shown in Figure \u0001b(cid:8) Top(cid:7)down connections generate patterns from a\nset of hidden variables(cid:8) Sensory input is processed by inverting the generative\nmodel(cid:2) recovering hidden variables that could have generated the sensory data(cid:8)\nThis operation is called either pattern recognition or pattern analysis(cid:2) depending\non the meaning of the hidden variables(cid:8) Inversion of the generative model is done\niteratively(cid:2) through a negative feedback loop driven by an error signal from the\nbottom(cid:7)up connections(cid:8) The error signal is also used for learning the connections\n\n\ferror\n\nrecognition\n\ngeneration\n\nerror\n\n(a)\n\n(b)\n\nFigure \u0001(cid:13) Bottom(cid:7)up and top(cid:7)down processing in neural networks(cid:8) (cid:14)a(cid:15) Backprop\nnetwork (cid:14)b(cid:15) Up(cid:7)prop network\n\nin the generative model(cid:8)\n\nUp(cid:7)propagation can be regarded as a generalization of principal component analysis\n(cid:14)PCA(cid:15) and its variants like Conic(cid:10)\u0002(cid:12) to nonlinear(cid:2) multilayer generative models(cid:8) Our\nexperiments with images of handwritten digits demonstrate that up(cid:7)propagation\nlearns a global(cid:2) nonlinear model of a pattern manifold(cid:8) With its global parametriza(cid:7)\ntion(cid:2) this model is distinct from locally linear models of pattern manifolds(cid:10)\u0003(cid:12)(cid:8)\n\n\u0001\n\nINVERTING THE GENERATIVE MODEL\n\nThe generative model is a network of L (cid:18) \u0001 layers of neurons(cid:2) with layer  at the\nbottom and layer L at the top(cid:8) The vectors xt(cid:2) t (cid:19)  (cid:2) (cid:2) (cid:2) L(cid:2) are the activations of\nthe layers(cid:8) The pattern x is generated from the hidden variables xL by a top(cid:7)down\npass through the network(cid:2)\n\nxt(cid:0)\u0001 (cid:19) f (cid:14)Wtxt(cid:15)(cid:3)\n\nt (cid:19) L(cid:3) (cid:2) (cid:2) (cid:2) (cid:3) \u0001 (cid:2)\n\n(cid:14)\u0001(cid:15)\n\nThe nonlinear function f acts on vectors component by component(cid:8) The matrix\nWt contains the synaptic connections from the neurons in layer t to the neurons in\nlayer t (cid:2) \u0001(cid:8) A bias term bt(cid:0)\u0001 can be added to the argument of f (cid:2) but is omitted\nhere(cid:8) It is convenient to de(cid:20)ne auxiliary variables (cid:21)xt by xt (cid:19) f (cid:14)(cid:21)xt(cid:15)(cid:8) In terms of\nthese auxiliary variables(cid:2) the top(cid:7)down pass is written as\n\n(cid:21)xt(cid:0)\u0001 (cid:19) Wtf (cid:14)(cid:21)xt(cid:15)\n\n(cid:14)\u0002(cid:15)\n\nGiven a sensory input d(cid:2) the top(cid:7)down generative model can be inverted by (cid:20)nding\nhidden variables xL that generate a pattern x matching d(cid:8)\nIf some of the hid(cid:7)\nden variables represent the identity of the pattern(cid:2) the inversion operation is called\nrecognition(cid:8) Alternatively(cid:2) the hidden variables may just be a more compact repre(cid:7)\nsentation of the pattern(cid:2) in which case the operation is called analysis or encoding(cid:8)\nThe inversion is done iteratively(cid:2) as described below(cid:8)\n\nIn the following(cid:2) the operator (cid:3) denotes elementwise multiplication of two vectors(cid:2)\nso that z (cid:19) x (cid:3) y means zi (cid:19) xiyi for all i(cid:8) The bottom(cid:7)up pass starts with the\nmismatch between the sensory data d and the generated pattern x(cid:2)\n\n(cid:4) (cid:19) f (cid:14)(cid:21)x(cid:15) (cid:3) (cid:14)d (cid:2) x(cid:15) (cid:3)\n\nwhich is propagated upwards by\n\n(cid:4)t (cid:19) f (cid:14)(cid:21)xt(cid:15) (cid:3) (cid:14)W T\n\nt (cid:4)t(cid:0)\u0001(cid:15) (cid:2)\n\n(cid:14)\u0003(cid:15)\n\n(cid:14)\u0004(cid:15)\n\nWhen the error signal reaches the top of the network(cid:2) it is used to update the hidden\nvariables xL(cid:2)\n\n(cid:22)xL (cid:4) W T\n\nL (cid:4)L(cid:0)\u0001 (cid:2)\n\n(cid:14)\u0005(cid:15)\n\n\fThis update closes the negative feedback loop(cid:8) Then a new pattern x is generated\nby a top(cid:7)down pass (cid:14)\u0001(cid:15)(cid:2) and the process starts over again(cid:8)\n\nThis iterative inversion process performs gradient descent on the cost function \u0001\n\u0002 jd(cid:2)\nxj\u0002(cid:2) subject to the constraints (cid:14)\u0001(cid:15)(cid:8) This can be proved using the chain rule(cid:2) as in\nthe traditional derivation of the backprop algorithm(cid:8) Another method of proof is\nto add the equations (cid:14)\u0001(cid:15) as constraints(cid:2) using Lagrange multipliers(cid:2)\n\n\u0001\n\u0002\n\njd (cid:2) f (cid:14)(cid:21)x(cid:15)j\u0002 (cid:18)\n\nL\n\nX\n\nt(cid:4)\u0001\n\n(cid:4)T\nt(cid:0)\u0001(cid:10)(cid:21)xt(cid:0)\u0001 (cid:2) Wtf (cid:14)(cid:21)xt(cid:15)(cid:12) (cid:2)\n\n(cid:14)\u0006(cid:15)\n\nThis derivation has the advantage that the bottom(cid:7)up activations (cid:4)t have an inter(cid:7)\npretation as Lagrange multipliers(cid:8)\n\nInverting the generative model by negative feedback can be interpreted as a process\nof sequential hypothesis testing(cid:8) The top(cid:7)down connections generate a hypothesis\nabout the sensory data(cid:8) The bottom(cid:7)up connections propagate an error signal\nthat is the disagreement between the hypothesis and data(cid:8) When the error signal\nreaches the top(cid:2) it is used to generate a revised hypothesis(cid:2) and the generate(cid:7)test(cid:7)\nrevise cycle starts all over again(cid:8) Perception is the convergence of this feedback loop\nto the hypothesis that is most consistent with the data(cid:8)\n\n\u0002 LEARNING THE GENERATIVE MODEL\n\nThe synaptic weights Wt determine the types of patterns that the network is able to\ngenerate(cid:8) To learn from examples(cid:2) the weights are adjusted to improve the network(cid:25)s\ngeneration ability(cid:8) A suitable cost function for learning is the reconstruction error\n\u0002 jd (cid:2) xj\u0002 averaged over an ensemble of examples(cid:8) Online gradient descent with\nrespect to the synaptic weights is performed by a learning rule of the form\n\n\u0001\n\n(cid:22)Wt (cid:4) (cid:4)t(cid:0)\u0001xT\nt\n\n(cid:2)\n\n(cid:14)\u0007(cid:15)\n\nThe same error signal (cid:4) that was used to invert the generative model is also used\nto learn it(cid:8)\n\nThe batch form of the optimization is compactly written using matrix notation(cid:8)\nTo do this(cid:2) we de(cid:20)ne the matrices D(cid:3) X(cid:3) (cid:2) (cid:2) (cid:2) (cid:3) XL whose columns are the vectors d(cid:2)\nx(cid:3) (cid:2) (cid:2) (cid:2) (cid:3) xL corresponding to examples in the training set(cid:8) Then computation and\nlearning are the minimization of\n\nmin\nXL(cid:2)Wt\n\n\u0001\n\u0002\n\njD (cid:2) Xj\u0002 (cid:3)\n\nsubject to the constraint that\n\nXt(cid:0)\u0001 (cid:19) f (cid:14)WtXt(cid:15) (cid:3)\n\nt (cid:19) \u0001(cid:3) (cid:2) (cid:2) (cid:2) (cid:3) L (cid:2)\n\n(cid:14)\b(cid:15)\n\n(cid:14)\t(cid:15)\n\nIn other words(cid:2) up(cid:7)prop is a dual minimization with respect to hidden variables and\nsynaptic connections(cid:8) Computation minimizes with respect to the hidden variables\nXL(cid:2) and learning minimizes with respect to the synaptic weight matrices Wt(cid:8)\n\nFrom the geometric viewpoint(cid:2) up(cid:7)propagation is an algorithm for learning pattern\nmanifolds(cid:8) The top(cid:7)down pass (cid:14)\u0001(cid:15) maps an nL(cid:7)dimensional vector xL to an n(cid:7)\ndimensional vector x(cid:8) Thus the generative model parametrizes a continuous nL(cid:7)\ndimensional manifold embedded in n(cid:7)dimensional space(cid:8) Inverting the generative\nmodel is equivalent to (cid:20)nding the point on the manifold that is closest to the sensory\ndata(cid:8) Learning the generative model is equivalent to deforming the manifold to (cid:20)t\na database of examples(cid:8)\n\n\fW\n\nprincipal components\n\nFigure \u0002(cid:13) One(cid:7)step generation of handwritten digits(cid:8) Weights of the \u0002\u0005\u0006(cid:7)\t up(cid:7)prop\nnetwork (cid:14)left(cid:15) versus the top \t principal components (cid:14)right(cid:15)\n\ntarget image\n\nx0 t=0\n\nt=1\n\nt=10\n\nt=100\n\nt=1000\n\nx1\n\n4\n\n2\n\n0\n\n4\n\n2\n\n0\n\n4\n\n2\n\n0\n\n4\n\n2\n\n0\n\n4\n\n2\n\n0\n\n0\n\n5\n\n10\n\n0\n\n5\n\n10\n\n0\n\n5\n\n10\n\n0\n\n5\n\n10\n\n0\n\n5\n\n10\n\nFigure \u0003(cid:13) Iterative inversion of a generative model as sequential hypothesis testing(cid:8)\nA fully trained \u0002\u0005\u0006(cid:27)\t network is inverted to generate an approximation to a target\nimage that was not previously seen during training(cid:8) The stepsize of the dynamics\nwas (cid:20)xed to (cid:2)\u0002 to show time evolution of the system(cid:8)\n\nPattern manifolds are relevant when patterns vary continuously(cid:8) For example(cid:2) the\nvariations in the image of a three(cid:7)dimensional object produced by changes of view(cid:7)\npoint are clearly continuous(cid:2) and can be described by the action of a transformation\ngroup on a prototype pattern(cid:8) Other types of variation(cid:2) such as deformations in\nthe shape of the object(cid:2) are also continuous(cid:2) even though they may not be readily\ndescribable in terms of transformation groups(cid:8) Continuous variability is clearly not\ncon(cid:20)ned to visual images(cid:2) but is present in many other domains(cid:8) Many existing\ntechniques for modeling pattern manifolds(cid:2) such as PCA or PCA mixtures(cid:10)\u0003(cid:12)(cid:2) de(cid:7)\npend on linear or locally linear approximations to the manifold(cid:8) Up(cid:7)prop constructs\na globally parametrized(cid:2) nonlinear manifold(cid:8)\n\n\u0003 ONE(cid:5)STEP GENERATION\n\nThe simplest generative model of the form (cid:14)\u0001(cid:15) has just one step (cid:14)L (cid:19) \u0001(cid:15)(cid:8) Up(cid:7)\npropagation minimizes the cost function\n\nmin\nX\u0001(cid:2)W\u0001\n\n\u0001\n\u0002\n\njD (cid:2) f (cid:14)W\u0001X\u0001(cid:15)j\u0002 (cid:2)\n\n(cid:14)\u0001(cid:15)\n\nFor a linear f this reduces to PCA(cid:2) as the cost function is minimized when the vec(cid:7)\ntors in the weight matrix W\u0001 span the same space as the top principal components\nof the data D(cid:8)\n\nUp(cid:7)propagation with a one(cid:7)step generative model was applied to the USPS\ndatabase(cid:10)\u0004(cid:12)(cid:2) which consists of example images of handwritten digits(cid:8) Each of the\n\u0007\u0002\t\u0001 training and \u0002\u0007 testing images was normalized to a \u0001\u0006 (cid:5) \u0001\u0006 grid with pixel\nintensities in the range (cid:10)(cid:3) \u0001(cid:12)(cid:8) A separate model was trained for each digit class(cid:8) The\nnonlinearity f was the logistic function(cid:8) Batch optimization of (cid:14)\u0001(cid:15) was done by\n\n\fReconstruction Error\n\nPCA, training \nUp\u2212prop, training\nPCA, test \nUp\u2212prop, test \n\n0.025\n\n0.02\n\nr\no\nr\nr\n\nE\n\n0.015\n\n0.01\n\n0.005\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\nnumber of vectors\n\n30\n\n35\n\n40\n\nFigure \u0004(cid:13) Reconstruction error for \u0002\u0005\u0006(cid:27)n networks as a function of n(cid:8) The error of\nPCA with n principal components is shown for comparison(cid:8) The up(cid:7)prop network\nperforms better on both the training set and test set(cid:8)\n\ngradient descent with adaptive stepsize control by the Armijo rule(cid:10)\u0005(cid:12)(cid:8) In most cases(cid:2)\nthe stepsize varied between \u0001(cid:0)\u0001 and \u0001(cid:0)\u0003(cid:2) and the optimization usually converged\nwithin \u0001\u0003 epochs(cid:8) Figure \u0002 shows the weights of a \u0002\u0005\u0006(cid:27)\t network that was trained\non \u0007\u0003\u0001 di(cid:9)erent images of the digit (cid:28)two(cid:8)(cid:29) Each of the \t subimages is the weight\nvector of a top(cid:7)level neuron(cid:8) The top \t principal components are also shown for\ncomparison(cid:8)\n\nFigure \u0003 shows the time evolution of a fully trained \u0002\u0005\u0006(cid:27)\t network during iterative\ninversion(cid:8) The error signal from the bottom layer x quickly activates the top layer\nx\u0001(cid:8) At early times(cid:2) all the top layer neurons have similar activation levels(cid:8) However(cid:2)\nthe neurons with weight vectors more relevant to the target image become dominant\nsoon(cid:2) and the other neurons are deactivated(cid:8)\n\nThe reconstruction error (cid:14)\u0001(cid:15) of the up(cid:7)prop network was much better than that of\nPCA(cid:8) We trained \u0001 di(cid:9)erent up(cid:7)prop networks(cid:2) one for each digit(cid:2) and these were\ncompared with \u0001 corresponding PCA models(cid:8) Figure \u0004 shows the average squared\nerror per pixel that resulted(cid:8) A \u0002\u0005\u0006(cid:27)\u0001\u0002 up(cid:7)prop network performed as well as PCA\nwith \u0003\u0006 principal components(cid:8)\n\n\u0004 TWO(cid:5)STEP GENERATION\n\nTwo(cid:7)step generation is a richer model(cid:2) and is learned using the cost function\n\nmin\n\nX\u0002(cid:2)W\u0001(cid:2)W\u0002\n\n\u0001\n\u0002\n\njD (cid:2) f (cid:14)W\u0001f (cid:14)W\u0002X\u0002(cid:15)(cid:15)j\u0002 (cid:2)\n\n(cid:14)\u0001\u0001(cid:15)\n\nNote that a nonlinear f is necessary for two(cid:7)step generation to have more represen(cid:7)\ntational power than one(cid:7)step generation(cid:8) When this two(cid:7)step generative model was\ntrained on the USPS database(cid:2) the weight vectors in W\u0001 learned features resembling\nprincipal components(cid:8) The activities of the X\u0001 neurons tended to be close to their\nsaturated values of one or zero(cid:8)\n\nThe reconstruction error of the two(cid:7)step generative network was compared to that of\nthe one(cid:7)step generative network with the same number of neurons in the top layer(cid:8)\n\n\fOur \u0002\u0005\u0006(cid:27)\u0002\u0005(cid:27)\t network outperformed our \u0002\u0005\u0006(cid:27)\t network on the test set(cid:2) though\nboth networks used nine hidden variables to encode the sensory data(cid:8) However(cid:2)\nthe learning time was much longer(cid:2) and iterative inversion was also slow(cid:8) While\nup(cid:7)prop for one(cid:7)step generation converged within several hundred epochs(cid:2) up(cid:7)prop\nfor two(cid:7)step generation often needed several thousand epochs or more to converge(cid:8)\nWe often found long plateaus in the learning curves(cid:2) which may be due to the\npermutation symmetry of the network architecture(cid:10)\u0006(cid:12)(cid:8)\n\n\u0005 DISCUSSION\n\nTo summarize the experiments discussed above(cid:2) we constructed separate generative\nmodels(cid:2) one for each digit class(cid:8) Relative to PCA(cid:2) each generative model was\nsuperior at encoding digits from its corresponding class(cid:8) This enhanced generative\nability was due to the use of nonlinearity(cid:8)\n\nWe also tried to use these generative models for recognition(cid:8) A test digit was\nclassi(cid:20)ed by inverting all the generative models(cid:2) and then choosing the one best able\nto generate the digit(cid:8) Our tests of this recognition method were not encouraging(cid:8)\nThe nonlinearity of up(cid:7)propagation tended to improve the generation ability of\nmodels corresponding to all classes(cid:2) not just the model corresponding to the correct\nclassi(cid:20)cation of the digit(cid:8) Therefore the improved encoding performance did not\nimmediately transfer to improved recognition(cid:8)\n\nWe have not tried the experiment of training one generative model on all the digits(cid:2)\nwith some of the hidden variables representing the digit class(cid:8) In this case(cid:2) pattern\nrecognition could be done by inverting a single generative model(cid:8) It remains to be\nseen whether this method will work(cid:8)\n\nIterative inversion was surprisingly fast(cid:2) as shown in Figure \u0003(cid:2) and gave solutions\nof surprisingly good quality in spite of potential problems with local minima(cid:2) as\nshown in Figure \u0004(cid:8) In spite of these virtues(cid:2) iterative inversion is still a problematic\nmethod(cid:8) We do not know whether it will perform well if a single generative model\nis trained on multiple pattern classes(cid:8) Furthermore(cid:2) it seems a rather indirect way\nof doing pattern recognition(cid:8)\n\nThe up(cid:7)prop generative model is deterministic(cid:2) which handicaps its modeling of\npattern variability(cid:8) The model can be dressed up in probabilistic language by de(cid:20)n(cid:7)\ning a prior distribution P (cid:14)xL(cid:15) for the hidden variables(cid:2) and adding Gaussian noise\nto x to generate the sensory data(cid:8) However(cid:2) this probabilistic appearance is only\nskin deep(cid:2) as the sequence of transformations from xL to x is still completely de(cid:7)\nterministic(cid:8) In a truly probabilistic model(cid:2) like a belief network(cid:2) every layer of the\ngeneration process adds variability(cid:8)\n\nIn conclusion(cid:2) we brie(cid:30)y compare up(cid:7)propagation to other algorithms and architec(cid:7)\ntures(cid:8)\n\n\u0001(cid:8) In backpropagation(cid:10)\u0001(cid:12)(cid:2) only the recognition model is explicit(cid:8) Iterative gra(cid:7)\ndient descent methods can be used to invert the recognition model(cid:2) though\nthis implicit generative model generally appears to be inaccurate(cid:10)\u0007(cid:2) \b(cid:12)(cid:8)\n\n\u0002(cid:8) Up(cid:7)propagation has an explicit generative model(cid:2) and recognition is done\nby inverting the generative model(cid:8) The accuracy of this implicit recognition\nmodel has not yet been tested empirically(cid:8) Iterative inversion of generative\nmodels has also been proposed for linear networks(cid:10)\u0002(cid:2) \t(cid:12) and probabilistic\nbelief networks(cid:10)\u0001(cid:12)(cid:8)\n\n\u0003(cid:8) In the autoencoder(cid:10)\u0001\u0001(cid:12) and the Helmholtz machine(cid:10)\u0001\u0002(cid:12)(cid:2) there are separate\n\n\fmodels of recognition and generation(cid:2) both explicit(cid:8) Recognition uses only\nbottom(cid:7)up connections(cid:2) and generation uses only top(cid:7)down connections(cid:8)\nNeither process is iterative(cid:8) Both processes can operate completely inde(cid:7)\npendently(cid:31) they only interact during learning(cid:8)\n\n\u0004(cid:8) In attractor neural networks(cid:10)\u0001\u0003(cid:2) \u0001\u0004(cid:12) and the Boltzmann machine(cid:10)\u0001\u0005(cid:12)(cid:2) recog(cid:7)\nnition and generation are performed by the same recurrent network(cid:8) Each\nprocess is iterative(cid:2) and each utilizes both bottom(cid:7)up and top(cid:7)down connec(cid:7)\ntions(cid:8) Computation in these networks is chie(cid:30)y based on positive(cid:2) rather\nthan negative feedback(cid:8)\n\nBackprop and up(cid:7)prop su(cid:9)er from a lack of balance in their treatment of bottom(cid:7)up\nand top(cid:7)down processing(cid:8) The autoencoder and the Helmholtz machine su(cid:9)er from\ninability to use iterative dynamics for computation(cid:8) Attractor neural networks lack\nthese de(cid:20)ciencies(cid:2) so there is incentive to solve the problem of learning attractors(cid:10)\u0001\u0004(cid:12)(cid:8)\n\nThis work was supported by Bell Laboratories(cid:8) JHO was partly supported by the\nResearch Professorship of the LG(cid:7)Yonam Foundation(cid:8) We are grateful to Dan Lee\nfor helpful discussions(cid:8)\n\nReferences\n\n(cid:5)\u0001(cid:7) D(cid:3) E(cid:3) Rumelhart(cid:4) G(cid:3) E(cid:3) Hinton(cid:4) and R(cid:3) J(cid:3) Williams(cid:3) Learning internal representations\n\nby back(cid:2)propagating errors(cid:3) Nature(cid:4) \u0003\u0002\u0003(cid:10)\u0005\u0003\u0003(cid:12)\u0005\u0003\u0006(cid:4) \u0001\t\b\u0006(cid:3)\n\n(cid:5)\u0002(cid:7) D(cid:3) D(cid:3) Lee and H(cid:3) S(cid:3) Seung(cid:3) Unsupervised learning by convex and conic coding(cid:3) Adv(cid:2)\n\nNeural Info(cid:2) Proc(cid:2) Syst(cid:2)(cid:4) \t(cid:10)\u0005\u0001\u0005(cid:12)\u0005\u0002\u0001(cid:4) \u0001\t\t\u0007(cid:3)\n\n(cid:5)\u0003(cid:7) G(cid:3) E(cid:3) Hinton(cid:4) P(cid:3) Dayan(cid:4) and M(cid:3) Revow(cid:3) Modeling the manifolds of images of hand(cid:2)\n\nwritten digits(cid:3) IEEE Trans(cid:2) Neural Networks(cid:4) \b(cid:10)\u0006\u0005(cid:12)\u0007\u0004(cid:4) \u0001\t\t\u0007(cid:3)\n\n(cid:5)\u0004(cid:7) Y(cid:3) LeCun et al(cid:3) Learning algorithms for classi(cid:18)cation(cid:10) a comparison on handwritten\ndigit recognition(cid:3) In J(cid:3)(cid:2)H(cid:3) Oh(cid:4) C(cid:3) Kwon(cid:4) and S(cid:3) Cho(cid:4) editors(cid:4) Neural networks(cid:3) the\nstatistical mechanics perspective(cid:4) pages \u0002\u0006\u0001(cid:12)\u0002\u0007\u0006(cid:4) Singapore(cid:4) \u0001\t\t\u0005(cid:3) World Scienti(cid:18)c(cid:3)\n\n(cid:5)\u0005(cid:7) D(cid:3) P(cid:3) Bertsekas(cid:3) Nonlinear programming(cid:3) Athena Scienti(cid:18)c(cid:4) Belmont(cid:4) MA(cid:4) \u0001\t\t\u0005(cid:3)\n(cid:5)\u0006(cid:7) K(cid:3) Kang(cid:4) J(cid:3)(cid:2)H(cid:3) Oh(cid:4) C(cid:3) Kwon(cid:4) and Y(cid:3) Park(cid:3) Generalization in a two(cid:2)layer neural\n\nnetwork(cid:3) Phys(cid:2) Rev(cid:2)(cid:4) E\u0004\b(cid:10)\u0004\b\u0005(cid:12)\u0004\b\t(cid:4) \u0001\t\t\u0003(cid:3)\n\n(cid:5)\u0007(cid:7) J(cid:3) Kindermann and A(cid:3) Linden(cid:3)\n\nInversion of neural networks by gradient descent(cid:3)\n\nParallel Computing(cid:4) \u0001\u0004(cid:10)\u0002\u0007\u0007(cid:12)\u0002\b\u0006(cid:4) \u0001\t\t(cid:3)\n\n(cid:5)\b(cid:7) Y(cid:3) Lee(cid:3) Handwritten digit recognition using K nearest(cid:2)neighbor(cid:4) radial(cid:2)basis function(cid:4)\n\nand backpropagation neural networks(cid:3) Neural Comput(cid:2)(cid:4) \u0003(cid:10)\u0004\u0004\u0001(cid:12)\u0004\u0004\t(cid:4) \u0001\t\t\u0001(cid:3)\n\n(cid:5)\t(cid:7) R(cid:3) P(cid:3) N(cid:3) Rao and D(cid:3) H(cid:3) Ballard(cid:3) Dynamic model of visual recognition predicts neural\n\nresponse properties in the visual cortex(cid:3) Neural Comput(cid:2)(cid:4) \t(cid:10)\u0007\u0002\u0001(cid:12)\u0006\u0003(cid:4) \u0001\t\t\u0007(cid:3)\n\n(cid:5)\u0001(cid:7) L(cid:3) K(cid:3) Saul(cid:4) T(cid:3) Jaakkola(cid:4) and M(cid:3) I(cid:3) Jordan(cid:3) Mean (cid:18)eld theory for sigmoid belief\n\nnetworks(cid:3) J(cid:2) Artif(cid:2) Intell(cid:2) Res(cid:2)(cid:4) \u0004(cid:10)\u0006\u0001(cid:12)\u0007\u0006(cid:4) \u0001\t\t\u0006(cid:3)\n\n(cid:5)\u0001\u0001(cid:7) G(cid:3) W(cid:3) Cottrell(cid:4) P(cid:3) Munro(cid:4) and D(cid:3) Zipser(cid:3) Image compression by back propagation(cid:10) an\nexample of extensional programming(cid:3) In N(cid:3) E(cid:3) Sharkey(cid:4) editor(cid:4) Models of cognition(cid:3)\na review of cognitive science(cid:3) Ablex(cid:4) Norwood(cid:4) NJ(cid:4) \u0001\t\b\t(cid:3)\n\n(cid:5)\u0001\u0002(cid:7) G(cid:3) E(cid:3) Hinton(cid:4) P(cid:3) Dayan(cid:4) B(cid:3) J(cid:3) Frey(cid:4) and R(cid:3) M(cid:3) Neal(cid:3) The (cid:20)wake(cid:2)sleep(cid:21) algorithm for\n\nunsupervised neural networks(cid:3) Science(cid:4) \u0002\u0006\b(cid:10)\u0001\u0001\u0005\b(cid:12)\u0001\u0001\u0006\u0001(cid:4) \u0001\t\t\u0005(cid:3)\n\n(cid:5)\u0001\u0003(cid:7) H(cid:3) S(cid:3) Seung(cid:3) Pattern analysis and synthesis in attractor neural networks(cid:3) In K(cid:3)(cid:2)Y(cid:3) M(cid:3)\nWong(cid:4) I(cid:3) King(cid:4) and D(cid:3)(cid:2)Y(cid:3) Yeung(cid:4) editors(cid:4) Theoretical Aspects of Neural Computation(cid:3)\nA Multidisciplinary Perspective(cid:4) Singapore(cid:4) \u0001\t\t\u0007(cid:3) Springer(cid:2)Verlag(cid:3)\n\n(cid:5)\u0001\u0004(cid:7) H(cid:3) S(cid:3) Seung(cid:3) Learning continuous attractors in recurrent networks(cid:3) Adv(cid:2) Neural Info(cid:2)\n\nProc(cid:2) Syst(cid:2)(cid:4) \u0001\u0001(cid:4) \u0001\t\t\b(cid:3)\n\n(cid:5)\u0001\u0005(cid:7) D(cid:3) H(cid:3) Ackley(cid:4) G(cid:3) E(cid:3) Hinton(cid:4) and T(cid:3) J(cid:3) Sejnowski(cid:3) A learning algorithm for Boltzmann\n\nmachines(cid:3) Cognitive Science(cid:4) \t(cid:10)\u0001\u0004\u0007(cid:12)\u0001\u0006\t(cid:4) \u0001\t\b\u0005(cid:3)\n\n\f", "award": [], "sourceid": 5220, "authors": [{"given_name": "Jong-Hoon", "family_name": "Oh", "institution": "Bell Labs"}, {"given_name": "H. Sebastian", "family_name": "Seung", "institution": "Lucent Technologies"}]}