{"title": "Constant-Time Loading of Shallow 1-Dimensional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 863, "page_last": 870, "abstract": null, "full_text": "Constant-Time Loading of Shallow 1-Dimensional \n\nNetworks \n\nStephen Judd \n\nSiemens Corporate Research, \n\n755 College Rd. E., \nPrinceton, NJ 08540 \n\njudd@learning.siemens.com \n\nAbstract \n\nThe complexity of learning in shallow I-Dimensional neural networks has \nbeen shown elsewhere to be linear in the size of the network. However, \nwhen the network has a huge number of units (as cortex has) even linear \ntime might be unacceptable. Furthermore, the algorithm that was given to \nachieve this time was based on a single serial processor and was biologically \nimplausible. \nIn this work we consider the more natural parallel model of processing \nand demonstrate an expected-time complexity that is constant (i.e. \ndependent of the size of the network). This holds even when inter-node \ncommunication channels are short and local, thus adhering to more bio(cid:173)\nlogical and VLSI constraints. \n\nin(cid:173)\n\n1 \n\nIntroduction \n\nShallow neural networks are defined in [J ud90]; the definition effectively limits the \ndepth of networks while allowing the width to grow arbitrarily, and it is used as a \nmodel of neurological tissue like cortex where neurons are arranged in arrays tens \nof millions of neurons wide but only tens of neurons deep. Figure I exemplifies \na family of networks which are not only shallow but \"I-dimensional\" as well-we \nallow the network to be extended as far as one liked in width (i.e. to the right) by \nrepeating the design segments shown. The question we address is how learning time \nscales with the width. In [Jud88], it was proved that the worst case time complexity \n863 \n\n\f864 \n\nJudd \n\nof training this family is linear in the width. But the proof involved an algorithm \nthat was biologically very implausible and it is this objection that will be somewhat \nredressed in this paper. \n\nThe problem with the given algorithm is that it operates only a monolithic serial \ncomputer; the single-CPU model of computing has no overt constraints on commu(cid:173)\nnication capacities and therefore is too liberal a model to be relevant to our neural \nmachinery. Furthermore, the algorithm reveals very little about how to do the \nprocessing in a parallel and distributed fashion. In this paper we alter the model \nof computing to attain a degree of biological plausibility. We allow a linear num(cid:173)\nber processors and put explicit constraints on the time required to communicate \nbetween processors. Both of these changes make the model much more biological \n(and also closer to the connectionist sty Ie of processing). \n\nThis change alone, however, does not alter the time complexity-the worst case \ntraining time is still linear. But when we change the complexity question being \nasked, a different answer is obtained. We define a class of tasks (viz. training data) \nthat are drawn at random and then ask for the expected time to load these tasks, \nrather than the worst-case time. This alteration makes the question much more \nenvironmentally relevant. It also leads us into a different domain of algorithms and \nyields fast loading times. \n\n2 Shallow I-D Loading \n\n2.1 Loading \n\nA family of the example shallow I-dimensional architectures that we shall examine \nis characterized solely by an integer, d, which defines the depth of each architecture \nin the family. An example is shown in figure 1 for d = 3. The example also happens \nto have a fixed fan-in of 2 and a very regular structure, but this is not essential. A \nmember of the family is specified by giving the width n, which we will take to be \nthe number of output nodes. \n\nA task is a set of pairs of binary vectors, each specifying an stimulus to a net and \nits desired response. A random task of size t is a set of t pairs of independently \ndrawn random strings; there is no guarantee it is a function. \n\nOur primary question has to do with the following problem, which is parameterized \nby some fixed depth d, and by a node function set (which is the collection of different \ntransfer functions that a node can be tuned to perform): \n\nShallow 1-D Loading: \n\nInstance: An integer n, and a task. \nObjective: Find a function (from the node function set) for each node in the \nnetwork in the shallow I-D architecture defined by d and n such that the \nresulting circuit maps all the stimuli in the task to their associated responses. \n\n\fConstant-Time Loading of Shallow I-Dimensional Networks \n\n865 \n\nFigure 1: A Example Shallow 1-D Architecture \n\n2.2 Model of Computation \n\nOur machine model for solving this question is the following: For an instance of \nshallow 1-D loading of width n, we allow n processors. Each one has access to \na piece of the task, namely processor i has access to bits i through i + d of each \nstimulus, and to bit i of each response. Each processor i has a communication link \nonly to its two neighbours, namely processors i-I and i + 1. (The first and nth \nprocessors have only one neighbour.) It takes one time step to communicate a fixed \namount of data between neighbours. There is no charge for computation, but this is \nnot an unreasonable cheat because we can show that a matrix multiply is sufficient \nfor this problem, and the size of the matrix is a function only of d (which is fixed). \n\nThis definition accepts the usual connectionist ideal of having the processor closely \nidentified with the network nodes for which it is \"finding the weights\", and data \navailable at the processor is restricted to the same \"local\" data that connectionist \nmachines have. \n\nThis sort of computation sets the stage for a complexity question, \n\n2.3 Question and Approach \n\nWe wish to demonstrate that \n\nClaim 1 This parallel machine solves shallow J-D loading where each processor is \nfinished in constant expected time The constant is dependent on the depth of the \narchitecture and on the size of the task, but not on the width. The expectation is \nover the tasks. \n\n\f866 \n\nJudd \n\nFor simplicity we shall focus on one particular processor-the one at the leftmost \nend-and we shall further restrict our at tention to finding a node function for one \nparticular node. \n\nTo operate in parallel, it is necessary and sufficient for each processor to make its \nlocal decisions in a \"safe\" manner-that is, it must make choices for its nodes in \nsuch a way as to facilitate a global solution. Constant-time loading precludes being \nable to see all the data; and if only local data is accessible to a processor, then \nits plight is essentially to find an assignment that is compatible with all nonlocal \nsatisfying assignments. \n\nTheorem 2 The expected communication complexity of finding a \"safe\" node func(cid:173)\ntion assignment for a particular node in a shallow l-D architecture is a constant \ndependent on d and t, but not on n. \n\nIf decisions about assignments to single nodes can be made easily and essentially \nwithout having to communicate with most of the network, then the induced parti(cid:173)\ntioning of the problem admits of fast parallel computation. There are some com(cid:173)\nplications to the details because all these decisions must be made in a coordinated \nfashion, but we omit these details here and claim they are secondary issues that do \nnot affect the gross complexity measurements. \n\nThe proof of the theorem comes in two pieces. First, we define a computational \nproblem called path finding and the graph-theoretic notion of domination which \nis its fundamental core. Then we argue that the loading problem can be reduced \nto path finding in constant parallel time and give an upper bound for determining \ndomination. \n\n3 Path Finding \n\nThe following problem is parameterized by an integer I<, which is fixed. \n\nPath finding : \n\nInstance: An integer n defining the number of parts in a partite graph, and a \nseries of I