{"title": "Improved Expressivity Through Dendritic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8057, "page_last": 8068, "abstract": "A typical biological neuron, such as a pyramidal neuron of the neocortex, receives thousands of afferent synaptic inputs on its dendrite tree and sends the efferent axonal output downstream. In typical artificial neural networks, dendrite trees are modeled as linear structures that funnel weighted synaptic inputs to the cell bodies. However, numerous experimental and theoretical studies have shown that dendritic arbors are far more than simple linear accumulators. That is, synaptic inputs can actively modulate their neighboring synaptic activities; therefore, the dendritic structures are highly nonlinear. In this study, we model such local nonlinearity of dendritic trees with our dendritic neural network (DENN) structure and apply this structure to typical machine learning tasks. Equipped with localized nonlinearities, DENNs can attain greater model expressivity than regular neural networks while maintaining efficient network inference. Such strength is evidenced by the increased fitting power when we train DENNs with supervised machine learning tasks. We also empirically show that the locality structure can improve the generalization performance of DENNs, as exemplified by DENNs outranking naive deep neural network architectures when tested on 121 classification tasks from the UCI machine learning repository.", "full_text": "Improved Expressivity Through Dendritic Neural\n\nNetworks\n\nXundong Wu\n\nXiangwen Liu\n\nWei Li\n\nQing Wu\n\nSchool of Computer Science and Technology\nHangzhou Dianzi University, Hangzhou, China\n\nwuxundong@gmail.com, wuq@hdu.edu.cn\n\nA typical biological neuron, such as a pyramidal neuron of the neocortex, receives thousands of\nafferent synaptic inputs on its dendrite tree and sends the efferent axonal output downstream. In\ntypical arti\ufb01cial neural networks, dendrite trees are modeled as linear structures that funnel weighted\nsynaptic inputs to the cell bodies. However, numerous experimental and theoretical studies have\nshown that dendritic arbors are far more than simple linear accumulators. That is, synaptic inputs can\nactively modulate their neighboring synaptic activities; therefore, the dendritic structures are highly\nnonlinear. In this study, we model such local nonlinearity of dendritic trees with our dendritic neural\nnetwork (DENN) structure and apply this structure to typical machine learning tasks. Equipped with\nlocalized nonlinearities, DENNs can attain greater model expressivity than regular neural networks\nwhile maintaining ef\ufb01cient network inference. Such strength is evidenced by the increased \ufb01tting\npower when we train DENNs with supervised machine learning tasks. We also empirically show\nthat the locality structure of DENNs can improve the generalization performance, as exempli\ufb01ed by\nDENNs outranking naive deep neural network architectures when tested on classi\ufb01cation tasks from\nthe UCI machine learning repository.\n\n1\n\nIntroduction\n\nDeep learning algorithms have made remarkable achievements in a vast array of \ufb01elds over the past\nfew years. Notably, deep convolutional neural networks have revolutionized computer vision research\nand real-world applications. Inspired by biological neuronal networks in our brains, originally in the\nform of a perceptron [37], an arti\ufb01cial neural network unit is typically constructed as a simple weighted\nsum of synaptic inputs followed by feeding the summation result through an activation function.\ni=1wixi) and exempli\ufb01ed as components in Fig.\n\nTypical neural network units can be depicted as \u03c3((cid:80)m\n\n1(a). Such a scheme has been used in almost all modern neural network models.\nIn the physiological realm, however, as revealed by both experimental and modeling studies, biological\nneurons are way more complicated than the simple weighted sum model described above. It is shown\nthat dendritic arbors (see Fig. A1 in appendix A for a schematic diagram of neurons and their\ndendrite arbors) contain an abundance of ion channels that are active, in other words, superlinear\n[41, 24, 23, 39, 25]. The existence of those active channels, combined with the leaky nature of\ndendrite membrane, suggests that one synaptic input can have nonlinear in\ufb02uence on synapses that\nare in its close proximity. In addition, studies on synapse plasticity also show strong evidence that the\nbiological machinery for plasticity also acts locally inside dendrites [24, 2]. Such properties greatly\nelevate the contribution of local nonlinear components in neuronal outputs and empower neuronal\nnetworks with much greater information processing ability [41, 17, 33, 25, 45, 32].\nDespite the continual progress of deep learning research, the performance of arti\ufb01cial neural network\nmodels is still very much inferior to that of biological neural networks, especially in small data\nlearning regimes. Naturally, we want to return to our biological brains for new inspirations. Could the\nactive dendrite structure be part of the inductive bias that gives our brains superior learning ability?\nHere, we introduce a neural network structure that aims to model the localized nonlinearity and\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fplasticity of dendrite trees and explore the advantages that active dendrites can bring to supervised\nlearning tasks.\nTo extract the local and nonlinear nature of dendritic structures, we design our dendritic neural\nnetwork (DENN) model as shown in Fig. 1(b). Constructed to replace a standard, fully connected,\nfeedforward neural network (FNN) model as in Fig. 1(a), every neuron in this model also receives a\nfull set of outputs from the earlier network layer or input data as in a standard FNN. However, in\nthis case, the connection maps at the dendrite branch level are sparse. The synaptic inputs are \ufb01rst\nsummed at each branch, followed by nonlinear integration to form the output of a neuron (see section\n3 for detail). We also ensure that learning events are isolated inside each branch for every pattern\nlearning.\nWe test DENN models on typical supervised machine learning tasks. It is revealed that our DENN\nstructure can give neural network models a major boost in expressivity under the \ufb01xed parameter size\nand network depth. At the same time, it is empirically shown that the DENN structure can improve\ngeneralization performance on certain machine learning tasks. When tested on 121 UCI machine\nlearning repository datasets, DENN models outrank naive standard FNN models.\nOur proposed DENN structure also avoids the typical computing inef\ufb01ciency associated with sparse\nneural networks and can enable ef\ufb01cient sparse neural network inference. This structure, which we\ncall intragroup sparsity, can potentially be adopted in general deep neural networks.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) A simple standard FNN with one hidden layer; (b) A DENN structure; (c) This DENN\nis a universal approximator. The hidden layer in the standard FNN is decomposed into two stages for\nthe DENN. At the \ufb01rst stage, the dendrite branch performs the linear weighted sum of the sparsely\nconnected inputs. At the second stage, the outputs of all branches are nonlinearly integrated to form\nthe neuron output. The last layer of the DENN is kept same as that in a regular FNN.\n\n2 Related Work\n\nDendritic neural networks: Driven by the disparity between classic arti\ufb01cial neural networks\n(ANNs) and biological neuronal networks, a notable amount of works have been done on integrating\nnonlinear dendrites into ANNs. Computational neuroscience research indicates that the nonlinear\ndendritic structure can strengthen the information processing capacity of neural networks [41, 32,\n33, 45, 17, 13, 9, 38]. In [32], the authors map pyramidal neurons onto two layer neural networks.\nIn the \ufb01rst layer, the sigmoidal subunits are driven by synaptic inputs. In the second layer, the\nsubunit outputs are then summed and thresholded by the cell output. Poirazi and Mel [33] show\nthat the dendritic structures can access a large capacity by a structural learning rule and random\nsynapse formation. In [35] the authors introduce and develop a mathematical model of dendritic\ncomputation in a morphological neuron based on lattice algebra. Such neurons with dendrites are\nable to approximate any compact region in Euclidean space to within any desired degree of accuracy.\nHussain et al. [13] propose a margin based multiclass classi\ufb01er using dendritic neurons, in which the\nvalue of synapses is binary. Their networks learn the data by modifying the structure, the connections\nbetween input and dendrites, not the weights. The network used in our study is different from earlier\nstudies in that our network is structured to abstract both the nonlinearity and localized learning nature\nof dendritic arbors. And our network architecture is designed with inference ef\ufb01ciency in mind. We\nalso apply this model to typical machine learning tasks with good results.\n\n2\n\n(cid:44)(cid:81)(cid:83)(cid:88)(cid:87)(cid:43)(cid:76)(cid:71)(cid:71)(cid:72)(cid:81)(cid:50)(cid:88)(cid:87)(cid:83)(cid:88)(cid:87)(cid:305)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:305)(cid:3)(cid:305)(cid:3)(cid:305)(cid:3)(cid:44)(cid:81)(cid:83)(cid:88)(cid:87)(cid:39)(cid:72)(cid:81)(cid:71)(cid:85)(cid:76)(cid:87)(cid:72)(cid:50)(cid:88)(cid:87)(cid:83)(cid:88)(cid:87)(cid:153)(cid:3)(cid:37)(cid:85)(cid:68)(cid:81)(cid:70)(cid:75)(cid:49)(cid:72)(cid:88)(cid:85)(cid:82)(cid:81)(cid:305)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:305)(cid:3)(cid:305)(cid:3)(cid:305)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:305)(cid:3)(cid:17)(cid:17)(cid:17)(cid:1856)(cid:883)(cid:3)(cid:1866)(cid:883)(cid:3)(cid:305)(cid:3)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:17)(cid:1856)(cid:884)(cid:3)(cid:1856)(cid:884)(cid:3)(cid:1875)(cid:883)(cid:3404)(cid:3)(cid:3397)(cid:883)(cid:3)(cid:1875)(cid:884)(cid:3404)(cid:3)(cid:3398)(cid:883)(cid:3)(cid:1866)(cid:882)(cid:3)(cid:1856)(cid:883)(cid:3)(cid:1856)(cid:883)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:305)(cid:3)(cid:305)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:153)(cid:3)(cid:305)(cid:3)(cid:47)(cid:72)(cid:73)(cid:87)(cid:53)(cid:76)(cid:74)(cid:75)(cid:87)\fDeep neural network: The remarkable success of the deep feedforward neural networks has driven\nmany researchers attempting to understand the reason behind such achievement. One essential factor,\nthe high expressive power of a deep network, is believed to underlie their success. In [34, 27, 26, 36],\nthe authors study the neural network expressivity by measuring the number of linear regions or\ntransition events between linear regions in the function space. A DENN can also be considered as a\nnetwork with additional depth. In this study, we also investigate the expressive power of DENNs,\nspeci\ufb01cally, how the size of dendrite branches affects the DENN performance.\nSparse neural network: There have been many works on the sparse neural networks[4, 22, 7,\n20, 29]. Sparsity can endow neural network models with the advantage of lower computational,\ncommunication, and memory requirements. In a large learning system, such as the human brain ,\nwhich contains 1011 neurons, sparse networks are the only design that is feasible. The very successful\nconvolutional neural network can also be considered as a sparse network with synaptic connections\nrestricted to local neighborhoods. The sparsity of neural networks can be built by direct construction,\nby sparsity inducing regularization [40], or by post learning weight pruning [11, 21, 44]. Our model\nbelongs to the category of sparsity through construction. Several important features make our model\nto stand out from other sparse models. First, of all our model is actually not sparse at the neuron\nlevel; sparsity only appears at the dendrite level. Second, practically no extra space is required to\nstore the connection maps of DENNs. Furthermore, the regularity structure associated with DENNs\ncan enable ef\ufb01cient computing of our networks as contrast with typical sparse neural networks. We\nwill elaborate on those properties in the following sections.\nLearned piecewise activation functions: Learned piecewise activation functions [28, 8, 10] have\nbeen the essential building blocks for the recent success of deep neural networks. Most relevant\nto this work, in [8], the authors propose to apply the Maxout output function for piecewise linear\nneural networks with enhanced performance. In our model, we adopt the Maxout function to enable\nlocalized nonlinear synaptic integration and divisional learning units. However, our model is very\ndifferent from the original Maxout network since every branch in a dendrite of our model receives\nmutually exclusive inputs.\nRandom feature subsets: In our model, the inputs to a speci\ufb01c dendrite branch are randomly\nselected. Such an approach resemble the practice in traditional machine learning algorithms that\nuse feature subsets to build an ensemble of models [49, 3]. The model ensemble built in such an\napproach can give models better generalization performance. In our model, feature subsets are used\nto construct features of dendrite subunits of hidden layers.\n\n3 Dendritic Neural Network\n\n3.1 De\ufb01nitions\nA standard feedforward neural network (FNN) de\ufb01nes a function F : Rn0 \u2192 Rout of the form\n\nF (x) = fout \u25e6 fL \u25e6 \u00b7\u00b7\u00b7 \u25e6 f1(x).\n\nA layer l of a standard feedforward neural network consists of computational units that de\ufb01ne a\nfunction fl : Rnl\u22121 \u2192 Rnl of the form\n\nfl(xl\u22121) = [fl,i(xl\u22121), . . . , fl,nl (xl\u22121)](cid:62),\nfl,i(xl\u22121) = \u03c3r(Wl,ixl\u22121 + bl,i), i \u2208 [nl],\n\nwhere fl,i : Rnl\u22121 \u2192 R is the function of the ith output unit in layer l, Wl \u2208 Rnl\u00d7nl\u22121 is the input\nweight matrix, xl\u22121 is the output vector of layer l \u2212 1 and the input vector of layer l, bl \u2208 Rnl is\nthe bias vector, \u03c3r is the activation function for each layer l \u2208 [L], L is the number of layers of a\nnetwork and {1, 2, . . . , L} is denoted by [L]. We mainly consider recti\ufb01er units \u03c3r(x) = max{0, x}.\nIn comparison, a DENN is a composition of dendritic layers that de\ufb01nes a function FD : Rn0 \u2192 Rout\ngiven by\n\nFD(x) = fout \u25e6 f D\n\nL \u25e6 \u00b7\u00b7\u00b7 \u25e6 f D\n\n: Rnl\u22121 \u2192 Rnl is the function of the dendritic layer l. Layer l has nl dendrite units\nwhere f D\nl\nand each dendrite unit is associated with one neuron output, which is the maximum of d dendritic\nbranches. Each branch makes connections with k inputs from the last layer. The selection strategy is\n\n1 (x),\n\n3\n\n\fthat each branch randomly chooses k = nl\u22121/d connections without replacement from nl\u22121 inputs.\nThis strategy assures that every input feature is only connected to a dendrite unit once to avoid\nredundant copies and that synapse sets of each branch are mutually exclusive for each neuron. Such\na selection strategy is also backed up by some physiology studies that suggest axons avoid making\nmore than one synapse connection to the dendritic arbor of each pyramidal neuron [47, 5]. The\nmutually exclusive connection selection strategy is also the basis for the ef\ufb01cient network inference.\nNote that we should let k < nl\u22121, that k is called the branch size, and that d is called the branch\nnumber. The branch hl,i,j is given by hl,i,j : Rnl\u22121 \u2192 R of the form\n\nnl\u22121(cid:88)\n\nhl,i,j(xl\u22121) =\n\n((Sl,i,j,m \u00b7 Wl,i,j,m)xl\u22121,m), j \u2208 [d],\n\nm=1\n\nwhere Wl \u2208 Rnl\u00d7nl\u22121\u00d7d are the weight matrices between input units and branches, Sl \u2208 Rnl\u00d7nl\u22121\u00d7d\nare the mask matrices to represent whether a branch has a connection with an input and the value of\nSl is binary (1 or 0). Sl are generated with a deterministic pseudo random number generator from a\npreassigned random seed. That is, we can reproduce Sl just from the original random seed. Therefore,\nwith a proper algorithm, Sl will not incur extra model storage and transfer cost.\nThe output of each neuron gl,i can be formulated by gl,i : Rd \u2192 R, gl,i(xl\u22121) = max(hl,i,j(xl\u22121))+\nbl,i, i \u2208 [nl], j \u2208 [d], where bl \u2208 Rnl is the bias vector. A dendritic layer output f D\nis the composition\nof branches and can be formulated by\n\nl\n\nl (xl\u22121) = [gl,1(xl\u22121), . . . , gl,nl (xl\u22121)](cid:62).\nf D\n\nIn particular, when d = 1, the dendritic layer is the same as an af\ufb01ne fully connected layer with linear\nactivation functions. The output layer fout of a dendritic neural network is the same as the output\nlayer of a feedforward neural network. For some cases, we use a fully connected layer with nonlinear\nactivation functions as the \ufb01rst layer of DENN.\n\n3.2 A Universal Approximator\n\nA feed-forward network with one hidden layer of \ufb01nite number of units is a universal approximator\n[12]. As shown in [8], a Maxout neural network with two hidden units of arbitrarily many compo-\nnents can also approximate any continuous function. Similarly, a DENN can also be a universal\napproximator. We have a DENN with an input layer of n0 units, the \ufb01rst layer has n1 output units\nand d1 \u2264 n0, d1 \u2208 N+, the second layer has two output units and d2 \u2264 n1, d2 \u2208 N+. Then, we use a\nvirtual unit to produce the difference of the two output units as shown in Fig. 1(c). Now consider\nthe continuous piecewise linear (PWL) function g(x) and the convex PWL function h(x) consisting\nof d2 locally af\ufb01ne regions on x \u2208 Rn. Each af\ufb01ne region of h(x) is determined by the parameter\nvectors [Wi, bi], i \u2208 [1, d2]. Then, we prove that a two-layer DENNs can approximate any continuous\nfunction f (x), x \u2208 Rn arbitrary well with suf\ufb01ciently large n1 and d2.\nProposition 1 (Wang[43]): For any positive integer n, there exist two groups of n + 1 dimensional\nreal-valued parameter vectors [W1i, b1i] and [W2i, b2i], i \u2208 [1, d2] such that g(x) = h1(x) \u2212 h2(x).\nThat is, any continuous PWL function can be expressed as the difference of two convex PWL functions.\n[8, 43].\nProposition 2 Stone-Weierstrass Approximation Theorem: Let C be a compact domain C \u2208 Rn,\nf : C \u2192 R be a continuous function, and \u0001 > 0 be any positive real number. Then, there exists a\ncontinuous P W L function g, such that for all x \u2208 C,|f (x) \u2212 g(x)| < \u0001. [8].\nTheorem 3 Universal Approximator Theorem: Any continuous function f can be approximated\narbitrarily well on a compact domain C \u2282 Rn0 by a two-layer dendritic neural network, with d1 = 1\nand n1 output units in the \ufb01rst layer and d2 \u2264 n1, d2 \u2208 N+ in the second layer with suf\ufb01ciently large\nn1 and d2.\n\nProof Sketch: Let there be a two-layer dendritic neural network with the \ufb01rst layer of n1 output\nunits and d1 = 1. The second layer has two output units, and d2 \u2264 n1, d2 \u2208 N+. Then, we construct\na virtual unit to output the \ufb01nal result, the difference of the two output units. An output unit in the\n\ufb01rst layer is a hyperplane in Rn and a branch in the second layer is also a hyperplane because it is the\n\n4\n\n\fsum of n1/d2 hyperplanes. An output unit in the second layer is the maximum of d2 branch units.\nIn other words, the output unit is the upper envelope of d2 hyperplanes, and it represents a convex\nPWL function. From Proposition 1, any continuous PWL function can be expressed by the virtual\nunit, a difference of two convex PWL functions. Then, according to Proposition 2, any continuous\nfunction can be approximated arbitrarily well by this network to achieve the desired degree of \u0001 with\nsuf\ufb01ciently large n1 and d2 on the compact domain C. Consequently, a two-layer neural network\nsatis\ufb01es Theorem 3 and then the Theorem holds. In general, as \u0001 \u2192 0, we have n1, d2 \u2192 \u221e.\nTheorem 3 is restrictive with d1 limited to 1. Here, we extend it to allow d1 \u2264 n0 with d1 \u2208 N+.\n\nTheorem 4 Generalized Universal Approximator Theorem: Any continuous function f can be\napproximated arbitrarily well on a compact domain C \u2282 Rn0 by a two-layer dendritic neural\nnetwork, with d1 \u2264 n0 and n1 output units in the \ufb01rst layer and d2 \u2264 n1, d2 \u2208 N+ in the second\nlayer with suf\ufb01ciently large n1 and d2.\n\nProof Sketch: From [8] we know that a Maxout neural network with two hidden units of arbitrarily\nmany components can approximate any continuous function. Assume that such a Maxout neural\nnetwork approximates the target function with m components in each hidden unit. We note two\nhidden Maxout units as the left unit and the right unit. We prove Theorem 4 by constructing a network\nthat is equivalent to such a Maxout neural network. We set d2 = m, and the second layer dendrite\nsize k = n0. From there we arrive at n1 = 2n0m. We assume all branches of the two second layer\nneurons receive mutually exclusive inputs from \ufb01rst layer neurons. This can be simply achieved\nthrough separating the \ufb01rst layer neurons into two independent groups as the second layer. In this\nway, every branch of the second layer neuron receives inputs from exactly n0 \ufb01rst layer neurons. We\ndenote the \ufb01rst layer input neurons as hij, with i \u2208 {1, 2, . . . , 2m} indexes the branches of second\nlayer neurons, j \u2208 {1, 2, . . . , n0}. Denote the input of the network as x0k. For every \ufb01rst layer\nneuron hij, we set all input weights in those neurons to 0, except the weight connection from x0k\nwhen k = j. The biases of branches of all zero weights are then set to negative in\ufb01nity. The bias\nof the branch of the nonzero weight are then set to zero. With this construction, each branch of the\nDENN second layer neurons receive a full set of n0 network inputs and weights corresponding to\neach inputs that are independent of other branches. Therefore, we have the equivalent network to the\noriginal Maxout network. Thus, we prove that a general two-layer DENN is a universal approximator.\n\n3.3\n\nIntragroup sparsity\n\nInside a DENN layer, the connection map between the layer inputs and the dendritic arbors are\ngenerated randomly and are sparse. Because of the poor data locality nature of generic sparse network\nstructures, performing inference with those networks is generally inef\ufb01cient [44]. We propose a\ndesign called intragroup sparsity. In this design, every DENN weight is given a single branch index,\nwhich tells us to which branch the weight feeds to. With this arrangement, each weight is \ufb01rst\nmultiplied with the corresponding layer input as usual followed by the accumulation procedure\ndirected by the branch index. That is, the multiplication result is added to a branch sum addressed by\nthe index. Given the number of branches used in our model is typically small, we will only need a\nfew bits to encode such an index. In this way, the intragroup sparsity structure avoid the poor data\nlocality issue associated with typical sparse neural networks.\n\n3.4 Network Complexity\n\nA wealth of literature has argued that deep neural networks can be exponentially more ef\ufb01cient at\nrepresenting certain functions than shallow networks [27, 34, 30, 31, 26, 36]. In [27], the author\nproposes that deep neural networks gain model complexity by mapping different parts of space into\nsame output. A network gains exponentially ef\ufb01ciency through the reuse of such maps of different\n\ufb01lters of later layers. That is, the neural networks having more linear regions can represent functions\nof higher complexity and have strong expressive power. More speci\ufb01cally, a perceptron with the\nReLU nonlinearity can divide input space into two linear regions by one hyperplane. A DENN\nneuron with d branches contains h = d(d \u2212 1)/2 divisional hyperplanes. According to the Zaslavsky\ntheorem [48] those hyperplanes can then divide the input space into a number of regions bounded\n. Empirical tests indeed show that DENNs can have boosted model complexity,\n\nabove by(cid:80)n0\n\n(cid:18)h\n(cid:19)\n\ns=0\n\ns\n\n5\n\n\fas highlighted by the number of region transitions (see Appendix B). Can such model complexity\ntranslate into better model \ufb01tting power? We test this hypothesis in the next section.\n\n4 Result\n\nIn the last section, we show that a DENN can divide function space into more linear regions than\na standard FNN. We would like to verify whether dividing function space into a greater number\nof regions actually translates into greater expressive power when the DENN is used to model\nreal world data. We also would like to determine if the inductive bias, an imitation of biological\ndendritic arbors, built into the DENNs can be bene\ufb01cial for machine learning tasks, i.e., improved\ngeneralization performance. Hence, we empirically compare the performance of DENNs with several\nFNN architectures on permutation-invariant CIFAR-10/100 and Fashion-MNIST [46] data and 121\nmachine learning classi\ufb01cation datasets from the UCI repository to \ufb01nd out.\n\n4.1 Permutation-invariant image datasets\n\nWe \ufb01rst test our models on permutation-invariant image datasets: the Fashion-MNIST[46], CIFAR-10\nand CIFAR-100 datasets[18]. The Fashion-MNIST dataset consists of 60,000 training and 10,000\ntest examples and each example is a 28 \u00d7 28 pixel grayscale article image, labeled with one out of\nten classes. In this experiment, the inputs of the network are of 28 \u00d7 28 = 784 dimensions. The\nCIFAR-10/100 dataset consists of 50,000 training and 10,000 test images, and each image is a 32\n\u00d7 32 color image, associated with one of 10 or 100 classes respectively. For permutation-invariant\nlearning, models should be unaware of the 2-D structure of images used in learning.\nThe baseline standard FNN used in this experiment is composed of three layers with n units for the\n\ufb01rst two layers and 10/100 category outputs for last layer. The output function used for the hidden\nunits is the popular ReLU function [19]. For the network outputs, the Softmax function is used to\ngenerate the network outputs. The cross entropy loss is then calculated between the Softmax output\nand ground-truth labels. To facilitate model training, we also insert batch normalization [14] or layer\nnormalization [1] in the \ufb01rst two layers of the standard FNNs (Indeed, we are giving control models\nan extra advantage here.). We construct a DENN and a Maxout network to compare against the\nstandard FNNs. For a proper comparison, we keep the number of synaptic parameters constant across\ndifferent architectures. That is, for DENNs, the number of hidden units n are kept the same as in\nthe baseline models. When we vary the number of dendrite branches d in each neuron, the number\nof synaptic weights is set to n/d to keep the number of synapses the same as in the standard FNN\nhidden layer. For the Maxout network, by increasing the number of kernels in a Maxout unit, the\nnumber of units is reduced accordingly to keep the total number of synaptic parameters constant. The\noutput layers of the DENN and Maxout models are kept same as those of the standard FNNs.\nWe train all models in comparison with the Adam optimizer [15] for 100 epochs. The learning rate\nused is decayed exponentially from 0.01 to 1e \u2212 5 unless otherwise stated. To show the model \ufb01tting\npower at plain state, we do not apply any kind of model regularization in our model training. For this\ntest, data are preprocessed with per-image standardization.\nWe start by training models on Fashion-MNIST and CIFAR-10 datasets to test the \ufb01tting power of\nmodels with layer size n set to 512. As shown in Fig. 2, DENNs can attain much lower training\nloss than other network architectures on both datasets. More speci\ufb01cally, in Fig. 2(a) and Fig. 2(c),\nthe lowest training loss values of the DENNs and Maxout networks of different branch numbers\nd (kernel number for Maxout) are compared against the lowest loss from standard FNNs with\nbatch normalization or layer normalization (the loss of regular FNN without normalization and\nself-normalizing neural networks (SNNs) are much higher and not shown). As we can observe,\nDENNs with mid-size dendrite branch numbers can attain much lower loss than standard FNNs.\nThe performances of the DENNs with extreme branch numbers, for example, 2 and 256 branches\nare comparable with standard FNNs. Regular Maxout models are known to bene\ufb01t from their extra\nparameters from additional kernels. In this experiment, constrained by the constant synaptic weight\nnumber requirement, the performance of Maxout networks deteriorates as we decrease the number of\nhidden units to compensate for the increased number of kernels per unit. Typical sets of training loss\ncurves are also shown in Fig. 2(b) and Fig. 2(d) respectively. Clearly, DENNs attain much lower\ntraining loss than other FNNs. We did not show the training accuracy curves here because they tend\nto reach quite close to 100% accuracy due to overparameterization and thus are not truly informative.\n\n6\n\n\f(a) Lowest training loss on Fashion-MNIST\n\n(b) Fashion-MNIST training loss curves\n\n(c) Lowest training loss on CIFAR-10\n\n(d) CIFAR-10 training loss curves\n\nFigure 2: Training loss results on Fashion-MNIST and CIFAR-10 dataset for ReLU FNNs (batch\nnormalization-ReLU (BN-ReLU) and layer normalization-ReLU (LN-ReLU)), DENNs and Maxout\nnetworks. The Y axis is set to the logarithmic scale.\n\nTo obtain a better understanding, we perform further experiments to test models with different model\nsizes where they are under more capacity pressure.\nIn Fig. 3 we show the results from training models on CIFAR-100 dataset with layer size n set to 64,\n128, 256 and 512 respectively. DENNs show clear advantage over other FNN models when under\ncapacity pressure observable on both training accuracy and loss curves.\n\n(a) Training accuracy on CIFAR-100 dataset\n\n(b) Training loss on CIFAR-100\n\nFigure 3: Training accuracy and training loss results on CIFAR-100 dataset for DENNs with different\nbranch numbers along with results of ReLU FNNs (batch normalization-ReLU i.e. BN-ReLU and\nlayer normalization-ReLU i.e. LN-ReLU).\n\nWe also perform experiments on DENNs with two different nonlinear activation functions other\nthan the Maxout function. In the \ufb01rst architecture, synaptic inputs are \ufb01rst summed, followed\nby passing individual dendrite summation result through the ReLU function. Then, we sum the\ndendritic outputs and pass it through a secondary ReLU function. The second architecture differs\nfrom the \ufb01rst architecture in that no secondary ReLU is used. The DENN with Maxout nonlinearities\nclearly outperforms the two ReLU-based architectures(see Appendix C, Fig. A10-13). It is worth to\nmentioning that even though those two architectures under-perform the standard DENNs, they also\nshow rather interesting learning behaviors. We will further explore on this in our future work.\nTo obtain a better understanding of the learning behaviors of DENNS, we also train networks data\nsynthesized from a random Gaussian noise generator (See Fig. A14 of Appendix C). Learning data\nwith random data means models are less likely to take advantage of the redundancy in the data and\nthus require more model capacity. DENN models appears to hold less of an advantage over standard\n\n7\n\n24816326412825610-810-610-410-2100102 LN-ReLU MaxoutTraining LossBranches/Neuron BN-ReLU DENN010203040506070809010010-810-610-410-2100102 LN-ReLU Maxout-2Training LossEpoch BN-ReLU DENN-1624816326412825610-810-610-410-2100102 LN-ReLU MaxoutTraining LossBranches/Neuron BN-ReLU DENN010203040506070809010010-810-610-410-2100102 LN-ReLU Maxout-8Training LossEpoch BN-ReLU DENN-322481632641280.40.50.60.70.80.91.0 BN-ReLU-64 BN-ReLU-128 BN-ReLU-256 BN-ReLU-512 LN-ReLU-64 LN-ReLU-128 LN-ReLU-256 LN-ReLU-512Training AccuracyBranches/Neuron DENN-64 DENN-128 DENN-256 DENN-5122481632641281E-41E-30.010.1110 BN-ReLU-64 BN-ReLU-128 BN-ReLU-256 BN-ReLU-512 LN-ReLU-64 LN-ReLU-128 LN-ReLU-256 LN-ReLU-512Training LossBranches/Neuron DENN-64 DENN-128 DENN-256 DENN-512\fFNN than when they are trained with regular image data. This result might indicate that part of the\nsuperior performance of DENN over standard FNNs comes from taking advantage of the correlation\ninside the data [22, 4].\nAdditional experimental results on permutation-invariant image datasets, including results on the test\nsets, can be found in Appendix C.\n\n4.2\n\n121 UCI Classi\ufb01cation Datasets\n\nIn addition to evaluating the \ufb01tting power of DENNs on image datasets, we also evaluate the\ngeneralization performance of our DENNs on a collection of 121 machine learning classi\ufb01cation\ntasks from the UCI repository as used in [16, 6, 42]. The data collection covers various application\nareas, such as biology, geology or physics. We compare the generalization performance of DENNs\nto those of standard FNN architectures tested in [16] on the dataset collection for its generalization\nperformance. To obtain the benchmark result of a speci\ufb01c standard FNN method, a grid search is\nperformed to search for the best architecture and hyperparameters with a separate validation set. The\nbest model is then evaluated for accuracy on the prede\ufb01ned test set. We repeat the experiment on\nself-normalizing neural networks (SNNs) and obtain very similar results as in [16]. For all standard\nFNN architectures other than DENNs, we use the benchmark results that are included in [16] for the\nrest of the experiments. A detailed list of test results on those standard FNN networks and the list of\ntheir corresponding grid search hyperparameter space can be found in the Appendix A.4 of [16].\nFor this part of the experiment, the DENNs used are all composed of three network layers. For the\n\ufb01rst layer, a fully connected ReLU input layer is used to accommodate a wide range of input feature\ndimensionality. The second layer is a DENN layer, followed by the third layer, a fully connected\nsoftmax output layer. The width of the input layer and the DENN layer are both set to 512 units.\nThe output layer width is set to equal to the number of classes for the corresponding dataset. For the\nDENN layer, we optimize the model architecture over the number of dendritic branches d for each\nhidden unit. The value of d is set to be one of 21, 22,\u00b7\u00b7\u00b7 , 28 for each model. Correspondingly, the\nnumber of active weights in each dendrite branch is set to 512/d. In addition to the regular DENNs\nmodels, we also train models with a dropout rate of 0.2. In total, for each task, we perform grid-search\non 16 different settings.\nAs in [16], we train each model for 100 epochs with a \ufb01xed learning rate of 0.01. Different from [16],\nwe use the Adam optimizer instead of the SGD optimizer for model training because DENNs are\ngenerally harder to train due to its sparse structure. For comparison, we also run tests of training\nSNN networks with the Adam optimizer, which gives worse results than the one trained with SGD\noptimizer. For the rest of the experiment, we follow the same procedure as in [16].\nThe accuracy results obtained on the DENNs can be found in Table A1. We ranked DENNs and other\nstandard FNN architectures by their accuracy for each dataset and compare their average ranks(Table\n1). DENNs outperform other network architectures in pairwise comparisons (paired Wilcoxon test\nacross datasets). The comparison results are reported in the right column of Table 1.\n\nTable 1: Comparison of DENNs with seven different FNN architectures on 121 UCI classi\ufb01cation\ntasks. The 1st column lists the name of architectures, the 2nd column shows the average rank, and\nthe 3rd column is the p-value of a paired Wilcoxon signed-rank test. Asterisks denote for statistical\nsigni\ufb01cance. DENNs get the best rank among network models in comparison.\n\nAvg.rank\nMethod\n3.08\nDENNs\n3.50\nSNN\nMSRAinit\n4.05*\nLayerNorm 4.10*\n\np-value Method\nHighway\n\nAvg.rank\n4.38*\n4.55*\n1.37e-01 ResNet\n2.53e-03 BatchNorm\n4.84*\n2.50e-03 WeightNorm 4.86*\n\np-value\n9.08e-05\n1.16e-05\n7.64e-07\n3.14e-07\n\n5 Discussion\n\nIn this paper, we have proposed the DENN, a neural network architecture constructed to study the\nadvantage that the dendrite tree structure of biological neurons might offer in learning. Speci\ufb01cally,\nwe apply DENNs on supervised machine learning tasks. Through theoretical and empirical studies,\n\n8\n\n\fwe identify that DENNs can provide greater expressivity than traditional neural networks. In addition,\nwe also show that DENNs can improve generalization performances for certain data, as evidenced by\nthe improved test accuracy on a large collection of machine learning tasks. Our DENN model is also\ndesigned to allow ef\ufb01cient network inference, which is not generally accessible to the typical sparse\nneural networks due to their built-in irregularity. In a DENN, each connection weight is associated\nwith one and only one low bit-width branch index. Such regularity in its structure will allow ef\ufb01cient\nnetwork computing with properly designed software and/or hardware. While such a sparse structure,\nnoted as intragroup sparsity, was designed for DENNs, it can also be extended beyond DENNs to\nallow multiple network units to share one set (or a few sets) of inputs and thus enable a novel kind of\ninference ef\ufb01cient sparse network architecture.\nWe recall that DENNs can have diverse expressive power across different branch sizes when tested on\npermutation-invariant image datasets. Here, we give tentative discussion on the reason why DENNs\nhave such behaviors. According to Theorem 3, any continuous function can be approximated by a\ncontinuous PWL function, which can be expressed as a difference of two convex PWL functions.\nA dendrite unit is the upper envelope of d real-valued linear functions. The approximation error \u0001\nbetween the objective function and the output of the dendrite unit goes to zero, as d \u2192 \u221e. However,\nthis result relies on an implicit condition that each linear function is without any constraint. As each\ndendrite branch connects with fewer inputs, the constraint for the linear functions would be stricter.\nThe approximation error between the dendrite unit and the requested convex function would go up.\nSuch a trade-off between the branch size and the branch number leads to optimal expressivity at\nmid-range dendrite sizes.\nThus far, we have not observed a clear advantage from DENN models in generalization performance\non image datasets we tested on (see Appendix C for results on test set accuracy). It is possible\nwe can obtain superior model generalization performance on those data by imposing additional\nregularization. Given the operating space granted by the substantially greater \ufb01tting power of DENN\nmodels, obtaining better generalization performance is certainly a possibility. We will test this in our\nfollow-up works. Another possibility is to explore the nonlinear functions other than Maxout and\nReLU we tested.\nThe network architecture used in this study is a boiled-down version of the biological neuronal\nnetwork. DENNs model the essence of dendrite structure through enforcing localized nonlinearities\nand compartmentalized learning in dendrite branches. Such structure certainly does not cover the\nfull complexity of dendrite arbors. In reality, we know that the in\ufb02uence of the activity of a certain\nsynapse over its neighborhood decays quickly over the length of the branch. Such in\ufb02uence is also\nknown to be modulated by meta-plasticity learning. In addition, the molecular machinery underlies\nsynaptic plasticity relies on the diffusion of ion channels and signal molecules. Therefore, in the\naspect of plasticity, the idea of treating a single dendritic branch as a learning compartment is another\nsimpli\ufb01cation. We also know that the size of dendritic branches can vary signi\ufb01cantly, which can\nhave compelling in\ufb02uence over the synaptic activation integration and localized plasticity inside\ndendrites. In this study, we used \ufb01xed sparsity masks that are generated at the initialization stage of\nneural networks. Such approach offers advantage that virtually no extra space is required to store the\nmask. As revealed by many earlier studies [33, 11], structural plasticity, i.e. changing the connection\nmap can greatly improve model capacity. We will explore on this topic in our future work.\nThe complexity of biological neurons suggests we might need a learning apparatus with more\nbuilt-in mechanisms or inductive bias to improve our neural network models. The following is a\npotential avenue that can be explored in our future research: To train a neural network with more\nthan tens of thousand parameters, generally only \ufb01rst-order gradient descent methods are feasible.\nHigher-order methods are impractical due to their prohibitively high computing complexity. With\na compartmentalized learning subunit, calculation of the localized higher-order gradients can be\nbrought into the realm and might help us to train models better and faster. Another potential approach\ncould be building local mechanisms [45] to improve model learning through implicitly utilizing\nhigher-order gradient information.\n\nAcknowledgement\n\nWe sincerely thank anonymous reviewers for their insightful comments and suggestions. We also\nthank Dr. G\u00fcnter Klambauer for his generous help on the UCI-datasets experiment.\n\n9\n\n\fReferences\n[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[2] Tiago Branco and Michael H\u00e4usser. The single dendritic branch as a fundamental functional\n\nunit in the nervous system. Current opinion in neurobiology, 20(4):494\u2013502, 2010.\n\n[3] Robert Bryll, Ricardo Gutierrez-Osuna, and Francis Quek. Attribute bagging: improving accu-\nracy of classi\ufb01er ensembles by using random feature subsets. Pattern recognition, 36(6):1291\u2013\n1302, 2003.\n\n[4] N Alex Cayco-Gajic, Claudia Clopath, and R Angus Silver. Sparse synaptic connectivity is\nrequired for decorrelation and pattern separation in feedforward networks. Nature Communica-\ntions, 8(1):1116, 2017.\n\n[5] Dmitri B Chklovskii, BW Mel, and K Svoboda. Cortical rewiring and information storage.\n\nNature, 431(7010):782, 2004.\n\n[6] Manuel Fern\u00e1ndez-Delgado, Eva Cernadas, Sen\u00e9n Barro, and Dinani Amorim. Do we need\nhundreds of classi\ufb01ers to solve real world classi\ufb01cation problems. J. Mach. Learn. Res,\n15(1):3133\u20133181, 2014.\n\n[7] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In\nProceedings of the fourteenth international conference on arti\ufb01cial intelligence and statistics,\npages 315\u2013323, 2011.\n\n[8] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio.\n\nMaxout networks. arXiv preprint arXiv:1302.4389, 2013.\n\n[9] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with\n\nsegregated dendrites. eLife, 6, 2017.\n\n[10] Caglar Gulcehre, Marcin Moczulski, Misha Denil, and Yoshua Bengio. Noisy activation\n\nfunctions. In International Conference on Machine Learning, pages 3059\u20133068, 2016.\n\n[11] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[12] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks,\n\n4(2):251\u2013257, 1991.\n\n[13] Shaista Hussain, Shih-Chii Liu, and Arindam Basu. Improved margin multi-class classi\ufb01cation\nusing dendritic neurons with morphological learning. In Circuits and Systems (ISCAS), 2014\nIEEE International Symposium on, pages 2640\u20132643. IEEE, 2014.\n\n[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing\nneural networks. In Advances in Neural Information Processing Systems, pages 972\u2013981, 2017.\n\n[17] Christof Koch, Tomaso Poggio, and Vincent Torre. Nonlinear interactions in a dendritic tree:\nlocalization, timing, and role in information processing. Proceedings of the National Academy\nof Sciences, 80(9):2799\u20132802, 1983.\n\n[18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, University of Toronto, 2009.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n10\n\n\f[20] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Ef\ufb01cient sparse coding algorithms.\n\nIn Advances in neural information processing systems, pages 801\u2013808, 2007.\n\n[21] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[22] Ashok Litwin-Kumar, Kameron Decker Harris, Richard Axel, Haim Sompolinsky, and LF Ab-\n\nbott. Optimal degrees of synaptic connectivity. Neuron, 93(5):1153\u20131164, 2017.\n\n[23] Attila Losonczy and Jeffrey C Magee. Integrative properties of radial oblique dendrites in\n\nhippocampal ca1 pyramidal neurons. Neuron, 50(2):291\u2013307, 2006.\n\n[24] Attila Losonczy, Judit K Makara, and Jeffrey C Magee. Compartmentalized dendritic plasticity\n\nand input feature storage in neurons. Nature, 452(7186):436, 2008.\n\n[25] BARTLETT W Mel. Synaptic integration in an excitable dendritic tree. Journal of neurophysi-\n\nology, 70(3):1086\u20131101, 1993.\n\n[26] Hrushikesh Mhaskar, Qianli Liao, and Tomaso A Poggio. When and why are deep networks\n\nbetter than shallow ones? In AAAI, pages 2343\u20132349, 2017.\n\n[27] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\nlinear regions of deep neural networks. In Advances in neural information processing systems,\npages 2924\u20132932, 2014.\n\n[28] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th International Conference on International Conference on Machine\nLearning, pages 807\u2013814. Omnipress, 2010.\n\n[29] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by v1? Vision research, 37(23):3311\u20133325, 1997.\n\n[30] Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of\ndeep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098,\n2013.\n\n[31] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao.\nWhy and when can deep-but not shallow-networks avoid the curse of dimensionality: A review.\nInternational Journal of Automation and Computing, 14(5):503\u2013519, 2017.\n\n[32] Panayiota Poirazi, Terrence Brannon, and Bartlett W Mel. Pyramidal neuron as two-layer neural\n\nnetwork. Neuron, 37(6):989\u2013999, 2003.\n\n[33] Panayiota Poirazi and Bartlett W Mel. Impact of active dendrites and structural plasticity on the\n\nmemory capacity of neural tissue. Neuron, 29(3):779\u2013796, 2001.\n\n[34] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\nexpressive power of deep neural networks. In International Conference on Machine Learning,\npages 2847\u20132854, 2017.\n\n[35] Gerhard X Ritter and Gonzalo Urcid. Lattice algebra approach to single-neuron computation.\n\nIEEE Transactions on Neural Networks, 14(2):282\u2013295, 2003.\n\n[36] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural\n\nfunctions. arXiv preprint arXiv:1705.05502, 2017.\n\n[37] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza-\n\ntion in the brain. Psychological review, 65(6):386, 1958.\n\n[38] Joao Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic error backprop-\n\nagation in deep cortical microcircuits. arXiv preprint arXiv:1801.00062, 2017.\n\n[39] Jackie Schiller, Guy Major, Helmut J Koester, and Yitzhak Schiller. Nmda spikes in basal\n\ndendrites of cortical pyramidal neurons. Nature, 404(6775):285, 2000.\n\n11\n\n\f[40] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural\nnetworks. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE\nConference on, pages 455\u2013462. IEEE, 2017.\n\n[41] Greg Stuart, Nelson Spruston, and Michael H\u00e4usser. Dendrites. Oxford University Press, 2016.\n\n[42] Michael Wainberg, Babak Alipanahi, and Brendan J Frey. Are random forests truly the best\n\nclassi\ufb01ers? The Journal of Machine Learning Research, 17(1):3837\u20133841, 2016.\n\n[43] Shuning Wang. General constructive representations for continuous piecewise-linear functions.\n\nIEEE Transactions on Circuits and Systems I: Regular Papers, 51(9):1889\u20131896, 2004.\n\n[44] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in\ndeep neural networks. In Advances in Neural Information Processing Systems, pages 2074\u20132082,\n2016.\n\n[45] Xundong E Wu and Bartlett W Mel. Capacity-enhancing synaptic learning rules in a medial\n\ntemporal lobe online learning model. Neuron, 62(1):31\u201341, 2009.\n\n[46] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[47] Rafael Yuste. Dendritic spines and distributed circuits. Neuron, 71(5):772\u2013781, 2011.\n\n[48] Thomas Zaslavsky. Facing up to Arrangements: Face-Count Formulas for Partitions of Space\nby Hyperplanes: Face-count Formulas for Partitions of Space by Hyperplanes, volume 154.\nAmerican Mathematical Soc., 1975.\n\n[49] Gabriele Zenobi and Padraig Cunningham. Using diversity in preparing ensembles of classi\ufb01ers\nbased on different feature subsets to minimize generalization error. In European Conference on\nMachine Learning, pages 576\u2013587. Springer, 2001.\n\n12\n\n\f", "award": [], "sourceid": 4972, "authors": [{"given_name": "Xundong", "family_name": "Wu", "institution": "Hangzhou Dianzi University"}, {"given_name": "Xiangwen", "family_name": "Liu", "institution": "Hangzhou Dianzi University"}, {"given_name": "Wei", "family_name": "Li", "institution": "Hangzhou Dianzi University"}, {"given_name": "Qing", "family_name": "Wu", "institution": "Hangzhou Dianzi University"}]}