{"title": "Constructing Deep Neural Networks by Bayesian Network Structure Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3047, "page_last": 3058, "abstract": "We introduce a principled approach for unsupervised structure learning of deep neural networks. We propose a new interpretation for depth and inter-layer connectivity where conditional independencies in the input distribution are encoded hierarchically in the network structure. Thus, the depth of the network is determined inherently. The proposed method casts the problem of neural network structure learning as a problem of Bayesian network structure learning. Then, instead of directly learning the discriminative structure, it learns a generative graph, constructs its stochastic inverse, and then constructs a discriminative graph. We prove that conditional-dependency relations among the latent variables in the generative graph are preserved in the class-conditional discriminative graph. We demonstrate on image classification benchmarks that the deepest layers (convolutional and dense) of common networks can be replaced by significantly smaller learned structures, while maintaining classification accuracy---state-of-the-art on tested benchmarks. Our structure learning algorithm requires a small computational cost and runs efficiently on a standard desktop CPU.", "full_text": "Constructing Deep Neural Networks by Bayesian\n\nNetwork Structure Learning\n\nRaanan Y. Rohekar\n\nIntel AI Lab\n\nraanan.yehezkel@intel.com\n\nShami Nisimov\n\nIntel AI Lab\n\nshami.nisimov@intel.com\n\nYaniv Gurwicz\n\nIntel AI Lab\n\nyaniv.gurwicz@intel.com\n\nGuy Koren\nIntel AI Lab\n\nguy.koren@intel.com\n\nGal Novik\nIntel AI Lab\n\ngal.novik@intel.com\n\nAbstract\n\nWe introduce a principled approach for unsupervised structure learning of deep\nneural networks. We propose a new interpretation for depth and inter-layer con-\nnectivity where conditional independencies in the input distribution are encoded\nhierarchically in the network structure. Thus, the depth of the network is determined\ninherently. The proposed method casts the problem of neural network structure\nlearning as a problem of Bayesian network structure learning. Then, instead of\ndirectly learning the discriminative structure, it learns a generative graph, constructs\nits stochastic inverse, and then constructs a discriminative graph. We prove that\nconditional-dependency relations among the latent variables in the generative graph\nare preserved in the class-conditional discriminative graph. We demonstrate on\nimage classi\ufb01cation benchmarks that the deepest layers (convolutional and dense)\nof common networks can be replaced by signi\ufb01cantly smaller learned structures,\nwhile maintaining classi\ufb01cation accuracy\u2014state-of-the-art on tested benchmarks.\nOur structure learning algorithm requires a small computational cost and runs\nef\ufb01ciently on a standard desktop CPU.\n\n1\n\nIntroduction\n\nOver the last decade, deep neural networks have proven their effectiveness in solving many chal-\nlenging problems in various domains such as speech recognition (Graves & Schmidhuber, 2005),\ncomputer vision (Krizhevsky et al., 2012; Girshick et al., 2014; Szegedy et al., 2015) and machine\ntranslation (Collobert et al., 2011). As compute resources became more available, large scale models\nhaving millions of parameters could be trained on massive volumes of data, to achieve state-of-the-art\nsolutions. Building these models requires various design choices such as network topology, cost\nfunction, optimization technique, and the con\ufb01guration of related hyper-parameters.\nIn this paper, we focus on the design of network topology\u2014structure learning. Generally, exploration\nof this design space is a time consuming iterative process that requires close supervision by a human\nexpert. Many studies provide guidelines for design choices such as network depth (Simonyan &\nZisserman, 2014), layer width (Zagoruyko & Komodakis, 2016), building blocks (Szegedy et al.,\n2015), and connectivity (He et al., 2016; Huang et al., 2016). Based on these guidelines, these studies\npropose several meta-architectures, trained on huge volumes of data. These were applied to other\ntasks by leveraging the representational power of their convolutional layers and \ufb01ne-tuning their\ndeepest layers for the task at hand (Donahue et al., 2014; Hinton et al., 2015; Long et al., 2015; Chen\net al., 2015; Liu et al., 2015). However, these meta-architectures may be unnecessarily large and\nrequire large computational power and memory for training and inference.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe problem of model structure learning has been widely researched for many years in the proba-\nbilistic graphical models domain. Speci\ufb01cally, Bayesian networks for density estimation and causal\ndiscovery (Pearl, 2009; Spirtes et al., 2000). Two main approaches were studied: score-based and\nconstraint-based. Score-based approaches combine a scoring function, such as BDe (Cooper &\nHerskovits, 1992), with a strategy for searching in the space of structures, such as greedy equivalence\nsearch (Chickering, 2002). Adams et al. (2010) introduced an algorithm for sampling deep belief\nnetworks (generative model) and demonstrated its applicability to high-dimensional image datasets.\nConstraint-based approaches (Pearl, 2009; Spirtes et al., 2000) \ufb01nd the optimal structures in the large\nsample limit by testing conditional independence (CI) between pairs of variables. They are generally\nfaster than score-based approaches (Yehezkel & Lerner, 2009) and have a well-de\ufb01ned stopping\ncriterion (e.g., maximal order of conditional independence). However, these methods are sensitive to\nerrors in the independence tests, especially in the case of high-order CI tests and small training sets.\nMotivated by these methods, we propose a new interpretation for depth and inter-layer connectivity in\ndeep neural networks. We derive a structure learning algorithm such that a hierarchy of independencies\nin the input distribution is encoded in a deep generative graph, where lower-order independencies are\nencoded in deeper layers. Thus, the number of layers is automatically determined, which is a desirable\nvirtue in any architecture learning method. We then convert the generative graph into a discriminative\ngraph, demonstrating the ability of the latter to mimic (preserve conditional dependencies) of the\nformer. In the resulting structure, a neuron in a layer is allowed to connect to neurons in deeper layers\nskipping intermediate layers. This is similar to the shortcut connection (Raiko et al., 2012), while\nour method derives it automatically. Moreover, neurons in deeper layers represent low-order (small\ncondition sets) independencies and have a wide scope of the input, whereas neurons in the \ufb01rst layers\nrepresent higher-order (larger condition sets) independencies and have a narrower scope. An example\nof a learned structure, for MNIST, is given in Figure 1 (X are image pixels).\n\nFigure 1: An example of a structure learned by our algorithm (classifying MNIST digits, 99.07%\naccuracy). Neurons in a layer may connect to neurons in any deeper layer. Depth is determined\nautomatically. Each gather layer selects a subset of the input, where each input variable is gathered\nonly once. A neural route, starting with a gather layer, passes through densely connected layers where\nit may split (copy) and merge (concatenate) with other routes in correspondence with the hierarchy of\nindependencies identi\ufb01ed by the algorithm. All routes merge into the \ufb01nal output layer.\n\nThe paper is organized as follows. We discuss related work in Section 2. In Section 3 we describe\nour method and prove its correctness in supplementary material Sec. A. We provide experimental\nresults in Section 4, and conclude in Section 5.\n\n2 Related Work\n\nRecent studies have focused on automating the exploration of the design space, posing it as a hyper-\nparameter optimization problem and proposing various approaches to solve it. Miconi (2016) learns\nthe topology of an RNN introducing structural parameters into the model and optimizing them along\nwith the model weights by the common gradient descent methods. Smith et al. (2016) take a similar\napproach incorporating the structure learning into the parameter learning scheme, gradually growing\nthe network up to a maximum size.\n\n2\n\nXYgatherdenseconcatcopy\fA common approach is to de\ufb01ne the design space in a way that enables a feasible exploration process\nand design an effective method for exploring it. Zoph & Le (2016) (NAS) \ufb01rst de\ufb01ne a set of\nhyper-parameters characterizing a layer (number of \ufb01lters, kernel size, stride). Then they use a\ncontroller-RNN for \ufb01nding the optimal sequence of layer con\ufb01gurations for a \u201ctrainee network\u201d. This\nis done using policy gradients (REINFORCE) for optimizing the objective function that is based\non the accuracy achieved by the \u201ctrainee\u201d on a validation set. Although this work demonstrates\ncapabilities to solve large-scale problems (Imagenet), it comes with huge computational cost. In a\nfollowing work, Zoph et al. (2017) address the same problem but apply a hierarchical approach. They\nuse NAS to design network modules on a small-scale dataset (CIFAR-10) and transfer this knowledge\nto a large-scale problem by learning the optimal topology composed of these modules. Baker et al.\n(2016) use reinforcement learning as well and apply Q-learning with epsilon-greedy exploration\nstrategy and experience replay. Negrinho & Gordon (2017) propose a language that allows a human\nexpert to compactly represent a complex search-space over architectures and hyper-parameters as a\ntree and then use methods such as MCTS or SMBO to traverse this tree. Smithson et al. (2016) present\na multi objective design space exploration, taking into account not only the classi\ufb01cation accuracy\nbut also the computational cost. In order to reduce the cost involved in evaluating the network\u2019s\naccuracy, they train a Response Surface Model that predicts the accuracy at a much lower cost,\nreducing the number of candidates that go through actual validation accuracy evaluation. Another\ncommon approach for architecture search is based on evolutionary strategies to de\ufb01ne and search the\ndesign space. Real et al. (2017) and Miikkulainen et al. (2017) use evolutionary algorithm to evolve\nan initial model or blueprint based on its validation performance.\nCommon to all these recent studies is the fact that structure learning is done in a supervised manner,\neventually learning a discriminative model. Moreoever, these approaches require huge compute\nresources, rendering the solution unfeasible for most applications given limited compute and time.\n\n3 Proposed Method\nPreliminaries. Consider X = {Xi}N\ni=1 a set of observed (input) random variables, H a set of latent\nvariables, and Y a target (classi\ufb01cation or regression) variable. Each variable is represented by a\nsingle node, and a single edge connects two distinct nodes. The parent set of a node X in G is denoted\nPa(X; G), and the children set is denoted Ch(X; G). Consider four graphical models, G, Ginv, Gdis,\nand gX. Graph G is a generative DAG de\ufb01ned over X \u222a H, where Ch(X; G) = \u2205,\u2200X \u2208 X. Graph\nG can be described as a layered deep Bayesian network where the parents of a node can be in any\ndeeper layer and not restricted to the previous layer1. In a graph with m latent layers, we index the\ndeepest layer as 0 and the layer connected to the input as m \u2212 1. The root nodes (parentless) are\nlatent, H (0), and the leaves (childless) are the observed nodes, X, and Pa(X; G) \u2282 H. Graph Ginv\nis called a stochastic inverse of G, de\ufb01ned over X \u222a H, where Pa(X; Ginv) = \u2205,\u2200X \u2208 X. Graph\nGdis is a discriminative graph de\ufb01ned over X \u222a H \u222a Y , where Pa(X; Gdis) = \u2205,\u2200X \u2208 X and\nCh(Y ; Gdis) = \u2205. Graph gX is a CPDAG (a family of Markov equivalent Bayesian networks) de\ufb01ned\nover X. Graph gX is generated and maintained as an internal state of the algorithm, serving as an\nauxiliary graph. The order of an independence relation between two variables is de\ufb01ned to be the\ncondition set size. For example, if X1 and X2 are independent given X3, X4, and X5 (d-separated in\nthe faithful DAG X1 \u22a5\u22a5 X2|{X3, X4, X5}), then the independence order is |{X3, X4, X5}| = 3.\n\n3.1 Key Idea\n\nand the posterior is P (Y |X) = (cid:82) P (H|X)P (Y |H (0))dH, where H = {H (i)}m\u22121\n\nWe cast the problem of learning the structure of a deep neural network as a problem of learning the\nstructure of a deep (discriminative) probabilistic graphical model, Gdis. That is, a graph of the form\nX (cid:32) H (m\u22121) (cid:32) \u00b7\u00b7\u00b7 (cid:32) H (0) \u2192 Y , where \u201c(cid:32)\u201d represent a sparse connectivity which we learn,\nand \u201c\u2192\u201d represents full connectivity. The joint probability factorizes as P (X)P (H|X)P (Y |H (0))\n. We refer\nto the P (H|X) part of the equation as the recognition network of an unknown \u201ctrue\u201d underlying\ngenerative model, P (X|H). That is, the network corresponding to P (H|X) approximates the\nposterior (e.g., as in amortized inference). The key idea is to approximate the latents H that\n\n0\n\n1This differs from the common de\ufb01nition of deep belief networks (Hinton et al., 2006; Adams et al., 2010)\n\nwhere the parents are restricted to the next layer.\n\n3\n\n\fgenerated the observed X, and then use these values of H (0) for classi\ufb01cation. That is, avoid\nlearning Gdis directly and instead, learn a generative structure X (cid:32) H, and then reverse the \ufb02ow by\nconstructing a stochastic inverse (Stuhlm\u00fcller et al., 2013) X (cid:32) H. Finally, add Y and modify the\ngraph to preserve conditional dependencies (Gdis can mimic G; Gdis does not include sparsity that is\nnot supported by G). Lastly, Gdis is converted into a deep neural network by replacing each latent\nvariable by a neural layer. We call this method B2N (Bayesian to Neural), as it learns the connectivity\nof a deep neural network through Bayesian network structure.\n\n3.2 Constructing a Deep Generative Graph\n\nThe key idea of constructing G, the generative graph, is to recursively introduce a new latent layer,\nH (n), after testing n-th order conditional independence in X, and connect it, as a parent, to latent\nlayers created by subsequent recursive calls that test conditional independence of order n+1. To better\nunderstand why deeper layer represent smaller condition independence sets, consider an ancestral\nsampling of the generative graph. First, the values of nodes in the deepest layer, corresponding to\nmarginal independence, are sampled\u2014each node is sampled independently. In the next layer, nodes\ncan be sampled independently given the values of deeper nodes. This enables gradually factorizing\n(\u201cdisentangling\u201d) the joint distribution over X. Hence, approximating the values of latents, H, in\nthe deepest layer provides us with statistically independent features of the data, which can be fed in\nto a single layer linear classi\ufb01er. Yehezkel & Lerner (2009) introduced an ef\ufb01cient algorithm (RAI)\nfor constructing a CPDAG over X by a recursive application of conditional independence tests with\nincreasing condition set sizes. Our algorithm is based on this framework for testing independence in\nX and updating the auxiliary graph gX.\nOur proposed recursive algorithm for constructing G, is presented in Algorithm 1 (DeepGen) and a\n\ufb02ow chart is shown in the supplementary material Sec. B. The algorithm starts with condition set\nn = 0, gX a complete graph (de\ufb01ned over X), and a set of exogenous nodes, X ex = \u2205. The set X ex\nis exogenous to gX and consists of parents of X. Note that there are two exit points, lines 4 and 14.\nAlso, there are multiple recursive calls, lines 8 (within a loop) and 9, leading to multiple parallel\nrecursive-traces, which will construct multiple generative \ufb02ows rooted at some deeper layer.\nThe algorithm starts by testing the exit condition (line 2). It is satis\ufb01ed if there are not enough nodes\nin X for a condition set of size n. In this case, the maximal depth is reached and an empty graph is\nreturned (a layer composed of observed nodes). From this point, the recursive procedure will trace\nback, adding latent parent layers.\n\nAlgorithm 1: G \u2190\u2212 DeepGen(gX , X, X ex, n)\n1 DeepGen (gX , X, X ex, n)\n\nresolution n.\n\nOutput: G, a latent structure over X and H\n\nInput: an initial CPDAG gX over endogeneous X & exogenous X ex observed nodes, and a desired\n\nif the maximal indegree of gX (X) is below n + 1 then\n\nG \u2190\u2212an empty graph over X\nreturn G\n\n(cid:46) exit condition\n(cid:46) create a gather layer\n\n2\n3\n4\n\n5\n\n6\n\n7\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n\nX \u2190\u2212IncSeparation(gX , n)\ng(cid:48)\n{X D, X A1, . . . , X Ak} \u2190\u2212SplitAutonomous(X, g(cid:48)\nfor i \u2208 {1 . . . k} do\n\nX)\n\nGAi \u2190\u2212 DeepGen(g(cid:48)\n\nX , X Ai, X ex, n + 1)\n\nGD \u2190\u2212 DeepGen(g(cid:48)\n\nX , X D, X ex \u222a {X Ai}k\n\ni=1, n + 1)\n\nG \u2190\u2212 ((cid:83)k\n\ni=1 GAi ) \u222a GD\n\nCreate in G, k latent nodes H (n) = {H (n)\nLet H A\nSet each H (n)\nreturn G\n\nand H (n+1)\nto be a parent of {H A\n\n(n+1)\ni\n\n(n+1)\ni\n\nD\n\n1\n\ni\n\nk }\n, . . . , H (n)\n\n\u222a H (n+1)\n\n}\n\nD\n\n(cid:46) n-th order independencies\n(cid:46) identify autonomies\n\n(cid:46) a recursive call\n(cid:46) a recursive call\n\n(cid:46) merge results\n(cid:46) create a latent layer\n\n(cid:46) connect\n\nbe the sets of parentless nodes in GAi and GD, respectively.\n\n4\n\n\fThe procedure IncSeparation (line 5) disconnects (in gX) conditionally independent variables\nin two steps. First, it tests dependency between X ex and X, i.e., X \u22a5\u22a5 X(cid:48)|S for every connected\npair X \u2208 X and X(cid:48) \u2208 X ex given a condition set S \u2282 {X ex \u222a X} of size n. Next, it tests\ndependency within X, i.e., Xi \u22a5\u22a5 Xj|S for every connected pair Xi, Xj \u2208 X given a condition\nset S \u2282 {X ex \u222a X} of size n. After removing the corresponding edges, the remaining edges are\ndirected by applying two rules (Pearl, 2009; Spirtes et al., 2000). First, v-structures are identi\ufb01ed\nand directed. Then, edges are continually directed, by avoiding the creation of new v-structures and\ndirected cycles, until no more edges can be directed. Following the terminology of Yehezkel & Lerner\n(2009), we say that this function increases the graph d-separation resolution from n \u2212 1 to n.\nThe procedure SplitAutonomous (line 6) identi\ufb01es autonomous sets, one descendant set, X D,\nand k ancestor sets, X A1, . . . , X Ak in two steps. First, the nodes having the lowest topological\norder are grouped into X D. Then, X D is removed (temporarily) from gX revealing unconnected\nsub-structures. The number of unconnected sub-structures is denoted by k and the nodes set of each\nsub-structure is denoted by X Ai (i \u2208 {1 . . . k}).\nAn autonomous set in gX includes all its nodes\u2019 parents (complying with the Markov property) and\ntherefore a corresponding latent structure can be further learned independently, using a recursive call.\nThus, the algorithm is called recursively and independently for the k ancestor sets (line 8), and then\nfor the descendant set, treating the ancestor sets as exogenous (line 9). This recursive decomposition\nof X is illustrated in Figure 2. Each recursive call returns a latent structure for each autonomous\nset. Recall that each latent structure encodes a generative distribution over the observed variables\nwhere layer H (n+1), the last added layer (parentless nodes), is a representation of some input subset\nX(cid:48) \u2282 X. Thus, latent variables, H (n), are introduced as parents of the H (n+1) layers (lines 11\u201313).\nIt is important to note that conditional independence is tested only between input variables, X, and\ncondition sets do not include latent variables. Conditioning on latent variables or testing independence\nbetween them is not required by our approach. A 2-layer toy-example is given in Figure 3.\n\nFigure 2: An example of a recursive decomposition of the observed set, X. Each circle represents\na distinct subset of observed variables (e.g., X (1)\nA1 in different circles represents different subsets).\nAt n = 0, a single circle represents all the variables. Each set of variables is split into autonomous\nancestors X (n)\nD subsets. An arrow indicates a recursive call. Best viewed in\ncolor.\n\nAi and descendent X (n)\n\n5\n\n \f[b]\n\n[c]\n\n[a]\nFigure 3: An example of learning a 2-layer generative model. [a] An example Bayesian network\nencoding the underlying independencies in X. [b] gX after marginal independence testing (n = 0).\nOnly A and B are marginally independent (A\u22a5\u22a5 B). [c] gX after a recursive call to learn the structure\nof nodes {C, D, E} with n = 2 (C \u22a5\u22a5 D|{A, B}). Exit condition is met in subsequent recursive calls\nand thus latent variables are added to G at n = 2 [d], and then at n = 0 [e] (the \ufb01nal structure).\n\n[d]\n\n[e]\n\n3.3 Constructing a Discriminative Graph\n\nWe now describe how to convert G into a discriminative graph, Gdis, with target variable, Y\n(classi\ufb01cation/regression). First, we construct Ginv, a graphical model that preserves all conditional\ndependencies in G but has a different node ordering in which the observed variables, X, have the\nhighest topological order (parentless)\u2014a stochastic inverse of G. Stuhlm\u00fcller et al. (2013) and\nPaige & Wood (2016) presented a heuristic algorithm for constructing such stochastic inverses.\nHowever, limiting Ginv to a DAG, although preserving all conditional dependencies, may omit many\nindependencies and add new edges between layers. Instead, we allow it to be a projection of a latent\nstructure (Pearl, 2009). That is, we assume the presence of additional hidden variables Q that are not\nin Ginv but induce dependency (for example, \u201cinteractive forks\u201d (Pearl, 2009)) among H. For clarity,\nwe omit these variables from the graph and use bi-directional edges to represent the dependency\ninduced by them. Ginv is constructed in two steps:\n\n1. Invert the direction of all the edges in G (invert inter-layer connectivity).\n2. Connect each pair of latent variables, sharing a common child in G, with a bi-directional\n\nedge.\n\nThese steps ensure the preservation of conditional dependence.\nProposition 1. Graph Ginv preserves all conditional dependencies in G (i.e., G (cid:22) Ginv).\nNote that conditional dependencies among X are not required to be preserved in Ginv and Gdis as\nthese are observed variables (Paige & Wood, 2016).\nFinally, a discriminative graph Gdis is constructed by replacing the bi-directional dependency relations\nin Ginv (induced by Q) with explaining-away relations, which are provided by adding the observed\nclass variable Y . Node Y is set in Gdis to be the common child of the leaves in Ginv (latents\nintroduced after testing marginal independencies in X). See an example in Figure 4. This ensures the\npreservation of conditional dependency relations in Ginv. That is, Gdis, given Y , can mimic Ginv.\n\n[a]\n\n[b]\n\n[c]\n\nFigure 4: An example of the three graphs constructed by our algorithm: [a] a generative deep latent\nstructure G, [b] its stochastic inverse Ginv (Stuhlm\u00fcller et al., 2013; Paige & Wood, 2016), and [c] a\ndiscriminative structure Gdis (target node Y is added).\n\nProposition 2. Graph Gdis, conditioned on Y , preserves all conditional dependencies in Ginv\n(i.e., Ginv (cid:22) Gdis|Y ).\nIt follows that G (cid:22) Ginv (cid:22) Gdis conditioned on Y .\nProposition 3. Graph Gdis, conditioned on Y , preserves all conditional dependencies in G\n(i.e., G (cid:22) Gdis).\nDetails and proofs for all the propositions are provided in supplementary material Sec. A.\n\n6\n\nCEDBACEDBACEDBACEDBAHCHDCEDBAHAHBHCHDCEDBAHAHBHCHDCEDBAHAHBHCHDCEDBAHAHBHCHDY\f3.4 Constructing a Feed-Forward Neural Network\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\n\nX\n\n+ b(cid:48)(cid:1) where sigm(x) = 1/(1 + exp(\u2212x)), X\n\nsigm(cid:0)W\none. They showed that this in\ufb01nite set can be approximated by(cid:80)N\n\nWe construct a neural network based on the connectivity in Gdis. Sigmoid belief networks (Neal,\n1992) have been shown to be powerful neural network density estimators (Larochelle & Murray,\n2011; Germain et al., 2015). In these networks, conditional probabilities are de\ufb01ned as logistic\nregressors. Similarly, for Gdis we may de\ufb01ne for each latent variable H(cid:48) \u2208 H, p(H(cid:48) = 1|X\n) =\n, b(cid:48)) are the\nparameters of the neural network. Nair & Hinton (2010) proposed replacing each binary stochastic\nnode H(cid:48) by an in\ufb01nite number of copies having the same weights but with decreasing bias offsets by\ni=1 sigm(v\u2212i+0.5) \u2248 log(1+ev),\n+ b(cid:48). They further approximate this function by max(0, v + \u0001) where \u0001 is a zero-\nwhere v = W\ncentered Gaussian noise. Following these approximations, they provide an approximate probabilistic\ninterpretation for the ReLU function, max(0, v). As demonstrated by Jarrett et al. (2009) and Nair &\nHinton (2010), these units are able to learn better features for object classi\ufb01cation in images.\nIn order to further increase the representational power, we represent each H(cid:48) by a set of neurons\nhaving ReLU activation functions. That is, each latent variable H(cid:48) in Gdis is represented in the neural\nnetwork by a fully-connected layer. Finally, the class node Y is represented by a softmax layer.\n\n= P a(H(cid:48); Gdis), and (W\n\n(cid:48)\n\n(cid:48)\n\nX\n\n(cid:48)\n\n(cid:48)\n\n4 Experiments\n\nOur structure learning algorithm is implemented using BNT (Murphy, 2001) and runs ef\ufb01ciently on\na standard desktop CPU (excluding neural network parameter learning). For the learned structures,\nall layers were allocated an equal number of neurons. Threshold for independence tests, and the\nnumber of neurons-per-layer were selected by using a validation set. In all the experiments, we used\nReLU activations, ADAM (Kingma & Ba, 2015) optimization, batch normalization (Ioffe & Szegedy,\n2015), and dropout (Srivastava et al., 2014) to all the dense layers. All optimization hyper-parameters\nthat were tuned for the vanilla topologies were also used, without additional tuning, for the learned\nstructures. In all the experiments, parameter learning was repeated \ufb01ve times where average and\nstandard deviation of the classi\ufb01cation accuracy were recorded. Only test-set accuracy is reported.\n\n4.1 Learning the Structure of the Deepest Layers in Common Topologies\n\nWe evaluate the quality of our learned structures using \ufb01ve image classi\ufb01cation benchmarks and\nseven common topologies (and simpler hand-crafted structures), which we call \u201cvanilla topologies\u201d.\nThe benchmarks and vanilla topologies are described in Table 1. Similarly to Li et al. (2017), we\nused the VGG-16 network that was previously modi\ufb01ed and adapted for the CIFAR-10 dataset. This\nVGG-16 version contains signi\ufb01cantly fewer parameters than the original one.\n\nTable 1: Benchmarks and vanilla topologies used in our experiments. MNIST-Man and SVHN-Man\ntopologies were manually created by us. MNIST-Man has two convolutional layer (32 and 64 \ufb01lters\neach) and one dense layer with 128 neurons. SVHN-Man was created as a small network reference\nhaving reasonable accuracy (Acc.) compared to Maxout-NiN.\n\nDataset\nMNIST (LeCun et al., 1998) A MNIST-Man\n\nTopology\n\nId.\n\nVanilla Topology\n\nDescription\n32-64-FC:128\n\nSVHN (Netzer et al., 2011)\n\nCIFAR 10 (Krizhevsky &\nHinton, 2009)\nCIFAR 100 (Krizhevsky &\nHinton, 2009)\nImageNet (Deng et al., 2009) G\n\nB\nC\nD\nE\nF\n\nMaxout NiN (Chang & Chen, 2015)\n16-16-32-32-64-FC:256\nSVHN-Man\n(Simonyan & Zisserman, 2014)\n(Zagoruyko & Komodakis, 2016)\n(Simonyan & Zisserman, 2014)\n\nVGG-16\nWRN-40-4\nVGG-16\n\nSize\nAcc.\n127K 99.35\n1.6M 98.10\n105K 97.10\n15M 92.32\n9M\n95.09\n15M 68.86\n\nAlexNet\n\n(Krizhevsky et al., 2012)\n\n61M 57.20\n\n7\n\n\fIn preliminary experiments we found that, for SVHN and ImageNet, a small subset of the training\ndata is suf\ufb01cient for learning the structure. As a result, for SVHN only the basic training data is used\n(without the extra data), i.e., 13% of the available training data, and for ImageNet 5% of the training\ndata is used. Parameters were optimized using all of the training data.\nConvolutional layers are powerful feature extractors for images exploiting spatial smoothness proper-\nties, translational invariance and symmetry. We therefore evaluate our algorithm by using the \ufb01rst\nconvolutional layers of the vanilla topologies as \u201cfeature extractors\u201d (mostly below 50% of the vanilla\nnetwork size) and then learning a deep structure, \u201clearned head\u201d, from their output. That is, the\ndeepest layers of the vanilla network, \u201cvanilla head\u201d, is removed and replaced by a structure which is\nlearned, in an unsupervised manner, by our algorithm2. This results in a new architecture which we\ntrain end-to-end. Finally, a softmax layer is added and the entire network parameters are optimized.\nFirst, we evaluate the accuracy of the learned structure as a function of the number of parameters\nand compare it to a densely connected network (fully connected layers) having the same depth and\nsize (Figure 5). For SVHN, we used the Batch Normalized Maxout Network-in-Network topology\n(Chang & Chen, 2015) and removed the deepest layers starting from the output of the second NiN\nblock (MMLP-2-2). For CIFAR-10, we used the VGG-16 and removed the deepest layers starting\nfrom the output of conv.7 layer. It is evident that accuracies of the learned structures are signi\ufb01cantly\nhigher (error bars represent 2 standard deviations) than those produced by a set of fully connected\nlayers, especially in cases where the network is limited to a small number of parameters.\n\ny\nc\na\nr\nu\nc\nc\na\nT\nS\nI\nN\nM\n\n99\n\n98\n\n97\n\n96\n\n0\n\n96\n\n94\n\n92\n\n90\n\ny\nc\na\nr\nu\nc\nc\na\nN\nH\nV\nS\n\nfully connected\nlearned structure\n\ny\nc\na\nr\nu\nc\nc\na\n\n0\n1\n-\nR\nA\nF\nI\nC\n\n90\n\n80\n\n70\n\nfully connected\nlearned structure\n\nfully connected\nlearned structure\n\n1\n\n0.5\n\n2\nnumber of parameters (\u00d7105)\n\n1.5\n\n1.08\n\n1.1\n\n1.12\n\nnumber of parameters (\u00d7106)\n\n7.67\n\n7.66\nnumber of parameters (\u00d7106)\n\n7.68\n\n7.69\n\nFigure 5: Classi\ufb01cation accuracy of MNIST, SVHN, and CIFAR-10, as a function of network size.\nError bars indicate two standard deviations.\n\nNext, in Figure 6 and Table 2 we provide a summary of network sizes and classi\ufb01cation accuracies,\nachieved by replacing the deepest layers of common topologies (vanilla) with a learned structure. In\nall the cases, the size of the learned structure is signi\ufb01cantly smaller than that of the vanilla topology.\n\nFigure 6: A comparison between the vanilla and our learned structure (B2N), in terms of normalized\nnumber of parameters. The \ufb01rst few layers of the vanilla topology are used for feature extraction.\nStacked bars refer to either the vanilla or our learned structure. The total number of parameters of the\nvanilla network is indicated on top of each stacked bar.\n\n2We also learned a structure for classifying MNIST digits directly from image pixels, without using convolu-\ntional layers for feature extraction. The resulting network structure (Figure 1), achieves an accuracy of 99.07%,\nwhereas a network with 3 fully-connected layers achieves 98.75%.\n\n8\n\nABCDEFG00.51featureextractionvanillaheadlearnedhead61M15M9M15M105K1.6M127Knormalizednumberofparameters\f4.2 Comparison to Other Methods\n\nOur structure learning algorithm runs ef\ufb01ciently on a standard desktop CPU, while providing struc-\ntures with competitive classi\ufb01cation accuracies and network sizes. First, we compare our method to\nthe NAS algorithm (Zoph & Le, 2016). NAS achieves for CIFAR-10 an error rate of 5.5% with a\nnetwork of size 4.2M. Our method, using the feature extraction of the WRN-40-4 network, achieves\nthis same error rate with a 26% smaller network (3.1M total size). Using the same feature extraction,\nthe lowest classi\ufb01cation error rate achieved by our algorithm for CIFAR 10 is 4.58% with a network\nof size 6M whereas the NAS algorithm achieves an error rate of 4.47% with a network of size 7.1M.\nRecall that the NAS algorithm requires training thousands of networks using hundreds of GPUs,\nwhich is impractical for most real-world applications.\nWhen compared to recent pruning methods, which focus on reducing the number of parameters in\na pre-trained network, our method demonstrates state-of-the-art reduction in parameters. Recently\nreported results are summarized in Table 3. It is important to note that although these methods prune\nall the network layers, whereas our method only replaces the network head, our method was found\nsigni\ufb01cantly superior. Moreover, pruning can be applied to the feature extraction part of the network\nwhich may further improve parameter reduction.\n\nTable 2: Parameter reduction ratio\n(vanilla size/learned size) and differ-\nence in classi\ufb01cation accuracy (Acc.\nDiff.=learned\u2212vanilla, higher is better).\n\u201cFull\u201d=feature extration+head.\n\nTable 3: Parameter reduction ratio (vanilla/learned size)\ncompared to recent pruning methods (reducing the size\nof a pre-trained network with minimal accuracy degra-\ndation). Results indicated by \u201cacc. deg.\u201d correspond to\naccuracy degradation after pruning.\n\nParam. Reduc.\nHead\nFull\n4.2\u00d7\n2.7\u00d7\n1.4\u00d7 10.0\u00d7\n2.5\u00d7\n3.5\u00d7\n7.0\u00d7 28.3\u00d7\n1.5\u00d7\n2.8\u00d7\n7.7\u00d7 53.2\u00d7\n13.3\u00d7 23.0\u00d7\n\nId.\nAcc. Diff.\nA +0.10 \u00b1 0.04\nB \u22120.40 \u00b1 0.05\nC \u22120.86 \u00b1 0.05\nD +0.29 \u00b1 0.14\nE +0.33 \u00b1 0.14\nF +0.05 \u00b1 0.17\nG +0.00 \u00b1 0.03\n\n5 Conclusions\n\nNetwork\nVGG-16\n(CIFAR-10) Ayinde & Zurada (2018)\n\nMethod\nLi et al. (2017)\n\nAlexNet\n(ImageNet)\n\nDing et al. (2018)\nHuang et al. (2018)\nB2N (our)\nDenton et al. (2014)\nYang et al. (2015)\nHan et al. (2015, 2016)\nManessi et al. (2017)\nB2N (our)\n\nReduction\n3\u00d7\n4.6\u00d7\n(acc. deg.) 5.4\u00d7\n(acc. deg.) 6\u00d7\n7\u00d7\n5\u00d7\n3.2\u00d7\n9\u00d7\n(acc. deg.) 12\u00d7\n13.3\u00d7\n\nWe presented a principled approach for learning the structure of deep neural networks. Our proposed\nalgorithm learns in an unsupervised manner and requires small computational cost. The resulting\nstructures encode a hierarchy of independencies in the input distribution, where a node in one layer\nmay connect to another node in any deeper layer, and network depth is determined automatically.\nWe demonstrated that our algorithm learns small structures, and maintains classi\ufb01cation accuracies\nfor common image classi\ufb01cation benchmarks. It is also demonstrated that while convolution layers\nare very useful at exploiting domain knowledge, such as spatial smoothness, translational invariance,\nand symmetry, in some cases, they are outperformed by a learned structure for the deeper layers.\nMoreover, while the use of common topologies (meta-architectures), for a variety of classi\ufb01cation\ntasks is computationally inef\ufb01cient, we would expect our approach to learn smaller and more accurate\nnetworks for each classi\ufb01cation task, uniquely.\nAs only unlabeled data is required for learning the structure, we expect our approach to be practical\nfor many domains, beyond image classi\ufb01cation, such as knowledge discovery, and plan to explore\nthe interpretability of the learned structures. Casting the problem of learning the connectivity of\ndeep neural network as a Bayesian network structure learning problem, enables the development\nof new principled and ef\ufb01cient approaches. This can lead to the development of new topologies\nand connectivity models, and can provide a greater understanding of the domain. One possible\nextension to our work which we plan to explore, is learning the connectivity between feature maps in\nconvolutional layers.\n\n9\n\n\fReferences\nAdams, Ryan, Wallach, Hanna, and Ghahramani, Zoubin. Learning the structure of deep sparse graphical\nmodels. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pp.\n1\u20138, 2010.\n\nAyinde, Babajide O. and Zurada, Jacek M. Building ef\ufb01cient convnets using redundant feature pruning. In\n\nWorkshop Track of the International Conference on Learning Representations (ICLR), 2018.\n\nBaker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Raskar, Ramesh. Designing neural network architectures using\n\nreinforcement learning. arXiv preprint arXiv:1611.02167, 2016.\n\nChang, Jia-Ren and Chen, Yong-Sheng. Batch-normalized maxout network in network. arXiv preprint\n\narXiv:1511.02583, 2015.\n\nChen, Tianqi, Goodfellow, Ian, and Shlens, Jonathon. Net2net: Accelerating learning via knowledge transfer.\n\narXiv preprint arXiv:1511.05641, 2015.\n\nChickering, David Maxwell. Optimal structure identi\ufb01cation with greedy search. Journal of machine learning\n\nresearch, 3(Nov):507\u2013554, 2002.\n\nCollobert, Ronan, Weston, Jason, Bottou, L\u00e9on, Karlen, Michael, Kavukcuoglu, Koray, and Kuksa, Pavel.\nNatural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):\n2493\u20132537, 2011.\n\nCooper, Gregory F and Herskovits, Edward. A Bayesian method for the induction of probabilistic networks\n\nfrom data. Machine learning, 9(4):309\u2013347, 1992.\n\nDeng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp.\n248\u2013255. IEEE, 2009.\n\nDenton, Emily L, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure\nwithin convolutional networks for ef\ufb01cient evaluation. In Advances in Neural Information Processing Systems,\npp. 1269\u20131277, 2014.\n\nDing, Xiaohan, Ding, Guiguang, Han, Jungong, and Tang, Sheng. Auto-balanced \ufb01lter pruning for ef\ufb01cient\nconvolutional neural networks. In Proceedings of the 32nd AAAI Conference on Arti\ufb01cial Intelligence (AAAI),\n2018.\n\nDonahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor.\nDecaf: A deep convolutional activation feature for generic visual recognition. In International Conference on\nMachine Learning, volume 32, pp. 647\u2013655, 2014.\n\nGermain, Mathieu, Gregor, Karol, Murray, Iain, and Larochelle, Hugo. Made: Masked autoencoder for\n\ndistribution estimation. In ICML, pp. 881\u2013889, 2015.\n\nGirshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra. Rich feature hierarchies for accurate object\ndetection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pp. 580\u2013587, 2014.\n\nGraves, Alex and Schmidhuber, J\u00fcrgen. Framewise phoneme classi\ufb01cation with bidirectional lstm and other\n\nneural network architectures. Neural Networks, 18(5):602\u2013610, 2005.\n\nHan, Song, Pool, Jeff, Tran, John, and Dally, William. Learning both weights and connections for ef\ufb01cient\n\nneural networks. In Advances in Neural Information Processing Systems, pp. 1135\u20131143, 2015.\n\nHan, Song, Mao, Huizi, and Dally, William J. Deep compression: Deep compression: Compressing deep\nneural networks with pruning, trained quantization and huffman coding. In Proceedings of the International\nConference on Learning Representations (ICLR), 2016.\n\nHe, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778, 2016.\n\nHinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\nHinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7):1527\u20131554, 2006.\n\n10\n\n\fHuang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional\n\nnetworks. arXiv preprint arXiv:1608.06993, 2016.\n\nHuang, Qiangui, Zhou, Kevin, You, Suya, and Neumann, Ulrich. Learning to prune \ufb01lters in convolutional\n\nneural networks. arXiv preprint arXiv:1801.07365, 2018.\n\nIoffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In International Conference on Machine Learning, pp. 448\u2013456, 2015.\n\nJarrett, Kevin, Kavukcuoglu, Koray, LeCun, Yann, et al. What is the best multi-stage architecture for object\nrecognition? In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2146\u20132153. IEEE, 2009.\n\nKingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization.\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\nIn Proceedings of the\n\nKrizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.\n\nKrizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pp. 1097\u20131105, 2012.\n\nLarochelle, Hugo and Murray, Iain. The neural autoregressive distribution estimator. In AISTATS, volume 1, pp.\n\n2, 2011.\n\nLeCun, Yann, Bottou, L\u00e9on, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nLi, Hao, Kadav, Asim, Durdanovic, Igor, Samet, Hanan, and Graf, Hans Peter. Pruning \ufb01lters for ef\ufb01cient\n\nconvnets. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.\n\nLiu, Baoyuan, Wang, Min, Foroosh, Hassan, Tappen, Marshall, and Pensky, Marianna. Sparse convolutional\nneural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\n806\u2013814, 2015.\n\nLong, Mingsheng, Cao, Yue, Wang, Jianmin, and Jordan, Michael. Learning transferable features with deep\n\nadaptation networks. In International Conference on Machine Learning, pp. 97\u2013105, 2015.\n\nManessi, Franco, Rozza, Alessandro, Bianco, Simone, Napoletano, Paolo, and Schettini, Raimondo. Automated\n\npruning for deep neural network compression. arXiv preprint arXiv:1712.01721, 2017.\n\nMiconi, Thomas. Neural networks with differentiable structure. arXiv preprint arXiv:1606.06216, 2016.\n\nMiikkulainen, Risto, Liang, Jason, Meyerson, Elliot, Rawal, Aditya, Fink, Dan, Francon, Olivier, Raju, Bala,\nNavruzyan, Arshak, Duffy, Nigel, and Hodjat, Babak. Evolving deep neural networks. arXiv preprint\narXiv:1703.00548, 2017.\n\nMurphy, K. The Bayes net toolbox for Matlab. Computing Science and Statistics, 33:331\u2013350, 2001.\n\nNair, Vinod and Hinton, Geoffrey E. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proceedings\n\nof the 27th international conference on machine learning (ICML-10), pp. 807\u2013814, 2010.\n\nNeal, Radford M. Connectionist learning of belief networks. Arti\ufb01cial intelligence, 56(1):71\u2013113, 1992.\n\nNegrinho, Renato and Gordon, Geoff. Deeparchitect: Automatically designing and training deep architectures.\n\narXiv preprint arXiv:1704.08792, 2017.\n\nNetzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading digits in\nnatural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, volume 2011, pp. 5, 2011.\n\nPaige, Brooks and Wood, Frank. Inference networks for sequential Monte Carlo in graphical models. In\n\nProceedings of the 33rd International Conference on Machine Learning, volume 48 of JMLR, 2016.\n\nPearl, Judea. Causality: Models, Reasoning, and Inference. Cambridge university press, second edition, 2009.\n\nRaiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in\n\nperceptrons. In Arti\ufb01cial Intelligence and Statistics, pp. 924\u2013932, 2012.\n\nReal, Esteban, Moore, Sherry, Selle, Andrew, Saxena, Saurabh, Suematsu, Yutaka Leon, Le, Quoc, and Kurakin,\n\nAlex. Large-scale evolution of image classi\ufb01ers. arXiv preprint arXiv:1703.01041, 2017.\n\n11\n\n\fSimonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\nSmith, Leslie N, Hand, Emily M, and Doster, Timothy. Gradual dropin of layers to train very deep neural\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4763\u2013\n4771, 2016.\n\nSmithson, Sean C, Yang, Guang, Gross, Warren J, and Meyer, Brett H. Neural networks designing neural net-\nworks: Multi-objective hyper-parameter optimization. In Computer-Aided Design (ICCAD), 2016 IEEE/ACM\nInternational Conference on, pp. 1\u20138. IEEE, 2016.\n\nSpirtes, P., Glymour, C., and Scheines, R. Causation, Prediction and Search. MIT Press, 2nd edition, 2000.\n\nSrivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15:\n1929\u20131958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.\n\nStuhlm\u00fcller, Andreas, Taylor, Jacob, and Goodman, Noah. Learning stochastic inverses. In Advances in neural\n\ninformation processing systems, pp. 3048\u20133056, 2013.\n\nSzegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru,\nVanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pp. 1\u20139, 2015.\n\nYang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and Wang, Ziyu.\nDeep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476\u2013\n1483, 2015.\n\nYehezkel, Raanan and Lerner, Boaz. Bayesian network structure learning by recursive autonomy identi\ufb01cation.\n\nJournal of Machine Learning Research, 10(Jul):1527\u20131570, 2009.\n\nZagoruyko, Sergey and Komodakis, Nikos. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.\n\nZoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\nZoph, Barret, Vasudevan, Vijay, Shlens, Jonathon, and Le, Quoc V. Learning transferable architectures for\n\nscalable image recognition. arXiv preprint arXiv:1707.07012, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1578, "authors": [{"given_name": "Raanan", "family_name": "Rohekar", "institution": "Intel Corporation"}, {"given_name": "Shami", "family_name": "Nisimov", "institution": "intel"}, {"given_name": "Yaniv", "family_name": "Gurwicz", "institution": "Intel AI Lab"}, {"given_name": "Guy", "family_name": "Koren", "institution": "Intel"}, {"given_name": "Gal", "family_name": "Novik", "institution": "Intel"}]}