{"title": "The streaming rollout of deep networks - towards fully model-parallel execution", "book": "Advances in Neural Information Processing Systems", "page_first": 4039, "page_last": 4050, "abstract": "Deep neural networks, and in particular recurrent networks, are promising candidates to control autonomous agents that interact in real-time with the physical world. However, this requires a seamless integration of temporal features into the network\u2019s architecture. For the training of and inference with recurrent neural networks, they are usually rolled out over time, and different rollouts exist. Conventionally during inference, the layers of a network are computed in a sequential manner resulting in sparse temporal integration of information and long response times. In this study, we present a theoretical framework to describe rollouts, the level of model-parallelization they induce, and demonstrate differences in solving specific tasks. We prove that certain rollouts, also for networks with only skip and no recurrent connections, enable earlier and more frequent responses, and show empirically that these early responses have better performance. The streaming rollout maximizes these properties and enables a fully parallel execution of the network reducing runtime on massively parallel devices. Finally, we provide an open-source toolbox to design, train, evaluate, and interact with streaming rollouts.", "full_text": "The streaming rollout of deep networks - towards\n\nfully model-parallel execution\n\nVolker Fischer\n\nJan K\u00f6hler\n\nBosch Center for Arti\ufb01cial Intelligence\n\nBosch Center for Arti\ufb01cial Intelligence\n\nRenningen, Germany\n\nvolker.fischer@de.bosch.com\n\nRenningen, Germany\n\njan.koehler@de.bosch.com\n\nThomas Pfeil\n\nBosch Center for Arti\ufb01cial Intelligence\n\nRenningen, Germany\n\nthomas.pfeil@de.bosch.com\n\nAbstract\n\nDeep neural networks, and in particular recurrent networks, are promising can-\ndidates to control autonomous agents that interact in real-time with the physical\nworld. However, this requires a seamless integration of temporal features into the\nnetwork\u2019s architecture. For the training of and inference with recurrent neural\nnetworks, they are usually rolled out over time, and different rollouts exist. Con-\nventionally during inference, the layers of a network are computed in a sequential\nmanner resulting in sparse temporal integration of information and long response\ntimes. In this study, we present a theoretical framework to describe rollouts, the\nlevel of model-parallelization they induce, and demonstrate differences in solving\nspeci\ufb01c tasks. We prove that certain rollouts, also for networks with only skip and\nno recurrent connections, enable earlier and more frequent responses, and show\nempirically that these early responses have better performance. The streaming\nrollout maximizes these properties and enables a fully parallel execution of the\nnetwork reducing runtime on massively parallel devices. Finally, we provide an\nopen-source toolbox to design, train, evaluate, and interact with streaming rollouts.\n\n1\n\nIntroduction\n\nOver the last years, the combination of newly available large datasets, parallel computing power, and\nnew techniques to implement and train deep neural networks has led to signi\ufb01cant improvements\nin the \ufb01elds of vision [1], speech [2], and reinforcement learning [3]. In the context of autonomous\ntasks, neural networks usually interact with the physical world in real-time which renders it essential\nto integrate the processing of temporal information into the network\u2019s design.\nRecurrent neural networks (RNNs) are one common approach to leverage temporal context and have\ngained increasing interest not only for speech [4] but also for vision tasks [5]. RNNs use neural\nactivations to inform future computations, hence introducing a recursive dependency between neuron\nactivations. This augments the network with a memory mechanism and allows it, unlike feed-forward\nneural networks, to exhibit dynamic behavior integrating a stream or sequence of inputs. For training\nand inference, backpropagation through time (BPTT) [6] or its truncated version [6, 7] are used,\nwhere the RNN is rolled out (or unrolled) through time disentangling the recursive dependencies and\ntransforming the recurrent network into a feed-forward network.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fc)\n\nd)\n\na)\n\nb)\n\nSince unrolling a cyclic graph is not well-de\ufb01ned [8], different possible rollouts exist for the same\nneural network. This is due to the rollout process itself, as there are several ways to unroll cycles with\nlength greater 1 (larger cycles than recurrent self-connections). More general, there are two ways\nto unroll every edge (cf. Fig. 1): having the edge connect its source and target nodes at the same\npoint in time (see, e.g., vertical edges in Fig. 1b) or bridging time steps (see, e.g., Fig. 1c). Bridging\nis especially necessary for self-recurrent edges or larger cycles in the network, so that the rollout\nin fact becomes a feed-forward network. In a rollout, conventionally most edges are applied in the\nintra-frame non-bridging manner and bridge time steps only if necessary [9, 10, 11, 12]. We refer to\nthese rollouts as sequential rollouts throughout this work. One contribution of this study is the proof\nthat the number of rollouts increases exponentially with network complexity.\nThe main focus of this work is that different rollouts\ninduce different levels of model-parallelism and dif-\nferent behaviors for an unrolled network. In rollouts\ninducing complete model-parallelism, which we call\nstreaming, nodes of a certain time step in the unrolled\nnetwork become computationally disentangled and\ncan be computed in parallel (see Fig. 1c). This idea is\nnot restricted to recurrent networks, but generalizes\nto a large variety of network architectures covered by\nthe presented graph-theoretical framework in Sec. 3.\nIn Sec. 4, we show experimental results that empha-\nsize the difference of rollouts for both, networks with\nrecurrent and skip, and only skip connections. In this\nstudy, we are not concerned comparing performances\nbetween networks, but between different rollouts of\na given network (e.g., Fig. 1b vs. c).\nOur theoretical and empirical \ufb01ndings show that\nstreaming rollouts enable fully model-parallel infer-\nence achieving low-latency and high-frequency re-\nsponses. These features are particularly important for\nreal-time applications such as autonomous cars [13]\nor UAV systems [14] in which the neural networks\nhave to make complex decisions on high dimensional\nand frequent input signals within a short time.\nTo the best of our knowledge, up to this study, no general theory exists that compares different rollouts\nand our contributions can be summarized as follows:\n\nFigure 1: (best viewed in color) a: Neural\nnetwork with skip and recurrent connections\n(SR) and different rollouts: b: the sequential\nrollout, c: the streaming rollout and d: a hy-\nbrid rollout. Nodes represent layers, edges\nrepresent transformations, e.g., convolutions.\nOnly one rollout step is shown and each col-\numn in (b-d) is one frame within the rollout.\n\nhybrid\nrollout\n\nsequential\n\nstreaming\n\nrollout\n\nrollout\n\n\u2022 We provide a theoretical framework to describe rollouts of deep neural networks and show\nthat, and in some cases how, different rollouts lead to different levels of model-parallelism\nand network behavior.\n\u2022 We formally introduce streaming rollouts enabling fully model-parallel network execution,\nand mathematically prove that streaming rollouts have the shortest response time to and\nhighest sampling frequency of inputs.\n\u2022 We empirically give examples underlining the theoretical statements and show that streaming\nrollouts can further outperform other rollouts by yielding better early and late performance.\n\u2022 We provide an open-source toolbox speci\ufb01cally designed to study streaming rollouts of deep\n\nneural networks.\n\n2 Related work\n\nThe idea of RNNs dates back to the mid-70s [15] and was popularized by [16]. RNNs and their\nvariants, especially Long Short-Term Memory networks (LSTM) [17], considerably improved perfor-\nmance in different domains such as speech recognition [4], handwriting recognition [5], machine\ntranslation [18], optical character recognition (OCR) [19], text-to-speech synthesis [20], social signal\nclassi\ufb01cation [21], or online multi-target tracking [22]. The review [23] gives an overview of the\nhistory and benchmark records set by DNNs and RNNs.\n\n2\n\n\fVariants of RNNs: There are several variants of RNN architectures using different mechanisms\nto memorize and integrate temporal information. These include LSTM networks [17] and related\narchitectures like Gated Recurrent Unit (GRU) networks [24] or recurrent highway networks [25].\nNeural Turing Machines (NTM) [26] and Differentiable Neural Computers (DNC) [27] extend\nRNNs by an addressable external memory. Bi-directional RNNs (BRNNs) [28] incorporate the\nability to model the dependency on future information. Numerous works extend and improve these\nRNN variants creating architectures with advantages for training or certain data domains (e.g.,\n[29, 30, 31, 32]).\nResponse time: While RNNs are the main reason to use network rollouts, in this work we also\ninvestigate rollouts for non-recurrent networks. Theoretical and experimental results suggest that\ndifferent rollout types yield different behavior especially for networks containing skip connections.\nThe rollout pattern in\ufb02uences the response time of a network which is the duration between input\n(stimulus) onset and network output (response).\nShortcut or skip connections can play an important role to decrease response times. Shortcut\nbranches attached to intermediate layers allow earlier predictions (e.g., BranchyNet [33]) and iterative\npredictions re\ufb01ne from early and coarse to late and \ufb01ne class predictions (e.g., feedback networks\n[12]). In [34], the authors show that identity skip connections, as used in Residual Networks (ResNet)\n[1], can be interpreted as local network rollouts acting as \ufb01lters, which could also be achieved through\nrecurrent self-connections. The good performance of ResNets underlines the importance of local\nrecurrent \ufb01lters. The runtime of inference and training for the same network can also be reduced by\nnetwork compression [35, 36] or optimization of computational implementations [37, 38].\nRollouts: To train RNNs, different rollouts are applied in the literature, though lacking a theoretically\nfounded background. One of the \ufb01rst to describe the transformation of a recurrent MLP into an\nequivalent feed-forward network and depicting it in a streaming rollout fashion was [39, ch. 9.4]. The\nmost common way in literature to unroll networks over time is to duplicate the model for each time\nstep as depicted in Fig. 1b [ch. 10.1 in 40, 9, 10, 11, 12, 41]. However, as we will show in this work,\nthis rollout pattern is neither the only way to unroll a network nor the most ef\ufb01cient.\nThe recent work of Carreira et al. [42] also addresses the idea of model-parallelization through\ndedicated network rollouts to reduce latency between input and network output by distributing\ncomputations over multiple GPUs. While their work shows promising empirical \ufb01ndings in the \ufb01eld\nof video processing, our work provides a theoretical formulation for a more general class of networks\nand their rollouts. Our work also differs in the way the delay between input and output, and network\ntraining is addressed.\nBesides the chosen rollout, other methods exist, that modify the integration of temporal information:\nfor example, temporal stacking (convolution over time), which imposes a \ufb01xed temporal receptive\n\ufb01eld (e.g., [43, 44]), clocks, where different parts of the network have different update frequencies,\n(e.g., [45, 46, 47, 48, 49]) or predictive states, which try to compensate temporal delays between\ndifferent network parts (e.g., [42]). For more details, please see also Sec. 5.\n\n3 Graph representations of network rollouts\n\nWe describe dependencies inside a neural network N as a directed graph N = (V, E). The nodes\nv \u2208 V represent different layers and the edges e \u2208 E \u2282 V \u00d7 V represent transformations introducing\ndirect dependencies between layers. We allow self-connections (v, v) \u2208 E and larger cycles in\na network. Before stating the central de\ufb01nitions and propositions, we introduce notations used\nthroughout this section and for the proofs in the appendix.\nLet G = (V, E) be a directed graph with vertices (or nodes) v \u2208 V and edges e = (esrc, etgt) \u2208\nE \u2282 V \u00d7 V . Since neural networks process input data, we denote the input of the graph as set IG,\nconsisting of all nodes without incoming edges:\n\nIG\n\n..= {v \u2208 V | (cid:64)u \u2208 V : (u, v) \u2208 E}.\n\n(1)\nA path in G is a mapping p : {1, . . . , L} \u2192 E with p(i)tgt = p(i+1)src for i \u2208 {1, . . . , L\u22121} where\nL \u2208 N is the length of p. We denote the length of a path p also as |p| and the number of elements in\na set A as |A|. A path p is called loop or cycle iff p(|p|)tgt = p(1)src and it is called minimal iff p is\ninjective. The set of all cycles is denoted as CG. Two paths are called non-overlapping iff they share\n\n3\n\n\fno edges. We say a graph is input-connected iff for every node v exists a path p with p(|p|)tgt = v\nand p(1)src \u2208 IG. Now we proceed with our de\ufb01nition of a (neural) network.\n\nDe\ufb01nition (network): A network is a directed and input-connected graph N = (V, E) for which\n0 < |E| < \u221e.\nFor our claims, this abstract formulation is suf\ufb01cient and, while excluding certain arti\ufb01cial cases, it\nensures that a huge variety of neural network types is covered (see Fig. A1 for network examples).\nFor deep neural networks, we give an explicit formulation of this abstraction in Sec. A1.2, which\nwe also use for our experiments. Important concepts introduced here are illustrated in Fig. 2. In this\nwork, we separate the concept of network rollouts into two parts: The temporal propagation scheme\nwhich we call rollout pattern and its associated rollout windows (see also Fig. 1 and Fig. 2):\n\nDe\ufb01nition (rollout pattern and window): Let N = (V, E) be a network. We call a mapping\nR : E \u2192 {0, 1} a rollout pattern of N. For a rollout pattern R, the rollout window of size W \u2208 N\nis the directed graph RW = (VW , EW ) with:\n\nVW\nEW\n\n..= {0, . . . , W} \u00d7 V, v = (i, v) \u2208 VW\n..= {((i, u), (j, v)) \u2208 VW \u00d7 VW | (u, v) \u2208 E \u2227 j = i + R((u, v))}.\n\n(2)\nEdges e \u2208 E with R(e) = 1 enable information to directly stream through time. In contrast,\nedges with R(e) = 0 cause information to be processed within frames, thus introducing sequential\ndependencies upon nodes inside a frame. We dropped the dependency of EW on the rollout pattern\nR in the notation. A rollout pattern and its rollout windows are called valid iff RW is acyclic for one\nand hence for all W \u2208 N. We denote the set of all valid rollout patterns as RN and the rollout pattern\nR \u2261 1 the streaming rollout Rstream \u2208 RN . We say two rollout patterns R and R(cid:48) are equally\nmodel-parallel iff they are equal (R(e) = R(cid:48)(e)) for all edges e = (u, v) \u2208 E, not originating in the\nnetwork\u2019s input (u /\u2208 IN ). For i \u2208 {0, . . . , W}, the subset {i} \u00d7 V \u2282 VW is called the i-th frame.\nProof: In Sec. A1.3, we prove that the de\ufb01nition of valid rollout patterns is well-de\ufb01ned and is\nconsistent with intuitions about rollouts, such as consistency over time. We also prove that the\nstreaming rollout exists for every network and is always valid.\nThe most non-streaming rollout pattern R \u2261 0 is not necessarily valid, because if N contains loops\nthen R \u2261 0 does not yield acyclic rollout windows. Commonly, recurrent networks are unrolled\nsuch that most edges operate inside the same frame (R(e) = 0), and only when necessary (e.g., for\nrecurrent or top-down) connections are unrolled (R(e) = 1). In contrast to this sequential rollout, the\nstreaming rollout pattern unrolls all edges with R(e) = 1 (cf. top and third row in Fig. 2).\nLemma 1: Let N = (V, E) be a network. The number of valid rollout patterns |RN| is bounded\nby:\n\n1 \u2264 n \u2264 |RN| \u2264 2|E|\u2212|Erec|,\n\n(3)\n\nwhere Erec is the set of all self-connecting edges Erec\n\n..= {(u, v) \u2208 E | u = v}, and n either:\n\u2022 n = 2|Eforward|, with Eforward being the set of edges not contained in any cycle of N, or\n\n(2|p| \u2212 1), C \u2282 CN being any set of minimal and pair-wise non-overlapping cycles.\n\n\u2022 n = (cid:81)\n\np\u2208C\n\nProof: See appendix Sec. A1.4.\nLemma 1 shows that the number of valid rollout patterns increases exponentially with network\ncomplexity. Inference of a rollout window is conducted in a sequential manner. This means, the state\nof all nodes in the rollout window is successively computed depending on the availability of already\ncomputed source nodes1. The chosen rollout pattern determines the mathematical function this rollout\nrepresents, which may be different between rollouts, e.g., for skip connections. In addition, the\nchosen rollout pattern also determines the order in which nodes can be computed leading to different\nruntimes to compute the full state of a rollout window.\nWe now introduce tools to compare these addressed differences between rollouts. States of the rollout\nwindow encode, which nodes have been computed so far and update steps determine the next state\n\n1given the state of all input nodes at all frames and initial states for all nodes at the zero-th frame\n\n4\n\n\fnetwork\n\nN\n\na)\n\nrollout\n\npattern R\nb)\n\nrollout window R3\n\n0 1 2 3\n0 1 2 3\n\nc)\n\nd)\n\nv\ne\n\nu\n\n(0, v)\n\n(3, u)\n\nupdate states & inference tableaus\n\nSinit\n\nU (Sinit)\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n1\n\n0\n\n1\n\n1\n\n0\n\n1\n\n1\n\n1\n\n0\n\n1\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\nU n(Sinit)\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\n3\n\n2\n\n1\n\n0\n\n2\n\n1\n\n1\n\n0\n\n1\n\n1\n\n1\n\n0\n\n3\n\n2\n\n1\n\n0\n\n4\n\n3\n\n2\n\n0\n\n3\n\n2\n\n2\n\n0\n\n2\n\n2\n\n2\n\n0\n\n6\n\n5\n\n4\n\n0\n\n5 = n\n\n4\n\n3\n\n0\n\n4 = n\n\n3\n\n3\n\n0\n\n3 = n\n\n3\n\n3\n\n0\n\n9 = n\n\n8\n\n7\n\n0\n\nFigure 2: (best viewed in color) a: Two different networks. b: Different rollout patterns R : E \u2192\n{0, 1} for the two networks. Sequential (R(e) = 0) and streaming (R(e) = 1) edges are indicated\nwith blue dotted and red solid arrows respectively. For the \ufb01rst network (top to bottom), the most\nsequential, one hybrid, and the streaming rollout patterns are shown. For the second network, one\nout of its 3 most sequential rollout patterns is shown (either of the three edges of the cycle could\nbe unrolled). c: Rollout windows of size W = 3. By de\ufb01nition (Eq. (2)), sequential and streaming\nedges propagate information within and to the next frames respectively. d: States S(v) and inference\ntableau values T (v). The state S(v) of a node is indicated with black (already known) or white\n(not yet computed). From left to right: initial state Sinit, state after \ufb01rst update step U (Sinit), full\nstate U n(Sinit) = Sfull. The number of update steps n to reach the full state differs between rollouts.\nNumbers inside nodes v indicate values of the inference tableau (T (v)). Inference factors F (R) are\nindicated with square instead of circular nodes in the \ufb01rst frame of the full states.\n\nbased on the previous state. Update tableaus list after how many update steps nodes in the rollout\nwindow are computed. Update states, update steps, and inference tableaus are shown for example\nnetworks and rollouts in Fig. 2.\n\nDe\ufb01nition (update state, update step, tableau, and factor): Let R be a valid rollout pattern of a\nnetwork N = (V, E). A state of the rollout window RW is any mapping S : VW \u2192 {0, 1}. Let \u03a3W\ndenote the set of all possible states. We de\ufb01ne the full state Sfull and initial state Sinit as:\n\n(4)\nFurther, we de\ufb01ne the update step U which updates states S. Because the updated state U (S) is\nagain a state and hence a mapping, we de\ufb01ne U by specifying the mapping U (S):\n\nSinit((i, v)) = 1 \u21d0\u21d2 v \u2208 IN \u2228 i = 0.\n\nSfull \u2261 1;\n\nU : \u03a3W \u2192 \u03a3W ;\n\nU (S) : VW \u2192 {0, 1}\n\n(5)\n\n(cid:26) 1\n\n0\n\nU (S)(v) ..=\n\nif S(v) = 1 or if for all (u, v) \u2208 EW : S(u) = 1\notherwise\n\n5\n\n\fWe call the mapping T : VW \u2192 N the inference tableau:\n|p| = argmin\nn\u2208N\n\nT (v) ..= max\np\u2208Pv\n\n{U n(Sinit)(v) = 1}\n\n(6)\nwhere U n is the n-th recursive application of U and for v \u2208 VW , Pv denotes the set of all paths in\nRW that end at v (i.e., p(|p|)tgt = v) and for which their \ufb01rst edge may start but not end in the 0-th\nframe, p(1)tgt /\u2208 {0} \u00d7 V . Hereby, we exclude edges (computational dependencies) which never\nhave to be computed, because all nodes in the 0-th frame are initialized from start. We dropped the\ndependencies of U and T on the rollout window RW in the notation and if needed we will express\nthem with URW and TRW . Further, we call the maximal value of T over the rollout window of size 1\nthe rollout pattern\u2019s inference factor:\n\nF (R) ..= max\nv\u2208V1\n\nTR1(v).\n\n(7)\n\nProof: In Sec. A1.6 we prove Eq. (6).\nWe also want to note that all rollout windows of a certain window size W have the same number\nof edges W \u2217 |E|, independent of the chosen rollout pattern (ignoring edges inside the 0-th frame,\nbecause these are not used for updates). However, maximal path lengths in the rollout windows differ\nbetween different rollout patterns (cf. Eq. (6) and its proof, as well as tableau values in Fig. 2).\nInference of rollout windows starts with the initial state Sinit. Successive applications of the update\nstep U updates all nodes until the fully updated state Sfull is reached (cf. Fig. 2 and see Sec. A1.5\nfor a proof). For a certain window size W , the number of operations to compute the full state is\nindependent of the rollout pattern, but which updates can be done in parallel heavily depends on the\nchosen rollout pattern. We will use the number of required update steps to measure computation time.\nThis number differs between different rollout patterns (e.g., F (R) in Fig. 2). In practice, the time\nneeded for the update U (S) of a certain state S depends on S (i.e., which nodes can be updated next).\nFor now, we will assume independence, but will address this issue in the discussion (Sec. 5).\n\nTheorem 1: Let R be a valid rollout pattern for a network N = (V, E) then the following\nstatements are equivalent:\n\na) R and the streaming rollout pattern Rstream are equally model-parallel.\nb) The \ufb01rst frame is updated entirely after the \ufb01rst update step: F (R) = 1.\nc) For W \u2208 N, the i-th frame of RW is updated at the i-th update step:\n\n\u2200(i, v) \u2208 VW : T ((i, v)) \u2264 i.\n\nd) For W \u2208 N, the inference tableau of RW is minimal everywhere and over all rollout patterns.\n\nIn other words, responses are earliest and most frequent:\n\u2200v \u2208 VW : TRW (v) = min\nR(cid:48)\u2208RN\n\nTR(cid:48)\n\nW\n\n(v).\n\nProof: See appendix Sec. A1.7.\n\n4 Experiments\n\nTo demonstrate the signi\ufb01cance of the chosen rollouts w.r.t. the runtime for inference and achieved\naccuracy, we compare the two extreme rollouts: the most model-parallel, i.e., streaming rollout\n(R \u2261 1, results in red in Fig. 3), and the most sequential rollout2 (R(e) = 0 for maximal number of\nedges, results in blue in Fig. 3).\nIn all experiments, we consider a response time task, in which the input is a sequence of images and\nthe networks have to respond as quickly as possible with the correct class. We want to restate that we\ndo not compare performances between networks but between rollout patterns of the same network.\nFor all experiments and rollout patterns under consideration, we conduct inference on shallow rollouts\n(W = 1) and initialize the zero-th frame of the next rollout window with the last (i.e., 1.) frame of\n\n2Here, the most sequential rollout is unique since the used networks do not contain cycles of length greater 1.\n\nFor sequential rollouts that are ambiguous see bottom row of Fig. 2.\n\n6\n\n\fFigure 3: (best viewed in color) Classi\ufb01cation accuracy for sequential (in dashed blue), streaming\n(in solid red), and one hybrid (violet; only for SR network in a) rollout on MNIST, CIFAR10, and\nGTSRB (for networks and data see Figs. 1, 3d, A3, and A2). a-c: Average classi\ufb01cation results\non MNIST over computation time measured in the number of update steps of networks with skip\n+ recurrent (SR, a), with skip (S, b), and only feed-forward (FF, c) connections. In a), scaling of\nthe abscissa changes at the vertical dashed line for illustration purposes. d: The input (top row) is\ncomposed of digits (bottom row) and noise (middle row). Note that the input is aligned to the time\naxis in (a). Red diamonds and blue stars indicate inputs sampled by streaming and sequential rollouts,\nrespectively. e: Classi\ufb01cation results of the network DSR2 on CIFAR10. f: Accuracies at time of\n\ufb01rst output of sequential rollout (see (V) in e) over networks DSR0 - DSR6 (red and blue curves; left\naxis). Differences of \ufb01rst response times between streaming and sequential rollouts (see (IV) in e;\nblack dotted curve; right axis). g: Average accuracies on GTSRB sequences starting at index 0 of the\noriginal sequences. h: Final classi\ufb01cation accuracies (see (VI) in g) over the start index of the input\nsequence. Standard errors are shown in all plots except e and f and are too small to be visible in (a-c).\n\nthe preceding rollout window (see discussion Sec. 5). Hence, the inference factor of a rollout pattern\nis used to determine the number of update steps between responses (see F (Rstr), F (Rseq) in Fig. 3a).\nDatasets: Rollout patterns are evaluated on three datasets: MNIST [50], CIFAR10 [51], and the\nGerman traf\ufb01c sign recognition benchmark (GTSRB) [52]. To highlight the differences between\ndifferent rollout patterns, we apply noise (different sample for each frame) to the data (see Fig. 3d\nand Fig. A3b, c). In contrast to data without noise, a single image is now not suf\ufb01cient for a good\nclassi\ufb01cation performance anymore and temporal integration is necessary. In case of GTSRB, this\nnoise can be seen as noise induced by the sensor as predominant under poor lighting conditions.\nGTSRB contains tracks of 30 frames from which sections are used as input sequences.\nNetworks: We compare the behavior of streaming and sequential rollout patterns on MNIST for\nthree different networks with two hidden layers (FF, S, SR; see Fig. 1 and Fig. A2). For evaluation on\nCIFAR10, we generate a sequence of 7 incrementally deeper networks (DSR0 - DSR6, see Fig. A3a)\nby adding layers to the blocks of a recurrent network with skip connections in a dense fashion (details\nin Fig. A3a). For evaluation on GTSRB, we used DSR4 leaving out the recurrent connection. Details\nabout data, preprocessing, network architectures, and the training process are given in Sec. A2.\nResults: Rollouts are compared on the basis of their test accuracies over the duration (measured in\nupdate steps) needed to achieve these accuracies (Fig. 3a-c, e, and g).\nWe show behavioral differences between streaming and sequential rollouts for increasingly complex\nnetworks on the MNIST dataset. In the case of neither recurrent, nor skip connections (see FF in\n\n7\n\n0.00.20.40.60.81.00.00.20.40.60.81.00123456780.40.50.60.70.80.9MNIST accuracy(I)(II)F(Rstr)F(Rseq)SR91215182124time (update steps)(III)sequentialhybridstreaming10202time (update steps)0.40.60.8MNIST accuracyS10203time (update steps)FF20404time (update steps)0.350.400.450.500.55CIFAR10 accuracy(V)(IV)DSR0DSR3DSR6network10154time (update steps)0.40.50.60.7GTRSB accuracy(VI)151015start index of input+=+=+=+=+=+=+=+=+=+=+=+=+=+=0246(IV) in up. stepsabcefghd\fFig. A2), the streaming rollout is mathematically identical to the sequential rollout. Neither rollout\ncan integrate information over time and, hence, both perform classi\ufb01cation on single images with the\nsame response time for the \ufb01rst input image and same accuracy (see Fig. 3c). However, due to the\npipelined structure of computations in the streaming case, outputs are more frequent.\nFor networks with skip, but without recurrent connections (see S in Fig. A2), the behavioral difference\nbetween streaming and sequential rollouts can be shown best. While the sequential rollout still only\nperforms classi\ufb01cation on single images, the streaming rollout can integrate over two input images\ndue to the skip connection that bridges time (see Fig. 3b).\nIn the streaming case, skip connections cause shallow shortcuts in time that can result in earlier\n(see (I) in Fig. 3a), but initially worse performance than for deep sequential rollouts. The streaming\nrollout responds 1 update step earlier than the sequential rollout since its shortest path is shorter by 1\n(see Fig. 2). These early \ufb01rst estimations are later re\ufb01ned when longer paths and \ufb01nally the longest\npath from input to output contribute to classi\ufb01cation. For example, after 3 time steps in Fig. 3a, the\nstreaming rollout uses the full network. This also applies to the sequential rollout, but instead of\nintegrating over two images (frames 0 and 1), only the image of a single frame (frame 1) is used (cf.\nblue to red arrows connecting Fig. 3d and a).\nDue to parallel computation of the entire frame in the streaming case, the sampling frequency of\ninput images (every time step; see red diamonds in Fig. 3d) is maximal (F (Rstr) = 1 in Fig. 3a; see\nd in Theorem 1 in Sec. 3). In contrast, the sampling frequency of the sequential rollout decreases\nlinearly with the length of the longest path (F (Rseq) = 3 in Fig. 3a; blue stars in Fig. 3d).\nHigh sampling frequencies and shallow shortcuts via skip connections establish a high degree of\ntemporal integration early on and result in better early performance (see (II) in Fig. 3a). In the long\nrun, however, classi\ufb01cation performances are comparable between streaming and sequential rollouts\nand the same number of input images is integrated over (see (III) in Fig. 3a).\nWe repeat similar experiments for the CIFAR10 dataset to demonstrate the increasing advantages of\nthe streaming over sequential rollouts for deeper and more complex networks. For the network DSR2\nwith the shortest path of length 4 and longest path of length 6, the \ufb01rst response of the streaming\nrollout is 2 update steps earlier than for the sequential rollout (see (IV) in Fig. 3e) and shows better\nearly performance (see (V) in Fig. 3e). With increasing depth (length of the longest path) over the\nsequence of networks DSR0 - DSR6 (see Fig. A3a), the time to \ufb01rst response stays constant for\nstreaming, but linearly grows with the depth for sequential rollouts (see Fig. 3f black curve). The\ndifference of early performance (see (V) in Fig. 3e) widens with deeper networks (Fig. 3f).\nFor evaluation of rollouts on GTSRB, we considere the DSR4 network. Self-recurrence is omitted\nsince the required short response times of this task cannot be achieved with sequential rollouts\ndue to the very small sampling frequencies. Consequently, for fair comparison, we calculate the\nclassi\ufb01cations of the \ufb01rst 8 images in parallel for the sequential case. In this case, where both rollouts\nuse the same amount of computations, performance for the sequential rollout increases over time due\nto less blurry input images, while the streaming rollout in addition performs temporal integration\nusing skip connections and yields better performance (see (VI) in Fig. 3g). This results in better\nperformance of streaming compared to sequential rollouts for more distant objects (Fig. 3h).\n\n5 Discussion and Conclusion\n\nThe presented theory for network rollouts is generically applicable to a vast variety of deep neural\nnetworks (see Sec. A1.2) and is not constrained to recurrent networks but could be used on forward\n(e.g., VGG [53], AlexNet [54]) or skipping networks (e.g., ResNet [1], DenseNet [55]). We restricted\nrollout patterns to have values R(e) \u2208 {0, 1} and did neither allow edges to bridge more than 1 frame\nR(e) > 1 nor pointing backwards in time R(e) < 0. The \ufb01rst case is subsumed under the presented\ntheory using copy-nodes for longer forward connections, and for R(e) < 0 rollouts with backward\nconnections loose the real-time capability, because information from future frames would be used.\nIn this work, we primarily investigated differences between rollout patterns in terms of the level of\nparallelization they induce in their rollout windows. But using different rollout patterns is not a mere\nimplementation issue. For some networks, all rollout patterns yield the same mathematical behavior\n(e.g., mere feed-forward networks without any skip or recurrent connections, cf. Fig. 3c). For other\nnetworks, different rollout patterns (see Sec. 3) may lead to differences in the behavior of their rollout\n\n8\n\n\fwindows (e.g., Fig. 3b). Hence, parameters between different rollout patterns might be incompatible.\nThe theoretical analysis of behavioral equivalency of rollout patterns is a topic for future work.\nOne disadvantage of the streaming rollout pattern seems to be that deeper networks also require\ndeeper rollout windows. Rollout windows should be at least as long as the longest minimal path\nconnecting input to output, i.e., all paths have appeared at least once in the rollout window. For\nsequential rollout patterns this is not the case, since, e.g., for a feed-forward network the longest\nminimal path is already contained in the \ufb01rst frame. However, for inference with streaming rollouts\ninstead of using deep rollouts we propose to use shallow rollouts (e.g., W = 1) and to initialize\nthe zero-th frame of the next rollout window with the last (i.e., \ufb01rst) frame of the preceding rollout\nwindow. This enables a potentially in\ufb01nite memory for recurrent networks and minimizes the memory\nfootprint of the rollout window during inference.\nThroughout the experimental section, we measured runtime by the number of necessary update\nsteps assuming equal update time for every single node update. Without this assumption and given\nfully parallel hardware, streaming rollouts still manifest the best case scenario in terms of maximal\nparallelization and the inference of a single frame would take the runtime of the computationally most\nexpensive node update. However, sequential rollouts would not bene\ufb01t from the assumed parallelism\nof such hardware and inference of a single frame takes the summed up runtime of all necessary node\nupdates. The streaming rollout favors network architectures with many nodes of approximately equal\nupdate times. In this case, the above assumtion approximately holds.\nThe difference in runtime between rollout patterns depends on the hardware used for execution.\nAlthough commonly used GPUs provide suf\ufb01cient parallelism to speed up calculations of activations\nwithin a layer, they are often not parallel enough to enable the parallel computation of multiple\nlayers. Novel massively parallel hardware architectures such as the TrueNorth chip [56, 57] allow to\nstore and run the full network rollouts on-chip reducing runtime of rollouts drastically and therefore\nmaking streaming rollouts highly attractive. The limited access to massively parallel hardware may\nbe one reason, why streaming rollouts have not been thoroughly discussed, yet.\nFurthermore, not only the hardware, but also the software frameworks must support the parallelization\nof independent nodes in their computation graph to exploit the advantages of streaming rollouts. This\nis usually not the case and by default sequential rollouts are used. For the experiments presented here,\nwe use the Keras toolbox to compare different rollout patterns. To realize arbitrary rollout patterns in\nKeras, instead of using Keras\u2019 build-in RNN functionalities, we created a dedicated model builder\nwhich explicitly generates the rollout windows. Additionally, we implemented an experimental\ntoolbox (Tensor\ufb02ow and Theano backends) to study (de\ufb01ne, train, evaluate, and visualize) networks\nusing the streaming rollout pattern (see Sec. A3). Both are available as open-source code3.\nSimilar to biological brains, synchronization of layers (nodes) plays an important role for the\nstreaming rollout pattern. At a particular time (frame), different nodes may carry differently delayed\ninformation with respect to the input. In this work, we evaluate network accuracy dependent on the\ndelayed response. An interesting area for future research is the exploration of mechanisms to guide\nand control information \ufb02ow in the context of the streaming rollout patterns, e.g., through gated skips\n(bottom-up) and recurrent (top-down) connections. New sensory information should be distributed\nquickly into deeper layers. and high-level representations and knowledge of the network about its\ncurrent task could stabilize, predict, and constrain lower-level representations.\nA related concept to layer synchronization is that of clocks, where different layers, or more generally\ndifferent parts of a network, are updated with different frequencies. In this work, all layers are\nupdated equally often. In general, it is an open research question to which extend clocking and more\ngenerally synchronization mechanisms should be implicit parts of the network and hence learnable or\nformulated as explicit a-priory constraints.\nConclusion: We presented a theoretical framework for network rollouts and investigated differences\nin behavior and model-parallelism between different rollouts. We especially analysed the streaming\nrollout, which fully disentangles computational dependencies between nodes and hence enables\nfull model-parallel inference. We empirically demonstrated the superiority of the streaming over\nnon-streaming rollouts for different image datasets due to faster \ufb01rst responses to and higher sampling\nof inputs. We hope our work will encourage the scienti\ufb01c community to further study the advantages\nand behavioral differences of streaming rollouts in preparation to future massively parallel hardware.\n\n3https://github.com/boschresearch/statestream\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank Bastian Bischoff, Dan Zhang, Jan-Hendrik Metzen, and J\u00f6rg Wagner\nfor their valuable remarks and discussions.\n\nReferences\n[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n770\u2013778, 2016.\n\n[2] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case,\nJared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech\nrecognition in english and mandarin. In International Conference on Machine Learning (ICML), pages\n173\u2013182, 2016.\n\n[3] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning (ICML), pages 1329\u2013\n1338, 2016.\n\n[4] Santiago Fern\u00e1ndez, Alex Graves, and J\u00fcrgen Schmidhuber. An application of recurrent neural networks\nto discriminative keyword spotting. In International Conference on Arti\ufb01cial Neural Networks, pages\n220\u2013229, 2007.\n\n[5] Alex Graves and J\u00fcrgen Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional recurrent\nneural networks. In Advances in Neural Information Processing Systems (NIPS), pages 545\u2013552, 2009.\n[6] Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural\n\nnetworks, 1(4):339\u2013356, 1988.\n\n[7] Ronald J Williams and David Zipser. Gradient-based learning algorithms for recurrent networks and their\ncomputational complexity. Backpropagation: Theory, architectures, and applications, 1:433\u2013486, 1995.\n[8] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks\n\nand visual cortex. arXiv:1604.03640, 2016.\n\n[9] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\n\nnetworks. In International Conference on Machine Learning (ICML), pages 1310\u20131318, 2013.\n\n[10] Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3367\u20133375, 2015.\n\n[11] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging.\n\narXiv:1508.01991, 2015.\n\n[12] A. R. Zamir, T.-L. Wu, L. Sun, W. Shen, B. E. Shi, J. Malik, and S. Savarese. Feedback Networks. arXiv\n\n1612.09508, 2016.\n\n[13] Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. End-to-end learning of driving models from large-\nscale video datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 3530\u20133538, 2017.\n\n[14] Chih-Min Lin, Ching-Fu Tai, and Chang-Chih Chung. Intelligent control system design for uav using a\n\nrecurrent wavelet neural network. Neural Computing & Applications, 24(2):487\u2013496, 2014.\n\n[15] William A Little. The existence of persistent states in the brain. In From High-Temperature Superconduc-\n\ntivity to Microminiature Refrigeration, pages 145\u2013164. Springer, 1974.\n\n[16] John J Hop\ufb01eld. Neural networks and physical systems with emergent collective computational abilities.\n\nProceedings of the national academy of sciences, 79(8):2554\u20132558, 1982.\n\n[17] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[18] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 3104\u20133112, 2014.\n\n[19] Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. High-performance OCR for\nprinted English and Fraktur using LSTM networks. In International Conference on Document Analysis\nand Recognition (ICDAR), pages 683\u2013687, 2013.\n\n[20] Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank K Soong. TTS synthesis with bidirectional LSTM based\nrecurrent neural networks. In Fifteenth Annual Conference of the International Speech Communication\nAssociation, 2014.\n\n10\n\n\f[21] Raymond Brueckner and Bjorn Schulter. Social signal classi\ufb01cation using deep BLSTM recurrent neural\nnetworks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages\n4823\u20134827, 2014.\n\n[22] Anton Milan, Seyed Hamid Rezato\ufb01ghi, Anthony R Dick, Ian D Reid, and Konrad Schindler. Online\nmulti-target tracking using recurrent neural networks. In Proceedings of the AAAI Conference on Arti\ufb01cial\nIntelligence (AAAI), pages 4225\u20134232, 2017.\n\n[23] J\u00fcrgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85\u2013117, 2015.\n[24] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\n\nneural machine translation: Encoder-decoder approaches. arXiv:1409.1259, 2014.\n\n[25] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn\u00edk, and J\u00fcrgen Schmidhuber. Recurrent highway\n\nnetworks. arXiv:1607.03474, 2016.\n\n[26] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv:1410.5401, 2014.\n[27] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi\u00b4nska,\nSergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing\nusing a neural network with dynamic external memory. Nature, 538:471\u2013476, 2016.\n\n[28] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on\n\nSignal Processing, 45(11):2673\u20132681, 1997.\n\n[29] V\u00edctor Campos, Brendan Jou, Xavier Gir\u00f3-i Nieto, Jordi Torres, and Shih-Fu Chang. Skip RNN: Learning\nto skip state updates in recurrent neural networks. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[30] Golan Pundak and Tara N Sainath. Highway-LSTM and recurrent highway networks for speech recognition.\n\nIn Proceedings of Interspeech, 2017.\n\n[31] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent\n\nneural networks. arXiv:1312.6026, 2013.\n\n[32] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[33] Surat Teerapittayanon, Bradley McDanel, and H.T. Kung. BranchyNet: Fast inference via early exiting\n\nfrom deep neural networks. In International Conference on Pattern Recognition (ICPR), 2016.\n\n[34] Klaus Greff, Rupesh K. Srivastava, and J\u00fcrgen Schmidhuber. Highway and residual networks learn unrolled\n\niterative estimation. In International Conference on Learning Representations (ICLR), 2017.\n\n[35] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\npruning, trained quantization and huffman coding. In International Conference on Learning Representa-\ntions (ICLR), 2016.\n\n[36] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression\nof deep convolutional neural networks for fast and low power mobile applications. arXiv:1511.06530,\n2015.\n\n[37] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4013\u20134021, 2016.\n\n[38] Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts.\n\nIn International Conference on Learning Representations (ICLR), 2014.\n\n[39] Marvin Minsky and Seymour A. Papert. Perceptrons: An introduction to computational geometry. MIT\n\npress, 1969. retrieved from the 1988 reissue.\n\n[40] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press Cambridge, 2016.\n[41] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. International Conference\n\non Machine Learning (ICML), pages 399\u2013406, 2010.\n\n[42] Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, and Simon Osindero. Massively\nparallel video networks. Proceedings of the European Conference on Computer Vision (ECCV), pages\n649\u2013666, 2018.\n\n[43] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal\nfeatures with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer\nVision (ICCV), pages 4489\u20134497, 2015.\n\n[44] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A\nlarge-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 961\u2013970, 2015.\n\n[45] Jan Koutnik, Klaus Greff, Faustino Gomez, and J\u00fcrgen Schmidhuber. A clockwork rnn. Proceedings of\n\nthe 31st International Conference on Machine Learning PMLR, pages 1863\u20131871, 2014.\n\n11\n\n\f[46] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver,\nand Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. International Conference\non Machine Learning (ICML), 2017.\n\n[47] Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor Darrell. Clockwork convnets for video semantic\n\nsegmentation. ECCV Workshop, 2016.\n\n[48] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan\nSalakhutdinov. Spatially adaptive computation time for residual networks. Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[49] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased LSTM: Accelerating recurrent network training\n\nfor long or event-based sequences. Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[50] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[51] Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto, 2009.\n\n[52] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traf\ufb01c sign recognition\nbenchmark: a multi-class classi\ufb01cation competition. In International Joint Conference on Neural Networks\n(IJCNN), pages 1453\u20131460. IEEE, 2011.\n\n[53] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv:1409.1556, 2014.\n\n[54] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097\u20131105, 2012.\n[55] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convo-\nlutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 2261\u20132269, 2017.\n\n[56] Paul A. Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Andrew S. Cassidy, Jun Sawada, Filipp Akopyan,\nBryan L. Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, Bernard Brezzo, Ivan Vo, Steven K. Esser,\nRathinakumar Appuswamy, Brian Taba, Arnon Amir, Myron D. Flickner, William P. Risk, Rajit Manohar,\nand Dharmendra S. Modha. A million spiking-neuron integrated circuit with a scalable communication\nnetwork and interface. Science, 345(6197):668\u2013673, 2014.\n\n[57] Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexan-\nder Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo\ndi Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, and Dharmendra S. Modha. Convolu-\ntional networks for fast, energy-ef\ufb01cient neuromorphic computing. Proceedings of the National Academy\nof Sciences, 113(41):11441\u201311446, 2016.\n\n[58] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2012.\n\n[59] Theano Development Team. Theano: A Python framework for fast computation of mathematical expres-\n\nsions. arXiv:1605.02688, 2016.\n\n[60] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,\nGeoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh\nLevenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,\nJonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay\nVasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,\nand Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL\nhttps://www.tensorflow.org/. Software available from tensor\ufb02ow.org.\n\n12\n\n\f", "award": [], "sourceid": 2000, "authors": [{"given_name": "Volker", "family_name": "Fischer", "institution": "Robert Bosch GmbH, Bosch Center for Artificial Intelligence"}, {"given_name": "Jan", "family_name": "Koehler", "institution": "Robert Bosch GmbH"}, {"given_name": "Thomas", "family_name": "Pfeil", "institution": "Robert Bosch GmbH"}]}