{"title": "Constrained deep neural network architecture search for IoT devices accounting for hardware calibration", "book": "Advances in Neural Information Processing Systems", "page_first": 6056, "page_last": 6066, "abstract": "Deep neural networks achieve outstanding results for challenging image classification tasks. However, the design of network topologies is a complex task, and the research community is conducting ongoing efforts to discover top-accuracy topologies, either manually or by employing expensive architecture searches. We propose a unique narrow-space architecture search that focuses on delivering low-cost and rapidly executing networks that respect strict memory and time requirements typical of Internet-of-Things (IoT) near-sensor computing platforms. Our approach provides solutions with classification latencies below 10~ms running on a low-cost device with 1~GB RAM and a peak performance of 5.6~GFLOPS. The narrow-space search of floating-point models improves the accuracy on CIFAR10 of an established IoT model from 70.64% to 74.87% within the same memory constraints. We further improve the accuracy to 82.07% by including 16-bit half types and obtain the highest accuracy of 83.45% by extending the search with model-optimized IEEE 754 reduced types. To the best of our knowledge, this is the first empirical demonstration of more than 3000 trained models that run with reduced precision and push the Pareto optimal front by a wide margin. Within a given memory constraint, accuracy is improved by more than 7% points for half and more than 1% points for the best individual model format.", "full_text": "Constrained deep neural network architecture search\nfor IoT devices accounting for hardware calibration\n\nFlorian Scheidegger1,2\n\nLuca Benini1,3\n\nCostas Bekas2\n\nCristiano Malossi2\n\n1 ETH Z\u00fcrich, R\u00e4mistrasse 101, 8092 Z\u00fcrich, Switzerland\n\n2 IBM Research - Z\u00fcrich, S\u00e4umerstrasse 4, 8803 R\u00fcschlikon, Switzerland\n\n3 Universit\u00e0 di Bologna, Via Zamboni 33, 40126 Bologna, Italy\n\nAbstract\n\nDeep neural networks achieve outstanding results for challenging image classi\ufb01ca-\ntion tasks. However, the design of network topologies is a complex task, and the\nresearch community is conducting ongoing efforts to discover top-accuracy topolo-\ngies, either manually or by employing expensive architecture searches. We propose\na unique narrow-space architecture search that focuses on delivering low-cost\nand rapidly executing networks that respect strict memory and time requirements\ntypical of Internet-of-Things (IoT) near-sensor computing platforms. Our approach\nprovides solutions with classi\ufb01cation latencies below 10 ms running on a low-cost\ndevice with 1 GB RAM and a peak performance of 5.6 GFLOPS. The narrow-space\nsearch of \ufb02oating-point models improves the accuracy on CIFAR10 of an estab-\nlished IoT model from 70.64% to 74.87% within the same memory constraints. We\nfurther improve the accuracy to 82.07% by including 16-bit half types and obtain\nthe highest accuracy of 83.45% by extending the search with model-optimized\nIEEE 754 reduced types. To the best of our knowledge, this is the \ufb01rst empirical\ndemonstration of more than 3000 trained models that run with reduced precision\nand push the Pareto optimal front by a wide margin. Within a given memory\nconstraint, accuracy is improved by more than 7% points for half and more than\n1% points for the best individual model format.\n\n1\n\nIntroduction\n\nDesigning an economically viable arti\ufb01cial intelligence system has become a formidable challenge in\nview of the increasing number of published methods, data, models, newly available deep-learning\nframeworks as well as the hype surrounding special-purpose hardware accelerators as they become\ncommercially available. The availability of large-scale datasets with known ground truths [12, 42, 13,\n51, 28, 10, 33, 54, 9, 5, 34, 37] and the widespread commercial availability of higher computational\nperformance\u2014usually achieved with graphic-processing units (GPUs)\u2014has driven the current growth\nof and strong interest in deep learning and the emergence of related new businesses. Smart homes\n[29], smart grids [15] and smart cities [17] trigger a natural demand for the Internet of Things (IoT),\nwhich are products designed to be low in cost and feature low energy consumption and fast reaction\ntimes due to the inherent constraints given by \ufb01nal applications that typically demand autonomy\nwith long battery lifetimes or fast real-time operation. Experts estimate that there will be some 30\nbillion IoT devices in use by 2020 [35], many of which serve applications that bene\ufb01t from arti\ufb01cial\nintelligence deployment.\nIn this context, we propose an automatic way to design deep-learning models that satisfy user-de\ufb01ned\nconstraints speci\ufb01cally tailored to match typical IoT requirements, such as inference latency bounds.\nAdditionally, our approach is designed in a modular manner that allows future adaptations and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fspecialization for novel network topology extensions to different IoT devices and lower precision\ncontexts. Our main contributions are the following:\n\nhardware constraints.\n\n\u2022 We propose an end-to-end approach to synthesize models that satisfy IoT application and\n\u2022 We propose a narrow-space architecture search algorithm to leverage knowledge from large\nreference models to generate a family of small and ef\ufb01cient models.\n\u2022 We evaluate reduced precision formats for more than 3000 models.\n\u2022 We isolate IoT device characteristics and demonstrate how our concepts operate with\n\nanalytical network properties and map them to \ufb01nal platform-speci\ufb01c metrics.\n\nThe remainder of this paper is organized as follows. Section 2 describes related work, Section 3\nintroduces the core design procedures, Section 4 details and merges a full synthesis work\ufb02ow,\nSection 5 presents and discusses the obtained results, and Section 6 concludes our \ufb01ndings.\n\n2 Related work\n\nAutomated architecture search has the potential to discover better models [31, 52, 53, 55, 56, 6, 4, 49].\nHowever, traditional approaches require a vast amount of computing resources or cause excessive\nexecution times due to the full training of candidate networks[38]. Early stopping based on learning-\ncurve predictors [14] or transferring learned wights shortens run-times [48]. A method called Train-\nless Accuracy Predictor for Architecture Search (TAPAS) demonstrates how to generalize architecture\nsearch results to new data without having to train during the search process [26]. Architecture\nsearches face the common challenge of de\ufb01ning the search space. Historically, new networks were\ndeveloped independently by expert knowledge that outperforms previously found networks generated\nby architectural searches. In such cases, very expensive reconsiderations led to follow-up work to\naccount correctly for a richer search space [36, 47]. Recent progress in the \ufb01eld, such as MnasNet\n[46] and FBNet [50], tailor the search by optimizing a multi-objective function including inference\ntime on smartphones. MnasNet trains a controller that adjusts to more optimal sample models in\nterms of multi-objectivity. FBNet trains a supernet by a differentiable neural architecture search\n(DNAS) in a single step and claims to be 420\u00d7 faster by avoiding additional model training steps.\nIn contrast to solving a joint optimization problem in one step, our proposed union of narrow-space\nsearches takes a modular approach that separates the search process of \ufb01nding architectures that\nstrictly satisfy constraints from the training of candidate networks. That way, we can analyze 10,000\narchitectures with no training cost and select only a small subset of suitable candidates for training.\nCompression, quantization and pruning techniques reduce heavy computational needs based on the\ninherent error resilience of deep neural networks [39]. Mobile nets [22] or low-rank expansions\n[27] change the topology into layers that require fewer weights and reduce workloads. Quantization\nstudies the effect of using reduced precision \ufb02oating-point or \ufb01xed-point formats [21, 30], whereas\ncompression attempts to reduce the binary footprint of activation and weight maps [7]. Pruning\napproaches avoid computation by enforcing sparsity [3]. We use \ufb02oatx, an IEEE 754-compliant\nreduced precision library [16], to assess data format-speci\ufb01c aspects of networks. The novelty of our\nwork is that we jointly evaluate network topologies in combination with reduced precision.\n\n3 Core design procedures\n\n3.1 Architecture search\n\nwhere results are obtained by aggregating n independent searches S =(cid:83)n\nIt is challenging to de\ufb01ne a space S that produces enough variation and simultaneously reduces\nthe probability of sampling suboptimal networks. We propose narrow-space architecture searches,\ni=1 Si. As a good search\nspace should satisfy Sr \u2282 S, where Sr = {M1, ...,Mn} is a set of reference models, we construct S\nby designing narrow spaces that obey Mi \u2208 Si in order to guarantee Sr \u2282 S. Instead of considering\none large space, we have specialized search spaces that produce simple sequence structures with\nresidual bypass operations (ResNets [19]) to even high fan-out and convergent structures such as they\noccur in the Inception module [44] or DenseNets [24]. Aggregation allows results to be extended\neasily with a tailored narrow-space search for new reference architectures. Next, we de\ufb01ne a set\n\n2\n\n\fFigure 1: Left: Three-layer architecture. Middle: Default con\ufb01guration of search space with restricted\nsampling laws. Right: Statistics of number of parameters obtained by sampling up to one million\nnetworks from the base con\ufb01guration space and 1000 networks from the restricted sampling laws.\n\nof distribution law con\ufb01gurations L1(Si), ...,Lk(Si) that allow samples to be drawn in a biased way\nsuch that models satisfy the properties of interest. Figure 1 illustrates the advantages over a uniform\ndistribution among valid networks. Consider a space of three-layer networks with allowed variations\nin kernel shapes in {1,3,5,7} and output channels in [1,128] leading to |S| = 46 \u2217 1283 = 8.6\u2217 109\nnetwork con\ufb01gurations.\nFigure 1 shows the statistics for up to 106 samples compared with sampling only 1000 samples using\nrestricted samplers L1,L2 and L3. Restricted random laws ef\ufb01ciently generate networks of interest,\nin contrast to a uniform sampler that fails to deliver high sampling densities in certain regions. For\nexample, only 132 out of 106 networks have fewer than 1000 parameters.\nWe de\ufb01ne each narrow-space architecture search and its sampling laws according to the following\ndesign goals: First, only valid models are generated with a topology that resembles and includes\nthe original model. Second, the main model-speci\ufb01c parameters are varied, and ef\ufb01cient models\nare obtained mainly by lowering channel widths in convolutional layers and reducing the number\nof topological replications. Third, all random laws are de\ufb01ned following a uniform distribution\nover available options, where the lower and upper limits were used as a way to bias the models to\nspan several orders of magnitude targeting the range of parameters and \ufb02op counts relevant for IoT\napplications.\n\n3.2 Precision analysis\n\nPrecision analysis evaluates model accuracies for models having reduced precision representations.\nFollowing general methodology, we perform precision analyses on the backend device that has\ndifferent execution capabilities than current or future targeted IoT devices. This methodology\nenforces emulated computation throughout the analysis to assess accuracy independent of the target\nhardware. Low precision can be applied to model parameters, to the computations performed by the\nmodels and to the activation maps that are passed between operators. Here we follow the extrinsic\nquantization approach [30], where we enforce a precision caused by the reduced type Tw,t of storage\nwidth 1 + w +t to be applied to all model parameters and all activation maps that are passed between\noperations. Our analysis follows the IEEE 754 standard [57], which de\ufb01nes storage encoding, special\ncases (Nan, Inf), and rounding behavior of \ufb02oating-point data. A sign s, an exponent e and the\nsigni\ufb01cand m represent a number v = (\u22121)s\u22172e\u2217m, where the exponent \ufb01eld width w and the trailing\nsigni\ufb01cant \ufb01eld width t limit dynamic range and precision. Types T5,10 and T8,23 correspond to\nstandard formats half and \ufb02oat. Our experiments are based on a PyTorch [1] integration of the GPU\nquantization kernel based on the high-performance \ufb02oatx library [16], which implements the type\nTw,t. A fast precision analysis allows us to evaluate more than 3,000 models with a full grid search of\n184 types (w \u2208 [1,8],t \u2208 [1,23]) of the entire validation data.\n\n3\n\nCONV_1BN, Pool 2x2CONV_2BN, Pool 2x2CONV_3BN, Pool 2x2100101102103numberofoccurrencesn=106n=105n=104n=103103104105106numberofparameters0204060numberofoccurrencesL1L2L3\fFigure 2: Left: High correlations between two analytical properties of network architectures. Right:\nRuntime-dependent latency is best correlated with the workload when different search space-speci\ufb01c\ncharacteristics are present.\n\n3.3 Deployment on hardware and performance characterization\n\nTo evaluate model execution performance on the IoT target device, we perform a calibration to assess\nthe execution speed of the models of interest. Despite many choices of deep-learning frameworks,\nways of optimizing code depending on compilation or software version and even several hardware\nplatforms that accelerate deep learning models, we formulate the performance characterization\nin a general manner and as decoupled as possible from the topology architecture search and the\nprecision analysis to facilitate subsequent extensions. Performance measurements on the IoT device\nare affected by explicit and implicit settings. We demonstrate our search algorithm with performance\nmeasurements featuring the fewest assumptions and requirements regarding runtime. To that end,\nwe selected Raspberry-Pi 3(B+) as a representative low-cost IoT device. It features a Broadcom\nBCM2837B0, quad-core ARMv8 Cortex-A53 running at 1.4 GHz. The board is equipped with 1 GB\nLPDDR2 memory [2]. The Raspberry-Pi 3(B+) belongs to the general-purpose device category that\nis shipped with peripherals (WiFi, LAN, Bluetooth, and USB, HDMI) and a full operating system\n(Raspbian, a Linux distribution). It is available for about $35 [32].\nThroughout this work, we measure the model inference latency on the target device by averaging over\nten repetitions. We used a batch size of one to minimize latency and internal memory requirements.\nThe latency study covers many relevant use cases, for example the classi\ufb01cation of sporadically\narriving data within a short time to prolong battery lifetime or frame processing a video stream, where\nthe classi\ufb01cation must be completed before the next frame arrives.\nFor each model, we consider two analytical properties, the number of trainable parameters and the\nworkload measured as the number of \ufb02oating-point operations required for inference. The calibration\nstep relates analytical properties with execution performance and allows us to separate runtime\nmetrics. Figure 2 shows high correlations between the number of parameters, the workload and\nthe measured latency on the Raspberry-Pi 3(B+) device. Workload and parameters follow a similar\nscaling over \ufb01ve orders of magnitude with homogeneous variations. The dynamic range of the latency\nspans more than two orders of magnitude with higher variations for larger models. However, owing\nto the compute-bound nature of the kernels, the workload is a better indicator of latency time than the\nnumber of parameters.\n\n4 Fast cognitive design algorithms\n\nIn this section, we leverage the architecture search, the precision analysis, and the hardware calibration\nsteps to synthesize case-speci\ufb01c solutions that satisfy given constraints. We address two tasks: First,\nthe constraint search solves for the model that best satis\ufb01es given constraints. Second, the Pareto\nfront elaboration provides insights into tradeoffs over the entire solution space. The two tasks are\nrelated. Solving the \ufb01rst task on a grid of constraints provides solutions to the second task, whereas\n\ufb01ltering the latter based on the given constraints yields the former. Both tasks are solved by manually\nand automatically by de\ufb01ning the sampling law con\ufb01gurations on the same set of narrow-search\n\n4\n\n103104105106107numberofparameters105106107108109Workload[operations]S1S2S3S4S5105106107108109Workload[\ufb02oatingpointoperations]101102103latencytime[ms]S1S2S3S4S5\fFigure 3: Manual and automatic work\ufb02ow. First, sampling laws are de\ufb01ned to generate models of\ninterest. Second, models are calibrated to check latency on the IoT device, even if they are not yet\ntrained. Third, models are trained to achieve accuracy. As training is the most expensive task, it is\nessential to limit the number of trained models to candidates of interest only.\n\nFigure 4: Left: Manually de\ufb01ned restricted sampling laws cover the entire space. Right: Automatic\nsearch \ufb01nds sampling law con\ufb01gurations without human interaction and the distribution covers a\nhigher dynamic range than sampling uniformly in the entire space.\n\nspaces as shown in Figure 3. In the manual task, collected statistics of analytical network properties\nprovide quick feedback to adapt the settings to cover the range of interest. For a fair comparison\nof the manual and automatic work\ufb02ows, we assume throughout our experiments that the expert has\nno further feedback knowledge about model accuracy. Additionally, network runtime performance\nmetrics can be measured on the target device or estimated from calibration measurements. Next,\ndepending on the task type, either a few candidate networks that satisfy constraints or a full wave of\nnetworks are selected for training. Large-scale training takes the most time\u2014as each training job is\nof complexity O(ntrainCmodelE)\u2014proportional to the amount of training data, model complexity and\nthe number of epochs for which the model is trained.\nWe designed a genetic and clustering-based algorithm to automatize the design of sampling laws.\nWe de\ufb01ne the valid space with a list of variables with absolute minimal and maximal ratings.\nA sampling law L(Si) is de\ufb01ned as an ordered set of uniform sampling laws L = (Ux[lx,hx], ...)\nwith lower and upper limits lx and hx per variable x. The genetic algorithm automatically learns\nthe search space-speci\ufb01c sampling law limits [lx,hx]. The cost function is de\ufb01ned in a two-step\napproach. First, the statistic (\u00b5m,\u03c3m) := En\nm(L) is estimated by computing means and standard\ndeviations over the metric m extracted from the n generated topologies. Second, cost is computed\nas c((\u00b5m,\u03c3m), (\u03c41,\u03c42)) := |\u00b5m \u2212 \u03c3m \u2212 \u03c41| +|\u00b5m + \u03c3m \u2212 \u03c42| in order that the high density range of\nthe estimated distribution coincides with a given interval (\u03c41,\u03c42). We avoided de\ufb01nitions based on\nsingle-sided constraints such as \u00b5 < \u03c4 because such formulations might be satis\ufb01ed trivially (using the\nsmallest network) or by undesirable laws having wide or narrow variations. We used the tournament\nselection variant of genetic algorithms [18] and de\ufb01ned mutations by randomly adapting the sampling\nlaw of hyper-parameters lx and hx. We used an initial population of ninit = 100 and ran the algorithm\n\n5\n\ndefinesearchspacedefine samplinglawNetAccuSizeTime12k3k1MA1A2A3perform HWcalibrationNetAccuSizeTime12k3k1MA1A2A31.2s7ms14mstrainselected networkNetAccuSizeTime12k3k1MA1A2A392%57%84%samplemodels compute analyticalestimate runtimemeasure runtimestatisticcostcontroll mutations (tournament selection) metrics (either of)init populationfinal pop.constraintssample andcollectstatisticshuman expert adaptionsGenetic evolutionMANUALAUTOMATIC103104105106107108numberofparameters050100150200250numberofoccurrencesL1L2L3L4L5L6103104105106107108numberofparameters02004006008001000numberofoccurrencesautomaticsearchrandomcon\ufb01guration\ffor nsteps = 900 steps while using neval = 10 samples to estimate mean and standard deviations per\ncon\ufb01guration. This way, one search considers (ninit + nsteps)\u2217 neval = 10,000 networks. As the \ufb01nal\npopulation might contain different sampling laws of similar quality, we performed spectral clustering\n[43] to \ufb01nd k = 10 clusters with similar sampling laws. We assembled a list of the most different\ntop-k laws by taking the best-\ufb01t law per cluster.\nTo elaborate the entire search space with a Pareto optimal front, we split each decade into three\nintervals [\u03c4,2\u03c4,5\u03c4,10\u03c4] and de\ufb01ne a grid for \u03c4 = 103,104,105,106 spanning \ufb01ve orders of magnitude.\nWe ran the genetic search algorithm several times by setting the target bounds (\u03c41,\u03c42) in a sliding-\nwindow manner over consecutive values from the de\ufb01ned grid. Finally, we accumulated results\nfrom twelve genetic searches, each of which found ten sampling laws, where we sampled each law\nnval = 100 times to obtain the statistic of 12,000 network architectures per narrow-space search.\nFigure 4 shows results for manually and automatically sampled networks. Even though the manual\nsearch covers the region of interest nicely, human expertise is required to de\ufb01ne the parameters\nof the laws L1 to L6 correctly. The naive sampling approach in the entire search space produces a\nnarrow distribution and is strongly skewed towards larger networks. In contrast, the genetic algorithm\nequalizes the distribution and provides samples that cover much higher dynamic ranges, extending\nthe scale especially for smaller networks without manually restricting the architecture.\n\n5 Results\n\nTo study our algorithm, we ran full design-space explorations on the well-established CIFAR-10\n[28] classi\ufb01cation task and compared our results with those obtained with established reference\nmodels. Figure 5 shows the tradeoff between model size and accuracy, including manually and\nautomatically generated results of the aggregate search spaces. The Pareto optimal front follows a\n\nFigure 5: Results of our architecture search compared with reference models. Each dot represents a\nmodel according to its size and the obtained accuracy on the CIFAR-10 validation set. Our search\n\ufb01nds results over \ufb01ve orders of magnitude and, in particular, \ufb01nds various models that are much\nsmaller than out-of-the box models. In the restricted IoT domain, our search delivers models that\noutperform the reference with a wide margin for \ufb01xed constraints.\n\n6\n\n103104105106107108modelmemoryfootprint[Bytes]405060708090top-1accuracyonCIFAR10[%]Our32-bitIEEE754\ufb02oatmodelsOurs,optimalformatTw,tOurs,8-bit\ufb02oatT4,3Ours,T5,10(half)Traditionalreferencemodels(\ufb02oat)ProbeNetreferencemodels(\ufb02oat)Paretooptimalfront:TraditionalreferenceParetooptimalfront:ProbeNetreferenceIoTreference(originalin16-bit\ufb01xpoint)IoTreference(castedto32-bitIEEE754\ufb02oat)\fFigure 6: Left: Zoomed view of direct comparison; manual and automatic searches perform equally\nwell. Middle: Manual and automatic search results. In the manual case, clusters are visible, whereas\nthe automatic search sampled in a more homogeneous manner. Right: Results for one narrow-space\nsearch with marked clusters matching Figure 4.\n\nFigure 7: Final result showing the achievable tradeoffs between the IoT device measured model\nlatency and model accuracy. Our search is able to deliver models that run below 10 ms on a Raspberry\nPi 3(B+), which we took as a representative low-cost IoT device.\n\nsmooth curve that saturates towards the best accuracy obtainable for large models. The number of\nparameters is logarithmic and the accuracy scales linearly. Even very small models with fewer than\n1000 parameters can achieve accuracies of greater than 45%. The accuracy increase per decade of\nadded parameters is on the order of 30%, 15%, 3% and < 2% points and then decreases very quickly.\nThis effect allows us to construct models having several orders of magnitude fewer parameters. It\nalso provides economically interesting solutions for IoT devices that are powerful enough to process\ndata in real time. We compare our results with three sources of reference models: (a) traditional\nreference models, (b) ProbeNets [40] that are designed to be small and fast and (c) models designed\nto run on the parallel ultra-low power (PULP) platform [11]. Traditional models include 30 reference\ntopologies including variants of VGG [41], ResNets [20], GoogleNet [45], MobileNets [23] dual-path\nnets (DPNs) [8] and DenseNets [25], where most of them (28/30) exceed 1 M parameters. ProbeNets\n\n7\n\n1063\u00d71054\u00d71056\u00d71059091929394top-1accuracy[%]MANUALAUTOMATIC103104105106107406080MANUAL(Allspaces)103104105106107406080L1L2L3L4L5L61046\u00d71032\u00d7104numberofparameters6570758085top-1accuracy[%]MANUALAUTOMATIC103104105106107numberofparameters406080AUTOMATIC(Allspaces)103104105106107numberofparameters406080AUTOMATIC(onespace)101102103latencytimeonRaspberryPi3(B+)[ms]5060708090top1-accuracyonCIFAR-10[%]S1S2S3S4S5\fFigure 8: We demonstrate the scalability of our approach by applying our search to three constraints\non thirteen datasets. Best models per dataset and constraint are connected with a line.\n\nwere originally introduced to characterize the classi\ufb01cation dif\ufb01culty and are considerably smaller by\ndesign [40]. They act as reference points for manually designed networks that cover the relevant lower\ntail in terms of parameters. In the IoT-relevant domain (<10 M parameters), our search outperforms\nall the listed reference models.\nThe top three fronts in Figure 5 show the results of our precision analysis. For each trained model, we\nevaluated the effect of running models with all con\ufb01gurations of type Tw,t and plot the Pareto-optimal\nfront. We considered three cases: (1) running all models with half-precision, (2) running all models\nwith the type T43, which is the best choice for types that are 8 bits long, and (3) running each model\nwith its individual best tradeoff type Tw,t. We demonstrate empirically that reduced precision pushes\nthe Pareto optimal front. Under a given memory constraint, accuracy improves by more than 7%\npoints for half and by another 1% points or more for the model individual format.\nFigure 6 shows details of manual and automatic searches, both of which yield very similar results.\nThe right-hand graphs show results obtained for one narrow-space search, where manually de\ufb01ned\nsampling laws led to clusters. The automatic search covered a similar range homogeneously. Figure 7\nshows inference times when the same set of models is executed on a Raspberry Pi 3(B+). Similarly,\nproviding additional latency time for small models results in dominant accuracy gains, however, large\nmodels only slightly improve accuracy even when using more complex models that require long\nevaluation times.\nFigure 8 demonstrates the scalability of our approach. We applied our search for three constraints\n\u03c4 = 103,104,105 on thirteen datasets [40], where we spent a training effort of ten architectures per\ndataset and constraint. The lines connect the best per constraint and dataset performing architectures.\n\n6 Conclusion\n\nWe studied the synthesis of deep neural networks that are eligible candidates to run ef\ufb01ciently on\nIoT devices. We propose a narrow-space search approach that leverages knowledge quickly from\nexisting architectures and that is modular enough to be adapted to new design patterns. Manually\nand automatically designed sampling laws allow various models to be generated having suf\ufb01ciently\nnumerous parameters to cover multiple orders of magnitude. We demonstrate that reduced precision\nimproves top-1 accuracy by over 8% points for constraint weight memory in the IoT-relevant domain.\nA strong correlation between model size and latency enables us to create small enough models that\nprovide superior inference response latencies below 10 ms on an edge device costing only about $35.\n\nAcknowledgments\n\nThis work was funded by the European Union\u2019s H2020 research and innovation program under grant\nagreement No 732631, project OPRECOMP.\n\n8\n\n104105106numberofparameters20406080100top-1accuracy[%]gtsrbfood101indoor67cifar100\ufb02owers102stl10cifar10caltech256\ufb02owerstexturesmnistfashion-mnistgtsrbcrop\fReferences\n[1] Pytorch. https://pytorch.org/. Accessed: 2019-05-22.\n\n[2] Raspberry pi 3 model b+ product description.\n\nhttps://www.raspberrypi.org/products/\n\nraspberry-pi-3-model-b-plus/. Accessed: 2019-05-14.\n\n[3] A. Ashiquzzaman, L. V. Ma, S. Kim, D. Lee, T. Um, and J. Kim. Compacting deep neural networks for\nlight weight iot scada based applications with node pruning. In 2019 International Conference on Arti\ufb01cial\nIntelligence in Information and Communication (ICAIIC), pages 082\u2013085, Feb 2019.\n\n[4] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement\n\nlearning. CoRR, abs/1611.02167, 2016.\n\n[5] L. Bossard, M. Guillaumin, and L. Van Gool. Food-101 \u2013 mining discriminative components with random\nforests. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision \u2013 ECCV 2014, pages\n446\u2013461, Cham, 2014. Springer International Publishing.\n\n[6] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Ef\ufb01cient architecture search by network transformation.\n\nIn Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[7] L. Cavigelli and L. Benini. Extended bit-plane compression for convolutional neural network accelerators.\n\nCoRR, abs/1810.03979, 2018.\n\n[8] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 30, pages 4467\u20134475. Curran Associates, Inc., 2017.\n\n[9] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In\nProceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR \u201914, pages\n3606\u20133613, Washington, DC, USA, 2014. IEEE Computer Society.\n\n[10] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In\nG. Gordon, D. Dunson, and M. Dud\u00edk, editors, Proceedings of the Fourteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages\n215\u2013223, Fort Lauderdale, FL, USA, 11\u201313 Apr 2011. PMLR.\n\n[11] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini. Pulp: A ultra-low power parallel accelerator for\nenergy-ef\ufb01cient and \ufb02exible embedded vision. Journal of Signal Processing Systems, 84(3):339\u2013354, Sep\n2016.\n\n[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In IEEE CVPR, pages 248\u2013255, 2009.\n\n[13] L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web].\n\nIEEE Signal Processing Magazine, 29(6):141\u2013142, 2012.\n\n[14] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep\nneural networks by extrapolation of learning curves. In Twenty-Fourth International Joint Conference on\nArti\ufb01cial Intelligence, 2015.\n\n[15] G. Fenza, M. Gallo, and V. Loia. Drift-aware methodology for anomaly detection in smart grid. IEEE\n\nAccess, 7:9645\u20139657, 2019.\n\n[16] G. Flegar, F. Scheidegger, V. Novakovic, G. Mariani, A. Tomas, C. Malossi, and E. Quintana-Ort\u00ed. Float x:\n\nA c++library for customized \ufb02oating-point arithmetic. submitted, 2019.\n\n[17] M. M. Gaber, A. Aneiba, S. Basurra, O. Batty, A. M. Elmisery, Y. Kovalchuk, and M. H. U. Rehman.\nInternet of things and data mining: From applications to techniques and systems. Wiley Interdisciplinary\nReviews: Data Mining and Knowledge Discovery, 9(3):e1292, 2019.\n\n[18] D. E. Goldberg and K. Deb. A comparative analysis of selection schemes used in genetic algorithms. In\n\nFoundations of genetic algorithms, volume 1, pages 69\u201393. Elsevier, 1991.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), June 2016.\n\n[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n9\n\n\f[21] P. Hill, B. Zamirai, S. Lu, Y. Chao, M. Laurenzano, M. Samadi, M. C. Papaefthymiou, S. A. Mahlke, T. F.\nWenisch, J. Deng, L. Tang, and J. Mars. Rethinking numerical representations for deep neural networks.\nCoRR, abs/1808.02513, 2018.\n\n[22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861,\n2017.\n\n[23] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861,\n2017.\n\n[24] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700\u20134708,\n2017.\n\n[25] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.\n\n[26] R. Istrate, F. Scheidegger, G. Mariani, D. S. Nikolopoulos, C. Bekas, and A. C. I. Malossi. Tapas: Train-less\n\naccuracy predictor for architecture search, 2019.\n\n[27] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank\n\nexpansions. CoRR, abs/1405.3866, 2014.\n\n[28] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[29] W. Li, T. Logenthiran, V.-T. Phan, and W. L. Woo. A novel smart energy theft system (sets) for iot based\n\nsmart home. IEEE Internet of Things Journal, 2019.\n\n[30] D. M. Loroch, F.-J. Pfreundt, N. Wehn, and J. Keuper. Tensorquant: A simulation toolbox for deep neural\nnetwork quantization. In Proceedings of the Machine Learning on HPC Environments, MLHPC\u201917, pages\n1:1\u20131:8, New York, NY, USA, 2017. ACM.\n\n[31] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad,\nA. Navruzyan, N. Duffy, and B. Hodjat. Chapter 15 - evolving deep neural networks. In R. Kozma,\nC. Alippi, Y. Choe, and F. C. Morabito, editors, Arti\ufb01cial Intelligence in the Age of Neural Networks and\nBrain Computing, pages 293 \u2013 312. Academic Press, 2019.\n\n[32] S. Mittal. A survey on optimized implementation of deep learning models on the nvidia jetson platform.\n\nJournal of Systems Architecture, 2019.\n\n[33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\nunsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning,\nvolume 2011, page 5, 2011.\n\n[34] M. E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of classes. In 2008\n\nSixth Indian Conference on Computer Vision, Graphics Image Processing, pages 722\u2013729, Dec 2008.\n\n[35] A. Nordrum. The internet of fewer things [news]. IEEE Spectrum, 53(10):12\u201313, October 2016.\n\n[36] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef\ufb01cient neural architecture search via parameter\n\nsharing. CoRR, abs/1802.03268, 2018.\n\n[37] A. Quattoni and A. Torralba. Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision\n\nand Pattern Recognition, pages 413\u2013420, June 2009.\n\n[38] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. Large-\nscale evolution of image classi\ufb01ers. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 2902\u20132911. JMLR. org, 2017.\n\n[39] V. Rybalkin, N. Wehn, M. R. Youse\ufb01, and D. Stricker. Hardware architecture of bidirectional long short-\nterm memory neural network for optical character recognition. In Proceedings of the Conference on Design,\nAutomation & Test in Europe, pages 1394\u20131399. European Design and Automation Association, 2017.\n\n[40] F. Scheidegger, R. Istrate, G. Mariani, L. Benini, C. Bekas, and C. Malossi. Ef\ufb01cient image dataset\n\nclassi\ufb01cation dif\ufb01culty estimation for predicting deep-learning accuracy. submitted, 2019.\n\n[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n10\n\n\f[42] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traf\ufb01c sign recognition benchmark: A\nmulti-class classi\ufb01cation competition. In The 2011 International Joint Conference on Neural Networks,\npages 1453\u20131460, July 2011.\n\n[43] X. Y. Stella and J. Shi. Multiclass spectral clustering. In null, page 313. IEEE, 2003.\n\n[44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\nGoing deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 1\u20139, 2015.\n\n[45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for\ncomputer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June\n2016.\n\n[46] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platform-aware neural architecture\n\nsearch for mobile. CoRR, abs/1807.11626, 2018.\n\n[47] Y. Weng, T. Zhou, L. Liu, and C. Xia. Automatic convolutional neural architecture search for image\n\nclassi\ufb01cation under different scenes. IEEE Access, 7:38495\u201338506, 2019.\n\n[48] M. Wistuba. Deep learning architecture search by neuro-cell-based evolution with function-preserving\nmutations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,\npages 243\u2013258. Springer, 2018.\n\n[49] M. Wistuba, A. Rawat, and T. Pedapati. A survey on neural architecture search. arXiv preprint\n\narXiv:1905.01392, 2019.\n\n[50] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardware-\naware ef\ufb01cient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.\n\n[51] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine\n\nlearning algorithms, 2017.\n\n[52] L. Xie and A. Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer\n\nVision, pages 1379\u20131388, 2017.\n\n[53] Z. Zhong, J. Yan, and C. Liu. Practical network blocks design with q-learning. CoRR, abs/1708.05552,\n\n2017.\n\n[54] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for\n\nscene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.\n\n[55] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578,\n\n2016.\n\n[56] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image\nrecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.\n\n[57] D. Zuras, M. Cowlishaw, A. Aiken, M. Applegate, D. Bailey, S. Bass, D. Bhandarkar, M. Bhat, D. Bindel,\n\nS. Boldo, et al. Ieee standard for \ufb02oating-point arithmetic. IEEE Std 754-2008, pages 1\u201370, 2008.\n\n11\n\n\f", "award": [], "sourceid": 3263, "authors": [{"given_name": "Florian", "family_name": "Scheidegger", "institution": "IBM Research -- Zurich"}, {"given_name": "Luca", "family_name": "Benini", "institution": "ETHZ, University of Bologna"}, {"given_name": "Costas", "family_name": "Bekas", "institution": "IBM Research GmbH"}, {"given_name": "A. Cristiano I.", "family_name": "Malossi", "institution": "IBM Research - Zurich"}]}