{"title": "Compiler Auto-Vectorization with Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 14625, "page_last": 14635, "abstract": "Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit fine-grained data level parallelism. To exploit this parallelism, compilers employ auto-vectorization techniques to automatically convert scalar code into vector code. Larsen & Amarasinghe (2000) first introduced superword level parallelism (SLP) based vectorization, which is one form of vectorization popularly used by compilers. Current compilers employ hand-crafted heuristics and typically only follow one SLP vectorization strategy which can be suboptimal. Recently, Mendis & Amarasinghe (2018) formulated the instruction packing problem of SLP vectorization by leveraging an integer linear programming (ILP) solver, achieving superior runtime performance. In this work, we explore whether it is feasible to imitate optimal decisions made by their ILP solution by fitting a graph neural network policy. We show that the learnt policy produces a vectorization scheme which is better than industry standard compiler heuristics both in terms of static measures and runtime performance. More specifically, the learnt agent produces a vectorization scheme which has a 22.6% higher average reduction in cost compared to LLVM compiler when measured using its own cost model and achieves a geometric mean runtime speedup of 1.015\u00d7 on the NAS benchmark suite when compared to LLVM\u2019s SLP vectorizer.", "full_text": "Compiler Auto-Vectorization with Imitation Learning\n\nCharith Mendis\n\nMIT CSAIL\n\ncharithm@mit.edu\n\nCambridge Yang\n\nMIT CSAIL\n\ncamyang@csail.mit.edu\n\nYewen Pu\nMIT CSAIL\n\nyewenpu@mit.edu\n\nSaman Amarasinghe\n\nMIT CSAIL\n\nsaman@csail.mit.edu\n\nMichael Carbin\n\nMIT CSAIL\n\nmcarbin@csail.mit.edu\n\nAbstract\n\nModern microprocessors are equipped with single instruction multiple data (SIMD)\nor vector instruction sets which allow compilers to exploit \ufb01ne-grained data level\nparallelism. To exploit this parallelism, compilers employ auto-vectorization\ntechniques to automatically convert scalar code into vector code. Larsen & Amaras-\ninghe (2000) \ufb01rst introduced superword level parallelism (SLP) based vectorization,\nwhich is a form of vectorization popularly used by compilers. Current compilers\nemploy hand-crafted heuristics and typically only follow one SLP vectorization\nstrategy which can be suboptimal. Recently, Mendis & Amarasinghe (2018) for-\nmulated the instruction packing problem of SLP vectorization by leveraging an\ninteger linear programming (ILP) solver, achieving superior runtime performance.\nIn this work, we explore whether it is feasible to imitate optimal decisions made by\ntheir ILP solution by \ufb01tting a graph neural network policy. We show that the learnt\npolicy, Vemal, produces a vectorization scheme that is better than the well-tuned\nheuristics used by the LLVM compiler. More speci\ufb01cally, the learnt agent produces\na vectorization strategy that has a 22.6% higher average reduction in cost compared\nto the LLVM compiler when measured using its own cost model, and matches the\nruntime performance of the ILP based solution in 5 out of 7 applications in the\nNAS benchmark suite.\n\nIntroduction\n\n1\nModern microprocessors have introduced single instruction multiple data (SIMD) or vector units (e.g.\nIntel x86 AVX extensions1) to accelerate execution of performance critical applications by performing\ncomputations on multiple data items in parallel. In order to use these vector units, programmers\nmust either code using platform speci\ufb01c vector assembly instructions (Figure 1(c)), which is tedious,\nerror-prone and results in non-portable code or use existing compiler auto-vectorization techniques\nto automatically discover data parallel portions of programs and to transform scalar instructions\n(Figure 1(a)) in such regions into vector instructions (Figure 1(b),(c)).\nLarsen & Amarasinghe (2000) \ufb01rst showed how to perform compiler auto-vectorization using \ufb01ne\ngrained parallelism available in programs at the instruction level (superword level parallelism)\ntargeting \ufb01xed-width vector instructions available in modern microprocessors. Superword level\nparallelism (SLP) based vectorization provided a more general alternative to loop vectorization\ntechniques proposed earlier (Allen & Kennedy, 1987; Sreraman & Govindarajan, 2000).\nIn order to exploit SLP for vectorization, compilers need to select a set of scalar instructions which\ncan be executed in parallel and merge them into a single vector instruction. It is shown that this\nprocess reduces to the optimal subset selection problem, which is known to be NP-hard. Therefore,\nmany existing SLP vectorization schemes are driven by hand-crafted heuristics (Liu et al., 2012;\n\n1https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: (a) scalar code storing result of adding elements loaded from array b,c into array a (b)\ncompiler auto-vectorized code; expressions inside {.} are executed in parallel. (c) manually written\nvector assembly instructions / executable code generated by the compiler for auto-vectorized code (b)\nShin et al., 2003, 2005, 2002), which can be suboptimal. Recently, goSLP (Mendis & Amarasinghe,\n2018) introduced a SLP vectorization strategy with certain optimality guarantees guided by an Integer\nLinear Programming (ILP) solver. It has shown superior vectorization schemes compared to heuristics\nguided solutions, achieving end-to-end speedups on well-known compiler benchmark suites.\nIn this work, we propose a technique to learn how to vectorize by imitating the solution presented\nin goSLP (Mendis & Amarasinghe, 2018). Speci\ufb01cally, we formulate the decision procedure as a\nMarkov Decision Process (MDP) and then use the DAGGER algorithm (Ross et al., 2011) to collect\ntraces on how the ILP solver solves the SLP vectorization problem. We use these trace aggregates\nas supervision to train a parametrized policy modeled by a Gated Graph Neural Network (Li et al.,\n2015; Allamanis et al., 2017).\nWe show that the learnt policy, Vemal, outperforms one of the well-tuned heuristics used in the LLVM\ncompiler (Lattner & Adve, 2004) both in terms of static metrics and dynamic runtime performance,\nwhile matching the performance of goSLP in 5 out of 7 programs in the NAS benchmark suite held\nout for testing.\nSpeci\ufb01cally, we make the following contributions:\n\nthe vectorization strategy is constructed sequentially.\n\n\u2022 Formulation of the SLP vectorization problem as a Markov Decision Process (MDP), where\n\u2022 Modeling of the agent policy that solves the MDP as a Gated Graph Neural Network\n(GGNN) (Li et al., 2015; Allamanis et al., 2017) and using imitation learning to train the\npolicy network to mimic the optimal decisions made by the ILP solver.\n\u2022 Evaluation of the learnt policy on representative compiler benchmark suites (SPEC2006fp\nC/C++ (Henning, 2006), SPEC2017fp C/C++ (Bucek et al., 2018) and NAS benchmark\nsuites (Bailey et al., 1991)). Speci\ufb01cally, we show that the learnt policy has an average static\ncost reduction which is 22.6% higher than LLVM as measured by LLVM\u2019s own static cost\nmodel. We also show that the learnt policy achieves a geometric mean runtime speedup\nof 1.015\u00d7 on the NAS benchmark suite (held out during training) compared to LLVM,\nmatching runtime speedups of goSLP in 5 out of 7 applications.\n\nIn summary, we have shown that it is possible to learn end-to-end compiler optimization policies\nwhich surpass the performance of hand-crafted compiler heuristics by imitating an optimal solution.\n2 The SLP Vectorization Problem\nSuperword Level Parallelism (SLP) is a type of \ufb01ne-grained parallelism \ufb01rst introduced by Larsen &\nAmarasinghe (2000) that is suitable for vector code generation. SLP is available in cases where two\nor more scalar instructions are independent (the second instruction does not require the output of the\n\ufb01rst instruction, and vice-versa) and are isomorphic (of the same instruction type, such as addition\nor subtraction). For example, consider the code snippet from Figure 2(a). It computes intermediate\nvalues A1 up to A3 by dividing values loaded from array L. Next, these intermediate values are used\nin a series of subtraction operations and the results are stored in array S, which is disjoint from L.\nBoth the division and subtraction instruction groups, adjacent loads and stores are independent and\nisomorphic. Hence, they are amenable to SLP vectorization.\nSuch scalar instructions can be executed in parallel and hence can be packed to form a single vector\ninstruction. We name a set of scalar instructions which are packed as a vector pack. For example\nconsidering the code snippet shown in Figure 2(a), we can form a vector pack out of A1 and A2 and\nperform the division in vector form as: {A1, A2} = {L[5], L[6]} / {L[2], L[3]}.\nForming vector instructions by packing multiple scalar instructions can lead to better runtime perfor-\nmance. However, not all packing opportunities are pro\ufb01table. For instance, if a set of instructions is\npacked while their operands remain scalar, there will be an additional packing overhead to bring the\n\n2\n\n a[0] = b[0] + c[0] a[1] = b[1] + c[1] {a[0],a[1]} = {b[0],b[1]} + {c[0],c[1]}movdqa xmm0, cpaddq xmm0, bmovdqa a, xmm0(a) (b) (c) \foperands also into packed form. On the other hand, if the output of packed instructions is used by\nother scalar instructions, there will be an unpacking overhead.\nThe task of the compiler is to \ufb01nd the most pro\ufb01table set of vector packs to form among all available\npacking opportunities. This combinatorial optimization task is not easy even in the case of pairwise\ninstruction packing (Mendis & Amarasinghe, 2018) and is NP-hard.\n2.1 Motivating Example\n\nFigure 2: Comparison of SLP vectorization strategies (a) code example; S and L are two disjoint\narrays, (b)-(c) show dependency graphs of packed instructions under each instruction packing strategy\n(b) under LLVM\u2019s SLP vectorization algorithm (c) optimal packing. Solid arrows show dependencies.\nGroupings with solid circles show vector packs. Groupings with dotted circles show packs created\nusing overhead packing instructions and dotted lines show unpacking of values from vector packs.\nWe compare LLVM\u2019s greedy packing strategy with goSLP\u2019s optimal packing strategy for the code\nsnippet shown in Figure 2(a) to illustrate how instruction packing affects the quality of vectorization.\nFigure 2(b) shows the packing solution of LLVM\u2019s SLP vectorization algorithm. LLVM\u2019s greedy\nalgorithm chooses to vectorize adjacent stores S[2] and S[3] \ufb01rst. Next, it tries to pack the operands\nof already packed values recursively. Using this procedure, operands of {S[2], S[3]} are packed next,\nforming vector packs {L[2], L[3]} and {A1, A3}. However, operands of {A1, A3} cannot be packed,\nsince they access non-adjacent memory locations, forcing LLVM to emit two overhead packing\ninstructions. Further, L[2] needs to be unpacked before it can be repacked with L[4]. Altogether,\nLLVM\u2019s strategy yields 3 vector instructions while incurring a cost of 3 overhead instructions.\nFigure 2(c) shows the optimal packing solution found by goSLP using an ILP solver. The optimal\nstrategy creates vector pack {S[1], S[2]} instead of {S[2], S[3]}, which proves to be globally optimal.\nAltogether, this strategy yields 5 vector instructions and requires only 2 overhead instructions.\nIn this paper, we evaluate whether it is possible to learn a policy that imitates goSLP\u2019s ILP solution\nfor the pairwise instruction packing problem. Note that, we are not aiming to learn a general purpose\nILP solver, but to imitate the ILP solution for this problem domain. We will now formalize the\npairwise instruction packing problem.\n2.2 The Pairwise Instruction Packing Problem\nLet I = {I1, I2, ..., In\u22121, In, I\u0001} be the set of n + 1 instructions, where I\u0001 is an arti\ufb01cial empty\nInstruction. A valid instruction packing \u00afP = {P1 . . . Pm} is a collection of pairs:\n\n(cid:91)\n\ni\n\nPi \u2208 I \u00d7 I ,\n\nPi = I\n\nthat satis\ufb01es two kinds of packing constraints: CI (within pairs) and CII (between pairs).\nWithin Pair A legal pack P = (Ii, Ij) must satisfy the within pair constraint CI(P )\n\n\u2022 Ii and Ij must be isomorphic: perform the same operation on same data types which results\n\u2022 Ii and Ij must be independent: Ii and Ij cannot be directly or transitively dependent, where\n\nin values of the same type.\n\nthey cannot be reachable by one another in the same data-dependency graph.\n\n3\n\n(a) Code example(b) LLVM\u2019s SLP algorithm (c) goSLP\u2019s optimal algorithm A1 = L[5] / L[2] A2 = L[6] / L[3] A3 = L[7] / L[4] S[1] = L[1] - A2 S[2] = L[2] - A3 S[3] = L[3] - A1{L[6],L[7]}{L[3],L[4]}{A2,A3}{S[1],S[2]}{L[1],L[2]}L[2]L[3]Vector instructions : 3Overhead instructions: 3Vector instructions : 5Overhead instructions: 2{S[2],S[3]}{L[2],L[3]}{A3,A1}{L[5],L[7]}{L[2],L[4]}L[2]\f\u2022 If Ii and Ij require reordering, it should be possible under the hardware memory model.\n\u2022 If Ii and Ij access memory they must access adjacent memory locations.\n\u2022 Ii \u227a Ij under some ordering of statements I1 \u227a \u00b7\u00b7\u00b7 \u227a In \u227a I\u0001\n\nBetween Pairs packs Pi and Pj can both be created iff they satisfy the between pairs constraint\nCII(Pi, Pj):\n\u2022 Pi and Pj are schedulable: there shouldn\u2019t be any circular dependencies between the two\npacks. For example, if Ii,1, Ii,2 \u2208 Pi and Ij,1, Ij,2 \u2208 Pj, it shouldn\u2019t be the case that\nIi,1 \u03b4 Ij,1 and Ij,2 \u03b4 Ii,2, where \u03b4 denotes dependency.\n\u2022 Pi and Pj are not overlapping: \u2200Ii \u2208 Pi =\u21d2 Ii /\u2208 Pj. That is, a single statement can only\n\nbelong to one pack.\n\nWe write CII(Pi, Pj), if the packs satisfy the between pairs constraint.\nStatic Cost. We evaluate the ef\ufb01cacy of the packing strategies using LLVM\u2019s static cost model. The\nstatic cost model assigns costs for each vector and scalar instruction. The total cost of a block of code\nis calculated as the addition of these per instruction costs.\nObjective. Let F be the performance (measured using static cost) of executing the block of code with\nvectorized instructions formed by a packing. The goal of any SLP vectorization scheme is to \ufb01nd a\npacking strategy P that minimizes the cost (argmin \u00afP F ( \u00afP )) subject to constraints CI and CII.\n3 Learnt Solution - Vemal\nOur learnt vectorization policy, Vemal, imitates goSLP\u2019s (Mendis & Amarasinghe, 2018) ILP solution\nto perform SLP vectorization. We cast the pairwise instruction packing problem as a Markov Decision\nProcess (MDP), which we solve by a neural-network policy that imitates the ILP solution via imitation\nlearning using the DAGGER algorithm (Ross et al., 2011).\n3.1 MDP formulation\nWe formulate the pairwise instruction packing problem as a MDP by iteratively forming one vector\npack at a time following a particular instruction traversal policy:\n\n\u2022 Bottom-up Traversal. The learnt policy starts making packing decisions traversing the\nfunction in reverse, starting from the \ufb01nal instruction with valid packing candidates. At\nthe end of each iteration i, we choose a predecessor of I i with valid packing candidates to\nconsider for packing next.\n\u2022 Top-down Traversal. The learnt policy starts making packing decisions traversing down-\nwards from the \ufb01rst instruction with valid packing candidates. At the end of each iteration i,\nwe choose a successor of I i with valid packing candidates to consider for packing next.\n\nThe instruction traversal policy selects a speci\ufb01c instruction I i in each iteration to be considered for\npacking next. The learnt vectorization policy decides which valid packing candidate instruction I j\nshould I i be packed with. We de\ufb01ne the components of our MDP as follows:\nState (Si), A state S in our MDP is a tuple, S = (I i, P B). Here, I i represents the instruction\nselected by the \ufb01xed traversal order; P B = {P1 . . . Pk} represents a set of formed packs.\nStart State (S1), S1 = (I 1,{}). The current instruction to consider for packing is I 1 and there are\nno vector packs formed yet. The traversal policy decides I 1.\nAction (A), On a given state S = (I i, P B), the set of legal 2 actions A[S] is given as follows:\n\nA[(I i, P B)] = {I j such that Pi = {I i, I j}, CI(Pi) and \u2200Pj \u2208 P B, CII(Pi, Pj)}\n\nTransition (T ), On a given state (I i, P B) and action A = I, the transition function T simply adds\nthe newly formed pack to P B: T [(I i, P B), I] = (I i+1, P B \u222a {I i, I}). The traversal policy decides\non I i+1. If there are no instructions with valid packing opportunities I i+1 = I\u0001.\nReward (R), We de\ufb01ne the reward function as follows:\n\nR((I i, P B)) =\n\n2satis\ufb01es CI and CII\n\n(cid:26)0\n\nif I i (cid:54)= I\u0001\n\nF (P B)\n\nif I i = I\u0001\n\n4\n\n\fFigure 3: Graph formulation of MDP states for the code snippet shown in (a) under the forward\ntraversal policy. Nodes: C-Constant, F-Focus, P-Pack, UP-Unpack and all other nodes are instruction\nnodes. Edges: different types are shown in the \ufb01gure. Edges with arrows are directed. Figure (a)\nshows the initial state and Figure (b) shows the state after packing instruction nodes {L[1], L[2]}.\n3.2 Graph Formulation of the MDP State\nWe use a Gated Graph Neural Network (GGNN) (Li et al., 2015) as part of the policy network\nmodeling to make packing decisions for each state. To leverage the GGNN, we formulate the state of\nour MDP as a graph as follows:\nNodes. We consider 5 types of nodes to encode the graph features of a state Si:\n\nor instructions which are already packed.\n\n\u2022 Instruction Node: correspond to each instruction with at least one valid packing opportunity\n\u2022 Pack Node: common node representing overhead packing instructions.\n\u2022 Unpack Node: common node representing overhead unpacking instructions.\n\u2022 Constant Node: common node representing any constant value used by instructions.\n\u2022 Focus Node: special node that is connected to the instruction that is considered for packing\n\nin this iteration (I i).\n\nEdges. Following are the 4 types of edges connecting the above nodes:\n\n\u2022 Dependency Edge: encodes if an instruction must be executed after another one in sequen-\ntial order. More over, depending on the position of the arguments the instruction depends\non, a different dependency edge type is created for a maximal of 5, with further additional\narguments collapsed to the same dependency edge type 6. If a vector pack is formed and it\nrequires overhead packing instructions a suitable dependency edge is added between the\npack node and the two instruction nodes which form the vector pack. Similarly, if a vector\npack needs to be used by a scalar a suitable dependency edge is added between the relevant\ninstruction node and the unpack node. Note that, all dependency edges are directed.\n\n\u2022 Possible Pack Edge: encodes whether two instructions can be packed together.\n\u2022 Packed Edge: encodes instructions that are already packed together.\n\u2022 Focus Edge: the focus edge connects the focus node to the instruction node that is consid-\n\nered for packing. This marks the node that we are making decisions on.\n\nWe illustrate our graph formulation for the code snippet shown in Figure 3(a) assuming a forward\ntraversal policy. Figure 3(b) shows the initial state with focus edge directed at L[1]. Further, it shows\nedges for dependencies for each operand position, possible packs ({A1, A2}, {A2, A3}, {A1, A3}\nand {L[1], L[2]}) as well as mandatory packing of non-adjacent load L[4] which is used by A3.\nFigure 3(c) shows the next state assuming pack {L[1], L[2]} is formed. Notice the packed edge\nbetween L[1] and L[2] and the update of the focus edge to A1, which is considered for packing next.\n3.3 Neural Network Architecture\nWe use a GGNN to encode the aforementioned graph representation of the MDP state. GGNNs\nhave shown promise in the \ufb01eld of code modeling (Brockschmidt et al., 2018), being able to capture\nintricate graph structures present in code sequences. GGNN maintains a hidden state vector for each\nnode. During forward simulation, it passes some \ufb01xed round of messages between nodes through\ntheir connecting edges. The hidden state of a node is updated on each message passing iteration by\n\ufb01rst aggregating the messages passed from adjacent nodes and next by running the aggregate through\na Gated Recurrent Unit (Cho et al., 2014) cell. Once message passing completes, we pass the hidden\nstates of neighbors connected through possible pack edges of the node selected for packing through\na multi-layer perceptron. Finally, we feed its output through a softmax layer to produce the action\nprobabilities indicating how likely the selected node will be packed with its neighbors.\n\n5\n\n I1 : A1 = C1 / L[1] I2 : A2 = C2 / L[2] I3 : A3 = L[4] / L[2] L[1]L[2]PCA1A2A3FUPL[1]L[2]CA1A2FPA3UP(a)(b)(c)Focus EdgeDepedency Edge (position 1)Depedency Edge (position 2)Pack EdgePossible Pack Edge\fImitation Learning\n\n3.4\nVemal uses both supervised pre-training and imitation learning using the DAGGER algorithm (Ross\net al., 2011) to imitate the packing decisions made by the ILP solver.\nSupervised Pre-training. We \ufb01rst precompute optimal packing decisions using the ILP solver for\nthe functions in our training set. We then pick a particular instruction traversal policy and create\noptimal trajectories of state-action pairs using the packing decisions made by the ILP solver, whilst\nupdating the focus edge according to the traversal policy for each function. Here, the action is the\nselection of a node for packing among all valid packing candidates. The MDP state serves as the\ninput to the GGNN based policy. We use cross entropy loss to train the policy by encoding the target\naction as a 1-hot vector. We perform supervised pre-training for a designated amount of batches.\nImitation learning with DAGGER. In the imitation learning phase, we sample a speci\ufb01c roll-out\nof the GGNN policy using the chosen instruction traversal policy for a randomly sampled batch of\nfunctions. For every MDP state visited by the GGNN policy, we encode the packing decisions already\nmade into the ILP formulation, and use the ILP solver to provide optimal packings for the remaining\ninstructions. We use this to \ufb01nd the optimal action for every state visited by the GGNN policy. The\nDAGGER algorithm augments the current dataset to include these state-action pairs. We then train on\nthis augmented dataset similar to supervised pre-training. At each epoch, we continue to augment the\ndataset using the aforementioned strategy. This allows the GGNN policy to learn how to rectify its\npolicy in cases where it falls out of the optimal trajectory.\n3.5 Function Partitioning\nGraph neural networks suffer from over-smoothing and scaling problems specially when the graph\u2019s\ndiameter is large (Zhou et al., 2018). Real world functions can be arbitrarily sized, in some cases\nreaching more than 190000 instructions in our dataset. To alleviate scaling problems in our formula-\ntion, we partition functions based on their instruction counts and learn vectorization for each partition\nseparately considering them as full functions using the formulation mentioned in Sections 3.1 to 3.4.\nNote that, we solve ILP problems considering each partition individually and use their solutions to\ntrain our agent. During inference time, we take separate rollouts using our agent for each partition of\nthe function and \ufb01nally merge all the decisions to form the \ufb01nal vectorization scheme.\n4 Dataset\nWe use the same set of benchmark programs from goSLP (Mendis & Amarasinghe, 2018) to train\nand evaluate our imitation learning approach. Our dataset is composed of all individual functions\ncollected out of the benchmark programs listed in Table 1. The benchmark programs represent\n\ufb02oating-point C/C++ programs from SPEC2006 (Henning, 2006), SPEC2017 (Bucek et al., 2018)\nand NAS (Bailey et al., 1991) benchmark suites, which are well known benchmark suites used for\nevaluating compiler optimizations.\n\nBenchmark Suite\nSPEC2006\nSPEC2017\nNAS\n\nBenchmark Programs\n433.milc, 444.namd, 447.dealII, 450.soplex, 453.povray, 470.lbm, 482.sphinx3\n508.namd_r, 510.parest_r, 511.povray_r, 519.lbm_r, 538.imagick_r, 544.nab_r\nBT, SP, LU, MG, FT, CG, EP\n\nTable 1: Benchmark programs used for training and testing our learnt agent\n\n4.1 Collection\nWe \ufb01rst compiled each source \ufb01le to LLVM\u2019s intermediate representation (IR) just before LLVM\u2019s\nexisting SLP vectorizer runs. By this way, we obtain the same IR that would have been seen by the\nvectorizer during an end-to-end compilation. Each source \ufb01le has a number of functions and goSLP\nbuilds ILP problems considering a single function as the vectorization unit. Hence, we collected both\nthe compiled LLVM IR (just before SLP vectorization) as well as the corresponding pairwise packing\nopportunities for each function for all programs in our benchmark suite.\n4.2 Preparation\nUsing the methodology outlined in Section 4.1, we collected 35635 functions in total. However, only\n3981 (11.17%) functions are vectorized by goSLP. If we use all collected functions during training, it\ninduces a natural bias towards not vectorizing due to the imbalance in our dataset, even for functions\n\n6\n\n\fwith abundant vectorizable opportunities. The goal of our learnt agent is to mimic goSLP as closely\nas possible when there are vectorizable opportunities. In cases where our learnt agent suggests an\nunpro\ufb01table scheme, it can be eliminated by a cost model similar to the current LLVM SLP vectorizer.\nThis asymmetric learning objective and the imbalance in our collected functions motivated us to\ncreate the \ufb01nal dataset that is biased towards functions which are vectorized by goSLP. We select all\nfunctions which are vectorized by goSLP as well as a random subset of non-vectorized functions\nsuch that, 80% (3981) of our dataset has functions with pro\ufb01table vectorization schemes and 20%\n(995) do not. Finally, we split the dataset into a training set (80%) and a test set (20%) such that the\nproportionality of the vectorized and non-vectorized functions remains the same for both. There are\n3169 and 812 vectorized functions in our training and test respectively. Our training set does not\ninclude any functions from the NAS benchmark suite which we use for evaluating the end-to-end\nruntimes of our learnt policy.\nWe evaluate 2 partitioning sizes, namely partitioning each function at 100 and 200 instruction counts.\nFor each such partition, we create a new function with only the instructions from that partition and\nsolve an ILP problem (goSLP\u2019s formulation) to retrieve the set of optimal actions for the partition.\nWe report the \ufb01nal training and test set compositions for each partitioning scheme in Table 2.\n\nScheme Name\n\nPartition Size\n\np100\np200\n\n100\n200\n\nTrain set (partitioned functions)\nvectorized\n7203\n5378\n\nnon-vectorized\n1802\n1346\n\nTest set (partitioned functions)\nvectorized\n1776\n1357\n\nnon-vectorized\n446\n341\n\nTable 2: Partitioned dataset statistics\n\n5 Training and Testing\n\nWe now explain how we train Vemal\u2019s GGNN policy (Section 5.1) and use it in inference (Section 5.2)\nfor making \ufb01nal vectorization decisions.\n5.1 Training Setup\nWe learn the GGNN policy using the training set for each partition size for both backward and forward\ninstruction traversal policies. Initially, there are 144944 and 163618 optimal state-action pairs for\npartition sizes 100 and 200 respectively under forward traversal policy. Under backward traversal the\nrespective numbers are 144026 and 163448.\nWe pre-train each network using 3000 randomly sampled batches. At the beginning of each epoch,\nwe randomly sample 400 of partitioned functions and augment our dataset using rollouts obtained for\nthose functions. We use a mixed student-teacher policy similar to that used by the original DAGGER\nalgorithm (Ross et al., 2011) to take rollouts, with the probability of choosing the teacher agent (ILP\nsolver) exponentially decayed by \u03b2 = 0.9 at the beginning of each epoch. Finally, we use goSLP\u2019s\nILP solution to compute the optimal actions for each state we encounter during rollouts.\nWe use 20 message passing iterations in our GGNN. We train the neural network using stochastic\ngradient descent with momentum of 0.9, initial learning rate of 0.002 and with an exponentially\ndecaying learning rate schedule (decay of 0.95). We randomly sample 50 state-action pairs for each\nbatch and sample replay buffer\n\nbatch size number of batches for each epoch.\n\n5.2 Evaluation Criteria\nIn order to evaluate whether our trained policy is better than LLVM\u2019s SLP vectorization algorithm,\nwe use three different metrics in our experiments.\n\n\u2022 Average cost reduction across all vectorized functions in the test set compared to scalar.\n\u2022 Geometric mean speedup across all vectorized functions in the test set compared to scalar.\n\u2022 Geometric mean speedup of actual runtimes for NAS benchmark suite over LLVM SLP.\n\nFor the \ufb01rst two metrics we use values reported by LLVM\u2019s cost model. For the \ufb01nal metric we use\nactual wall clock times. For each partition size and instruction traversal policy, we evaluate each\npolicy both when it uses the action with the highest probability (argmax policy) for each state as well\nas when it uses the best trace among n-rollouts (multi-rollout policy).\n\n7\n\n\f6 Experimental Results\n\nWe trained all four agents \u2013 partition sizes 100 and 200 and instruction traversal policies forward and\nbackward \u2013 for 40 epochs. We use LLVM SLP (clang-6.0), goSLP and a random packing agent as our\nbaselines for comparison. Note that, we restrict goSLP and our learnt agents to only perform pairwise\npacking (vectorization factor = 2), where as for LLVM SLP we do not restrict its vectorization factor\nand use its implementation without any change. Random packing agent is an agent which chooses\nuniformly from alternative actions for a given MDP state.\n6.1 Static Results\nTable 3 shows the average cost reduction and geometric mean speedup for the functions in the test set\nwhich are vectorized by goSLP (812). We include two additional comparison points, goSLP-p100\nand goSLP-p200, which are policies that solve multiple ILP problems for partitioned functions using\ngoSLP\u2019s formulation and merges the decisions to perform \ufb01nal vectorization.\n\nVectorization\n\nPolicy\n\nTraversal\n\nPolicy\n\n# of rollouts Average Cost Reduction\n\n(LLVM cost model)\n\nGeo-mean Speedup\n(LLVM cost model)\n\ngoSLP\ngoSLP-p100\ngoSLP-p200\nLLVM SLP\nrandom\nrandom\np100\np100\np100\np100\np200\np200\np200\np200\n\nforward\nbackward\nforward\nforward\nbackward\nbackward\nforward\nforward\nbackward\nbackward\n\n1\n10\n1\n10\n1\n10\n1\n10\n\n23.6182\n19.4815\n21.2857\n12.1872\n1.0320\n1.0567\n13.3633\n14.9421\n9.2180\n11.3227\n13.5259\n14.7340\n9.5801\n11.5631\n\n1.1226\n1.1148\n1.1193\n1.0873\n1.0150\n1.0126\n1.0833\n1.1018\n1.0685\n1.0911\n1.0829\n1.1020\n1.0693\n1.0912\n\nTable 3: Average cost reduction and geometric mean speedups for vectorized functions in our test set\nbased on LLVM\u2019s cost model under different vectorization policies\nWe notice that the agents using forward traversal learn a better vectorization policy than when using\nbackward traversal. Also, all learnt agents except agents using backward traversal surpass LLVM\nSLP\u2019s average cost reduction. This fact is magni\ufb01ed with more rollouts, showing the ef\ufb01cacy of our\nlearnt policy compared to LLVM SLP\u2019s greedy algorithm. The best performing agent (p100 with 10\nrollouts) has an average cost reduction compared to scalar which is 22.6% higher than that of LLVM.\nThis is in spite of the fact, LLVM is not restricted to pairwise packing.\nAlso, notice that partitioned versions of goSLP achieve a lower average cost reduction compared to\nthat of goSLP. This is because of the sub-optimality introduced by solving subproblems as opposed\nto solving vectorization for the entire function. The maximum average cost reduction of learnt\nagents p100 and p200 are capped at those of goSLP-p100 and goSLP-p200 respectively. We should\ntherefore, expect p200 to learn a better policy than p100. However, the learnt policies only have small\noverall average cost reduction differences. This is because the GGNN is not as good at approximating\ngoSLP\u2019s packing policy when it comes to larger graphs. This gives rise to a trade-off space between\nsub-optimality of the solution and the learnability of a packing strategy at various partition sizes.\n6.2 Runtime Results\nWe use our learnt GGNN policy for both partition sizes 100 and 200 to perform end-to-end vectoriza-\ntion for the NAS benchmark suite. We use the agents learnt under forward traversal and use both\nargmax and multi-rollout policy to evaluate the ef\ufb01cacy of the learnt policy on end-to-end runtimes.\nAll benchmark programs are run on a Haswell Intel(R) Xeon(R) CPU E5-2680 v3 machine running\nat 2.50GHz with 32kB L1 and 256kB L2 cache sizes. We use Class-A workloads in our evaluation.\nWe run each benchmark program 3 times and report the median as is the common reporting method\nfor compiler benchmarks. Figure 4 shows the runtime speedups for each program under goSLP and\nour learnt policies compared to LLVM SLP. The table shows the \ufb01nal geometric mean speedup for all\npolicies compared to LLVM SLP.\n\n8\n\n\fPolicy Rollouts\n\nSpeedup over\nLLVM SLP\n\ngoSLP\np100\np100\np200\np200\n\n1\n10\n1\n10\n\n1.041\n0.979\n1.015\n0.987\n1.003\n\nFigure 4: Speedup of goSLP, p100 with 1 and 10 rollouts, p200 with 1 and 10 rollouts compared to\nLLVM SLP for individual benchmarks in the NAS benchmark suite. The table shows the geometric\nmean speedups for the entire benchmark suite.\nThe best performing learnt agent achieves 1.015\u00d7 geometric mean speedup over all NAS benchmarks.\nIn fact, both p100 and p200 learnt agents were able to beat LLVM SLP\u2019s performance with only 10\nrollouts. This shows the ef\ufb01cacy of our learnt agents on end-to-end runtime performance.\nMore notably, all agents except p100 with 1 rollout beat or replicate the substantial runtime speedup\nof goSLP on BT benchmark over LLVM SLP. This signi\ufb01es that the agents have learned a non-trivial\nvectorization policy not covered by LLVM SLP. Also, note that for SP and MG benchmarks all agents\nconsistently beat goSLP in terms of performance. Even though goSLP performs optimal packing\naccording to LLVM\u2019s cost model, for these benchmarks there exist other vectorization strategies, as\nuncovered by our learnt agents, which are better in terms of runtimes, but are suboptimal according\nto LLVM\u2019s cost model. This shows inaccuracies of the LLVM cost model. This provides evidence to\nour hypothesis, that in future a reinforcement learning based policy with a better cost model has the\npotential to learn an even better end-to-end vectorization policy than goSLP.\n7 Related Work\nSince the inception of vector machines, loop vectorization (Allen & Kennedy, 1987; Nuzman &\nZaks, 2008) has been introduced to speedup scienti\ufb01c computing kernels. Larsen & Amarasinghe\n(2000) introduced superword level parallelism based vectorization targeting shorter SIMD width\nmachines. Subsequently many techniques have emerged suggesting better heuristics to perform SLP\nvectorization (Porpodas & Jones, 2015; Porpodas et al., 2015; Liu et al., 2012; Shin et al., 2003, 2005,\n2002). Recently, Mendis & Amarasinghe (2018) introduced a pairwise optimal packing algorithm\nusing an ILP solver for statement packing which outperforms previous greedy approaches.\nThere has been work to \ufb01nd better compiler heuristics using machine learning (Stephenson et al.,\n2003; Cummins et al., 2017). More speci\ufb01cally, there has been previous attempts at identifying better\nheurisitics or program orders for vectorization using machine learning (Stock et al., 2012). However,\nit does not provide an end-to-end learnt solution for vectorization. Reinforcement Learning has been\nused to perform compiler instruction scheduling (McGovern et al., 2002) prior to the era of deep\nneural networks. In this work, we have shown how to imitate an ILP based solution using a graph\nneural network based policy to come up with the \ufb01rst end-to-end learnt auto-vectorizer.\nIn our formulation, we solve a NP-hard packing problem. Previously, reinforcement learning has\nbeen used to solve combinatorial optimization problems (Dai et al., 2017; Bello et al., 2016; Li et al.,\n2018). Comparatively, we have stronger supervision through an oracle with optimal actions.\n8 Conclusion\nCompiler auto-vectorization allows compilers to harness \ufb01ne-grained parallelism within programs.\nMany greedy heuristic based solutions were proposed, and recently Mendis & Amarasinghe (2018)\nintroduced a tractable solution with optimality guarantees using an ILP solver. Our work shows the\nfeasibility of learning an end-to-end vectorization policy by imitating this optimal solution and we\nshow that it outperforms well-tuned compiler heuristics used by the LLVM compiler. This holds out\nthe promise that learnt compiler optimizations can be a better alternative to hand-written counterparts\nin the near future.\n\n9\n\n0.70.80.911.11.21.3BTSPLUMGFTCGEPgoSLPp100-1p100-10p200-1p200-10Speedup compared to LLVM SLP\fAcknowledgements\nWe would like to thank Darsh Shah who was initially involved with this project and all reviewers for\ninsightful comments and suggestions. This research was supported by DARPA D3M Award #FA8750-\n17-2-0126, DARPA HACCS Award #HR0011-18-C-0059 and DARPA SDH Award #HR0011-18-3-\n0007. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are\nthose of the authors and do not necessarily re\ufb02ect the views of the funding agencies.\nReferences\nAllamanis, M., Brockschmidt, M., and Khademi, M. Learning to represent programs with graphs.\n\narXiv preprint arXiv:1711.00740, 2017.\n\nAllen, R. and Kennedy, K. Automatic translation of fortran programs to vector form. ACM Trans.\nProgram. Lang. Syst., 9(4):491\u2013542, October 1987. ISSN 0164-0925. doi: 10.1145/29873.29875.\nURL http://doi.acm.org/10.1145/29873.29875.\n\nBailey, D., Barszcz, E., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Frederick-\nson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., and Weeratunga, S. The\nInt. J. High Perform. Comput. Appl., 5(3):63\u201373, September 1991.\nnas parallel benchmarks.\nISSN 1094-3420. doi: 10.1177/109434209100500306. URL http://dx.doi.org/10.1177/\n109434209100500306.\n\nBello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combinatorial optimization with\n\nreinforcement learning. ArXiv, abs/1611.09940, 2016.\n\nBrockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with\n\ngraphs. arXiv preprint arXiv:1805.08490, 2018.\n\nBucek, J., Lange, K.-D., and v. Kistowski, J. Spec cpu2017: Next-generation compute benchmark.\nIn Companion of the 2018 ACM/SPEC International Conference on Performance Engineering,\nICPE \u201918, pp. 41\u201342, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5629-9. doi: 10.1145/\n3185768.3185771. URL http://doi.acm.org/10.1145/3185768.3185771.\n\nCho, K., Van Merri\u00ebnboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine\n\ntranslation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\nCummins, C., Petoumenos, P., Wang, Z., and Leather, H. End-to-end deep learning of optimization\nheuristics. In 2017 26th International Conference on Parallel Architectures and Compilation\nTechniques (PACT), pp. 219\u2013232. IEEE, 2017.\n\nDai, H., Khalil, E. B., Zhang, Y., Dilkina, B., and Song, L. Learning combinatorial optimization\nalgorithms over graphs. In Proceedings of the 31st International Conference on Neural Information\nProcessing Systems, NIPS\u201917, pp. 6351\u20136361, USA, 2017. Curran Associates Inc. ISBN 978-1-\n5108-6096-4. URL http://dl.acm.org/citation.cfm?id=3295222.3295382.\n\nHenning, J. L. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34\nISSN 0163-5964. doi: 10.1145/1186736.1186737. URL http:\n\n(4):1\u201317, September 2006.\n//doi.acm.org/10.1145/1186736.1186737.\n\nLarsen, S. and Amarasinghe, S. Exploiting superword level parallelism with multimedia instruction\nsets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design\nand Implementation, PLDI \u201900, pp. 145\u2013156, New York, NY, USA, 2000. ACM. ISBN 1-58113-\n199-2. doi: 10.1145/349299.349320. URL http://doi.acm.org/10.1145/349299.349320.\n\nLattner, C. and Adve, V. LLVM: A Compilation Framework for Lifelong Program Analysis &\nTransformation. In Proceedings of the 2004 International Symposium on Code Generation and\nOptimization (CGO\u201904), Palo Alto, California, Mar 2004.\n\nLi, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. arXiv\n\npreprint arXiv:1511.05493, 2015.\n\nLi, Z., Chen, Q., and Koltun, V. Combinatorial optimization with graph convolutional networks and\n\nguided tree search. In NeurIPS, 2018.\n\n10\n\n\fLiu, J., Zhang, Y., Jang, O., Ding, W., and Kandemir, M. A compiler framework for extract-\nIn Proceedings of the 33rd ACM SIGPLAN Conference\ning superword level parallelism.\non Programming Language Design and Implementation, PLDI \u201912, pp. 347\u2013358, New York,\nNY, USA, 2012. ACM.\nISBN 978-1-4503-1205-9. doi: 10.1145/2254064.2254106. URL\nhttp://doi.acm.org/10.1145/2254064.2254106.\n\nMcGovern, A., Moss, E., and Barto, A. G. Building a basic block instruction scheduler with\n\nreinforcement learning and rollouts. Machine learning, 49(2-3):141\u2013160, 2002.\n\nMendis, C. and Amarasinghe, S. goslp: Globally optimized superword level parallelism framework.\nProc. ACM Program. Lang., 2(OOPSLA):110:1\u2013110:28, October 2018. ISSN 2475-1421. doi:\n10.1145/3276480. URL http://doi.acm.org/10.1145/3276480.\n\nNuzman, D. and Zaks, A. Outer-loop vectorization: Revisited for short simd architectures. In\nProceedings of the 17th International Conference on Parallel Architectures and Compilation\nTechniques, PACT \u201908, pp. 2\u201311, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-282-5.\ndoi: 10.1145/1454115.1454119. URL http://doi.acm.org/10.1145/1454115.1454119.\n\nPorpodas, V. and Jones, T. M. Throttling automatic vectorization: When less is more. In Proceedings\nof the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT \u201915,\npp. 432\u2013444, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4673-9524-3.\ndoi: 10.1109/PACT.2015.32. URL https://doi.org/10.1109/PACT.2015.32.\n\nPorpodas, V., Magni, A., and Jones, T. M. Pslp: Padded slp automatic vectorization. In Proceedings of\nthe 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO\n\u201915, pp. 190\u2013201, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4799-8161-8.\nURL http://dl.acm.org/citation.cfm?id=2738600.2738625.\n\nRoss, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to\nno-regret online learning. In Proceedings of the fourteenth international conference on arti\ufb01cial\nintelligence and statistics, pp. 627\u2013635, 2011.\n\nShin, J., Chame, J., and Hall, M. W. Compiler-controlled caching in superword register \ufb01les for\nmultimedia extension architectures. In Proceedings of the 2002 International Conference on\nParallel Architectures and Compilation Techniques, PACT \u201902, pp. 45\u201355, Washington, DC, USA,\n2002. IEEE Computer Society. ISBN 0-7695-1620-3. URL http://dl.acm.org/citation.\ncfm?id=645989.674318.\n\nShin, J., Chame, J., and Hall, M. Exploiting superword-level locality in multimedia extension\n\narchitectures, volume 5. 4 2003.\n\nShin, J., Hall, M., and Chame, J. Superword-level parallelism in the presence of control \ufb02ow. In\nProceedings of the International Symposium on Code Generation and Optimization, CGO \u201905,\npp. 165\u2013175, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2298-X. doi:\n10.1109/CGO.2005.33. URL http://dx.doi.org/10.1109/CGO.2005.33.\n\nSreraman, N. and Govindarajan, R. A vectorizing compiler for multimedia extensions. Int. J. Parallel\nProgram., 28(4):363\u2013400, August 2000. ISSN 0885-7458. doi: 10.1023/A:1007559022013. URL\nhttp://dx.doi.org/10.1023/A:1007559022013.\n\nStephenson, M., Amarasinghe, S., Martin, M., and O\u2019Reilly, U.-M. Meta optimization: improving\ncompiler heuristics with machine learning. In ACM SIGPLAN Notices, volume 38, pp. 77\u201390.\nACM, 2003.\n\nStock, K., Pouchet, L.-N., and Sadayappan, P. Using machine learning to improve automatic\nvectorization. ACM Transactions on Architecture and Code Optimization (TACO), 8(4):50, 2012.\n\nZhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun, M. Graph neural networks: A review of\n\nmethods and applications. arXiv preprint arXiv:1812.08434, 2018.\n\n11\n\n\f", "award": [], "sourceid": 8271, "authors": [{"given_name": "Charith", "family_name": "Mendis", "institution": "MIT"}, {"given_name": "Cambridge", "family_name": "Yang", "institution": "MIT"}, {"given_name": "Yewen", "family_name": "Pu", "institution": "MIT"}, {"given_name": "Dr.Saman", "family_name": "Amarasinghe", "institution": "Massachusetts institute of technology"}, {"given_name": "Michael", "family_name": "Carbin", "institution": "MIT"}]}