{"title": "Static Analysis of Binary Executables Using Structural SVMs", "book": "Advances in Neural Information Processing Systems", "page_first": 1063, "page_last": 1071, "abstract": "We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline.", "full_text": "Static Analysis of Binary Executables Using\n\nStructural SVMs\n\nNikos Karampatziakis\u2217\n\nDepartment of Computer Science\n\nCornell University\nIthaca, NY 14853\n\nnk@cs.cornell.edu\n\nAbstract\n\nWe cast the problem of identifying basic blocks of code in a binary executable as\nlearning a mapping from a byte sequence to a segmentation of the sequence. In\ngeneral, inference in segmentation models, such as semi-CRFs, can be cubic in\nthe length of the sequence. By taking advantage of the structure of our problem,\nwe derive a linear-time inference algorithm which makes our approach practical,\ngiven that even small programs are tens or hundreds of thousands bytes long. Fur-\nthermore, we introduce two loss functions which are appropriate for our problem\nand show how to use structural SVMs to optimize the learned mapping for these\nlosses. Finally, we present experimental results that demonstrate the advantages\nof our method against a strong baseline.\n\n1\n\nIntroduction\n\nIn this work, we are interested in the problem of extracting the CPU instructions that comprise a\nbinary executable \ufb01le. Solving this problem is an important step towards verifying many simple\nproperties of a given program. In particular we are motivated by a computer security application, in\nwhich we want to detect whether a previously unseen executable contains malicious code. This is a\ntask that computer security experts have to solve many times every day because in the last few years\nthe volume of malicious software has witnessed an exponential increase (estimated at 50000 new\nmalicious code samples every day). However, the tools that analyze binary executables require a lot\nof manual effort in order to produce a correct analysis. This happens because the tools themselves\nare based on heuristics and make many assumptions about the way a binary executable is structured.\nBut why is it hard to \ufb01nd the instructions inside a binary executable? After all, when running\na program the CPU always knows which instructions it is executing. The caveat here is that we\nwant to extract the instructions from the executable without running it. On one hand, running the\nexecutable will in general reveal little information about all possible instructions in the program, and\non the other hand it may be dangerous or even misguiding.1\nAnother issue that makes this task challenging is that binary executables contain many other things\nexcept the instructions they will execute.2 Furthermore, the executable does not contain any demar-\ncations about the locations of instructions in the \ufb01le.3 Nevertheless, an executable \ufb01le is organized\ninto sections such as a code section, a section with constants, a section containing global variables\netc. But even inside the code section, there is a lot more than just a stream of instructions. We will\n\n\u2217http://www.cs.cornell.edu/\u223cnk\n1Many malicious programs try to detect whether they are running under a controlled environment.\n2Here, we are focusing on Windows executables for the Intel x86 architecture, though everything carries\n\nover to any other modern operating system and any other architecture with a complex instruction set.\n\n3Executables that contain debugging information are an exception, but most software is released without it\n\n1\n\n\frefer to all instructions as code and to everything else as data. For example, the compiler may, for\nperformance reasons, prefer to pad a function with up to 3 data bytes so that the next function starts\nat an address that is a multiple of 4. Moreover, data can appear inside functions too. For example,\na \u201cswitch\u201d statement in C is usually implemented in assembly using a table of addresses, one for\neach \u201ccase\u201d statement. This table does not contain any instructions, yet it can be stored together\nwith the instructions that make up the function in which the \u201cswitch\u201d statement appears. Apart from\nthe compiler, the author of a malicious program can also insert data bytes in the code section of her\nprogram. The ultimate goal of this act is to confuse the heuristic tools via creative uses of data bytes.\n\n1.1 A text analogy\n\nTo convey more intuition about the dif\ufb01culties in our task we will use a text analogy. The following\nis an excerpt from a message sent to General Burgoyne during the American revolutionary war [1]:\n\nYou will have heard, Dr Sir I doubt not long before this can have reached you\nthat Sir W. Howe is gone from hence. The Rebels imagine that he is gone to\nthe Eastward. By this time however he has \ufb01lled Chesapeak bay with surprize\nand terror. Washington marched the greater part of the Rebels to Philadelphia in\norder to oppose Sir Wm\u2019s. army.\n\nThe sender also sent a mask via a different route that, when placed on top of the message, revealed\nonly the words that are shown here in bold. Our task can be thought as learning what needs to be\nmasked so that the hidden message is revealed. In this sense, words play the role of instructions\nand letters play the role of bytes. For complex instruction sets like the Intel x86, instructions are\ncomposed of a variable number of bytes, as words are composed of a variable number of letters.\nThere are also some minor differences. For example, programs have control logic (i.e. execution\ncan jump from one point to another), while text is read sequentially. Moreover, programs do not\nhave spaces while most written languages do (exceptions are Chinese, Japanese, and Thai).\nThis analogy motivates tackling our problem as predicting a segmentation of the input sequence into\nblocks of code and blocks of data. An obvious \ufb01rst approach for this task would be to treat it as a\nsequence labeling problem and train, for example, a linear chain conditional random \ufb01eld (CRF) [2]\nto tag each byte in the sequence as being the beginning, inside, or outside of a data block. However\nthis approach ignores much of the problem\u2019s structure, most importantly that transitions from code to\ndata can only occur at speci\ufb01c points. Instead, we will use a more \ufb02exible model which, in addition\nto sequence labeling features, can express features of whole code blocks. Inference in our model is as\nfast as for sequence labeling and we show a connection to weighted interval scheduling. This strikes\na balance between ef\ufb01cient but simple sequence labeling models such as linear chain CRFs, and\nexpressive but slow4 segmentation models such as semi-CRFs [3] and semi-Markov SVMs [4]. To\nlearn the parameters of the model, we will use structural SVMs to optimize performance according\nto loss functions that are appropriate for our task, such as the sum of incorrect plus missed CPU\ninstructions induced by the segmentation.\nBefore explaining our model in detail, we present some background on the workings of widely\nused tools for binary code analysis in section 2, which allows us to easily explain our approach\nin section 3. We empirically demonstrate the effectiveness of our model in section 4 and discuss\nrelated work and other applications in section 5. Finally, section 6 discusses future work and states\nour conclusions.\n\n2 Heuristic tools for analyzing binary executables\n\nTools for statically analyzing binary executables differ in the details of their workings but they all\nshare the same high level logic, which is called recursive disassembly.5 The tool starts by obtaining\nthe address of the \ufb01rst instruction from a speci\ufb01c location inside the executable. It then places this\naddress on a stack and executes the following steps while the stack is non-empty. It takes the next\n\n4More speci\ufb01cally, inference needs O(nL2) time where L is an a priori bound on the lengths of the segments\n(L = 2800 in our data) and n is the length of the sequence. With additional assumptions on the features, [5]\ngives an O(nM ) algorithm where M is the maximum span of any edge in the CRF.\n\n5Two example tools are IdaPro (http://www.hex-rays.com/idapro) and OllyDbg (http://www.ollydbg.de)\n\n2\n\n\faddress from the stack and disassembles (i.e. decompiles to assembly) the sequence starting from\nthat address. All the disassembled instructions would execute one after the other until we reach\nan instruction that changes the \ufb02ow of execution. These control \ufb02ow instructions, are conditional\nand unconditional jumps, calls, and returns. After the execution of an unconditional jump the next\ninstruction to be executed is at the address speci\ufb01ed by the jump\u2019s argument. Other control \ufb02ow\ninstructions are similar to the unconditional jump. A conditional jump also speci\ufb01es a condition and\ndoes nothing if the condition is false. A call saves the address of the next instruction and then jumps.\nA return jumps to the address saved by a call (and does not need an address as an argument). The\ntool places the arguments of control \ufb02ow instructions it encounters on the stack. If the control \ufb02ow\ninstruction is a conditional jump or a call, it continues disassembling, otherwise it takes the next\naddress, that has not yet been disassembled, from the stack and repeats.\nEven though recursive disassembly seems like a robust way of extracting the instructions from a\nprogram, there are many reasons that can make it fail [6]. Most importantly, the arguments of the\ncontrol \ufb02ow instructions do not have to be constants, they can be registers whose values are generally\nnot available during static analysis. Hence, recursive disassembly can ran out of addresses much\nbefore all the instructions have been extracted. After this point, the tool has to resort to heuristics to\npopulate its stack. For example, a heuristic might check for positions in the sequence that match a\nhand-crafted regular expression. Furthermore, some heuristics have to be applied on multiple passes\nover the sequence. According to its documentation, OllyDbg does 12 passes over the sequence.\nRecursive disassembly can also fail because of its assumptions. Recall that after encountering a call\ninstruction, it continues disassembling the next instruction, assuming that the call will eventually\nreturn to execute it. Similarly for a conditional jump it assumes that both branches can potentially\nexecute. Though these assumptions are reasonable for most programs, malicious programs can\nexploit them to confuse the static analysis tools. For example, the author of a malicious program\ncan write a function that, say, adds 3 to the return address that was saved by the call instruction.\nThis means that if the call instruction was spanning positions a, . . . , a + (cid:96)\u2212 1 of the sequence, upon\nthe function\u2019s return the next instruction will be at position a + (cid:96) + 3, not at a + (cid:96). This will give\na completely different decoding of the sequence and is called disassembly desynchronization. To\nreturn to a text analogy, recursive disassembly parses the sequence \u201cdriverballetterrace\u201d as [driver,\nballet, terrace] while the actual parsing, obtained by starting three positions down, is [xxx, verbal,\nletter, race], where x denotes junk data.\n\n3 A structured prediction model\n\nIn this section we will combine ideas from recursive disassembly and structured prediction to derive\nan expressive and ef\ufb01cient model for predicting the instructions inside a binary executable. As\nin recursive disassembly, if we are certain that code begins at position i we can unambiguously\ndisassemble the byte sequence starting from position i until we reach a control \ufb02ow instruction. But\nunlike recursive disassembly, we maintain a trellis graph, a directed graph that succinctly represents\nall possibilities. The trellis graph has vertices bi that denote the possibility that a code block starts at\nposition i. It also has vertices ej and edges (bi, ej) which denote that disassembling from position\ni yields a possible code block that spans positions i, . . . , j. Furthermore, vertices di denote the\npossibility that the i-th position is part of a data block. Edges (ej, bj+1) and (ej, dj+1) encode that\nthe next byte after a code block can either be the beginning of another code block, or a data byte\nrespectively. For data blocks no particular structure is assumed and we just use edges (dj, dj+1) and\n(dj, bj+1) to denote that a data byte can be followed either by another data byte or by the beginning\nof a code block respectively. Finally, we include vertices s and t and edges (s, b1), (s, d1), (dn, t)\nand (en, t) to encode that sequences can start and end either with code or data.\nAn example is shown in Figure 1. The graph encodes all possible valid6 segmentations of the\nsequence. In fact, there is a simple bijection P from any valid segmentation y to an s \u2212 t path P (y)\nin this graph. For example, the sequence in Figure 1 contains three code blocks that span positions\n1\u20137, 8\u20138, and 10\u201312. This segmentation can be encoded by the path s, b1, e7, b8, e8, d9, b10, e12, t.\n\n6Some subsequences will produce errors while decoding to assembly because some bytes may not corre-\nspond to any instructions. These could never be valid code blocks because they would crash the program. Also\nthe program cannot do something interesting and crash in the same code block; interesting things can only\nhappen with system calls which, being call instructions, have to be at the end of their code block\n\n3\n\n\fFigure 1: The top line shows an example byte sequence in hexadecimal. Below this, we show the\nactual x86 instructions with position 9 being a data byte. We show both the mnemonic instructions\nas well as the bytes they are composed of. Some alternative decodings of the sequence are shown on\nthe bottom. The decoding that starts from the second position is able to skip over two control \ufb02ow\ninstructions. In the middle we show the graph that captures all possible decodings of the sequence.\nDisassembling from positions 3, 5, and 12 leads to decoding errors.\n\nAs usual for predicting structured outputs [2] [7], we de\ufb01ne the score of a segmentation y for a\nsequence x to be the inner product w(cid:62)\u03a8(x, y) where w \u2208 Rd are the parameters of our model and\n\u03a8(x, y) \u2208 Rd is a vector of features that captures the compatibility of the segmentation y and the\nsequence x. Given a sequence x and a vector of parameters w, the inference task is to \ufb01nd the\nhighest scoring segmentation\n\n(1)\nwhere Y is the space of all valid segmentations of x. We will assume that \u03a8(x, y) decomposes as\n\n\u02c6y = argmax\n\ny\u2208Y\n\nw(cid:62)\u03a8(x, y)\n\n\u03a8(x, y) := (cid:88)\n(cid:88)\n\n(u,v)\u2208P (y)\n\n\u02c6y = argmax\n\ny\u2208Y\n\n(u,v)\u2208P (y)\n\n\u03a6(u, v, x)\n\nw(cid:62)\u03a6(u, v, x)\n\nwhere \u03a6(u, v, x) is a vector of features that can be computed using only the endpoints of edge (u, v)\nand the byte sequence. This assumption allows ef\ufb01cient inference because (1) can be rewritten as\n\nwhich we recognize as computing the heaviest path in the trellis graph with edge weights given by\nw(cid:62)\u03a6(u, v, x). This problem can be solved ef\ufb01ciently with dynamic programming by visiting each\nvertex in topological order and updating the longest path to each of its neighbors.\nThe inference task can be viewed as a version of the weighted interval scheduling problem. Disas-\nsembling from position i in the sequence yields an interval [i, j] where j is the position where the\n\ufb01rst encountered control \ufb02ow instruction ends. In weighted interval scheduling we want to select a\nsubset of non-overlapping intervals with maximum total weight. Our inference problem is the same\nexcept we also have a cost for switching to the next interval, say the one that starts at position j + 2,\nwhich is captured by the cost of the path ej, dj+1, bj+2. Finally, the dynamic programming algo-\nrithm for solving this version is a simple modi\ufb01cation of the classic weighted interval scheduling\nalgorithm. Section 5 discusses other setups where this inference problem arises.\n\n3.1 Loss functions\n\nNow we introduce loss functions that measure how close an inferred segmentation \u02c6y is to the real\none y. First, we argue that Hamming loss, how well the bytes of the blocks in \u02c6y overlap with with\nthe bytes of the blocks in y, is not appropriate for this task because, as we recall from the text\n\n4\n\nd1d283c7043bfe7211c390407510d3d4d5d6d7d8d9d10d11d12b1b2b3b4b5b6b7b8b9b10b11b12e7e8e12staddedi4cmpediesijb0x401018retinceaxjnz0x40101cinceaxjnz0x40101cinceaxjnz0x40101cmov[ebx+edi]0xc31172fenopnopadcebxeax\fanalogy at the end of section 2, two blocks may be overlapping very well but they may lead to\ncompletely different decodings of the sequence. Hence, we introduce two loss functions which are\nmore appropriate for our task.\nThe \ufb01rst loss function, which we call block loss, comes from the observation that the beginnings of\nthe code blocks are necessary and suf\ufb01cient to describe the segmentation. Therefore, we let y and\n\u02c6y be the sets of positions where the code blocks start in the two segmentations and the block loss\ncounts how well these two sets overlap using the cardinality of their symmetric difference\n\n\u2206B(y, \u02c6y) = |y| + |\u02c6y| \u2212 2|y \u2229 \u02c6y|\n\nThe second loss function, which we call instruction loss, is a little less stringent. In the case where\nthe inferred \u02c6y identi\ufb01es, say, the second instruction in a block as its start, we would like to penalize\nthis less, since the disassembly is still synchronized, and only missed one instruction. Formally,\nwe let y and \u02c6y be the sets of positions where the instructions start in the two segmentations and we\nde\ufb01ne the instruction loss \u2206I(y, \u02c6y) to be the cardinality of their symmetric difference.\nAs an example, consider the segmentation which corresponds to path s, d1, b2, e12, t in Figure 1.\nTherefore \u02c6y = {2} and from the \ufb01gure we see that the segmentation in the top line has to pass\nthrough b1, b8, b10 i.e. y = {1, 8, 10}. Hence its block loss is 4 because it misses b1, b8, b10 and it in-\ntroduces b2. For the instruction loss, the positions of the real instructions are y = {1, 4, 6, 8, 10, 11}\nwhile the proposed segmentation has \u02c6y = {2, 9, 10, 11}. Taking the symmetric difference of these\nsets, we see that the instruction loss has value 6.\nFinally a variation of these loss functions occurs when we aggregate the losses over a set of se-\nquences. If we simply sum the losses for each sequence then the losses in longer executables may\novershadow the losses on shorter ones. To represent each executable equally in the \ufb01nal measure we\ncan normalize our loss functions, for example we can de\ufb01ne the normalized instruction loss to be\n\n\u2206N I(y, \u02c6y) =\n\n|y| + |\u02c6y| \u2212 2|y \u2229 \u02c6y|\n\n|y|\n\nand we similarly de\ufb01ne a normalized block loss \u2206N B. If |\u02c6y| = |y|, \u2206N I and \u2206N B are scaled\nversions of a popular loss function 1 \u2212 F1, where F1 is the harmonic mean of precision and recall.\n\n3.2 Training\n\nGiven a set of training pairs (xi, yi) i = 1, . . . , n of sequences and segmentations we can learn a\nvector of parameters w, that assigns a high score to segmentation yi and a low score to all other\npossible segmentations of xi. For this we will use the structural SVM formulation with margin\nrescaling [7] that solves the following problem:\n\nmin\nw,\u03bei\n\n1\n2\n\n||w||2 + C\nn\n\n\u03bei\n\nn(cid:88)\n\ni=1\n\ns.t. \u2200i \u2200\u00afy \u2208 Yi : w(cid:62)\u03a8(xi, yi) \u2212 w(cid:62)\u03a8(xi, \u00afy) \u2265 \u2206(yi, \u00afy) \u2212 \u03bei\n\nThe constraints of this optimization problem enforce that the difference in score between the correct\nsegmentation y and any incorrect segmentation \u00afy is at least as large as the loss \u2206(yi, \u00afy). If \u02c6yi is the\ninferred segmentation then the slack variable \u03bei upper bounds \u2206(yi, \u02c6yi). Hence, the objective is a\ntradeoff between a small upper bound of the average training loss and a low-complexity hypothesis\nw. The tradeoff is controlled by C which is set using cross-validation. Since the sets of valid\nsegmentations Yi are exponentially large, we solve the optimization problem with a cutting plane\nalgorithm [7]. We start with an empty set of constraints and in each iteration we \ufb01nd the most\nviolated constraint for each example. We add these constraints in our optimization problem and\nre-optimize. We do this until there are no constraints which are violated by more than a prespeci\ufb01ed\n\u0001 ) iterations [8]. For a training pair (xi, yi) the\ntolerance \u0001. This procedure will terminate after O( 1\nmost violated constraint is:\n\n\u02c6y = argmax\n\n\u00afy\u2208Yi\n\nw(cid:62)\u03a8(xi, \u00afy) + \u2206(yi, \u00afy)\n\n(2)\n\nApart from the addition of \u2206(yi, \u00afy), this is the same as the inference problem. For the losses we\nintroduced, we can solve the above problem with the same inference algorithm in a slightly modi\ufb01ed\n\n5\n\n\fBytes Blocks Block length (bytes) Block length (instructions)\n\nMaximum 49152\nAverage\n16712\n\n3502\n887\n\n2794\n13\n\n1009\n\n4\n\nTable 1: Some statistics about the executable sections of the programs in the dataset\n\ntrellis graph. More precisely, for every vertex v we can de\ufb01ne a cost c(v) for visiting it (this can be\nabsorbed into the costs of v\u2019s incoming edges) and \ufb01nd the longest path in this modi\ufb01ed graph. This\nis possible because our losses decompose over the vertices of the graph. This is not true for losses\nsuch 1 \u2212 F1 for which (2) seems to require time quadratic in the length of the sequence.\nFor the block loss, the costs are de\ufb01ned as follows. If bi \u2208 y then c(di) = 1. This encodes that\nusing di instead of bi misses the beginning of one block. If bi /\u2208 y then bi de\ufb01nes an incorrect code\nblock which spans bytes i, . . . , j and c(bi) = 1 + |{k|bk \u2208 y \u2227 i < k \u2264 j}|, capturing that we will\nintroduce one incorrect block and we will skip all the blocks that begin between positions i and j.\nAll other vertices in the graph have zero cost. In Figure 1 vertices d1, d8 and d10 have a cost of 1,\nwhile b2, b4, b6, b7, b9, and b11 have costs 3, 1, 1, 3, 2, and 1 respectively.\nFor the instruction loss, y is a set of instruction positions. Similarly to the block loss if i \u2208 y\nthen c(di) = 1. If i /\u2208 y then bi is the beginning of an incorrect block that spans bytes i, . . . , j\nand produces instructions in a set of positions \u02dcyi. Let s be the \ufb01rst position in this block that gets\nsynchronized with the correct decoding i.e. s = min(\u02dcyi \u2229 y) with s = j if the intersection is empty.\nThen c(bi) = |{k|k \u2208 \u02dcyi \u2227 i \u2264 k < s}| + |{k|k \u2208 y \u2227 i < k < s}|. The \ufb01rst term captures\nthe number of incorrect instructions produced by treating bi as the start of a code block, while the\nsecond term captures the number of missed real instructions. All other vertices in the graph have\nzero cost. In Figure 1 vertices d1, d4, d6, d8, d10 and d11 have a cost of 1, while b2, b7, and b9 have\ncosts 5, 3, and 1 respectively. For the normalized losses, we simply divide the costs by |y|.\n\n4 Experiments\n\nTo evaluate our model we tried two different ways of collecting data, since we could not \ufb01nd a\npublicly available set of programs together with their segmentations. First, we tried using debugging\ninformation, i.e. compile a program with and without debugging information and use the debug\nannotations to identify the code blocks. This approach could not discover all code blocks, especially\nwhen the compiler was automatically inserting code that did not exist in the source, such as the\ncalls to destructors generated by C++ compilers. Therefore we resorted to treating the output of\nOllyDbg, a heuristic tool, as the ground truth. Since the executables we used were 200 common\nprograms from a typical installation of Windows XP, we believe that the outputs of heuristic tools\nshould have little noise. For a handful of programs we manually veri\ufb01ed that another heuristic tool,\nIdaPro, mostly agreed with OllyDbg. Of course, our model is a general statistical model and given\nan expressive feature map, it can learn any ground truth. In this view the experiments suggest the\nrelative performance of the compared models. The dataset, and an implementation of our model, is\navailable at http://www.cs.cornell.edu/\u223cnk/svmwis. Table 1 shows some statistics of the dataset.\nWe use two kinds of features, byte-level and instruction-level features. For each edge in the graph,\nthe byte-level features are extracted from an 11 byte window around the source of the edge (so if the\nsource vertex is at position i, the window spans positions i \u2212 5, . . . , i + 5). The features are which\nbytes and byte pairs appear in which position inside the window. An example feature is \u201cdoes byte c3\nappear in position i \u2212 1?\u201d. In x86 architectures, when the previous instruction is a return instruction\nthis feature \ufb01res. Of course, it also \ufb01res in other cases and that is why we need instruction-level\nfeatures. These are obtained from histograms of instructions that occur in candidate code blocks\n(i.e. edges of the form (bi, ej)). We use two kinds of histograms, one where we abstract the values\nof the arguments of the instructions but keep their type (register, memory location or constant), and\none where we completely discard all information about the arguments. An example of the former\ntype of feature would be \u201cnumber of times the instruction [add register, register] appears in this\nblock\u201d. An example of the latter type of feature would be \u201cnumber of times the instruction [mov]\nappears in this block\u201d. In total, we have 2.3 million features. Finally, we normalize the features by\ndividing them by the length of the sequence.\n\n6\n\n\fGreedy\nSVMhmm\nSVMwis \u2206I\nSVMwis \u2206N I\nSVMwis \u2206B\nSVMwis \u2206N B\n\n\u2206H\n1623.6\n236.2\n98.8\n104.3\n86.5\n85.2\n\n\u00afL \u00b7 \u2206N H\n1916.6\n201.3\n115.6\n103.7\n98.2\n87.2\n\n\u2206I\n\n2164.3\n\n\u2014\n44.6\n45.5\n39.6\n40.6\n\n\u00afI \u00b7 \u2206N I\n7045.2\n\n\u2014\n98.0\n79.7\n80.2\n75.4\n\n\u2206B\n1564.9\n45.1\n26.1\n30.5\n21.5\n23.4\n\n\u00afB \u00b7 \u2206N B\n4747.2\n46.9\n41.1\n35.5\n32.1\n29.8\n\nTable 2: Empirical results. \u2206H is Hamming loss. Normalized losses (\u2206N X) are multiplied with the\naverage number of bytes (\u00afL), instructions (\u00afI), or blocks ( \u00afB) to bring all numbers to a similar scale.\n\nWe compare our model SVMwis (standing for weighted interval scheduling, to underscore that it\nis not a general segmentation model), trained to minimize the losses we introduced, with a very\nstrong baseline, a discriminatively trained HMM (using SVMhmm). This model uses only the byte-\nlevel features since it cannot express the instruction-level features. It tags each byte as being the\nbeginning, inside or outside of a code block using Viterbi and optimizes Hamming loss. Running a\ngeneral segmentation model [4] was impractical since inference depends quadratically on the max-\nimum length of the code blocks, which was 2800 in our data. Finally, it would be interesting to\ncompare with [5], but we could not \ufb01nd their inference algorithm available as a ready to use soft-\nware. For all experiments we use \ufb01ve fold cross-validation where three folds are used for training\none fold for validation (selecting C) and one fold for testing.\nTable 2 shows the results of our comparison for different loss functions (columns): Hamming loss,\ninstruction loss, block loss, and their normalized counterparts. Results for normalized losses have\nbeen multiplied with the average number of bytes (\u00afL), instructions (\u00afI), or blocks ( \u00afB) to bring all\nnumbers to a similar scale. To highlight the stregth of our main baseline, SVMhmm, we have\nincluded a very simple baseline which we call greedy. Greedy starts decoding from the begining of\nthe sequence and after decoding a block (bi, ej) it repeats at position j + 1. It only marks a byte as\ndata if the decoding fails, in which case it starts decoding from the next byte in the sequence. The\nresults suggest that just treating our task as a simple sequence labeling problem at the level of bytes\nalready goes a long way in terms of Hamming loss and block loss. SVMhmm sometimes predicts\nas the beginning of a code block a position that leads to a decoding error. Since it is not clear how\nto compute the instruction loss in this case, we do not report instruction losses for this model. The\nlast four rows of the table show the results for our model, trained to minimize the loss indicated on\neach line. We observe a further reduction in loss for all of our models. To assess this reduction,\nwe used paired Wilcoxon signed rank tests between the losses of SVMhmm\u2019s predictions and the\nlosses of our model\u2019s predictions (200 pairs). For all four models the tests suggest a statistically\nsigni\ufb01cant improvement over SVMhmm at the 1% level. For the block loss and its normalized\nversion \u2206N B, we see that the best performance is obtained for the model trained to minimize the\nrespective loss. However this is not true for the other loss functions. For the Hamming loss, this is\nexpected since the SVMwis models are more expressive and a small block loss or instruction loss\nimplies a small Hamming loss, but not vice versa. For the instruction loss, we believe this occurs\nbecause of two reasons. First our data consists of benign programs and for them learning to identify\nthe code blocks may be enough. Second it may be harder to learn with the instruction loss since its\nvalue depends on how quickly each decoding synchronizes with another (the correct) decoding of\nthe stream, something that is not modeled in the feature map we are using. The end result is that the\nmodels trained for block loss also attain the smallest losses for all other loss functions.\n\n5 Related work and other applications\n\nThere are two lines of research which are relevant to this work: one is structured prediction ap-\nproaches for segmenting sequences and the other is research on static analysis techniques for \ufb01nding\ncode and data blocks in executables. Segmentation of sequences can be done via sequence labeling\ne.g. [9]. If features of whole segments are needed then more expressive models such as semi-CRFs\n[3] or semi-Markov SVMs [4] can be used. The latter work introduced training of segmentation\nmodels for speci\ufb01c losses. However, if the segments are allowed to be long enough, these models\nhave polynomial but impractical inference complexity. With additional assumptions on the features\n\n7\n\n\f[5] gives an ef\ufb01cient, though somewhat complicated, inference algorithm. In our model inference\ntakes linear time, is simple to implement, and does not depend on the length of the segments.\nPrevious techniques for identifying code blocks in executables have used no or very little statistical\nlearning. For example, [10] and [11] use recursive disassembly and pattern heuristics similarly to\ncurrently used tools such as OllyDbg and IdaPro. These heuristics make many assumptions about\nthe data which are lucidly explained in [6]. In this work, the authors use simple statistical models\nbased on unigram and bigram instruction models in addition to the pattern heuristics. However,\nthese approaches make independent decisions for every candidate code block and they have a less\nprincipled way of dealing with equally plausible but overlapping code blocks.\nOur work is most similar to [12] which uses a CRF to locate the entry points of functions. They\nuse features that induce pairwise interactions between all possible positions in the executable which\nmakes their formulation intractable. They perform approximate inference with a custom iterative\nalgorithm but this is still slow. Our model can capture all the types of features that were used in\nthat model except one. This feature encodes whether an address that is called by a function is not\nmarked as a function and including this in our structure would make exact inference NP-hard. One\nway to approximate this feature would be to count how many candidate code blocks have instructions\nthat jump to or call the current position in the sequence. For their task, compiling with debugging\ninformation was enough to get real labels and they showed that, according to these labels, heuristic\ntools are outperformed by their learning approach.\nFinally, we conclude this section with a discussion on the broader impact of this work. Our model\nis a general structured learning model and can be used in many sequence labeling problems. First,\nit can encode all features of a linear chain CRF and can simulate it by specifying a structure where\neach block is required to end at the same position where it starts. Furthermore, it can be used for any\napplication where each position can yield at most one or a small number of arbitrarily long possible\nintervals and still have linear time inference, while inference in segmentation models depends on the\nlength of the segments. Applications of this form can arise in any kind of scheduling problem where\nwe want to learn a scheduler from example schedules. For example, a news website may decide to\nshow an ad in their front page together with their news stories. Each advertiser submits an ad along\nwith the times on which they want the ad to be shown. The news website can train a model like the\none we proposed based on past schedules and the observed total pro\ufb01t for each of those days. The\npro\ufb01t may not be directly observable for each individual ad depending on who serves the ads. When\none or more ads change in the future, the model can still create a good schedule because its decisions\ndepend on the features of the ads (such as the words in each ad), the time selected for displaying the\nad as well as the surrounding ads.\n\n6 Conclusions\n\nIn this work we proposed a code segmentation model SVMwis that can help security experts in\nthe static analysis of binary executables. We showed that inference in this model is as fast as for\nsequence labeling, even though our model can have features that can be computed from entire blocks\nof code. Moreover, our model is trained for the loss functions that are appropriate for the task.\nWe also compared our model with a very strong baseline, a sequence labeling approach using a\ndiscriminatively trained HMM, and showed that we consistently outperform it.\nIn the future we would like to use data annotated with real segmentations which might be possible\nto extract via a closer look at the compilation and linking process. We also want to look into richer\nfeatures such as some approximation of call consistency (since the actual constraints give rise to NP-\nhard inference), so that addresses which are targets of call or jump instructions from a code block\ndo not lie inside data blocks. Finally, we plan to extend our model to allow for joint segmentation\nand classi\ufb01cation of the executable as malicious or not.\n\nAcknowledgments\n\nI would like to thank Adam Siepel for bringing segmentation models to my attention and Thorsten\nJoachims, Dexter Kozen, Ainur Yessenalina, Chun-Nam Yu, and Yisong Yue for helpful discussions.\n\n8\n\n\fReferences\n[1] F. B. Wrixon Codes, Ciphers, Secrets and Cryptic Communication. page 490, Black Dog &\n\nLeventhal Publishers, 2005.\n\n[2] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random \ufb01elds:\nProbabilistic models for segmenting and labeling sequence data. In ICML \u201901: Proceedings of\nthe Eighteenth International Conference on Machine Learning, pages 282\u2013289, San Francisco,\nCA, USA, 2001. Morgan Kaufmann Publishers Inc.\n\n[3] S. Sarawagi and W.W. Cohen. Semi-markov conditional random \ufb01elds for information extrac-\n\ntion. Advances in Neural Information Processing Systems, 17:1185\u20131192, 2005.\n\n[4] Q. Shi, Y. Altun, A. Smola, and SVN Vishwanathan. Semi-Markov Models for Sequence\n\nSegmentation. In Proceedings of the 2007 EMNLP-CoNLL.\n\n[5] S. Sarawagi. Ef\ufb01cient inference on sequence segmentation models. In Proceedings of the 23rd\n\ninternational conference on Machine learning, page 800. ACM, 2006.\n\n[6] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna. Static disassembly of obfuscated binaries.\nIn Proceedings of the 13th conference on USENIX Security Symposium-Volume 13, page 18.\nUSENIX Association, 2004.\n\n[7] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\nand interdependent output variables. Journal of Machine Learning Research, 6(2):1453, 2006.\n[8] T. Joachims, T. Finley, and C-N. Yu. Cutting-Plane Training of Structural SVMs. Machine\n\nLearning, 77(1):27, 2009.\n\n[9] F. Sha and F. Pereira. Shallow parsing with conditional random \ufb01elds.\n\nHLT-NAACL, pages 213\u2013220, 2003.\n\nIn Proceedings of\n\n[10] H. Theiling. Extracting safe and precise control \ufb02ow from binaries. In Seventh International\n\nConference on Real-Time Computing Systems and Applications, pages 23\u201330, 2000.\n\n[11] C. Cifuentes and M. Van Emmerik. UQBT: Adaptable binary translation at low cost. Computer,\n\n33(3):60\u201366, 2000.\n\n[12] N. Rosenblum, X. Zhu, B. Miller, and K. Hunt. Learning to analyze binary computer code. In\n\nConference on Arti\ufb01cial Intelligence (AAAI 2008), Chicago, Illinois, 2008.\n\n9\n\n\f", "award": [], "sourceid": 752, "authors": [{"given_name": "Nikos", "family_name": "Karampatziakis", "institution": null}]}